1. Defining the Question

1.1) Specifiying the Data Analytic Question

An online cryptography course has been created and needs to be advertised on the blog of the entrepreneur who created it. The target audience for the advert originate from various countries. In the past, the entrepreneur has ran ads to advertise a related course on the same blog and collected data in the process. She, therefore wish to employ my services as a Data Science Consultant to assist her in identifying the individuals are most likely to click on her ads.

1.2) Defining the Metric of Success

The metric of success for this project is to identifying the individuals are most likely to click on her ads using Univariate and Bivariate Analysis.

1.3) Understanding the Context

The project is centered around the Advertising industry. Clicks can help in giving an insight of how well an ad is appealing to people who see it. Relevant, highly-targeted ads are more likely to receive clicks.

1.4) Recording the Experimental Design

For this analysis, I will perform the following actions:

  1. Loading the Data.

  2. Reading the Data.

  3. Cleaning the Dataset.

  4. Performing EDA:

    • Univariate Analysis.

    • Bivariate Analysis.

  5. Modelling.

  6. Conclusion.

  7. Recommendation.

1.5) Data Relevance

This data is relevant because it helps in providing a comprehensive and consolidated view of the different Audiences and make audience management and optimization simpler.

2. Reading the Data

# Reading our data from a csv file.
advert<-read.csv('advertising.csv')
# Checking the class of the file.
class(advert)
## [1] "data.frame"
# Checking the dimension of the dataframe.
dim(advert)
## [1] 1000   10

The dataframe is comprised of 1000 entries and 10 fields.

# Previewing the first five records of the dataframe.
head(advert)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0
# Previewing the last five records of the dataframe.
tail(advert)
##      Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995                     43.70  28    63126.96               173.01
## 996                     72.97  30    71384.57               208.58
## 997                     51.30  45    67782.17               134.42
## 998                     51.63  51    42415.72               120.37
## 999                     55.55  19    41920.79               187.95
## 1000                    45.01  26    29875.80               178.35
##                             Ad.Topic.Line          City Male
## 995         Front-line bifurcated ability  Nicholasland    0
## 996         Fundamental modular algorithm     Duffystad    1
## 997       Grass-roots cohesive monitoring   New Darlene    1
## 998          Expanded intangible solution South Jessica    1
## 999  Proactive bandwidth-monitored policy   West Steven    0
## 1000      Virtual 5thgeneration emulation   Ronniemouth    0
##                     Country           Timestamp Clicked.on.Ad
## 995                 Mayotte 2016-04-04 03:57:48             1
## 996                 Lebanon 2016-02-11 21:49:00             1
## 997  Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 998                Mongolia 2016-02-01 17:24:57             1
## 999               Guatemala 2016-03-24 02:35:54             0
## 1000                 Brazil 2016-06-03 21:43:21             1
# Printing information on the structure of a data frame.
str(advert)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...

The Timestamp column has a wrong datatype. During data cleaning, we will change its data type to the appropriate data type.

3. Tidying the Dataset

3.1 Dealing with Outliers
# Checking for outliers in every numerical column
outliers <- function(){
  boxplot(data)
  return (boxplot)
}
boxplot(advert$Daily.Time.Spent.on.Site, main="Boxplot on Daily Time Spent on Site")

boxplot(advert$Age, main="Boxplot on Age")

boxplot(advert$Area.Income, main="Boxplot on Area Income")

boxplot(advert$Daily.Internet.Usage, main="Boxplot on Daily Internet Usage")

boxplot(advert$Male, main="Boxplot on Male")

boxplot(advert$Clicked.on.Ad, main="Bloxplot on Clicked on Ad")

From the boxplots, we see that the column with outliers is Area.Income.

# Displaying all the outliers in the Area.Income column.
boxplot.stats(advert$Area.Income)$out
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
# Extracting the row numbers where these outliers are found.
out <- boxplot.stats(advert$Area.Income)$out
out_ind <- which(advert$Area.Income %in% c(out))
out_ind
## [1] 136 511 641 666 693 769 779 953
# Displaying the rows with the outliers in the Area.Income column.
advert[out_ind,]
##     Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 136                    49.89  39    17709.98               160.03
## 511                    57.86  30    18819.34               166.86
## 641                    64.63  45    15598.29               158.80
## 666                    58.05  32    15879.10               195.54
## 693                    66.26  47    14548.06               179.04
## 769                    68.58  41    13996.50               171.54
## 779                    52.67  44    14775.50               191.26
## 953                    62.79  36    18368.57               231.87
##                                    Ad.Topic.Line             City Male
## 136           Enhanced system-worthy application     East Michele    1
## 511                   Horizontal modular success        Estesfurt    0
## 641 Triple-buffered high-level Internet solution     Isaacborough    1
## 666              Total asynchronous architecture      Sanderstown    1
## 693               Optional full-range projection      Matthewtown    1
## 769                  Exclusive discrete firmware New Williamville    1
## 779     Persevering 5thgeneration knowledge user    New Hollyberg    0
## 953                       Total coherent archive        New James    1
##         Country           Timestamp Clicked.on.Ad
## 136      Belize 2016-04-16 12:09:25             1
## 511     Algeria 2016-07-08 17:14:01             1
## 641  Azerbaijan 2016-06-12 03:11:04             1
## 666  Tajikistan 2016-02-12 10:39:10             1
## 693     Lebanon 2016-04-25 19:31:39             1
## 769 El Salvador 2016-07-06 12:04:29             1
## 779      Jersey 2016-05-19 06:37:38             1
## 953  Luxembourg 2016-05-30 20:08:51             1
# Dealing with outliers
lower_limit <- 47032 - 1.5 * IQR(advert$Area.Income) # Defining the lower limit
advert$Area.Income[advert$Area.Income < lower_limit]<- lower_limit
# Plotting a boxplot
boxplot(advert$Area.Income, main="Boxplot of Area.Income")

From the second graph, we can see that there are no outliers in the column.

3.2 Dealing with Missing Values
# Calculating the total number of missing values in our dataset.
sum(is.na(advert))
## [1] 0

The dataset has no missing values.

3.3 Dealing with Duplicates
sum(duplicated(advert))
## [1] 0

The dataset has not duplicated values.

3.4 Changing the datatype of timestamp
advert$Timestamp <- strptime(advert$Timestamp, "%Y-%m-%d %H:%M:%S")
# Printing information on the structure of a data frame.
class(advert$Timestamp)
## [1] "POSIXlt" "POSIXt"
str(advert)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : POSIXlt, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...

The data type of timestamp has been converted from character to “POSIXlt” “POSIXt”. ##### 3.5 Ensuring Uniformity of Column Names

# Converting the column names to lower case.
names(advert) <- tolower(names(advert))
# Removing (.) in column names and replacing them with (_).
names(advert) <- gsub("\\.", "_", names(advert))
# Previewing the cleaned dataset
head(advert)
##   daily_time_spent_on_site age area_income daily_internet_usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           ad_topic_line           city male    country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             timestamp clicked_on_ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0

4.Exploratory Data Analysis

4.1 Univariate Analysis
# Renaming the male column to gender
names(advert)[names(advert) == "male"] <- "gender"
# Checking the statistical summary of each field in our dataframe.
summary(advert)
##  daily_time_spent_on_site      age         area_income    daily_internet_usage
##  Min.   :32.60            Min.   :19.00   Min.   :19374   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55025   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##  ad_topic_line          city               gender        country         
##  Length:1000        Length:1000        Min.   :0.000   Length:1000       
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.481                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :1.000                     
##    timestamp                      clicked_on_ad
##  Min.   :2016-01-01 02:52:10.00   Min.   :0.0  
##  1st Qu.:2016-02-18 02:55:42.00   1st Qu.:0.0  
##  Median :2016-04-07 17:27:29.50   Median :0.5  
##  Mean   :2016-04-10 10:34:06.64   Mean   :0.5  
##  3rd Qu.:2016-05-31 03:18:14.00   3rd Qu.:1.0  
##  Max.   :2016-07-24 00:22:16.00   Max.   :1.0

From the statistical summary, we can be able to see the mean and median for the numerical columns, and mode of the categorical columns.

# Getting the standard deviation of some numerical variables.
sapply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], sd)
## daily_time_spent_on_site                      age              area_income 
##                15.853615                 8.785562             13343.223865 
##     daily_internet_usage 
##                43.902339
# Comparing the standard deviation and mean of daily_time_spent_on_site
print(sd(advert$daily_time_spent_on_site))
## [1] 15.85361
print(mean(advert$daily_time_spent_on_site))
## [1] 65.0002
# Comparing the standard deviation and mean of area_income
print(sd(advert$area_income))
## [1] 13343.22
print(mean(advert$area_income))
## [1] 55025.32
# Comparing the standard deviation and mean of age
print(sd(advert$age))
## [1] 8.785562
print(mean(advert$age))
## [1] 36.009

From these the variables, we can see that they exhibit a high standard deviation, which indicates that the data points tend to be far from their mean. From this information, we can tentatively interpret that the distribution of our data is not normal.

# Getting the variance of some numerical variables.
sapply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], var)
## daily_time_spent_on_site                      age              area_income 
##             2.513371e+02             7.718611e+01             1.780416e+08 
##     daily_internet_usage 
##             1.927415e+03

The variance also has a similar interpretation to that of standard deviation.

# Getting the interquartile range of some numerical variables.
sapply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], IQR)
## daily_time_spent_on_site                      age              area_income 
##                  27.1875                  13.0000               18438.8325 
##     daily_internet_usage 
##                  79.9625

The interquartile range gives the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile).

# Getting the skewness of some numerical variables.
library(moments)
skew_val <- apply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], 2, skewness)
print(skew_val)
## daily_time_spent_on_site                      age              area_income 
##              -0.37120261               0.47842268              -0.62004808 
##     daily_internet_usage 
##              -0.03348703

From the skewness data, we can deduce that the variables are moderately skewed. This is because the skewness value of some variables is between −1 and −½ (daily_internet_usage) and that of others (age) between +½ and +1.

# Getting the kurtosis of some numerical variables.
kurt_val <- apply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], 2, kurtosis)
print(kurt_val)
## daily_time_spent_on_site                      age              area_income 
##                 1.903942                 2.595482                 2.789881 
##     daily_internet_usage 
##                 1.727701

If the coefficient of kurtosis is less than 3 for each of the variables, then the data distribution is platykurtic.

# load library ggplot2
library(ggplot2)
# Plotting the histogram of Age Distribution
p<-ggplot(advert, aes(x=age)) + 
  geom_histogram(color="black", fill="light blue")+
  ggtitle("Histogram for Age Distribution")
p 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram above, we find that people between the age bracket of 25 - 50 are the ones that account for more clicks. This, therefore, implies that the target audience are between this age group.

# Counting the number of male and female 
ggplot(advert, aes(x = gender)) +
  geom_bar(fill="light blue", color="black")+
  ggtitle("Bar Graph for The count of Male and Female")

From the barplot, we see that female audience are slightly more than male audience. This means that in our target audience, we have more female than male.

# Ranking the countries according to count
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
advert %>%
  count(country, sort=TRUE)
##                                                 country n
## 1                                        Czech Republic 9
## 2                                                France 9
## 3                                           Afghanistan 8
## 4                                             Australia 8
## 5                                                Cyprus 8
## 6                                                Greece 8
## 7                                               Liberia 8
## 8                                            Micronesia 8
## 9                                                  Peru 8
## 10                                              Senegal 8
## 11                                         South Africa 8
## 12                                               Turkey 8
## 13                                              Albania 7
## 14                                              Bahamas 7
## 15                               Bosnia and Herzegovina 7
## 16                                              Burundi 7
## 17                                             Cambodia 7
## 18                                              Eritrea 7
## 19                                             Ethiopia 7
## 20                                                 Fiji 7
## 21                                           Luxembourg 7
## 22                                               Taiwan 7
## 23                                            Venezuela 7
## 24                                       Western Sahara 7
## 25                                              Algeria 6
## 26                                             Anguilla 6
## 27                                              Belarus 6
## 28                                              Bolivia 6
## 29                                             Bulgaria 6
## 30                                                China 6
## 31                                     Christmas Island 6
## 32                                           Costa Rica 6
## 33                                              Croatia 6
## 34                                          El Salvador 6
## 35                                                Gabon 6
## 36                                            Hong Kong 6
## 37                                              Hungary 6
## 38                                            Indonesia 6
## 39                                               Jersey 6
## 40                                      Kyrgyz Republic 6
## 41                                              Lebanon 6
## 42                                        Liechtenstein 6
## 43                                           Madagascar 6
## 44                                                Malta 6
## 45                                              Mayotte 6
## 46                                               Mexico 6
## 47                                              Moldova 6
## 48                                             Mongolia 6
## 49                                 Netherlands Antilles 6
## 50                                          Philippines 6
## 51                                               Poland 6
## 52                                          Puerto Rico 6
## 53                                                Qatar 6
## 54                     Saint Vincent and the Grenadines 6
## 55                                                Samoa 6
## 56                                            Singapore 6
## 57                         Svalbard & Jan Mayen Islands 6
## 58                                         Turkmenistan 6
## 59                                 United Arab Emirates 6
## 60                                              Vanuatu 6
## 61                                             Zimbabwe 6
## 62                                       American Samoa 5
## 63                                  Antigua and Barbuda 5
## 64                                              Austria 5
## 65                                              Bahrain 5
## 66                                             Barbados 5
## 67                                              Belgium 5
## 68                                               Belize 5
## 69                            Bouvet Island (Bouvetoya) 5
## 70                                               Brazil 5
## 71                                    Brunei Darussalam 5
## 72                                             Cameroon 5
## 73                                               Canada 5
## 74                                       Cayman Islands 5
## 75                                                 Cuba 5
## 76                                             Dominica 5
## 77                                              Ecuador 5
## 78                                                Egypt 5
## 79                                              Finland 5
## 80                                     French Polynesia 5
## 81                          French Southern Territories 5
## 82                                            Greenland 5
## 83                                               Guyana 5
## 84                                             Honduras 5
## 85                                                 Iran 5
## 86                                                Italy 5
## 87                                              Jamaica 5
## 88                                                Korea 5
## 89                                              Myanmar 5
## 90                                       Norfolk Island 5
## 91                                             Pakistan 5
## 92                                     Papua New Guinea 5
## 93                                               Rwanda 5
## 94                                         Saint Helena 5
## 95                            Saint Pierre and Miquelon 5
## 96                                               Serbia 5
## 97                                              Somalia 5
## 98                                          Timor-Leste 5
## 99                                                Tonga 5
## 100                            Turks and Caicos Islands 5
## 101                                             Ukraine 5
## 102                            United States of America 5
## 103                                             Uruguay 5
## 104                                              Angola 4
## 105                                          Bangladesh 4
## 106                                        Burkina Faso 4
## 107                                                Chad 4
## 108                                               Chile 4
## 109                                               Congo 4
## 110                                       Cote d'Ivoire 4
## 111                                  Dominican Republic 4
## 112                                   Equatorial Guinea 4
## 113                         Falkland Islands (Malvinas) 4
## 114                                       French Guiana 4
## 115                                             Georgia 4
## 116                                               Ghana 4
## 117                                             Grenada 4
## 118                                                Guam 4
## 119                                           Guatemala 4
## 120                                              Israel 4
## 121                                               Japan 4
## 122                                          Kazakhstan 4
## 123                                               Kenya 4
## 124                    Lao People's Democratic Republic 4
## 125                                              Latvia 4
## 126                              Libyan Arab Jamahiriya 4
## 127                                              Malawi 4
## 128                                            Maldives 4
## 129                                                Mali 4
## 130                                          Martinique 4
## 131                                           Mauritius 4
## 132                                         Netherlands 4
## 133                                         New Zealand 4
## 134                                               Palau 4
## 135                                        Saint Martin 4
## 136                                        Saudi Arabia 4
## 137                                           Sri Lanka 4
## 138                                              Sweden 4
## 139                                         Switzerland 4
## 140                                            Thailand 4
## 141                                             Tokelau 4
## 142                                             Tunisia 4
## 143                                              Tuvalu 4
## 144                                              Uganda 4
## 145                United States Minor Outlying Islands 4
## 146                        United States Virgin Islands 4
## 147                                   Wallis and Futuna 4
## 148                                              Zambia 4
## 149        Antarctica (the territory South of 60 deg S) 3
## 150                                             Armenia 3
## 151                                          Azerbaijan 3
## 152                              British Virgin Islands 3
## 153                                        Cook Islands 3
## 154                                             Denmark 3
## 155                                             Estonia 3
## 156                                       Faroe Islands 3
## 157                                           Gibraltar 3
## 158                                            Guernsey 3
## 159                                              Guinea 3
## 160                   Heard Island and McDonald Islands 3
## 161                       Holy See (Vatican City State) 3
## 162                                             Iceland 3
## 163                                             Ireland 3
## 164                                         Isle of Man 3
## 165                                           Lithuania 3
## 166                                               Macao 3
## 167                                            Malaysia 3
## 168                                              Monaco 3
## 169                                             Morocco 3
## 170                                               Nauru 3
## 171                                               Nepal 3
## 172                                           Nicaragua 3
## 173                                               Niger 3
## 174                                                Niue 3
## 175                            Northern Mariana Islands 3
## 176                               Palestinian Territory 3
## 177                                            Paraguay 3
## 178                                            Portugal 3
## 179                                  Russian Federation 3
## 180                                          San Marino 3
## 181                                          Seychelles 3
## 182                                               Spain 3
## 183                                Syrian Arab Republic 3
## 184                                          Tajikistan 3
## 185                                            Tanzania 3
## 186                                                Togo 3
## 187                                 Trinidad and Tobago 3
## 188                                      United Kingdom 3
## 189                                             Vietnam 3
## 190                                               Yemen 3
## 191                                             Andorra 2
## 192                                           Argentina 2
## 193                                               Benin 2
## 194                                              Bhutan 2
## 195                            Central African Republic 2
## 196                                            Colombia 2
## 197                                             Comoros 2
## 198                                            Djibouti 2
## 199                                              Gambia 2
## 200                                          Guadeloupe 2
## 201                                       Guinea-Bissau 2
## 202                                               Haiti 2
## 203                                               India 2
## 204                                              Kuwait 2
## 205                                           Macedonia 2
## 206                                          Mauritania 2
## 207                                          Montenegro 2
## 208                                             Namibia 2
## 209                                       New Caledonia 2
## 210                                              Norway 2
## 211                                              Panama 2
## 212                                    Pitcairn Islands 2
## 213                                             Reunion 2
## 214                                    Saint Barthelemy 2
## 215                                         Saint Lucia 2
## 216                               Sao Tome and Principe 2
## 217                                        Sierra Leone 2
## 218                          Slovakia (Slovak Republic) 2
## 219        South Georgia and the South Sandwich Islands 2
## 220                                               Sudan 2
## 221                                            Suriname 2
## 222                                           Swaziland 2
## 223                                          Uzbekistan 2
## 224                                               Aruba 1
## 225                                             Bermuda 1
## 226 British Indian Ocean Territory (Chagos Archipelago) 1
## 227                                          Cape Verde 1
## 228                                             Germany 1
## 229                                              Jordan 1
## 230                                            Kiribati 1
## 231                                             Lesotho 1
## 232                                    Marshall Islands 1
## 233                                          Montserrat 1
## 234                                          Mozambique 1
## 235                                             Romania 1
## 236                               Saint Kitts and Nevis 1
## 237                                            Slovenia 1

From the list, the top countries with the highest count are: Czech Republic, France, Afghanistan, Australia, Cyprus, Greece, Liberia, Micronesia, Peru, Senegal.

4.2 Bivariate Analysis
# Separating the date and time from the timestamp column.
advert$date <- as.Date(advert$timestamp)
advert$time <- format(advert$timestamp,"%H:%M:%S")
# Extracting month from the date
advert$month <- format(advert$date, "%m") 
# Previewing the dataset
head(advert)
##   daily_time_spent_on_site age area_income daily_internet_usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           ad_topic_line           city gender    country
## 1    Cloned 5thgeneration orchestration    Wrightburgh      0    Tunisia
## 2    Monitored national standardization      West Jodi      1      Nauru
## 3      Organic bottom-line service-desk       Davidton      0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt      1      Italy
## 5         Robust logistical utilization   South Manuel      0    Iceland
## 6       Sharable client-driven software      Jamieberg      1     Norway
##             timestamp clicked_on_ad       date     time month
## 1 2016-03-27 00:53:11             0 2016-03-27 00:53:11    03
## 2 2016-04-04 01:39:02             0 2016-04-04 01:39:02    04
## 3 2016-03-13 20:35:42             0 2016-03-13 20:35:42    03
## 4 2016-01-10 02:31:19             0 2016-01-10 02:31:19    01
## 5 2016-06-03 03:36:18             0 2016-06-03 03:36:18    06
## 6 2016-05-19 14:30:17             0 2016-05-19 14:30:17    05
library(ggplot2)
ggplot(advert, aes(x = gender, fill=clicked_on_ad)) +
  geom_bar(color="black")+
  ggtitle("Bar Graph for The count of Male and Female Vs. their Clicked_on_ad Status")

From the graph, we see that most female clicked on the ad, as compared to the male. The female are slightly more that the male.

ggplot(advert, aes(x = month, fill=clicked_on_ad)) +
  geom_bar(color="black")+
  ggtitle("Bar Graph for The count of month Vs. Clicked_on_ad")

From the bar graph, we can see that most clicks were done on the month February, followed closely by the month of May. We can also see that the month with the highest count is February followed closely by March.

# Plotting a histogram of clicked_on_ad and daily_time_spent_on_site 
advert$clicked_on_ad <- recode_factor(advert$clicked_on_ad, '0' = 'No', '1' = 'Yes')
# Use clicked_on_ad as the faceting variable
ggplot(advert, aes(x = daily_time_spent_on_site)) +
  geom_histogram(fill = "light blue", colour = "black") +
  ggtitle("Histogram of Clicked_on_ad and Daily_time_spent_on_site")+
  facet_grid(clicked_on_ad ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the graph, we see that most people who spend the most time on the site, tend not to click on the ad.

advert$gender <- recode_factor(advert$gender, '0' = 'Female', '1' = 'Male')
# Use gender as the faceting variable
ggplot(advert, aes(x = daily_time_spent_on_site)) +
  geom_histogram(fill = "light blue", colour = "black") +
  ggtitle("Histogram of Gender and Daily_time_spent_on_site")+
  facet_grid(gender ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the graph, we see that most female as compared to male, spend more time on the site.

# Plotting a jitter plot of daily_time_spent_on_site, age, and gender 
ggplot(advert, aes(x = daily_time_spent_on_site, y = age, color = gender)) +
  ggtitle("Jitterplot of daily_time_spent_on_site, age, and gender")+
  geom_jitter(width = .2) 

From the jitterplot, we can deduce that most people spend time on the site.

# Use smoke as the faceting variable
ggplot(advert, aes(x = daily_internet_usage)) +
  geom_histogram(fill = "light blue", colour = "black") +
  ggtitle("Histogram of daily_internet_usage and clicked_on_ad")+
  facet_grid(clicked_on_ad ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the graphs, we see that most people who use the internet daily, tend not to click on the ad.

# Plotting a jitter plot of daily_time_spend_on_site and clicked_on_ad.
ggplot(advert, aes(x = daily_time_spent_on_site, y = area_income, color = clicked_on_ad )) +
  geom_jitter(width = .2)+
  ggtitle("Jitter plot of daily_time_spend_on_site, area_income and clicked_on_ad")

From the jitter plot, we see that most people who are of a higher area income, and spend most time on the site, tend not to click on the ad.

# Creating a copy of the dataset
adverts <-advert
head(adverts)
##   daily_time_spent_on_site age area_income daily_internet_usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           ad_topic_line           city gender    country
## 1    Cloned 5thgeneration orchestration    Wrightburgh Female    Tunisia
## 2    Monitored national standardization      West Jodi   Male      Nauru
## 3      Organic bottom-line service-desk       Davidton Female San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt   Male      Italy
## 5         Robust logistical utilization   South Manuel Female    Iceland
## 6       Sharable client-driven software      Jamieberg   Male     Norway
##             timestamp clicked_on_ad       date     time month
## 1 2016-03-27 00:53:11            No 2016-03-27 00:53:11    03
## 2 2016-04-04 01:39:02            No 2016-04-04 01:39:02    04
## 3 2016-03-13 20:35:42            No 2016-03-13 20:35:42    03
## 4 2016-01-10 02:31:19            No 2016-01-10 02:31:19    01
## 5 2016-06-03 03:36:18            No 2016-06-03 03:36:18    06
## 6 2016-05-19 14:30:17            No 2016-05-19 14:30:17    05
# Filtering the top ten records as per area_income
adverts <- adverts %>% slice_max(area_income, n = 10)
# Creating a jitter plot of daily_time_spent_on_site,area_income, and clicked_on_ad
ggplot(adverts, aes(x = daily_time_spent_on_site, y = area_income, color = clicked_on_ad )) +
  geom_jitter(width = .2)+
  ggtitle("Jitter plot of daily_time_spent_on_site,area_income, and clicked_on_ad")

From the graph, we see that people who are of a high area_income and spend most of the time on the site tend not to click on the ad.

#Jitter plot of daily_time_spent_on_site,gender, and clicked_on_ad
ggplot(adverts, aes(x = clicked_on_ad, y = daily_time_spent_on_site , color = gender )) +
  geom_jitter(width = .2)+
  ggtitle("Jitter plot of daily_time_spent_on_site,gender, and clicked_on_ad")

From the graph, we see that people who spend most time on the site tend not to click on the ad. And is a mix of both male and female.

# Line Plot of daily_time_spent_on_site and date
adverts %>%
  ggplot( aes(x=date, y=daily_time_spent_on_site)) +
  geom_line() +
  ggtitle("Line Plot of daily_time_spent_on_site and date")+
  geom_point()

From the line graph, we can see the trend of the daily_time_spent_on_site and the month. It started on a high in February, then it dipped as from April, then went high again towards June, took a dip in June, then spiked towards July.

library(corrplot)
## corrplot 0.92 loaded
advert_num <- Filter(is.numeric, advert)
corrplot(cor(advert_num))

# correlation for all numerical variables
round(cor(advert_num),
  digits = 2 # rounded to 2 decimals
)
##                          daily_time_spent_on_site   age area_income
## daily_time_spent_on_site                     1.00 -0.33        0.31
## age                                         -0.33  1.00       -0.18
## area_income                                  0.31 -0.18        1.00
## daily_internet_usage                         0.52 -0.37        0.34
##                          daily_internet_usage
## daily_time_spent_on_site                 0.52
## age                                     -0.37
## area_income                              0.34
## daily_internet_usage                     1.00

A negative correlation implies that the two variables under consideration vary in opposite directions, that is, if a variable increases the other decreases and vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well.

The more extreme the correlation coefficient (the closer to -1 or 1), the stronger the relationship. This also means that a correlation close to 0 indicates that the two variables are independent, that is, as one variable increases, there is no tendency in the other variable to either decrease or increase.

5. Modelling

5.1 Using Support Vector Machine (SVM)

# Creating a copy of our dataset
advert_copy <- advert
# Dropping the column "timestamp" since we do not need it.
drop <- c("timestamp", "ad_topic_line","city","country","time","month")
advert_copy = advert_copy[,!(names(advert_copy) %in% drop)] 
# Mapping the clicked_on_ad to 0 for No and 1 to Yes
advert$clicked_on_ad <- recode_factor(advert$clicked_on_ad, '0' = 'No', '1' = 'Yes')
library(caret)
## Loading required package: lattice
# One hot encoding our categorical variables
advert_dmy <- dummyVars(" ~ .", data = advert_copy, fullRank = T)
dat_transformed <- data.frame(predict(advert_dmy, newdata = advert_copy))

Splitting the dataset into the Training set and Test set

# Splitting the dataset into the Training set and Test set
library(caTools)
 
set.seed(123)
split = sample.split(dat_transformed$clicked_on_ad, SplitRatio = 0.75)
 
training_set = subset(dat_transformed, split == TRUE)
test_set = subset(dat_transformed, split == FALSE)
# Checking the first six records of the train set
head(training_set)
##    daily_time_spent_on_site age area_income daily_internet_usage gender.Male
## 1                     68.95  35    61833.90               256.09           0
## 3                     69.47  26    59785.94               236.50           0
## 6                     59.99  23    59761.56               226.74           1
## 7                     88.91  33    53852.85               208.36           0
## 8                     66.00  48    24593.33               131.76           1
## 10                    69.88  20    55642.32               183.82           1
##    clicked_on_ad.Yes  date
## 1                  0 16887
## 3                  0 16873
## 6                  0 16940
## 7                  0 16828
## 8                  1 16867
## 10                 0 16993
# Fitting SVM to the Training set
library(e1071)
## 
## Attaching package: 'e1071'
## The following objects are masked from 'package:moments':
## 
##     kurtosis, moment, skewness
classifier = svm(formula = clicked_on_ad.Yes ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'linear')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set)
y_pred
##   2   4   5   9  14  23  26  29  36  38  39  43  46  56  58  59  62  65  66  68 
##   0   0   0   0   0   1   0   1   0   0   1   0   1   0   1   0   0   1   0   1 
##  73  79  84  88  91  94  95 103 106 108 116 120 121 124 130 133 136 138 139 140 
##   1   1   1   1   1   1   1   0   0   1   0   1   0   1   0   1   1   1   0   0 
## 141 145 146 157 158 160 166 172 175 176 178 180 182 187 191 201 206 210 213 219 
##   0   0   1   1   1   0   1   0   1   0   0   0   0   1   1   0   1   1   0   1 
## 221 222 223 224 227 231 232 233 234 241 243 248 249 251 258 262 266 267 270 278 
##   0   0   1   1   1   0   1   1   0   1   0   0   1   0   1   1   1   1   0   0 
## 280 288 289 292 294 300 302 303 310 312 313 327 337 346 347 354 357 360 366 371 
##   0   0   1   0   0   0   1   1   1   0   1   1   0   0   0   0   1   0   1   1 
## 372 375 376 381 384 394 402 409 414 432 434 437 444 452 454 457 459 462 467 469 
##   1   0   0   0   0   0   0   1   1   0   0   0   1   1   0   1   1   1   1   1 
## 473 476 478 479 485 493 495 496 501 510 517 518 523 525 528 538 542 543 545 550 
##   0   0   1   1   1   0   1   0   1   0   0   1   0   0   0   0   0   0   0   0 
## 554 555 561 572 573 575 577 578 579 583 587 589 591 592 594 601 619 622 624 630 
##   1   1   1   0   0   1   1   0   0   1   0   0   1   1   0   1   1   0   0   0 
## 631 637 639 643 649 652 653 660 664 676 680 687 691 698 701 706 707 717 721 726 
##   0   1   1   0   0   0   0   0   1   0   1   0   0   0   1   0   1   1   0   0 
## 727 730 732 736 738 740 745 748 753 756 769 773 775 778 781 783 786 792 800 804 
##   0   0   0   0   1   0   1   1   0   0   1   0   1   0   0   0   1   1   0   1 
## 808 809 815 818 821 827 828 832 833 834 840 845 846 848 854 855 861 870 873 884 
##   1   1   0   1   1   0   1   1   1   1   1   0   1   0   0   0   0   0   0   1 
## 886 888 889 891 892 894 897 899 900 902 920 924 927 929 930 943 949 951 953 958 
##   1   1   0   0   1   0   0   1   1   1   0   1   0   0   1   1   0   1   1   0 
## 961 962 972 974 975 976 979 987 995 999 
##   1   0   1   0   1   1   0   0   1   1 
## Levels: 0 1
# Making the Confusion Matrix
cm = table(test_set[, 5], y_pred)
cm
##    y_pred
##      0  1
##   0 66 74
##   1 65 45

Our model has been able to predict correctly 66 people who are likely to click on the ad, and 45 correct predictions of people who are not likely to click on the ad.

6. Conclusion

  • The variables exhibit a high standard deviation, which indicates that the data points tend to be far from their mean. From this information, we can tentatively interpret that the distribution of our data is not normal.

  • People between the age bracket of 25 - 50 are the ones that account for more clicks. This, therefore, implies that the target audience are between this age group.

  • Female audience are slightly more than male audience. This means that in our target audience, we have more female than male.

  • From the list, the top countries with the highest count are: Czech Republic, France, Afghanistan, Australia, Cyprus, Greece, Liberia, Micronesia, Peru, Senegal.

  • Most people who are of a higher area income, and spend most time on the site, tend not to click on the ad.

  • Most people who use the internet daily, tend not to click on the ad.

  • Most female clicked on the ad, as compared to the male. The female are slightly more that the male.

  • People who are of a high area_income and spend most of the time on the site tend not to click on the ad.

-Most clicks were done on the month February, followed closely by the month of May. We can also see that the month with the highest count is February followed closely by March.

7. Recommendation

  • I recommend more analysis to be done to find out why more people who spend time on the site do not click on the ad. Is it that it is not catchy, or it doe not resonate with them?