An online cryptography course has been created and needs to be advertised on the blog of the entrepreneur who created it. The target audience for the advert originate from various countries. In the past, the entrepreneur has ran ads to advertise a related course on the same blog and collected data in the process. She, therefore wish to employ my services as a Data Science Consultant to assist her in identifying the individuals are most likely to click on her ads.
The metric of success for this project is to identifying the individuals are most likely to click on her ads using Univariate and Bivariate Analysis.
The project is centered around the Advertising industry. Clicks can help in giving an insight of how well an ad is appealing to people who see it. Relevant, highly-targeted ads are more likely to receive clicks.
For this analysis, I will perform the following actions:
Loading the Data.
Reading the Data.
Cleaning the Dataset.
Performing EDA:
Univariate Analysis.
Bivariate Analysis.
Modelling.
Conclusion.
Recommendation.
This data is relevant because it helps in providing a comprehensive and consolidated view of the different Audiences and make audience management and optimization simpler.
# Reading our data from a csv file.
advert<-read.csv('advertising.csv')
# Checking the class of the file.
class(advert)
## [1] "data.frame"
# Checking the dimension of the dataframe.
dim(advert)
## [1] 1000 10
The dataframe is comprised of 1000 entries and 10 fields.
# Previewing the first five records of the dataframe.
head(advert)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11 0
## 2 2016-04-04 01:39:02 0
## 3 2016-03-13 20:35:42 0
## 4 2016-01-10 02:31:19 0
## 5 2016-06-03 03:36:18 0
## 6 2016-05-19 14:30:17 0
# Previewing the last five records of the dataframe.
tail(advert)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995 43.70 28 63126.96 173.01
## 996 72.97 30 71384.57 208.58
## 997 51.30 45 67782.17 134.42
## 998 51.63 51 42415.72 120.37
## 999 55.55 19 41920.79 187.95
## 1000 45.01 26 29875.80 178.35
## Ad.Topic.Line City Male
## 995 Front-line bifurcated ability Nicholasland 0
## 996 Fundamental modular algorithm Duffystad 1
## 997 Grass-roots cohesive monitoring New Darlene 1
## 998 Expanded intangible solution South Jessica 1
## 999 Proactive bandwidth-monitored policy West Steven 0
## 1000 Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked.on.Ad
## 995 Mayotte 2016-04-04 03:57:48 1
## 996 Lebanon 2016-02-11 21:49:00 1
## 997 Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 998 Mongolia 2016-02-01 17:24:57 1
## 999 Guatemala 2016-03-24 02:35:54 0
## 1000 Brazil 2016-06-03 21:43:21 1
# Printing information on the structure of a data frame.
str(advert)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : chr "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
## $ Clicked.on.Ad : int 0 0 0 0 0 0 0 1 0 0 ...
The Timestamp column has a wrong datatype. During data cleaning, we will change its data type to the appropriate data type.
# Checking for outliers in every numerical column
outliers <- function(){
boxplot(data)
return (boxplot)
}
boxplot(advert$Daily.Time.Spent.on.Site, main="Boxplot on Daily Time Spent on Site")
boxplot(advert$Age, main="Boxplot on Age")
boxplot(advert$Area.Income, main="Boxplot on Area Income")
boxplot(advert$Daily.Internet.Usage, main="Boxplot on Daily Internet Usage")
boxplot(advert$Male, main="Boxplot on Male")
boxplot(advert$Clicked.on.Ad, main="Bloxplot on Clicked on Ad")
From the boxplots, we see that the column with outliers is Area.Income.
# Displaying all the outliers in the Area.Income column.
boxplot.stats(advert$Area.Income)$out
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
# Extracting the row numbers where these outliers are found.
out <- boxplot.stats(advert$Area.Income)$out
out_ind <- which(advert$Area.Income %in% c(out))
out_ind
## [1] 136 511 641 666 693 769 779 953
# Displaying the rows with the outliers in the Area.Income column.
advert[out_ind,]
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 136 49.89 39 17709.98 160.03
## 511 57.86 30 18819.34 166.86
## 641 64.63 45 15598.29 158.80
## 666 58.05 32 15879.10 195.54
## 693 66.26 47 14548.06 179.04
## 769 68.58 41 13996.50 171.54
## 779 52.67 44 14775.50 191.26
## 953 62.79 36 18368.57 231.87
## Ad.Topic.Line City Male
## 136 Enhanced system-worthy application East Michele 1
## 511 Horizontal modular success Estesfurt 0
## 641 Triple-buffered high-level Internet solution Isaacborough 1
## 666 Total asynchronous architecture Sanderstown 1
## 693 Optional full-range projection Matthewtown 1
## 769 Exclusive discrete firmware New Williamville 1
## 779 Persevering 5thgeneration knowledge user New Hollyberg 0
## 953 Total coherent archive New James 1
## Country Timestamp Clicked.on.Ad
## 136 Belize 2016-04-16 12:09:25 1
## 511 Algeria 2016-07-08 17:14:01 1
## 641 Azerbaijan 2016-06-12 03:11:04 1
## 666 Tajikistan 2016-02-12 10:39:10 1
## 693 Lebanon 2016-04-25 19:31:39 1
## 769 El Salvador 2016-07-06 12:04:29 1
## 779 Jersey 2016-05-19 06:37:38 1
## 953 Luxembourg 2016-05-30 20:08:51 1
# Dealing with outliers
lower_limit <- 47032 - 1.5 * IQR(advert$Area.Income) # Defining the lower limit
advert$Area.Income[advert$Area.Income < lower_limit]<- lower_limit
# Plotting a boxplot
boxplot(advert$Area.Income, main="Boxplot of Area.Income")
From the second graph, we can see that there are no outliers in the column.
# Calculating the total number of missing values in our dataset.
sum(is.na(advert))
## [1] 0
The dataset has no missing values.
sum(duplicated(advert))
## [1] 0
The dataset has not duplicated values.
advert$Timestamp <- strptime(advert$Timestamp, "%Y-%m-%d %H:%M:%S")
# Printing information on the structure of a data frame.
class(advert$Timestamp)
## [1] "POSIXlt" "POSIXt"
str(advert)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : POSIXlt, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked.on.Ad : int 0 0 0 0 0 0 0 1 0 0 ...
The data type of timestamp has been converted from character to “POSIXlt” “POSIXt”. ##### 3.5 Ensuring Uniformity of Column Names
# Converting the column names to lower case.
names(advert) <- tolower(names(advert))
# Removing (.) in column names and replacing them with (_).
names(advert) <- gsub("\\.", "_", names(advert))
# Previewing the cleaned dataset
head(advert)
## daily_time_spent_on_site age area_income daily_internet_usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## ad_topic_line city male country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## timestamp clicked_on_ad
## 1 2016-03-27 00:53:11 0
## 2 2016-04-04 01:39:02 0
## 3 2016-03-13 20:35:42 0
## 4 2016-01-10 02:31:19 0
## 5 2016-06-03 03:36:18 0
## 6 2016-05-19 14:30:17 0
# Renaming the male column to gender
names(advert)[names(advert) == "male"] <- "gender"
# Checking the statistical summary of each field in our dataframe.
summary(advert)
## daily_time_spent_on_site age area_income daily_internet_usage
## Min. :32.60 Min. :19.00 Min. :19374 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55025 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
## ad_topic_line city gender country
## Length:1000 Length:1000 Min. :0.000 Length:1000
## Class :character Class :character 1st Qu.:0.000 Class :character
## Mode :character Mode :character Median :0.000 Mode :character
## Mean :0.481
## 3rd Qu.:1.000
## Max. :1.000
## timestamp clicked_on_ad
## Min. :2016-01-01 02:52:10.00 Min. :0.0
## 1st Qu.:2016-02-18 02:55:42.00 1st Qu.:0.0
## Median :2016-04-07 17:27:29.50 Median :0.5
## Mean :2016-04-10 10:34:06.64 Mean :0.5
## 3rd Qu.:2016-05-31 03:18:14.00 3rd Qu.:1.0
## Max. :2016-07-24 00:22:16.00 Max. :1.0
From the statistical summary, we can be able to see the mean and median for the numerical columns, and mode of the categorical columns.
# Getting the standard deviation of some numerical variables.
sapply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], sd)
## daily_time_spent_on_site age area_income
## 15.853615 8.785562 13343.223865
## daily_internet_usage
## 43.902339
# Comparing the standard deviation and mean of daily_time_spent_on_site
print(sd(advert$daily_time_spent_on_site))
## [1] 15.85361
print(mean(advert$daily_time_spent_on_site))
## [1] 65.0002
# Comparing the standard deviation and mean of area_income
print(sd(advert$area_income))
## [1] 13343.22
print(mean(advert$area_income))
## [1] 55025.32
# Comparing the standard deviation and mean of age
print(sd(advert$age))
## [1] 8.785562
print(mean(advert$age))
## [1] 36.009
From these the variables, we can see that they exhibit a high standard deviation, which indicates that the data points tend to be far from their mean. From this information, we can tentatively interpret that the distribution of our data is not normal.
# Getting the variance of some numerical variables.
sapply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], var)
## daily_time_spent_on_site age area_income
## 2.513371e+02 7.718611e+01 1.780416e+08
## daily_internet_usage
## 1.927415e+03
The variance also has a similar interpretation to that of standard deviation.
# Getting the interquartile range of some numerical variables.
sapply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], IQR)
## daily_time_spent_on_site age area_income
## 27.1875 13.0000 18438.8325
## daily_internet_usage
## 79.9625
The interquartile range gives the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile).
# Getting the skewness of some numerical variables.
library(moments)
skew_val <- apply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], 2, skewness)
print(skew_val)
## daily_time_spent_on_site age area_income
## -0.37120261 0.47842268 -0.62004808
## daily_internet_usage
## -0.03348703
From the skewness data, we can deduce that the variables are moderately skewed. This is because the skewness value of some variables is between −1 and −½ (daily_internet_usage) and that of others (age) between +½ and +1.
# Getting the kurtosis of some numerical variables.
kurt_val <- apply(advert[,c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage")], 2, kurtosis)
print(kurt_val)
## daily_time_spent_on_site age area_income
## 1.903942 2.595482 2.789881
## daily_internet_usage
## 1.727701
If the coefficient of kurtosis is less than 3 for each of the variables, then the data distribution is platykurtic.
# load library ggplot2
library(ggplot2)
# Plotting the histogram of Age Distribution
p<-ggplot(advert, aes(x=age)) +
geom_histogram(color="black", fill="light blue")+
ggtitle("Histogram for Age Distribution")
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histogram above, we find that people between the age bracket of 25 - 50 are the ones that account for more clicks. This, therefore, implies that the target audience are between this age group.
# Counting the number of male and female
ggplot(advert, aes(x = gender)) +
geom_bar(fill="light blue", color="black")+
ggtitle("Bar Graph for The count of Male and Female")
From the barplot, we see that female audience are slightly more than male audience. This means that in our target audience, we have more female than male.
# Ranking the countries according to count
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
advert %>%
count(country, sort=TRUE)
## country n
## 1 Czech Republic 9
## 2 France 9
## 3 Afghanistan 8
## 4 Australia 8
## 5 Cyprus 8
## 6 Greece 8
## 7 Liberia 8
## 8 Micronesia 8
## 9 Peru 8
## 10 Senegal 8
## 11 South Africa 8
## 12 Turkey 8
## 13 Albania 7
## 14 Bahamas 7
## 15 Bosnia and Herzegovina 7
## 16 Burundi 7
## 17 Cambodia 7
## 18 Eritrea 7
## 19 Ethiopia 7
## 20 Fiji 7
## 21 Luxembourg 7
## 22 Taiwan 7
## 23 Venezuela 7
## 24 Western Sahara 7
## 25 Algeria 6
## 26 Anguilla 6
## 27 Belarus 6
## 28 Bolivia 6
## 29 Bulgaria 6
## 30 China 6
## 31 Christmas Island 6
## 32 Costa Rica 6
## 33 Croatia 6
## 34 El Salvador 6
## 35 Gabon 6
## 36 Hong Kong 6
## 37 Hungary 6
## 38 Indonesia 6
## 39 Jersey 6
## 40 Kyrgyz Republic 6
## 41 Lebanon 6
## 42 Liechtenstein 6
## 43 Madagascar 6
## 44 Malta 6
## 45 Mayotte 6
## 46 Mexico 6
## 47 Moldova 6
## 48 Mongolia 6
## 49 Netherlands Antilles 6
## 50 Philippines 6
## 51 Poland 6
## 52 Puerto Rico 6
## 53 Qatar 6
## 54 Saint Vincent and the Grenadines 6
## 55 Samoa 6
## 56 Singapore 6
## 57 Svalbard & Jan Mayen Islands 6
## 58 Turkmenistan 6
## 59 United Arab Emirates 6
## 60 Vanuatu 6
## 61 Zimbabwe 6
## 62 American Samoa 5
## 63 Antigua and Barbuda 5
## 64 Austria 5
## 65 Bahrain 5
## 66 Barbados 5
## 67 Belgium 5
## 68 Belize 5
## 69 Bouvet Island (Bouvetoya) 5
## 70 Brazil 5
## 71 Brunei Darussalam 5
## 72 Cameroon 5
## 73 Canada 5
## 74 Cayman Islands 5
## 75 Cuba 5
## 76 Dominica 5
## 77 Ecuador 5
## 78 Egypt 5
## 79 Finland 5
## 80 French Polynesia 5
## 81 French Southern Territories 5
## 82 Greenland 5
## 83 Guyana 5
## 84 Honduras 5
## 85 Iran 5
## 86 Italy 5
## 87 Jamaica 5
## 88 Korea 5
## 89 Myanmar 5
## 90 Norfolk Island 5
## 91 Pakistan 5
## 92 Papua New Guinea 5
## 93 Rwanda 5
## 94 Saint Helena 5
## 95 Saint Pierre and Miquelon 5
## 96 Serbia 5
## 97 Somalia 5
## 98 Timor-Leste 5
## 99 Tonga 5
## 100 Turks and Caicos Islands 5
## 101 Ukraine 5
## 102 United States of America 5
## 103 Uruguay 5
## 104 Angola 4
## 105 Bangladesh 4
## 106 Burkina Faso 4
## 107 Chad 4
## 108 Chile 4
## 109 Congo 4
## 110 Cote d'Ivoire 4
## 111 Dominican Republic 4
## 112 Equatorial Guinea 4
## 113 Falkland Islands (Malvinas) 4
## 114 French Guiana 4
## 115 Georgia 4
## 116 Ghana 4
## 117 Grenada 4
## 118 Guam 4
## 119 Guatemala 4
## 120 Israel 4
## 121 Japan 4
## 122 Kazakhstan 4
## 123 Kenya 4
## 124 Lao People's Democratic Republic 4
## 125 Latvia 4
## 126 Libyan Arab Jamahiriya 4
## 127 Malawi 4
## 128 Maldives 4
## 129 Mali 4
## 130 Martinique 4
## 131 Mauritius 4
## 132 Netherlands 4
## 133 New Zealand 4
## 134 Palau 4
## 135 Saint Martin 4
## 136 Saudi Arabia 4
## 137 Sri Lanka 4
## 138 Sweden 4
## 139 Switzerland 4
## 140 Thailand 4
## 141 Tokelau 4
## 142 Tunisia 4
## 143 Tuvalu 4
## 144 Uganda 4
## 145 United States Minor Outlying Islands 4
## 146 United States Virgin Islands 4
## 147 Wallis and Futuna 4
## 148 Zambia 4
## 149 Antarctica (the territory South of 60 deg S) 3
## 150 Armenia 3
## 151 Azerbaijan 3
## 152 British Virgin Islands 3
## 153 Cook Islands 3
## 154 Denmark 3
## 155 Estonia 3
## 156 Faroe Islands 3
## 157 Gibraltar 3
## 158 Guernsey 3
## 159 Guinea 3
## 160 Heard Island and McDonald Islands 3
## 161 Holy See (Vatican City State) 3
## 162 Iceland 3
## 163 Ireland 3
## 164 Isle of Man 3
## 165 Lithuania 3
## 166 Macao 3
## 167 Malaysia 3
## 168 Monaco 3
## 169 Morocco 3
## 170 Nauru 3
## 171 Nepal 3
## 172 Nicaragua 3
## 173 Niger 3
## 174 Niue 3
## 175 Northern Mariana Islands 3
## 176 Palestinian Territory 3
## 177 Paraguay 3
## 178 Portugal 3
## 179 Russian Federation 3
## 180 San Marino 3
## 181 Seychelles 3
## 182 Spain 3
## 183 Syrian Arab Republic 3
## 184 Tajikistan 3
## 185 Tanzania 3
## 186 Togo 3
## 187 Trinidad and Tobago 3
## 188 United Kingdom 3
## 189 Vietnam 3
## 190 Yemen 3
## 191 Andorra 2
## 192 Argentina 2
## 193 Benin 2
## 194 Bhutan 2
## 195 Central African Republic 2
## 196 Colombia 2
## 197 Comoros 2
## 198 Djibouti 2
## 199 Gambia 2
## 200 Guadeloupe 2
## 201 Guinea-Bissau 2
## 202 Haiti 2
## 203 India 2
## 204 Kuwait 2
## 205 Macedonia 2
## 206 Mauritania 2
## 207 Montenegro 2
## 208 Namibia 2
## 209 New Caledonia 2
## 210 Norway 2
## 211 Panama 2
## 212 Pitcairn Islands 2
## 213 Reunion 2
## 214 Saint Barthelemy 2
## 215 Saint Lucia 2
## 216 Sao Tome and Principe 2
## 217 Sierra Leone 2
## 218 Slovakia (Slovak Republic) 2
## 219 South Georgia and the South Sandwich Islands 2
## 220 Sudan 2
## 221 Suriname 2
## 222 Swaziland 2
## 223 Uzbekistan 2
## 224 Aruba 1
## 225 Bermuda 1
## 226 British Indian Ocean Territory (Chagos Archipelago) 1
## 227 Cape Verde 1
## 228 Germany 1
## 229 Jordan 1
## 230 Kiribati 1
## 231 Lesotho 1
## 232 Marshall Islands 1
## 233 Montserrat 1
## 234 Mozambique 1
## 235 Romania 1
## 236 Saint Kitts and Nevis 1
## 237 Slovenia 1
From the list, the top countries with the highest count are: Czech Republic, France, Afghanistan, Australia, Cyprus, Greece, Liberia, Micronesia, Peru, Senegal.
# Separating the date and time from the timestamp column.
advert$date <- as.Date(advert$timestamp)
advert$time <- format(advert$timestamp,"%H:%M:%S")
# Extracting month from the date
advert$month <- format(advert$date, "%m")
# Previewing the dataset
head(advert)
## daily_time_spent_on_site age area_income daily_internet_usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## ad_topic_line city gender country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## timestamp clicked_on_ad date time month
## 1 2016-03-27 00:53:11 0 2016-03-27 00:53:11 03
## 2 2016-04-04 01:39:02 0 2016-04-04 01:39:02 04
## 3 2016-03-13 20:35:42 0 2016-03-13 20:35:42 03
## 4 2016-01-10 02:31:19 0 2016-01-10 02:31:19 01
## 5 2016-06-03 03:36:18 0 2016-06-03 03:36:18 06
## 6 2016-05-19 14:30:17 0 2016-05-19 14:30:17 05
library(ggplot2)
ggplot(advert, aes(x = gender, fill=clicked_on_ad)) +
geom_bar(color="black")+
ggtitle("Bar Graph for The count of Male and Female Vs. their Clicked_on_ad Status")
From the graph, we see that most female clicked on the ad, as compared to the male. The female are slightly more that the male.
ggplot(advert, aes(x = month, fill=clicked_on_ad)) +
geom_bar(color="black")+
ggtitle("Bar Graph for The count of month Vs. Clicked_on_ad")
From the bar graph, we can see that most clicks were done on the month February, followed closely by the month of May. We can also see that the month with the highest count is February followed closely by March.
# Plotting a histogram of clicked_on_ad and daily_time_spent_on_site
advert$clicked_on_ad <- recode_factor(advert$clicked_on_ad, '0' = 'No', '1' = 'Yes')
# Use clicked_on_ad as the faceting variable
ggplot(advert, aes(x = daily_time_spent_on_site)) +
geom_histogram(fill = "light blue", colour = "black") +
ggtitle("Histogram of Clicked_on_ad and Daily_time_spent_on_site")+
facet_grid(clicked_on_ad ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the graph, we see that most people who spend the most time on the site, tend not to click on the ad.
advert$gender <- recode_factor(advert$gender, '0' = 'Female', '1' = 'Male')
# Use gender as the faceting variable
ggplot(advert, aes(x = daily_time_spent_on_site)) +
geom_histogram(fill = "light blue", colour = "black") +
ggtitle("Histogram of Gender and Daily_time_spent_on_site")+
facet_grid(gender ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the graph, we see that most female as compared to male, spend more time on the site.
# Plotting a jitter plot of daily_time_spent_on_site, age, and gender
ggplot(advert, aes(x = daily_time_spent_on_site, y = age, color = gender)) +
ggtitle("Jitterplot of daily_time_spent_on_site, age, and gender")+
geom_jitter(width = .2)
From the jitterplot, we can deduce that most people spend time on the site.
# Use smoke as the faceting variable
ggplot(advert, aes(x = daily_internet_usage)) +
geom_histogram(fill = "light blue", colour = "black") +
ggtitle("Histogram of daily_internet_usage and clicked_on_ad")+
facet_grid(clicked_on_ad ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the graphs, we see that most people who use the internet daily, tend not to click on the ad.
# Plotting a jitter plot of daily_time_spend_on_site and clicked_on_ad.
ggplot(advert, aes(x = daily_time_spent_on_site, y = area_income, color = clicked_on_ad )) +
geom_jitter(width = .2)+
ggtitle("Jitter plot of daily_time_spend_on_site, area_income and clicked_on_ad")
From the jitter plot, we see that most people who are of a higher area income, and spend most time on the site, tend not to click on the ad.
# Creating a copy of the dataset
adverts <-advert
head(adverts)
## daily_time_spent_on_site age area_income daily_internet_usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## ad_topic_line city gender country
## 1 Cloned 5thgeneration orchestration Wrightburgh Female Tunisia
## 2 Monitored national standardization West Jodi Male Nauru
## 3 Organic bottom-line service-desk Davidton Female San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt Male Italy
## 5 Robust logistical utilization South Manuel Female Iceland
## 6 Sharable client-driven software Jamieberg Male Norway
## timestamp clicked_on_ad date time month
## 1 2016-03-27 00:53:11 No 2016-03-27 00:53:11 03
## 2 2016-04-04 01:39:02 No 2016-04-04 01:39:02 04
## 3 2016-03-13 20:35:42 No 2016-03-13 20:35:42 03
## 4 2016-01-10 02:31:19 No 2016-01-10 02:31:19 01
## 5 2016-06-03 03:36:18 No 2016-06-03 03:36:18 06
## 6 2016-05-19 14:30:17 No 2016-05-19 14:30:17 05
# Filtering the top ten records as per area_income
adverts <- adverts %>% slice_max(area_income, n = 10)
# Creating a jitter plot of daily_time_spent_on_site,area_income, and clicked_on_ad
ggplot(adverts, aes(x = daily_time_spent_on_site, y = area_income, color = clicked_on_ad )) +
geom_jitter(width = .2)+
ggtitle("Jitter plot of daily_time_spent_on_site,area_income, and clicked_on_ad")
From the graph, we see that people who are of a high area_income and spend most of the time on the site tend not to click on the ad.
#Jitter plot of daily_time_spent_on_site,gender, and clicked_on_ad
ggplot(adverts, aes(x = clicked_on_ad, y = daily_time_spent_on_site , color = gender )) +
geom_jitter(width = .2)+
ggtitle("Jitter plot of daily_time_spent_on_site,gender, and clicked_on_ad")
From the graph, we see that people who spend most time on the site tend not to click on the ad. And is a mix of both male and female.
# Line Plot of daily_time_spent_on_site and date
adverts %>%
ggplot( aes(x=date, y=daily_time_spent_on_site)) +
geom_line() +
ggtitle("Line Plot of daily_time_spent_on_site and date")+
geom_point()
From the line graph, we can see the trend of the daily_time_spent_on_site and the month. It started on a high in February, then it dipped as from April, then went high again towards June, took a dip in June, then spiked towards July.
library(corrplot)
## corrplot 0.92 loaded
advert_num <- Filter(is.numeric, advert)
corrplot(cor(advert_num))
# correlation for all numerical variables
round(cor(advert_num),
digits = 2 # rounded to 2 decimals
)
## daily_time_spent_on_site age area_income
## daily_time_spent_on_site 1.00 -0.33 0.31
## age -0.33 1.00 -0.18
## area_income 0.31 -0.18 1.00
## daily_internet_usage 0.52 -0.37 0.34
## daily_internet_usage
## daily_time_spent_on_site 0.52
## age -0.37
## area_income 0.34
## daily_internet_usage 1.00
A negative correlation implies that the two variables under consideration vary in opposite directions, that is, if a variable increases the other decreases and vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well.
The more extreme the correlation coefficient (the closer to -1 or 1), the stronger the relationship. This also means that a correlation close to 0 indicates that the two variables are independent, that is, as one variable increases, there is no tendency in the other variable to either decrease or increase.
# Creating a copy of our dataset
advert_copy <- advert
# Dropping the column "timestamp" since we do not need it.
drop <- c("timestamp", "ad_topic_line","city","country","time","month")
advert_copy = advert_copy[,!(names(advert_copy) %in% drop)]
# Mapping the clicked_on_ad to 0 for No and 1 to Yes
advert$clicked_on_ad <- recode_factor(advert$clicked_on_ad, '0' = 'No', '1' = 'Yes')
library(caret)
## Loading required package: lattice
# One hot encoding our categorical variables
advert_dmy <- dummyVars(" ~ .", data = advert_copy, fullRank = T)
dat_transformed <- data.frame(predict(advert_dmy, newdata = advert_copy))
# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(123)
split = sample.split(dat_transformed$clicked_on_ad, SplitRatio = 0.75)
training_set = subset(dat_transformed, split == TRUE)
test_set = subset(dat_transformed, split == FALSE)
# Checking the first six records of the train set
head(training_set)
## daily_time_spent_on_site age area_income daily_internet_usage gender.Male
## 1 68.95 35 61833.90 256.09 0
## 3 69.47 26 59785.94 236.50 0
## 6 59.99 23 59761.56 226.74 1
## 7 88.91 33 53852.85 208.36 0
## 8 66.00 48 24593.33 131.76 1
## 10 69.88 20 55642.32 183.82 1
## clicked_on_ad.Yes date
## 1 0 16887
## 3 0 16873
## 6 0 16940
## 7 0 16828
## 8 1 16867
## 10 0 16993
# Fitting SVM to the Training set
library(e1071)
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:moments':
##
## kurtosis, moment, skewness
classifier = svm(formula = clicked_on_ad.Yes ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set)
y_pred
## 2 4 5 9 14 23 26 29 36 38 39 43 46 56 58 59 62 65 66 68
## 0 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1
## 73 79 84 88 91 94 95 103 106 108 116 120 121 124 130 133 136 138 139 140
## 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0
## 141 145 146 157 158 160 166 172 175 176 178 180 182 187 191 201 206 210 213 219
## 0 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 1 1 0 1
## 221 222 223 224 227 231 232 233 234 241 243 248 249 251 258 262 266 267 270 278
## 0 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 0 0
## 280 288 289 292 294 300 302 303 310 312 313 327 337 346 347 354 357 360 366 371
## 0 0 1 0 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1
## 372 375 376 381 384 394 402 409 414 432 434 437 444 452 454 457 459 462 467 469
## 1 0 0 0 0 0 0 1 1 0 0 0 1 1 0 1 1 1 1 1
## 473 476 478 479 485 493 495 496 501 510 517 518 523 525 528 538 542 543 545 550
## 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0
## 554 555 561 572 573 575 577 578 579 583 587 589 591 592 594 601 619 622 624 630
## 1 1 1 0 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 0
## 631 637 639 643 649 652 653 660 664 676 680 687 691 698 701 706 707 717 721 726
## 0 1 1 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0
## 727 730 732 736 738 740 745 748 753 756 769 773 775 778 781 783 786 792 800 804
## 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1
## 808 809 815 818 821 827 828 832 833 834 840 845 846 848 854 855 861 870 873 884
## 1 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 1
## 886 888 889 891 892 894 897 899 900 902 920 924 927 929 930 943 949 951 953 958
## 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1 0
## 961 962 972 974 975 976 979 987 995 999
## 1 0 1 0 1 1 0 0 1 1
## Levels: 0 1
# Making the Confusion Matrix
cm = table(test_set[, 5], y_pred)
cm
## y_pred
## 0 1
## 0 66 74
## 1 65 45
Our model has been able to predict correctly 66 people who are likely to click on the ad, and 45 correct predictions of people who are not likely to click on the ad.
The variables exhibit a high standard deviation, which indicates that the data points tend to be far from their mean. From this information, we can tentatively interpret that the distribution of our data is not normal.
People between the age bracket of 25 - 50 are the ones that account for more clicks. This, therefore, implies that the target audience are between this age group.
Female audience are slightly more than male audience. This means that in our target audience, we have more female than male.
From the list, the top countries with the highest count are: Czech Republic, France, Afghanistan, Australia, Cyprus, Greece, Liberia, Micronesia, Peru, Senegal.
Most people who are of a higher area income, and spend most time on the site, tend not to click on the ad.
Most people who use the internet daily, tend not to click on the ad.
Most female clicked on the ad, as compared to the male. The female are slightly more that the male.
People who are of a high area_income and spend most of the time on the site tend not to click on the ad.
-Most clicks were done on the month February, followed closely by the month of May. We can also see that the month with the highest count is February followed closely by March.