Matilda Kadzo 25/05/2022
A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran adverts to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her adverts.
To provide an accurate depiction of the people most likely to view the clients advertisements and provide recommendations to the client based on the results of the univariate and bivariate analysis conducted on the dataset.
Clicks on adverts can help you understand how appealing your advert is to people who see it. Highly targeted ads are more likely to receive clicks. This can help you gauge how enticing your advert is. In this case, it would help us know how many people would be interested in the online cryptography course through the number of clicks on our client’s blog.
Steps taken:
advert <- read.csv("/home/binti/Downloads/R/advertising.csv")head(advert)## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11 0
## 2 2016-04-04 01:39:02 0
## 3 2016-03-13 20:35:42 0
## 4 2016-01-10 02:31:19 0
## 5 2016-06-03 03:36:18 0
## 6 2016-05-19 14:30:17 0
tail(advert)## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995 43.70 28 63126.96 173.01
## 996 72.97 30 71384.57 208.58
## 997 51.30 45 67782.17 134.42
## 998 51.63 51 42415.72 120.37
## 999 55.55 19 41920.79 187.95
## 1000 45.01 26 29875.80 178.35
## Ad.Topic.Line City Male
## 995 Front-line bifurcated ability Nicholasland 0
## 996 Fundamental modular algorithm Duffystad 1
## 997 Grass-roots cohesive monitoring New Darlene 1
## 998 Expanded intangible solution South Jessica 1
## 999 Proactive bandwidth-monitored policy West Steven 0
## 1000 Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked.on.Ad
## 995 Mayotte 2016-04-04 03:57:48 1
## 996 Lebanon 2016-02-11 21:49:00 1
## 997 Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 998 Mongolia 2016-02-01 17:24:57 1
## 999 Guatemala 2016-03-24 02:35:54 0
## 1000 Brazil 2016-06-03 21:43:21 1
names(advert)## [1] "Daily.Time.Spent.on.Site" "Age"
## [3] "Area.Income" "Daily.Internet.Usage"
## [5] "Ad.Topic.Line" "City"
## [7] "Male" "Country"
## [9] "Timestamp" "Clicked.on.Ad"
Finding the total missing values in our dataset.
colSums(is.na(advert))## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad
## 0
#There are no missing values in our datasetChecking for duplicates across our rows.
sum(advert[duplicated(advert),])## [1] 0
#There are no duplicates in this dataset.The dataset had neither missing values or any duplicated values
Checking the descriptive statistics of the dataset
summary(advert)## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
## Ad.Topic.Line City Male Country
## Length:1000 Length:1000 Min. :0.000 Length:1000
## Class :character Class :character 1st Qu.:0.000 Class :character
## Mode :character Mode :character Median :0.000 Mode :character
## Mean :0.481
## 3rd Qu.:1.000
## Max. :1.000
## Timestamp Clicked.on.Ad
## Length:1000 Min. :0.0
## Class :character 1st Qu.:0.0
## Mode :character Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
Checking the structure of the dataframe
str(advert)## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : chr "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
## $ Clicked.on.Ad : int 0 0 0 0 0 0 0 1 0 0 ...
Checking for outliers in the dataset. These show a visual shape of our data distribution.
boxplot(advert$Area.Income,
main ="Area Income",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)#There are a few outliers in the area income column.boxplot(advert$Daily.Time.Spent.on.Site,
main ="Daily Time Spent on Site",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)#There are no outliers in the daily time spent on site column. boxplot(advert$Age,
main ="Age",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)#There are no outliers in the age column.boxplot(advert$Daily.Internet.Usage,
main ="Daily Internet Usage",
col = "orange",
border = 'brown',
horizontal = TRUE,
notch = TRUE)#There are no outliers in the daily internet usage columnMean of the numeric columns
colMeans(advert[sapply(advert,is.numeric)])## Daily.Time.Spent.on.Site Age Area.Income
## 65.0002 36.0090 55000.0001
## Daily.Internet.Usage Male Clicked.on.Ad
## 180.0001 0.4810 0.5000
Median of our numeric columns
ad_time_median <- median(advert$Daily.Time.Spent.on.Site)
print(ad_time_median)## [1] 68.215
ad_age_median <- median(advert$Age)
ad_age_median## [1] 35
ad_income_median <- median(advert$Area.Income)
ad_income_median## [1] 57012.3
ad_internet_usage_median <- median(advert$Daily.Internet.Usage)
ad_internet_usage_median## [1] 183.13
Mode of our numeric columns.
Let’s create the mode function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]}Finding the mode in the age column
getmode(advert$Age)## [1] 31
getmode(advert$Daily.Time.Spent.on.Site)## [1] 62.26
getmode(advert$Area.Income)## [1] 61833.9
getmode(advert$Daily.Internet.Usage)## [1] 167.22
getmode(advert$City)## [1] "Lisamouth"
getmode(advert$Ad.Topic.Line)## [1] "Cloned 5thgeneration orchestration"
getmode(advert$Male)## [1] 0
getmode(advert$Country)## [1] "Czech Republic"
getmode(advert$Timestamp)## [1] "2016-03-27 00:53:11"
Minimum values in the numeric columns
min(advert$Age)## [1] 19
min(advert$Daily.Time.Spent.on.Site)## [1] 32.6
min(advert$Area.Income)## [1] 13996.5
min(advert$Daily.Internet.Usage)## [1] 104.78
Maximum values in the numeric columns
max(advert$Age)## [1] 61
max(advert$Daily.Time.Spent.on.Site)## [1] 91.43
max(advert$Area.Income)## [1] 79484.8
max(advert$Daily.Internet.Usage)## [1] 269.96
Range in the numeric columns
range(advert$Age)## [1] 19 61
range(advert$Daily.Time.Spent.on.Site)## [1] 32.60 91.43
range(advert$Area.Income)## [1] 13996.5 79484.8
range(advert$Daily.Internet.Usage)## [1] 104.78 269.96
Summary * The youngest respondent is 19 and the oldest 61 years of age. * The least time spent on her site is 32 minutes and the highest 91 minutes. * The lowest income earner among the respondents earns 13,996 while the highest earns 79,484. * Daily internet usage ranges from 104 - 269.
Quantiles in the columns
quantile(advert$Age)## 0% 25% 50% 75% 100%
## 19 29 35 42 61
quantile(advert$Daily.Time.Spent.on.Site)## 0% 25% 50% 75% 100%
## 32.6000 51.3600 68.2150 78.5475 91.4300
quantile(advert$Area.Income)## 0% 25% 50% 75% 100%
## 13996.50 47031.80 57012.30 65470.64 79484.80
quantile(advert$Daily.Internet.Usage)## 0% 25% 50% 75% 100%
## 104.7800 138.8300 183.1300 218.7925 269.9600
Variance of the numeric columns.
This shows how the data values are dispersed around the mean.
var(advert$Age)## [1] 77.18611
var(advert$Daily.Time.Spent.on.Site)## [1] 251.3371
var(advert$Area.Income)## [1] 179952406
var(advert$Daily.Internet.Usage)## [1] 1927.415
Finding the standard deviation of the columns.
sd(advert$Age)## [1] 8.785562
sd(advert$Daily.Time.Spent.on.Site)## [1] 15.85361
sd(advert$Area.Income)## [1] 13414.63
sd(advert$Daily.Internet.Usage)## [1] 43.90234
requency Distribution in the age column
table(advert$Age)##
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 6 6 6 13 19 21 27 37 33 48 48 39 60 38 43 39 39 50 36 37 30 36 32 26 23 21
## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
## 30 18 13 16 18 20 12 15 10 9 7 2 6 4 2 4 1
# Most respondents fall between theage bracket 25-42. The age with the highest number of readers is 31 which has a total of 61 people in total.Plotting histograms for the columns
hist(advert$Age, col = "Cyan")#Most respondents fall in the age bracket 25-40.hist(advert$Area.Income, col = "Purple")#The respondents mostly earn between 50K - 70Khist(advert$Daily.Time.Spent.on.Site, col = "gold")hist(advert$Daily.Internet.Usage, col = "pink")
### Plotting count plots for Categorical data
library(ggplot2)
ggplot(advert, aes(x=Male)) + geom_bar(fill=rgb(0.4,0.1,0.5))
There were more male than female users that visited the site and clicked
on the advert
ggplot(advert, aes(x=factor(`Clicked.on.Ad`))) + geom_bar( fill=rgb(0.6,0.4,0.4))
The number of users that clicked the advert are equal to those that did
not click on the advert.
library(ggplot2)
ggplot(data = advert, aes(x = Area.Income, fill = Clicked.on.Ad))+
geom_histogram(bins =20,col = "orange")+
labs(title = "Income Distribution", x = "Area Income", y= "Frequency", fill = "Clicked on Ad")+ scale_color_brewer(
palette = "Set1"
)ggplot(data = advert, aes(x = Age, fill = Clicked.on.Ad))+
geom_histogram(bins =20,col = "orange")+
labs(title = "Age Distribution", x = "Age", y= "Frequency", fill = "Clicked on Ad")+ scale_color_brewer(
palette = "Set1"
)ggplot(data = advert, aes(x =Daily.Time.Spent.on.Site, fill = Clicked.on.Ad))+
geom_histogram(bins =20,col = "orange")+
labs(title = "Daily Time Spent on Site", x = "Time Spent on Site", y= "Frequency", fill = "Clicked on Ad")+ scale_color_brewer(
palette = "Set1"
)Covariance is a statistical representation of the degree to which two variables vary together.
cov(advert$Age, advert$Daily.Time.Spent.on.Site)## [1] -46.17415
#There is a negative relationship between the age and the time spent on site which means as the age increases, the daily time spent on the site decreases. The opposite is true.cov(advert$Age, advert$Daily.Internet.Usage)## [1] -141.6348
#There is a negative relationship between the age and the daily internet usage as well.cov(advert$Area.Income,advert$Daily.Time.Spent.on.Site)## [1] 66130.81
#There is a strong positive relationship between the income and daily time spent on site variables. That goes to say that the higher the income, the more the time spent on site and the lower the income, the less the time spent on site.cov(advert$Age,advert$Area.Income)## [1] -21520.93
#There is a negative correlation between the age and income variables.cor(advert$Age, advert$Daily.Time.Spent.on.Site)## [1] -0.3315133
cor(advert$Age,advert$Daily.Internet.Usage)## [1] -0.3672086
cor(advert$Area.Income,advert$Daily.Internet.Usage)## [1] 0.3374955
cor(advert$Area.Income,advert$Daily.Time.Spent.on.Site)## [1] 0.3109544
cor(advert$Age,advert$Area.Income)## [1] -0.182605
cor(advert[, c("Age","Daily.Time.Spent.on.Site","Daily.Internet.Usage")])## Age Daily.Time.Spent.on.Site
## Age 1.0000000 -0.3315133
## Daily.Time.Spent.on.Site -0.3315133 1.0000000
## Daily.Internet.Usage -0.3672086 0.5186585
## Daily.Internet.Usage
## Age -0.3672086
## Daily.Time.Spent.on.Site 0.5186585
## Daily.Internet.Usage 1.0000000
cor(advert[,unlist(lapply(advert, is.numeric))])## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 1.00000000 -0.33151334 0.310954413
## Age -0.33151334 1.00000000 -0.182604955
## Area.Income 0.31095441 -0.18260496 1.000000000
## Daily.Internet.Usage 0.51865848 -0.36720856 0.337495533
## Male -0.01895085 -0.02104406 0.001322359
## Clicked.on.Ad -0.74811656 0.49253127 -0.476254628
## Daily.Internet.Usage Male Clicked.on.Ad
## Daily.Time.Spent.on.Site 0.51865848 -0.018950855 -0.74811656
## Age -0.36720856 -0.021044064 0.49253127
## Area.Income 0.33749553 0.001322359 -0.47625463
## Daily.Internet.Usage 1.00000000 0.028012326 -0.78653918
## Male 0.02801233 1.000000000 -0.03802747
## Clicked.on.Ad -0.78653918 -0.038027466 1.00000000
Plotting a correlation heatmap for the numerical variables
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggcorrplot)# Selecting the Numerical Variables of the dataset
corr <- dplyr::select(advert,Age,Area.Income,Clicked.on.Ad,Daily.Internet.Usage,Daily.Time.Spent.on.Site,Male )# Plotting the Correlation Heatmap
library(ggcorrplot)
ggcorrplot(cor(corr), hc.order = F,type =
"lower", lab = T,
ggtheme = ggplot2::theme_gray,
colors = c("#00798c", "violet", "#edae49"))Here, it was noted noted that :
There was a strong negative correlation between the Daily Internet usage and Clicked on Ad variables. This means that the higher ones income the less likely they are to click on the blog ads. The same can also be said for the Daily Time Spent on Site and Click on ad variables.
The Click on Ad variable had a strong positive correlation with the Age Variable, the older users were more likely to click on the ad , as we observed above in our analysis.
The clicked on ad variable was also strongly negatively correlated with the Area Income , where the higher ones income was the less likely they were to click on the ad.
Scatter plots are used when we want to see a graphical representation of two different variables. They show how the variables are correlated.
Let’s plot a scatter plot for age and daily time spent on site.
ggplot(advert, aes(Area.Income,Age))+geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
labs(title = "Scatter Plot of Age Distribution vs Area Income",
x = "Area Income",
y = "Age")
The scatter plot for the Area Income against Age showed that , majority
of the users who did not click on the ad were the high income earners
and many of these were aged between 20 and 40 years.
Scatter plot for Income and Daily Internet Usage
ggplot(advert, aes(Area.Income, Daily.Internet.Usage))+
geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
labs(title = "Scatter Plot of Area Income vs Daily Internet Usage",
x = "Area Income",
y = "Daily Internet Usage")
Scatter Plot of Age Distribution vs Time Spent on Site
ggplot(advert, aes(Age, Daily.Time.Spent.on.Site))+
geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
labs(title = "Scatter Plot of Age Distribution vs Time Spent on Site",
x = "Age",
y = "Time Spent on Site")
Plotting the Age against Time spent on the site variable we see that the
younger demographic are less tolerant to ads despite spending
significant amounts of time on the site.
The reason for this may be that younger people , are more tech savvy and therefore are more likely to detect ads and avoid them while using the internet compared to their older counterparts.
Scatter plot for Income Distribution and Daily time spent on site.
ggplot(advert, aes(Daily.Time.Spent.on.Site, Area.Income))+
geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
labs(title = "Time spent on site vs Income",
x = "Daily Time Spent on Site",
y = "Income Distribution")
The people who were least likely to click on the ad were the higher
income earners , this was despite the fact that they seemed to spend a
over an hour a day on the site.
The same sentiments can be echoed for the Usage , total internet usage per day, variable . When plotted against income we see that those who spend over 200 minutes online all day and earn more than 50,000 are the least likely to click on ads on the internet.
Scatter plot for Age and Income Distribution
ggplot(advert, aes(Age, Daily.Internet.Usage))+
geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
labs(title = "Scatter Plot of Age Distribution vs Daily Usage",
x = "Age",
y = "Daily Usage")Focusing more on those earning a lower income i.e less than 60,000 would prove to be more beneficial as these consumers click on adverts more.
Users who were aged over 35 should be targeted more, as they were more likely to click on the ad.
Finally the users who spend less time on the site and on the internet in general would prove a better demographic for the ads.