1. Defining the question.

Which individuals among a blog’s audience are likely to click on ads within the blog?

2. Defining the metrics of success

The analysis will be declared a success once we get the characteristics of individuals who are more likely to click on ads.

3. Defining the context.

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

4. Experimental Design.

  1. Exploratory Data Analysis
  2. Data Cleaning
  3. Univariate Analysis
  4. Bivariate Analysis
  5. Supervised Learning
  6. Conclusion
  7. Recommendation

5. Appropriateness of the available data.

The dataset provided contains appropriate variables that will provide the insigts we need. The dataset can be found here

6. Reading the data

## Installing the required packages
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(caretEnsemble)
## 
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
## 
##     autoplot
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(rpart)
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
## 
##     outlier
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(superml) # for label encoding
## Loading required package: R6
library(e1071) # Holds the Naive Bayes function.

data <- read.csv("http://bit.ly/IPAdvertisingData")

7. Data Understanding.

## 7.1 Previewing the head of the dataset
head(data)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0
## 7.2 Previewing the tail of the dataset
tail(data)
##      Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995                     43.70  28    63126.96               173.01
## 996                     72.97  30    71384.57               208.58
## 997                     51.30  45    67782.17               134.42
## 998                     51.63  51    42415.72               120.37
## 999                     55.55  19    41920.79               187.95
## 1000                    45.01  26    29875.80               178.35
##                             Ad.Topic.Line          City Male
## 995         Front-line bifurcated ability  Nicholasland    0
## 996         Fundamental modular algorithm     Duffystad    1
## 997       Grass-roots cohesive monitoring   New Darlene    1
## 998          Expanded intangible solution South Jessica    1
## 999  Proactive bandwidth-monitored policy   West Steven    0
## 1000      Virtual 5thgeneration emulation   Ronniemouth    0
##                     Country           Timestamp Clicked.on.Ad
## 995                 Mayotte 2016-04-04 03:57:48             1
## 996                 Lebanon 2016-02-11 21:49:00             1
## 997  Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 998                Mongolia 2016-02-01 17:24:57             1
## 999               Guatemala 2016-03-24 02:35:54             0
## 1000                 Brazil 2016-06-03 21:43:21             1
## 7.3 Checking the data types of the variables
str(data)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...

8. Data Cleaning

8.1 Dealing with duplicates

duplicated_rows <- data[duplicated(data),]
duplicated_rows
##  [1] Daily.Time.Spent.on.Site Age                      Area.Income             
##  [4] Daily.Internet.Usage     Ad.Topic.Line            City                    
##  [7] Male                     Country                  Timestamp               
## [10] Clicked.on.Ad           
## <0 rows> (or 0-length row.names)

There are no duplicated rows in the data set.

7.2 Dealing with missing data.

sum(is.na(data))
## [1] 0

The data set has no missing data

9 Exploratory Data Analysis.

9.1 Univariate Analysis.

9.1.1 Numerical Variables

# A brief summary of our dataset
 # the psych package gives more statistical summaries.
library(psych)
describe(data)
##                          vars    n     mean       sd   median  trimmed      mad
## Daily.Time.Spent.on.Site    1 1000    65.00    15.85    68.22    65.74    17.92
## Age                         2 1000    36.01     8.79    35.00    35.51     8.90
## Area.Income                 3 1000 55000.00 13414.63 57012.30 56038.94 13316.62
## Daily.Internet.Usage        4 1000   180.00    43.90   183.13   179.99    58.61
## Ad.Topic.Line*              5 1000   500.50   288.82   500.50   500.50   370.65
## City*                       6 1000   487.32   279.31   485.50   487.51   356.57
## Male                        7 1000     0.48     0.50     0.00     0.48     0.00
## Country*                    8 1000   116.41    69.94   114.50   115.82    89.70
## Timestamp*                  9 1000   500.50   288.82   500.50   500.50   370.65
## Clicked.on.Ad              10 1000     0.50     0.50     0.50     0.50     0.74
##                               min      max    range  skew kurtosis     se
## Daily.Time.Spent.on.Site    32.60    91.43    58.83 -0.37    -1.10   0.50
## Age                         19.00    61.00    42.00  0.48    -0.41   0.28
## Area.Income              13996.50 79484.80 65488.30 -0.65    -0.11 424.21
## Daily.Internet.Usage       104.78   269.96   165.18 -0.03    -1.28   1.39
## Ad.Topic.Line*               1.00  1000.00   999.00  0.00    -1.20   9.13
## City*                        1.00   969.00   968.00  0.00    -1.19   8.83
## Male                         0.00     1.00     1.00  0.08    -2.00   0.02
## Country*                     1.00   237.00   236.00  0.08    -1.23   2.21
## Timestamp*                   1.00  1000.00   999.00  0.00    -1.20   9.13
## Clicked.on.Ad                0.00     1.00     1.00  0.00    -2.00   0.02

9.1.2 Categorical Variables

# Getting the modes
# Creating a function to get the modes
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Mode of the city
city.mode <- getmode(data$City)
city.mode
## [1] "Lisamouth"
# Mode of age
age.mode <- getmode(data$Age)
age.mode
## [1] 31
# Mode of Country
country.mode <- getmode(data$Country)
country.mode
## [1] "Czech Republic"
# Mode of Gender
gender.mode <- getmode(data$Male)
gender.mode
## [1] 0

The gender that clicked on the ads the most were female.

# Mode of Clicked on Ad
clicked.mode <- getmode(data$Clicked.on.Ad)
clicked.mode
## [1] 0

Most people did not click on the ad

9.1.3 Graphical Anlaysis

a) Boxplots

# boxplots
library(tidyverse)
data%>%
  ggplot(aes(data$Age, data$Daily.Time.Spent.on.Site))+
  geom_boxplot()+
  geom_point(alpha=0.5, aes(colour=data$Clicked.on.Ad,
                            size= data$Area.Income))+
  facet_wrap(~Male)+
  coord_flip()+
  theme_bw()+
  labs(title="Daily Time Spent on Site by Age")
## Warning: Use of `data$Age` is discouraged. Use `Age` instead.
## Warning: Use of `data$Daily.Time.Spent.on.Site` is discouraged. Use
## `Daily.Time.Spent.on.Site` instead.
## Warning: Use of `data$Clicked.on.Ad` is discouraged. Use `Clicked.on.Ad`
## instead.
## Warning: Use of `data$Area.Income` is discouraged. Use `Area.Income` instead.
## Warning: Use of `data$Age` is discouraged. Use `Age` instead.
## Warning: Use of `data$Daily.Time.Spent.on.Site` is discouraged. Use
## `Daily.Time.Spent.on.Site` instead.
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

Part of the audience that was young(below 40) and that spent more time on the site did not click on the ads.

b) Histograms

# Histogram of age
data%>%
  ggplot(aes(Age))+
  geom_histogram(binwidth = 2, fill = "#0C8E7A")+
  theme_bw()+
  labs(title="Age")

# Histogram of Daily Internet Usage
library(tidyverse)
data%>%
  ggplot(aes(data$Daily.Internet.Usage))+
  geom_histogram(binwidth = 4, fill = "#0C8E7A")+
  theme_bw()+
  labs(title="Daily Internet Usage")
## Warning: Use of `data$Daily.Internet.Usage` is discouraged. Use
## `Daily.Internet.Usage` instead.

# Histogram of Daily Time Spent on Site 
data%>%
ggplot(aes(data$'Daily.Time.Spent.on.Site'))+ 
  geom_histogram(fill="#A2DC58")+
  theme_bw()+
  labs(title= "Histogram of Daily Time Spent on Site",
       x = "Daily Time Spent on Site",
       y = "Count")
## Warning: Use of `data$Daily.Time.Spent.on.Site` is discouraged. Use
## `Daily.Time.Spent.on.Site` instead.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

9.2 Bivariate and Multivariate Analysis

9.2.1 Covariance

print(cor(data$Daily.Time.Spent.on.Site, data$Age))
## [1] -0.3315133
print("Age and daily time spent have a low negative correlation")
## [1] "Age and daily time spent have a low negative correlation"
#
print(cor(data$Age, data$Clicked.on.Ad))
## [1] 0.4925313
print("Age and clicked on data have a medium positve covariance")
## [1] "Age and clicked on data have a medium positve covariance"
#
print(cor(data$Area.Income, data$Clicked.on.Ad))
## [1] -0.4762546
print("Area income has a medium negative correlation with clicked on ad ")
## [1] "Area income has a medium negative correlation with clicked on ad "
#
print(cor(data$Daily.Internet.Usage, data$Clicked.on.Ad))
## [1] -0.7865392
print("Daily internet usage has a high negative correlation with clicked on ad")
## [1] "Daily internet usage has a high negative correlation with clicked on ad"
#
print(cor(data$Daily.Time.Spent.on.Site, data$Clicked.on.Ad))
## [1] -0.7481166
print("Daily time spent on site has a negative correlation with clicked on ad")
## [1] "Daily time spent on site has a negative correlation with clicked on ad"

9.2.2 Scatterplot

# Scatterpot of daily time spent on site per Area Income
data%>%
  ggplot(aes(Daily.Time.Spent.on.Site, Area.Income))+
  geom_point(alpha=0.5,aes(size=Age, 
                 colour= Clicked.on.Ad))+
  facet_wrap(~Male)+
  coord_flip()+
  theme_bw()+
  labs(title="Daily Time Spent on Site per Area Income")

People with less income clicked more on ads than people with high income. Both genders had nearly the same behaviors.

10 Supervised Learning

10.1 Naive Bayes

# Label encoding some of the columns
# initializing the label encoder
label <- LabelEncoder$new()
#
# encoding
data$Ad.Topic.Line <- label$fit_transform(data$Ad.Topic.Line)
data$City <- label$fit_transform(data$City)
data$Country <- label$fit_transform(data$Country)
data$Timestamp <- label$fit_transform(data$Timestamp)
head(data)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage Ad.Topic.Line
## 1                    68.95  35    61833.90               256.09             0
## 2                    80.23  31    68441.85               193.77             1
## 3                    69.47  26    59785.94               236.50             2
## 4                    74.15  29    54806.18               245.89             3
## 5                    68.37  35    73889.99               225.58             4
## 6                    59.99  23    59761.56               226.74             5
##   City Male Country Timestamp Clicked.on.Ad
## 1    0    0       0         0             0
## 2    1    1       1         1             0
## 3    2    0       2         2             0
## 4    3    1       3         3             0
## 5    4    0       4         4             0
## 6    5    1       5         5             0
# Converting target variable into a categorical variable
data$Clicked.on.Ad <- factor(data$Clicked.on.Ad, levels=c("1","0"), labels= c("True","False"))
head(data)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage Ad.Topic.Line
## 1                    68.95  35    61833.90               256.09             0
## 2                    80.23  31    68441.85               193.77             1
## 3                    69.47  26    59785.94               236.50             2
## 4                    74.15  29    54806.18               245.89             3
## 5                    68.37  35    73889.99               225.58             4
## 6                    59.99  23    59761.56               226.74             5
##   City Male Country Timestamp Clicked.on.Ad
## 1    0    0       0         0         False
## 2    1    1       1         1         False
## 3    2    0       2         2         False
## 4    3    1       3         3         False
## 5    4    0       4         4         False
## 6    5    1       5         5         False
# Splitting data into training and test datasets

indxTrain <- createDataPartition(y = data$Clicked.on.Ad ,p = 0.70,list = FALSE)
training <- data[indxTrain,]
testing <- data[-indxTrain,]
# Checking dimensions of the split
# ---
#
prop.table(table(data$Clicked.on.Ad)) * 100
## 
##  True False 
##    50    50
prop.table(table(training$Clicked.on.Ad)) * 100
## 
##  True False 
##    50    50
prop.table(table(testing$Clicked.on.Ad)) * 100
## 
##  True False 
##    50    50
# Comparing the outcome of the training and testing phase
# Creating objects x which holds the predictor variables and y which holds the response variables
#
x = training[,1:9]
y = training$Clicked.on.Ad
# Fitting the model
set.seed(100)
model <- naiveBayes(Clicked.on.Ad ~Daily.Time.Spent.on.Site+Age+
Area.Income+Daily.Internet.Usage+City+Male+Country,
data = training)
model
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  True False 
##   0.5   0.5 
## 
## Conditional probabilities:
##        Daily.Time.Spent.on.Site
## Y           [,1]      [,2]
##   True  53.48157 12.823274
##   False 76.95651  7.779927
## 
##        Age
## Y           [,1]     [,2]
##   True  40.65714 8.985958
##   False 31.81429 6.245899
## 
##        Area.Income
## Y           [,1]      [,2]
##   True  48373.41 14665.832
##   False 61257.96  8723.281
## 
##        Daily.Internet.Usage
## Y           [,1]     [,2]
##   True  146.1902 29.78609
##   False 214.8703 24.07377
## 
##        City
## Y           [,1]     [,2]
##   True  491.6486 291.3963
##   False 472.2514 272.7930
## 
##        Male
## Y            [,1]      [,2]
##   True  0.4771429 0.5001923
##   False 0.4685714 0.4997257
## 
##        Country
## Y           [,1]     [,2]
##   True  105.7943 63.36873
##   False 107.9286 67.43948
# Model Evaluation
# Predicting our testing set
# 
Predict <- predict(model,newdata = testing )

# Getting the confusion matrix to see accuracy value and other parameter values
#
confusionMatrix(Predict, testing$Clicked.on.Ad)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction True False
##      True   143     3
##      False    7   147
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.9396, 0.9839)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.9533          
##             Specificity : 0.9800          
##          Pos Pred Value : 0.9795          
##          Neg Pred Value : 0.9545          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4767          
##    Detection Prevalence : 0.4867          
##       Balanced Accuracy : 0.9667          
##                                           
##        'Positive' Class : True            
## 

The model has a high accuracy of 95.67% and most of the records have been correctly classified.

10.2 KNN

#calling the relevant library
library(class)

# fitting the base model
knn.model <- knn(train = training[,c(-10)],
                 test = testing[,c(-10)],
                 cl = training$Clicked.on.Ad,
                 k = 2)

# Getting the confusion matrix
confusionMatrix(knn.model, testing$Clicked.on.Ad)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction True False
##      True    92    56
##      False   58    94
##                                           
##                Accuracy : 0.62            
##                  95% CI : (0.5624, 0.6752)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.915e-05       
##                                           
##                   Kappa : 0.24            
##                                           
##  Mcnemar's Test P-Value : 0.9254          
##                                           
##             Sensitivity : 0.6133          
##             Specificity : 0.6267          
##          Pos Pred Value : 0.6216          
##          Neg Pred Value : 0.6184          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3067          
##    Detection Prevalence : 0.4933          
##       Balanced Accuracy : 0.6200          
##                                           
##        'Positive' Class : True            
## 

The accuracy dropped to 0.6067 and many records were wrongly classified.

# Find optimal k values
i=1
k.optm=1
for (i in 1:30){
knn.model2 <- knn(train=training[,c(-10)], test=testing[,c(-10)],
               cl=training$Clicked.on.Ad, k=i)
k.optm[i] <- 100 * sum(testing$Clicked.on.Ad == knn.model2)/
  NROW(testing$Clicked.on.Ad)

k=i
cat(k,'=',k.optm[i],'\n')
}
## 1 = 63 
## 2 = 63.66667 
## 3 = 64 
## 4 = 66.33333 
## 5 = 66 
## 6 = 64.33333 
## 7 = 65.33333 
## 8 = 66.33333 
## 9 = 66.33333 
## 10 = 67 
## 11 = 67.66667 
## 12 = 68.33333 
## 13 = 68.33333 
## 14 = 68.66667 
## 15 = 67 
## 16 = 66.66667 
## 17 = 67.33333 
## 18 = 68 
## 19 = 68.33333 
## 20 = 68 
## 21 = 68.66667 
## 22 = 68 
## 23 = 68.66667 
## 24 = 69 
## 25 = 69 
## 26 = 69 
## 27 = 69.33333 
## 28 = 68.66667 
## 29 = 69.33333 
## 30 = 68.33333

The optimal k value in our KNN model is 28 with an accuracy of 69.33.

11. Challenging our solution

Decision Tree

# Calling the required libraries
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: sandwich
## 
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
## 
##     boundary
library(rpart)
library(rpart.plot)
#
# Fitting the model
model3 <- rpart(Clicked.on.Ad ~ ., data = training, method = 'class')
#
# Plotting the decision tree
rpart.plot(model3, extra = 100)

# Model Evaluation
# Predicting our testing set
# 
Predict3 <- predict(model3,newdata = testing, type= "class" )

# Getting the confusion matrix to see accuracy value and other parameter values
#
confusionMatrix(Predict3, testing$Clicked.on.Ad)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction True False
##      True   141     4
##      False    9   146
##                                          
##                Accuracy : 0.9567         
##                  95% CI : (0.927, 0.9767)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9133         
##                                          
##  Mcnemar's Test P-Value : 0.2673         
##                                          
##             Sensitivity : 0.9400         
##             Specificity : 0.9733         
##          Pos Pred Value : 0.9724         
##          Neg Pred Value : 0.9419         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4700         
##    Detection Prevalence : 0.4833         
##       Balanced Accuracy : 0.9567         
##                                          
##        'Positive' Class : True           
## 

The Decision tree model has an accuracy of 94.67%.

12. Conclusions

Model Conclusions: Naive Bayes has an accuracy of 95.67%, KNN has 69.335 and Decision Tree has 94.67%.

#13. Recommendations 1. Target people from the lower income areas more. 2. Increase the amount of content on the site so as to increase the amount of time spent on the site by people. 3. Target the site on older people since age has a positive correlation with ad clicking. 4. Naive Bayes should be used when making predictions since it has the highest accuracy.