1. DEFINING THE QUESTION

a) Specifying the Question

Which individuals are most likely to click on an ad?

b) Defining the Metric of Success

Give conclusive recommendations based on the EDA results and the modeling results; these should include the factors that would make a person click on the advertisement.

c) Understanding the Context

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to identify which individuals are most likely to click on her ads.

d) Recording the Experimental Design

  1. Load, preview and understand the data
  2. Perform data cleaning and manipulation
  3. Perform Exploratory Data Analysis
  4. Conclusion
  5. Recommendations
  6. Follow up questions

e) Data Relevance

The dataset provided would be considered relevant if we are able to prove that the factors given actually lead a person to clicking on an ad.

2. LOADING AND PREVIEWING THE DATA

Loading the dataset

library("data.table")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
ads <- fread('http://bit.ly/IPAdvertisingData')

Previewing the dataset

# Determining the number of rows and columns in the dataset
cat('Number of rows are', nrow(ads), 'and the number of columns are', ncol(ads))
## Number of rows are 1000 and the number of columns are 10
# Preview the top of the dataset
head(ads)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
## 6:                    59.99  23    59761.56               226.74
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
## 6:       Sharable client-driven software      Jamieberg    1     Norway
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0
## 6: 2016-05-19 14:30:17             0
# Previewing the bottom of the dataset
tail(ads)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    43.70  28    63126.96               173.01
## 2:                    72.97  30    71384.57               208.58
## 3:                    51.30  45    67782.17               134.42
## 4:                    51.63  51    42415.72               120.37
## 5:                    55.55  19    41920.79               187.95
## 6:                    45.01  26    29875.80               178.35
##                           Ad Topic Line          City Male
## 1:        Front-line bifurcated ability  Nicholasland    0
## 2:        Fundamental modular algorithm     Duffystad    1
## 3:      Grass-roots cohesive monitoring   New Darlene    1
## 4:         Expanded intangible solution South Jessica    1
## 5: Proactive bandwidth-monitored policy   West Steven    0
## 6:      Virtual 5thgeneration emulation   Ronniemouth    0
##                   Country           Timestamp Clicked on Ad
## 1:                Mayotte 2016-04-04 03:57:48             1
## 2:                Lebanon 2016-02-11 21:49:00             1
## 3: Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 4:               Mongolia 2016-02-01 17:24:57             1
## 5:              Guatemala 2016-03-24 02:35:54             0
## 6:                 Brazil 2016-06-03 21:43:21             1
# checking the structure of the dataset and datatype of each column
str(ads)
## Classes 'data.table' and 'data.frame':   1000 obs. of  10 variables:
##  $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily Internet Usage    : num  256 194 236 246 226 ...
##  $ Ad Topic Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked on Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

All the columns have the appropriate datatypes. However, I will change the character and encoded columns to factors. This is because factors store the data as a vector of integer values.

# converting the character columns to factors
ads <- as.data.frame(unclass(ads),
                     stringsAsFactors = TRUE)
# converting the encoded columns(male and clicked on ad) to factors
names <- c(7,10)
ads[,names] <- lapply(ads[,names] , factor)
# checking the changes made
str(ads)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : Factor w/ 1000 levels "Adaptive 24hour Graphic Interface",..: 92 465 567 904 767 806 223 724 108 455 ...
##  $ City                    : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
##  $ Male                    : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ Country                 : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked.on.Ad           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...

All the changes to the column datatypes have been implemented. From this, we can see that all the ad topic lines are unique for each clicked ad, there are 969 unique cities and 237 unique countries.

3. DATA CLEANING AND MANIPULATION

Missing values

# Check for sum of missing values per column
colSums(is.na(ads))
## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                        0                        0                        0 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked.on.Ad 
##                        0

There are no missing values in the dataset.

###Duplicates

# Checking for any duplicated records 
duplicates <- ads[duplicated(ads),]
duplicates
##  [1] Daily.Time.Spent.on.Site Age                      Area.Income             
##  [4] Daily.Internet.Usage     Ad.Topic.Line            City                    
##  [7] Male                     Country                  Timestamp               
## [10] Clicked.on.Ad           
## <0 rows> (or 0-length row.names)

There are no duplicates in the dataset.

###Outliers

num_cols <- ads[, c(1,2,3,4)]
# Box plots to visualize outliers
par(mfrow = c(2,2))
for (i in 1:length(num_cols)){
  boxplot(num_cols[i], main = paste('Boxplot for',  names(num_cols)[i]), 
          ylab = 'Count')
}

There are outliers only in the area income column, which will not be removed.

4. EXPLORATORY DATA ANALYSIS

Univariate Analysis

summary(ads)
##  Daily.Time.Spent.on.Site      Age         Area.Income    Daily.Internet.Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##                                                                               
##                                  Ad.Topic.Line              City     Male   
##  Adaptive 24hour Graphic Interface      :  1   Lisamouth      :  3   0:519  
##  Adaptive asynchronous attitude         :  1   Williamsport   :  3   1:481  
##  Adaptive context-sensitive application :  1   Benjaminchester:  2          
##  Adaptive contextually-based methodology:  1   East John      :  2          
##  Adaptive demand-driven knowledgebase   :  1   East Timothy   :  2          
##  Adaptive uniform capability            :  1   Johnstad       :  2          
##  (Other)                                :994   (Other)        :986          
##            Country      Timestamp                      Clicked.on.Ad
##  Czech Republic:  9   Min.   :2016-01-01 02:52:10.00   0:500        
##  France        :  9   1st Qu.:2016-02-18 02:55:42.00   1:500        
##  Afghanistan   :  8   Median :2016-04-07 17:27:29.50                
##  Australia     :  8   Mean   :2016-04-10 10:34:06.64                
##  Cyprus        :  8   3rd Qu.:2016-05-31 03:18:14.00                
##  Greece        :  8   Max.   :2016-07-24 00:22:16.00                
##  (Other)       :950

From the output above, we can deduce the following:

  • The mean daily time spent on site is 65.00

  • The mean age is 36

  • The mean area income is 55000

  • The mean daily internet usage is 180.0

  • The median daily time spent on site is 68.22

  • The median age is 35

  • The median area income is 57012

  • The median daily internet usage is 183.1

  • The maximum and minimum daily time spent on site are 91.43 and 32.60

  • The maximum and minimum age are 61 and 19

  • The maximum and minimum area income are 79485 and 13996

  • The maximum and minimum daily internet usage are 270.0 and 104.8

  • The first and third quartiles in daily time spent on site are 51.36 and 78.55

  • The first and third quartiles in age are 29 and 42

  • The first and third quartiles in area income are 47032 and 65471

  • The first and third quartiles in daily internet usage are 138.8 and 218.8

  • Czech Republic and France are the most popular countries.

  • Lisamouth and Williamsport are the most popular cities.

# getting the mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

age.mode <- getmode(ads$Age)
age.mode
## [1] 31
time.mode <- getmode(ads$Daily.Time.Spent.on.Site)
time.mode
## [1] 62.26
income.mode <- getmode(ads$Area.Income)
income.mode
## [1] 61833.9
usage.mode <- getmode(ads$Daily.Internet.Usage)
usage.mode
## [1] 167.22
  • The most popular age is 31
  • The most people spend 62.26 on a site
  • The most popular area income is 61833.9
  • The most people spend 167.22 on internet access
# getting the range - the difference between the maximum and minimum values in specified columns
age.range <- max(ads$Age) - min(ads$Age)
age.range
## [1] 42
time.range <- max(ads$Daily.Time.Spent.on.Site) - min(ads$Daily.Time.Spent.on.Site)
time.range
## [1] 58.83
income.range <- max(ads$Area.Income) - min(ads$Area.Income)
income.range
## [1] 65488.3
usage.range <- max(ads$Daily.Internet.Usage) - min(ads$Daily.Internet.Usage)
usage.range
## [1] 165.18
# getting the variance (how the data values is dispersed around the mean)
age.var <- var(ads$Age)
age.var
## [1] 77.18611
time.var <- var(ads$Daily.Time.Spent.on.Site) 
time.var
## [1] 251.3371
income.var <- var(ads$Area.Income) 
income.var
## [1] 179952406
usage.var <- var(ads$Daily.Internet.Usage) 
usage.var
## [1] 1927.415

Area income has the highest variance. This implies that the data in this column is greatly spread about its mean.

# getting the standard deviation
age.stdev <- sd(ads$Age)
age.stdev
## [1] 8.785562
time.stdev <- sd(ads$Daily.Time.Spent.on.Site) 
time.stdev
## [1] 15.85361
income.stdev <- sd(ads$Area.Income) 
income.stdev
## [1] 13414.63
usage.stdev <- sd(ads$Daily.Internet.Usage) 
usage.stdev
## [1] 43.90234

Age has a low standard deviation, this indicates that the data points are close to the mean of the column. Area income has a high standard deviation, this indicates that the data points are spread out over a wider range of values.

# plotting a histogram for age 
hist(ads$Age)

Most people who clicked on the ads are between 25 and 40 years and the distribution is skewed to the right.

# plotting a bar graph for male
sex <- ads$Male

sex_freq <- table(sex)
sex_freq
## sex
##   0   1 
## 519 481
barplot(sex_freq)

There are more females than males clicking on the ads

# plotting a bar graph for clicked on ad
click <- ads$Clicked.on.Ad

click_freq <- table(click)
click_freq
## click
##   0   1 
## 500 500
barplot(click_freq)

The frequency of clicking on an ad or not is the same

# plotting a histogram for daily time spent on site  
hist(ads$Daily.Time.Spent.on.Site)

Most people spend between 75 and 85 units of time on a site with the ad

# plotting a histogram for area income 
hist(ads$Area.Income)

The distribution for area income is skewed to the left with most incomes being between 50000 and 70000

# plotting a histogram for daily internet usage 
hist(ads$Daily.Internet.Usage)

Most people used slightly above 150 units of time daily on the internet

# Unique years
year <- format(ads$Timestamp, format="%y")
unique(year)
## [1] "16"

The data was collected in 2016 only.

# Unique months
month <- format(ads$Timestamp, format="%m")
# adding months column to the dataset
ads$Month <- month
sort(unique(month))
## [1] "01" "02" "03" "04" "05" "06" "07"

These are the months in the dataset, 1-7. The data was collected during the first 7 months of the year.

# Unique days
day <- wday(ads$Timestamp)
# adding days column to the dataset 
ads$Day <- day
sort(unique(day))
## [1] 1 2 3 4 5 6 7

These are the days in the dataset, 1-7. The data was collected in all the days of the week.

# plotting a bar graph for months
month <- ads$Month

month_freq <- table(month)
month_freq
## month
##  01  02  03  04  05  06  07 
## 147 160 156 147 147 142 101
barplot(month_freq)

February has the frequency, followed by March.

# plotting a bar graph for days
day <- ads$Day

day_freq <- table(day)
day_freq
## day
##   1   2   3   4   5   6   7 
## 159 140 122 156 142 155 126
barplot(day_freq)

Day 1 is Monday and day 7 is Sunday. Monday has the highest frequency, followed by Thursday then Saturday.

Bivariate Analysis

# covariance
cov(ads$Daily.Time.Spent.on.Site, ads$Age)
## [1] -46.17415
cov(ads$Daily.Time.Spent.on.Site, ads$Area.Income)
## [1] 66130.81
cov(ads$Daily.Time.Spent.on.Site, ads$Daily.Internet.Usage)
## [1] 360.9919
cov(ads$Age, ads$Area.Income)
## [1] -21520.93
cov(ads$Age, ads$Daily.Internet.Usage)
## [1] -141.6348
cov(ads$Area.Income, ads$Daily.Internet.Usage)
## [1] 198762.5

The covariance between daily time spent on site, area income and daily internet usage are positive indicating a positive linear relationship between the variables. The covariance between age, daily time spent on site, area income and daily internet usage are negative, indicating a negative linear relationship between the variables.

# Correlation Coefficient 
cor(ads$Daily.Time.Spent.on.Site, ads$Age)
## [1] -0.3315133
cor(ads$Daily.Time.Spent.on.Site, ads$Area.Income)
## [1] 0.3109544
cor(ads$Daily.Time.Spent.on.Site, ads$Daily.Internet.Usage)
## [1] 0.5186585
cor(ads$Age, ads$Area.Income)
## [1] -0.182605
cor(ads$Age, ads$Daily.Internet.Usage)
## [1] -0.3672086
cor(ads$Area.Income, ads$Daily.Internet.Usage)
## [1] 0.3374955

Most of the variables are negatively linearly correlated or weakly correlated.

# Computing correlation
x = ads%>%
#x <- cor
select(Daily.Time.Spent.on.Site, Age,Area.Income, Daily.Internet.Usage)
corr = cor(x)
corr
##                          Daily.Time.Spent.on.Site        Age Area.Income
## Daily.Time.Spent.on.Site                1.0000000 -0.3315133   0.3109544
## Age                                    -0.3315133  1.0000000  -0.1826050
## Area.Income                             0.3109544 -0.1826050   1.0000000
## Daily.Internet.Usage                    0.5186585 -0.3672086   0.3374955
##                          Daily.Internet.Usage
## Daily.Time.Spent.on.Site            0.5186585
## Age                                -0.3672086
## Area.Income                         0.3374955
## Daily.Internet.Usage                1.0000000
# Correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(corr, method="number", sig.level = 0.01, insig = "blank")

# Scatter plot for daily time spent on site and area income
plot(ads$Daily.Time.Spent.on.Site, ads$Area.Income, xlab="Daily Time Spent on Site", ylab="Area Income")

There is a weak correlation between the variables

# Scatter plot for daily time spent on site and daily internet usage
plot(ads$Daily.Time.Spent.on.Site, ads$Daily.Internet.Usage, xlab="Daily Time Spent on Site", ylab="Daily Internet Usage")

There is a weak correlation between the variables

From the analysis, we can conclude the following:

  • Audiences between the ages of 25 and 40 are more interested in the ads on the blog.

  • More females than males were seen to be clicking more on the ads.

  • Most people with an income of between 50000 and 70000 were noted to be very interested with the ads.

  • The audience who spent between 75 and 85 units of time were also noted to be more drawn towards clicking an ad.

  • There is no particular topic that the audience prefer over the other, each ad topic line is unique to a particular user.

  • Most people clicked on the ads on Mondays, Thursdays and Saturdays.

  • The months with the most number of clicks were February, March, April and May.

##5. IMPLEMENTING THE SOLUTION ### Performing predictive analysis

head(ads)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad Month Day
## 1 2016-03-27 00:53:11             0    03   1
## 2 2016-04-04 01:39:02             0    04   2
## 3 2016-03-13 20:35:42             0    03   1
## 4 2016-01-10 02:31:19             0    01   1
## 5 2016-06-03 03:36:18             0    06   6
## 6 2016-05-19 14:30:17             0    05   5

From the exploratory analysis, ad topic line was unique to every user, I will therefore not use it for predictive analysis. Timestamp column will also be dropped.

# Dropping the 2 columns
df = subset(ads, select = -c(Ad.Topic.Line, Timestamp) )

head(df)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage           City
## 1                    68.95  35    61833.90               256.09    Wrightburgh
## 2                    80.23  31    68441.85               193.77      West Jodi
## 3                    69.47  26    59785.94               236.50       Davidton
## 4                    74.15  29    54806.18               245.89 West Terrifurt
## 5                    68.37  35    73889.99               225.58   South Manuel
## 6                    59.99  23    59761.56               226.74      Jamieberg
##   Male    Country Clicked.on.Ad Month Day
## 1    0    Tunisia             0    03   1
## 2    1      Nauru             0    04   2
## 3    0 San Marino             0    03   1
## 4    1      Italy             0    01   1
## 5    0    Iceland             0    06   6
## 6    1     Norway             0    05   5
# label encoding city and country
df$City = as.integer(as.factor(df$City))
df$Country = as.integer(as.factor(df$Country))

# changing the factor and character columns to integer 
df$Male = as.integer(df$Male)
df$Month = as.integer(df$Month)
# scaling the data
cols <- df[, c(1,2,3,4,5,7,9)]
data <- scale(cols)
head(data)
##   Daily.Time.Spent.on.Site        Age Area.Income Daily.Internet.Usage
## 1                0.2491419 -0.1148475  0.50943618            1.7331628
## 2                0.9606516 -0.5701399  1.00202882            0.3136484
## 3                0.2819420 -1.1392555  0.35677007            1.2869451
## 4                0.5771428 -0.7977862 -0.01444841            1.5008289
## 5                0.2125572 -0.1148475  1.40816290            1.0382112
## 6               -0.3160289 -1.4807248  0.35495265            1.0646335
##         City    Country       Month
## 1  1.6994534  1.4239054 -0.42398107
## 2  1.4918000  0.4516733  0.09496761
## 3 -1.3437431  0.9806820 -0.42398107
## 4  1.6206883 -0.1774181 -1.46187843
## 5  1.1409373 -0.2775008  1.13286497
## 6 -0.7315236  0.6089462  0.61391629
# appending the scaled data to the original data 
final_data <- cbind(df, data)

# dropping the unscaled columns
df <- final_data[ -c(1:5, 7,9) ]
head(df)
##   Male Clicked.on.Ad Day Daily.Time.Spent.on.Site        Age Area.Income
## 1    1             0   1                0.2491419 -0.1148475  0.50943618
## 2    2             0   2                0.9606516 -0.5701399  1.00202882
## 3    1             0   1                0.2819420 -1.1392555  0.35677007
## 4    2             0   1                0.5771428 -0.7977862 -0.01444841
## 5    1             0   6                0.2125572 -0.1148475  1.40816290
## 6    2             0   5               -0.3160289 -1.4807248  0.35495265
##   Daily.Internet.Usage       City    Country       Month
## 1            1.7331628  1.6994534  1.4239054 -0.42398107
## 2            0.3136484  1.4918000  0.4516733  0.09496761
## 3            1.2869451 -1.3437431  0.9806820 -0.42398107
## 4            1.5008289  1.6206883 -0.1774181 -1.46187843
## 5            1.0382112  1.1409373 -0.2775008  1.13286497
## 6            1.0646335 -0.7315236  0.6089462  0.61391629
library('caret')
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# Splitting the data into train and test sets 

# “y” parameter takes the value of variable according to which data needs to be partitioned.
# p shows the percentage of the split. I'm using p=0.8. It means that data split should be done in 80:20 ratio.
# list parameter is for whether to return a list or matrix. 
intrain <- createDataPartition(y = df$Clicked.on.Ad, p= 0.8, list = FALSE)
train <- df[intrain,]
test <- df[-intrain,]
# Checking the dimensions of the train dataframe and test dataframe

dim(train); 
## [1] 800  10
dim(test);
## [1] 200  10
# Before training the model, the computational overheads need to be controlled. This will be implemented using the trainControl() method and will allow usage of train() function from the caret package

train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

svm_Linear <- train(Clicked.on.Ad ~., data = train, method = "svmLinear",
trControl=train_control,
preProcess = c("center", "scale"),
tuneLength = 10)
# Checking the results of the train model
svm_Linear
## Support Vector Machines with Linear Kernel 
## 
## 800 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9670833  0.9341667
## 
## Tuning parameter 'C' was held constant at a value of 1
# Predicting 
pred <- predict(svm_Linear, newdata = test)
pred
##   [1] 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
##  [38] 0 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1
##  [75] 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0
## [112] 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0
## [149] 1 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0
## [186] 0 0 1 0 1 0 1 1 0 1 1 0 1 1 1
## Levels: 0 1
# Checking the accuracy using a confusion matrix
cm <- confusionMatrix(table(pred, test$Clicked.on.Ad))
cm
## Confusion Matrix and Statistics
## 
##     
## pred  0  1
##    0 98  4
##    1  2 96
##                                           
##                Accuracy : 0.97            
##                  95% CI : (0.9358, 0.9889)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.94            
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9800          
##             Specificity : 0.9600          
##          Pos Pred Value : 0.9608          
##          Neg Pred Value : 0.9796          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4900          
##    Detection Prevalence : 0.5100          
##       Balanced Accuracy : 0.9700          
##                                           
##        'Positive' Class : 0               
## 

SVM gives an accuracy 97.5% with 97.5% accurately predicted predicted inputs from the test dataframe. The p-value is less than 1.

##6. CHALLENGING THE SOLUTION

This will involve looking for the optimal cost of misclassification by tuning the model’s parameters

set.seed(10)

# Attach Packages
library(tidyverse)    # data manipulation and visualization
library(kernlab)      # SVM methodology
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
## 
##     cross
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(e1071)        # SVM methodology
library(RColorBrewer) # customized coloring of plots
# find optimal cost of misclassification
tune.out <- tune(svm, Clicked.on.Ad ~., data = df, kernel = "linear",
                 ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
# extract the best model
(bestmod_linear <- tune.out$best.model)
## 
## Call:
## best.tune(method = svm, train.x = Clicked.on.Ad ~ ., data = df, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  90
# find optimal cost of misclassification
tune.out <- tune(svm, Clicked.on.Ad ~., data = df, kernel = "radial",
                 ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
# extract the best model
(bestmod_radial <- tune.out$best.model)
## 
## Call:
## best.tune(method = svm, train.x = Clicked.on.Ad ~ ., data = df, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.1 
## 
## Number of Support Vectors:  366
# constructing a table of predicted classes against true classes using the predict() command 
# Create a table of misclassified observations
ypred <- predict(bestmod_linear, df)
(misclass <- table(predict = ypred, truth = df$Clicked.on.Ad))
##        truth
## predict   0   1
##       0 491  18
##       1   9 482
# constructing a table of predicted classes against true classes using the predict() command 
# Create a table of misclassified observations
ypred <- predict(bestmod_radial, df)
(misclass <- table(predict = ypred, truth = df$Clicked.on.Ad))
##        truth
## predict   0   1
##       0 496  25
##       1   4 475

Linear method uses 90 support vectors while radial method uses 366 support vectors. In linear, 491 and 482 inputs have been correctly classified while in radial 496 and 475 inputs have been correctly classified. This gives an accuracy rate of 97.3% using the linear method and 97.1% using the radial method.

7. CONCLUSION

From the EDA, we found that the frequency of a user clicking on an ad or not clicking is the same. Therefore the prediction results from modeling the solution are good.

8. RECOMMENDATIONS

In light of the new results from modeling the problem, I would recommend the following, all the features being constant:

9. FOLLOW-UP QUESTIONS

Did we have the right data?

Yes. The data was provided by the client and it is the right data for the problem.

Do we need any other data to answer our question?

Yes. More data to train the model on would be rally helpful.

Do we have the right question?

Yes.