Which individuals are most likely to click on an ad?
Give conclusive recommendations based on the EDA results and the modeling results; these should include the factors that would make a person click on the advertisement.
A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to identify which individuals are most likely to click on her ads.
The dataset provided would be considered relevant if we are able to prove that the factors given actually lead a person to clicking on an ad.
library("data.table")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between() masks data.table::between()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks data.table::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::last() masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
ads <- fread('http://bit.ly/IPAdvertisingData')
# Determining the number of rows and columns in the dataset
cat('Number of rows are', nrow(ads), 'and the number of columns are', ncol(ads))
## Number of rows are 1000 and the number of columns are 10
# Preview the top of the dataset
head(ads)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## 6: 59.99 23 59761.56 226.74
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## 6: Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
## 6: 2016-05-19 14:30:17 0
# Previewing the bottom of the dataset
tail(ads)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 43.70 28 63126.96 173.01
## 2: 72.97 30 71384.57 208.58
## 3: 51.30 45 67782.17 134.42
## 4: 51.63 51 42415.72 120.37
## 5: 55.55 19 41920.79 187.95
## 6: 45.01 26 29875.80 178.35
## Ad Topic Line City Male
## 1: Front-line bifurcated ability Nicholasland 0
## 2: Fundamental modular algorithm Duffystad 1
## 3: Grass-roots cohesive monitoring New Darlene 1
## 4: Expanded intangible solution South Jessica 1
## 5: Proactive bandwidth-monitored policy West Steven 0
## 6: Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked on Ad
## 1: Mayotte 2016-04-04 03:57:48 1
## 2: Lebanon 2016-02-11 21:49:00 1
## 3: Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 4: Mongolia 2016-02-01 17:24:57 1
## 5: Guatemala 2016-03-24 02:35:54 0
## 6: Brazil 2016-06-03 21:43:21 1
# checking the structure of the dataset and datatype of each column
str(ads)
## Classes 'data.table' and 'data.frame': 1000 obs. of 10 variables:
## $ Daily Time Spent on Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area Income : num 61834 68442 59786 54806 73890 ...
## $ Daily Internet Usage : num 256 194 236 246 226 ...
## $ Ad Topic Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked on Ad : int 0 0 0 0 0 0 0 1 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
All the columns have the appropriate datatypes. However, I will change the character and encoded columns to factors. This is because factors store the data as a vector of integer values.
# converting the character columns to factors
ads <- as.data.frame(unclass(ads),
stringsAsFactors = TRUE)
# converting the encoded columns(male and clicked on ad) to factors
names <- c(7,10)
ads[,names] <- lapply(ads[,names] , factor)
# checking the changes made
str(ads)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : Factor w/ 1000 levels "Adaptive 24hour Graphic Interface",..: 92 465 567 904 767 806 223 724 108 455 ...
## $ City : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
## $ Male : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
## $ Country : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked.on.Ad : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
All the changes to the column datatypes have been implemented. From this, we can see that all the ad topic lines are unique for each clicked ad, there are 969 unique cities and 237 unique countries.
# Check for sum of missing values per column
colSums(is.na(ads))
## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad
## 0
There are no missing values in the dataset.
###Duplicates
# Checking for any duplicated records
duplicates <- ads[duplicated(ads),]
duplicates
## [1] Daily.Time.Spent.on.Site Age Area.Income
## [4] Daily.Internet.Usage Ad.Topic.Line City
## [7] Male Country Timestamp
## [10] Clicked.on.Ad
## <0 rows> (or 0-length row.names)
There are no duplicates in the dataset.
###Outliers
num_cols <- ads[, c(1,2,3,4)]
# Box plots to visualize outliers
par(mfrow = c(2,2))
for (i in 1:length(num_cols)){
boxplot(num_cols[i], main = paste('Boxplot for', names(num_cols)[i]),
ylab = 'Count')
}
There are outliers only in the area income column, which will not be removed.
summary(ads)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
##
## Ad.Topic.Line City Male
## Adaptive 24hour Graphic Interface : 1 Lisamouth : 3 0:519
## Adaptive asynchronous attitude : 1 Williamsport : 3 1:481
## Adaptive context-sensitive application : 1 Benjaminchester: 2
## Adaptive contextually-based methodology: 1 East John : 2
## Adaptive demand-driven knowledgebase : 1 East Timothy : 2
## Adaptive uniform capability : 1 Johnstad : 2
## (Other) :994 (Other) :986
## Country Timestamp Clicked.on.Ad
## Czech Republic: 9 Min. :2016-01-01 02:52:10.00 0:500
## France : 9 1st Qu.:2016-02-18 02:55:42.00 1:500
## Afghanistan : 8 Median :2016-04-07 17:27:29.50
## Australia : 8 Mean :2016-04-10 10:34:06.64
## Cyprus : 8 3rd Qu.:2016-05-31 03:18:14.00
## Greece : 8 Max. :2016-07-24 00:22:16.00
## (Other) :950
From the output above, we can deduce the following:
The mean daily time spent on site is 65.00
The mean age is 36
The mean area income is 55000
The mean daily internet usage is 180.0
The median daily time spent on site is 68.22
The median age is 35
The median area income is 57012
The median daily internet usage is 183.1
The maximum and minimum daily time spent on site are 91.43 and 32.60
The maximum and minimum age are 61 and 19
The maximum and minimum area income are 79485 and 13996
The maximum and minimum daily internet usage are 270.0 and 104.8
The first and third quartiles in daily time spent on site are 51.36 and 78.55
The first and third quartiles in age are 29 and 42
The first and third quartiles in area income are 47032 and 65471
The first and third quartiles in daily internet usage are 138.8 and 218.8
Czech Republic and France are the most popular countries.
Lisamouth and Williamsport are the most popular cities.
# getting the mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
age.mode <- getmode(ads$Age)
age.mode
## [1] 31
time.mode <- getmode(ads$Daily.Time.Spent.on.Site)
time.mode
## [1] 62.26
income.mode <- getmode(ads$Area.Income)
income.mode
## [1] 61833.9
usage.mode <- getmode(ads$Daily.Internet.Usage)
usage.mode
## [1] 167.22
# getting the range - the difference between the maximum and minimum values in specified columns
age.range <- max(ads$Age) - min(ads$Age)
age.range
## [1] 42
time.range <- max(ads$Daily.Time.Spent.on.Site) - min(ads$Daily.Time.Spent.on.Site)
time.range
## [1] 58.83
income.range <- max(ads$Area.Income) - min(ads$Area.Income)
income.range
## [1] 65488.3
usage.range <- max(ads$Daily.Internet.Usage) - min(ads$Daily.Internet.Usage)
usage.range
## [1] 165.18
# getting the variance (how the data values is dispersed around the mean)
age.var <- var(ads$Age)
age.var
## [1] 77.18611
time.var <- var(ads$Daily.Time.Spent.on.Site)
time.var
## [1] 251.3371
income.var <- var(ads$Area.Income)
income.var
## [1] 179952406
usage.var <- var(ads$Daily.Internet.Usage)
usage.var
## [1] 1927.415
Area income has the highest variance. This implies that the data in this column is greatly spread about its mean.
# getting the standard deviation
age.stdev <- sd(ads$Age)
age.stdev
## [1] 8.785562
time.stdev <- sd(ads$Daily.Time.Spent.on.Site)
time.stdev
## [1] 15.85361
income.stdev <- sd(ads$Area.Income)
income.stdev
## [1] 13414.63
usage.stdev <- sd(ads$Daily.Internet.Usage)
usage.stdev
## [1] 43.90234
Age has a low standard deviation, this indicates that the data points are close to the mean of the column. Area income has a high standard deviation, this indicates that the data points are spread out over a wider range of values.
# plotting a histogram for age
hist(ads$Age)
Most people who clicked on the ads are between 25 and 40 years and the distribution is skewed to the right.
# plotting a bar graph for male
sex <- ads$Male
sex_freq <- table(sex)
sex_freq
## sex
## 0 1
## 519 481
barplot(sex_freq)
There are more females than males clicking on the ads
# plotting a bar graph for clicked on ad
click <- ads$Clicked.on.Ad
click_freq <- table(click)
click_freq
## click
## 0 1
## 500 500
barplot(click_freq)
The frequency of clicking on an ad or not is the same
# plotting a histogram for daily time spent on site
hist(ads$Daily.Time.Spent.on.Site)
Most people spend between 75 and 85 units of time on a site with the ad
# plotting a histogram for area income
hist(ads$Area.Income)
The distribution for area income is skewed to the left with most incomes being between 50000 and 70000
# plotting a histogram for daily internet usage
hist(ads$Daily.Internet.Usage)
Most people used slightly above 150 units of time daily on the internet
# Unique years
year <- format(ads$Timestamp, format="%y")
unique(year)
## [1] "16"
The data was collected in 2016 only.
# Unique months
month <- format(ads$Timestamp, format="%m")
# adding months column to the dataset
ads$Month <- month
sort(unique(month))
## [1] "01" "02" "03" "04" "05" "06" "07"
These are the months in the dataset, 1-7. The data was collected during the first 7 months of the year.
# Unique days
day <- wday(ads$Timestamp)
# adding days column to the dataset
ads$Day <- day
sort(unique(day))
## [1] 1 2 3 4 5 6 7
These are the days in the dataset, 1-7. The data was collected in all the days of the week.
# plotting a bar graph for months
month <- ads$Month
month_freq <- table(month)
month_freq
## month
## 01 02 03 04 05 06 07
## 147 160 156 147 147 142 101
barplot(month_freq)
February has the frequency, followed by March.
# plotting a bar graph for days
day <- ads$Day
day_freq <- table(day)
day_freq
## day
## 1 2 3 4 5 6 7
## 159 140 122 156 142 155 126
barplot(day_freq)
Day 1 is Monday and day 7 is Sunday. Monday has the highest frequency, followed by Thursday then Saturday.
# covariance
cov(ads$Daily.Time.Spent.on.Site, ads$Age)
## [1] -46.17415
cov(ads$Daily.Time.Spent.on.Site, ads$Area.Income)
## [1] 66130.81
cov(ads$Daily.Time.Spent.on.Site, ads$Daily.Internet.Usage)
## [1] 360.9919
cov(ads$Age, ads$Area.Income)
## [1] -21520.93
cov(ads$Age, ads$Daily.Internet.Usage)
## [1] -141.6348
cov(ads$Area.Income, ads$Daily.Internet.Usage)
## [1] 198762.5
The covariance between daily time spent on site, area income and daily internet usage are positive indicating a positive linear relationship between the variables. The covariance between age, daily time spent on site, area income and daily internet usage are negative, indicating a negative linear relationship between the variables.
# Correlation Coefficient
cor(ads$Daily.Time.Spent.on.Site, ads$Age)
## [1] -0.3315133
cor(ads$Daily.Time.Spent.on.Site, ads$Area.Income)
## [1] 0.3109544
cor(ads$Daily.Time.Spent.on.Site, ads$Daily.Internet.Usage)
## [1] 0.5186585
cor(ads$Age, ads$Area.Income)
## [1] -0.182605
cor(ads$Age, ads$Daily.Internet.Usage)
## [1] -0.3672086
cor(ads$Area.Income, ads$Daily.Internet.Usage)
## [1] 0.3374955
Most of the variables are negatively linearly correlated or weakly correlated.
# Computing correlation
x = ads%>%
#x <- cor
select(Daily.Time.Spent.on.Site, Age,Area.Income, Daily.Internet.Usage)
corr = cor(x)
corr
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 1.0000000 -0.3315133 0.3109544
## Age -0.3315133 1.0000000 -0.1826050
## Area.Income 0.3109544 -0.1826050 1.0000000
## Daily.Internet.Usage 0.5186585 -0.3672086 0.3374955
## Daily.Internet.Usage
## Daily.Time.Spent.on.Site 0.5186585
## Age -0.3672086
## Area.Income 0.3374955
## Daily.Internet.Usage 1.0000000
# Correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(corr, method="number", sig.level = 0.01, insig = "blank")
# Scatter plot for daily time spent on site and area income
plot(ads$Daily.Time.Spent.on.Site, ads$Area.Income, xlab="Daily Time Spent on Site", ylab="Area Income")
There is a weak correlation between the variables
# Scatter plot for daily time spent on site and daily internet usage
plot(ads$Daily.Time.Spent.on.Site, ads$Daily.Internet.Usage, xlab="Daily Time Spent on Site", ylab="Daily Internet Usage")
There is a weak correlation between the variables
From the analysis, we can conclude the following:
Audiences between the ages of 25 and 40 are more interested in the ads on the blog.
More females than males were seen to be clicking more on the ads.
Most people with an income of between 50000 and 70000 were noted to be very interested with the ads.
The audience who spent between 75 and 85 units of time were also noted to be more drawn towards clicking an ad.
There is no particular topic that the audience prefer over the other, each ad topic line is unique to a particular user.
Most people clicked on the ads on Mondays, Thursdays and Saturdays.
The months with the most number of clicks were February, March, April and May.
##5. IMPLEMENTING THE SOLUTION ### Performing predictive analysis
head(ads)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked.on.Ad Month Day
## 1 2016-03-27 00:53:11 0 03 1
## 2 2016-04-04 01:39:02 0 04 2
## 3 2016-03-13 20:35:42 0 03 1
## 4 2016-01-10 02:31:19 0 01 1
## 5 2016-06-03 03:36:18 0 06 6
## 6 2016-05-19 14:30:17 0 05 5
From the exploratory analysis, ad topic line was unique to every user, I will therefore not use it for predictive analysis. Timestamp column will also be dropped.
# Dropping the 2 columns
df = subset(ads, select = -c(Ad.Topic.Line, Timestamp) )
head(df)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage City
## 1 68.95 35 61833.90 256.09 Wrightburgh
## 2 80.23 31 68441.85 193.77 West Jodi
## 3 69.47 26 59785.94 236.50 Davidton
## 4 74.15 29 54806.18 245.89 West Terrifurt
## 5 68.37 35 73889.99 225.58 South Manuel
## 6 59.99 23 59761.56 226.74 Jamieberg
## Male Country Clicked.on.Ad Month Day
## 1 0 Tunisia 0 03 1
## 2 1 Nauru 0 04 2
## 3 0 San Marino 0 03 1
## 4 1 Italy 0 01 1
## 5 0 Iceland 0 06 6
## 6 1 Norway 0 05 5
# label encoding city and country
df$City = as.integer(as.factor(df$City))
df$Country = as.integer(as.factor(df$Country))
# changing the factor and character columns to integer
df$Male = as.integer(df$Male)
df$Month = as.integer(df$Month)
# scaling the data
cols <- df[, c(1,2,3,4,5,7,9)]
data <- scale(cols)
head(data)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 0.2491419 -0.1148475 0.50943618 1.7331628
## 2 0.9606516 -0.5701399 1.00202882 0.3136484
## 3 0.2819420 -1.1392555 0.35677007 1.2869451
## 4 0.5771428 -0.7977862 -0.01444841 1.5008289
## 5 0.2125572 -0.1148475 1.40816290 1.0382112
## 6 -0.3160289 -1.4807248 0.35495265 1.0646335
## City Country Month
## 1 1.6994534 1.4239054 -0.42398107
## 2 1.4918000 0.4516733 0.09496761
## 3 -1.3437431 0.9806820 -0.42398107
## 4 1.6206883 -0.1774181 -1.46187843
## 5 1.1409373 -0.2775008 1.13286497
## 6 -0.7315236 0.6089462 0.61391629
# appending the scaled data to the original data
final_data <- cbind(df, data)
# dropping the unscaled columns
df <- final_data[ -c(1:5, 7,9) ]
head(df)
## Male Clicked.on.Ad Day Daily.Time.Spent.on.Site Age Area.Income
## 1 1 0 1 0.2491419 -0.1148475 0.50943618
## 2 2 0 2 0.9606516 -0.5701399 1.00202882
## 3 1 0 1 0.2819420 -1.1392555 0.35677007
## 4 2 0 1 0.5771428 -0.7977862 -0.01444841
## 5 1 0 6 0.2125572 -0.1148475 1.40816290
## 6 2 0 5 -0.3160289 -1.4807248 0.35495265
## Daily.Internet.Usage City Country Month
## 1 1.7331628 1.6994534 1.4239054 -0.42398107
## 2 0.3136484 1.4918000 0.4516733 0.09496761
## 3 1.2869451 -1.3437431 0.9806820 -0.42398107
## 4 1.5008289 1.6206883 -0.1774181 -1.46187843
## 5 1.0382112 1.1409373 -0.2775008 1.13286497
## 6 1.0646335 -0.7315236 0.6089462 0.61391629
library('caret')
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Splitting the data into train and test sets
# “y” parameter takes the value of variable according to which data needs to be partitioned.
# p shows the percentage of the split. I'm using p=0.8. It means that data split should be done in 80:20 ratio.
# list parameter is for whether to return a list or matrix.
intrain <- createDataPartition(y = df$Clicked.on.Ad, p= 0.8, list = FALSE)
train <- df[intrain,]
test <- df[-intrain,]
# Checking the dimensions of the train dataframe and test dataframe
dim(train);
## [1] 800 10
dim(test);
## [1] 200 10
# Before training the model, the computational overheads need to be controlled. This will be implemented using the trainControl() method and will allow usage of train() function from the caret package
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(Clicked.on.Ad ~., data = train, method = "svmLinear",
trControl=train_control,
preProcess = c("center", "scale"),
tuneLength = 10)
# Checking the results of the train model
svm_Linear
## Support Vector Machines with Linear Kernel
##
## 800 samples
## 9 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9670833 0.9341667
##
## Tuning parameter 'C' was held constant at a value of 1
# Predicting
pred <- predict(svm_Linear, newdata = test)
pred
## [1] 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 0
## [38] 0 1 1 1 0 1 1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 1 0 1
## [75] 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0
## [112] 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0
## [149] 1 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 1 0 0 1 0
## [186] 0 0 1 0 1 0 1 1 0 1 1 0 1 1 1
## Levels: 0 1
# Checking the accuracy using a confusion matrix
cm <- confusionMatrix(table(pred, test$Clicked.on.Ad))
cm
## Confusion Matrix and Statistics
##
##
## pred 0 1
## 0 98 4
## 1 2 96
##
## Accuracy : 0.97
## 95% CI : (0.9358, 0.9889)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.94
##
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9800
## Specificity : 0.9600
## Pos Pred Value : 0.9608
## Neg Pred Value : 0.9796
## Prevalence : 0.5000
## Detection Rate : 0.4900
## Detection Prevalence : 0.5100
## Balanced Accuracy : 0.9700
##
## 'Positive' Class : 0
##
SVM gives an accuracy 97.5% with 97.5% accurately predicted predicted inputs from the test dataframe. The p-value is less than 1.
##6. CHALLENGING THE SOLUTION
This will involve looking for the optimal cost of misclassification by tuning the model’s parameters
set.seed(10)
# Attach Packages
library(tidyverse) # data manipulation and visualization
library(kernlab) # SVM methodology
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(e1071) # SVM methodology
library(RColorBrewer) # customized coloring of plots
# find optimal cost of misclassification
tune.out <- tune(svm, Clicked.on.Ad ~., data = df, kernel = "linear",
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
# extract the best model
(bestmod_linear <- tune.out$best.model)
##
## Call:
## best.tune(method = svm, train.x = Clicked.on.Ad ~ ., data = df, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 90
# find optimal cost of misclassification
tune.out <- tune(svm, Clicked.on.Ad ~., data = df, kernel = "radial",
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
# extract the best model
(bestmod_radial <- tune.out$best.model)
##
## Call:
## best.tune(method = svm, train.x = Clicked.on.Ad ~ ., data = df, ranges = list(cost = c(0.001,
## 0.01, 0.1, 1, 5, 10, 100)), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.1
##
## Number of Support Vectors: 366
# constructing a table of predicted classes against true classes using the predict() command
# Create a table of misclassified observations
ypred <- predict(bestmod_linear, df)
(misclass <- table(predict = ypred, truth = df$Clicked.on.Ad))
## truth
## predict 0 1
## 0 491 18
## 1 9 482
# constructing a table of predicted classes against true classes using the predict() command
# Create a table of misclassified observations
ypred <- predict(bestmod_radial, df)
(misclass <- table(predict = ypred, truth = df$Clicked.on.Ad))
## truth
## predict 0 1
## 0 496 25
## 1 4 475
Linear method uses 90 support vectors while radial method uses 366 support vectors. In linear, 491 and 482 inputs have been correctly classified while in radial 496 and 475 inputs have been correctly classified. This gives an accuracy rate of 97.3% using the linear method and 97.1% using the radial method.
From the EDA, we found that the frequency of a user clicking on an ad or not clicking is the same. Therefore the prediction results from modeling the solution are good.
In light of the new results from modeling the problem, I would recommend the following, all the features being constant:
The entrepreneur should prioritize advertising to an audience comprising of ages 25 to 40.
She should make the ads visible to more females than males.
She should use any topic line for the ad as users are not particular to a few, as long as it is relevant to what is on her blog.
She should target the audience with an income between 50000 to 70000, as they are noted to be very interested in the ads.
She should target the audience who spend between 75 and 85 units of time, as they are noted to be very interested in the ads.
She should do the advertisements on Mondays, Thursdays and Saturdays. These are the days most people clicked on them.
Yes. The data was provided by the client and it is the right data for the problem.
Yes. More data to train the model on would be rally helpful.
Yes.