0.1 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:


1 Defining Data Question


1.1 a) Specifying the Question

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ our services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

1.2 b) Defining the Metric for success.

This project will be considered a success after we have thouroughly cleaned our data and performed both univariate and bivariate analysis and offering summaries of our dataset.

1.3 c) Understanding the context

The dataset that we will be using is an advertisement dataset.

1.4 d) Recording the experimental design.

The following steps will be followed in conducting this study:

  • Define the question, the metric for success, the context, experimental design taken.

  • Read and explore the given dataset. Define the appropriateness of the available data to answer the given question.

  • Find and deal with outliers, anomalies, and missing data within the dataset.

  • Perform univariate and bivariate analysis and recording our observations.

  • From our insights we will provide a conclusion and recommendation.

1.6 Viewing top entries

head(df)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0
# checking data composition
str(df)
## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
#checking dimension of our dataset
dim(df)
## [1] 1000   10
#confirming our dataset is a dataframe
class(df)
## [1] "data.frame"

2 Cleaning our data

2.1 Checking for missing values

sum(is.na(df))
## [1] 0
#there is no missing values 

2.2 Checking for duplicates

sum(duplicated(df))
## [1] 0
#there is no duplicates

2.3 Checking and dealing with outliers

boxplot(df$`Area.Income`,main="Boxplot for Area.Income",col = "grey")

boxplot(df$`Age`,main="Boxplot for Age",col = "orange")

boxplot(df$`Daily.Time.Spent.on.Site`,main="Boxplot for Daily.Time.Spent.on.Site",col = "green")

boxplot(df$`Male`,main="Boxplot for Male",col = "blue")

boxplot(df$`Daily.Internet.Usage`,main="Boxplot for Daily.Internet.Usage",col = "yellow")

boxplot(df$`Clicked.on.Ad`,main="Boxplot for Clicked.on.Ad",col = "red")

#We dont have many outliers in our columns so we will just leave it 

3 Univariate Analysis

summary(df)
##  Daily.Time.Spent.on.Site      Age         Area.Income    Daily.Internet.Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##  Ad.Topic.Line          City                Male         Country         
##  Length:1000        Length:1000        Min.   :0.000   Length:1000       
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.481                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :1.000                     
##   Timestamp         Clicked.on.Ad
##  Length:1000        Min.   :0.0  
##  Class :character   1st Qu.:0.0  
##  Mode  :character   Median :0.5  
##                     Mean   :0.5  
##                     3rd Qu.:1.0  
##                     Max.   :1.0
#getting summary in our dataset i.e mean , quartiles, median, maximum and minimum

3.1 Getting important measures of dispersion(range and standard deviation)

cat("the range of age  is",range(df$'Age'))
## the range of age  is 19 61
cat("\n")
cat("the range of  Area.Income is",range(df$'Area.Income'))
## the range of  Area.Income is 13996.5 79484.8
cat("\n")
cat("the range of Daily.Time.Spent.on.Site  is",range(df$'Daily.Time.Spent.on.Site'))
## the range of Daily.Time.Spent.on.Site  is 32.6 91.43
cat("\n")
cat("the range of male  is",range(df$'Male'))
## the range of male  is 0 1
cat("\n")
cat("the range of  Daily.Internet.Usage is",range(df$'Daily.Internet.Usage'))
## the range of  Daily.Internet.Usage is 104.78 269.96
cat("\n")
cat("the standard deviation of age  is",sd(df$'Age'))
## the standard deviation of age  is 8.785562
cat("\n")
cat("the standard deviation of Area.Income  is",sd(df$'Area.Income'))
## the standard deviation of Area.Income  is 13414.63
cat("\n")
cat("the standard deviatione of Daily.Time.Spent.on.Site is",sd(df$'Daily.Time.Spent.on.Site'))
## the standard deviatione of Daily.Time.Spent.on.Site is 15.85361
cat("\n")
cat("the standard deviation of male is",sd(df$'Male'))
## the standard deviation of male is 0.4998889
cat("\n")
cat("the standard deviation of Daily.Internet.Usage  is",sd(df$'Daily.Internet.Usage'))
## the standard deviation of Daily.Internet.Usage  is 43.90234

3.2 Getting a histogram of our columns

 hist(df$`Area.Income`,main="histogram for Area.Income",col = "grey")

hist(df$`Age`,main="histogram for Age",col = "orange")

hist(df$`Daily.Time.Spent.on.Site`,main="histogram for Daily.Time.Spent.on.Site",col = "green")

hist(df$`Male`,main="histogram for Male",col = "blue")

hist(df$`Daily.Internet.Usage`,main="histogram for Daily.Internet.Usage",col = "yellow")

hist(df$`Clicked.on.Ad`,main="histogram for Clicked.on.Ad",col = "red")

3.3 Univariate Summary

  1. In our dataset, many people are aged between 25 and 40.
  2. In our dataset, the common time on most daily time spent on site is between 75 and 85.
  3. In our dataset, the common area income is between 50,000 and 70,000.
  4. In our dataset , there is averagely distributed # Bivariate analysis

4 Bivariate analysis

#assigning columns to respective variables
ts<-df$Daily.Time.Spent.on.Site
age<-df$Age
ai<-df$Area.Income
dis<-df$Daily.Internet.Usage
mal<-df$Male
ca<-df$Clicked.on.Ad

4.1 Getting variance between columns

cat("the variance between age and daily time spent on site is",var(ts,age))
## the variance between age and daily time spent on site is -46.17415
cat("\n")
cat("the variance between age and Area.Income is",var(age,ai))
## the variance between age and Area.Income is -21520.93
cat("\n")
cat("the variance between age and daily internet usage is",var(age,dis))
## the variance between age and daily internet usage is -141.6348
cat("\n")
cat("the variance between age and Clicked.on.Ad is",var(ca,age))
## the variance between age and Clicked.on.Ad is 2.164665
cat("\n")
cat("the variance between area income and daily time spent on site is",var(ts,ai))
## the variance between area income and daily time spent on site is 66130.81
cat("\n")
cat("the variance between daily internet usage and daily time spent on site is",var(ts,dis))
## the variance between daily internet usage and daily time spent on site is 360.9919
cat("\n")
cat("the variance between clicked on ad and daily time spent on site is",var(ts,ca))
## the variance between clicked on ad and daily time spent on site is -5.933143
cat("\n")
cat("the variance between daily internet usage and area income",var(ts,dis))
## the variance between daily internet usage and area income 360.9919
cat("\n")
cat("the variance between daily internet usage and area income is",var(ai,dis))
## the variance between daily internet usage and area income is 198762.5
cat("\n")
cat("the variance between daily internet usage and clicked on ad is",var(ca,dis))
## the variance between daily internet usage and clicked on ad is -17.27409
cat("\n")

4.2 Getting correlation between columns

cat("the correlation between age and daily time spent on site is",cor(ts,age))
## the correlation between age and daily time spent on site is -0.3315133
cat("\n")
cat("the correlation between age and Area.Income is",cor(age,ai))
## the correlation between age and Area.Income is -0.182605
cat("\n")
cat("the correlation between age and daily internet usage is",cor(age,dis))
## the correlation between age and daily internet usage is -0.3672086
cat("\n")
cat("the correlation between age and Clicked.on.Ad is",cor(ca,age))
## the correlation between age and Clicked.on.Ad is 0.4925313
cat("\n")
cat("the correlation between area income and daily time spent on site is",cor(ts,ai))
## the correlation between area income and daily time spent on site is 0.3109544
cat("\n")
cat("the correlation between daily internet usage and daily time spent on site is",cor(ts,dis))
## the correlation between daily internet usage and daily time spent on site is 0.5186585
cat("\n")
cat("the correlation between clicked on ad and daily time spent on site is",cor(ts,ca))
## the correlation between clicked on ad and daily time spent on site is -0.7481166
cat("\n")
cat("the correlation between daily internet usage and area income",cor(ts,dis))
## the correlation between daily internet usage and area income 0.5186585
cat("\n")
cat("the correlation between daily internet usage and area income is",cor(ai,dis))
## the correlation between daily internet usage and area income is 0.3374955
cat("\n")
cat("the correlation between daily internet usage and clicked on ad is",cor(ca,dis))
## the correlation between daily internet usage and clicked on ad is -0.7865392
cat("\n")

4.3 Getting covariance between columns

cat("the covariance between age and daily time spent on site is",cov(ts,age))
## the covariance between age and daily time spent on site is -46.17415
cat("\n")
cat("the covariance between age and Area.Income is",cov(age,ai))
## the covariance between age and Area.Income is -21520.93
cat("\n")
cat("the covariance between age and daily internet usage is",cov(age,dis))
## the covariance between age and daily internet usage is -141.6348
cat("\n")
cat("the covariance between age and Clicked.on.Ad is",cov(ca,age))
## the covariance between age and Clicked.on.Ad is 2.164665
cat("\n")
cat("the covariance between area income and daily time spent on site is",cov(ts,ai))
## the covariance between area income and daily time spent on site is 66130.81
cat("\n")
cat("the covariance between daily internet usage and daily time spent on site is",cov(ts,dis))
## the covariance between daily internet usage and daily time spent on site is 360.9919
cat("\n")
cat("the covariance between clicked on ad and daily time spent on site is",cov(ts,ca))
## the covariance between clicked on ad and daily time spent on site is -5.933143
cat("\n")
cat("the covariance between daily internet usage and area income",cov(ts,dis))
## the covariance between daily internet usage and area income 360.9919
cat("\n")
cat("the covariance between daily internet usage and area income is",cov(ai,dis))
## the covariance between daily internet usage and area income is 198762.5
cat("\n")
cat("the covariance between daily internet usage and clicked on ad is",cov(ca,dis))
## the covariance between daily internet usage and clicked on ad is -17.27409
cat("\n")

4.4 Plotting scatterplots between columns

plot(age, dis, xlab="age", ylab="daily internet usage",col = "orange")

plot(age,ai, xlab="age", ylab="area income",col="blue")

plot(age, ts, xlab="age", ylab="Time spent on site",col="red")

plot(age,ca, xlab="age", ylab="clicked on ad",col="yellow")

plot(ts,ai, xlab="Time spent on site", ylab="area income",col="pink")

plot(ts,dis, xlab="Time spent on site", ylab="daily internet usage",col="grey")

plot(ts,ca, xlab="Time spent on site", ylab="clicked on ad",col="green")

plot(ai,dis, xlab="area income", ylab="daily internet usage",col="purple")

plot(ca,dis, xlab="clicked on ad", ylab="daily internet usage",col="black")

4.5 Bivariate summary

  1. There is no correlation between clicked on ad and daily internet usage.
  2. There is no correlation between clicked on ad and time spent on site.
  3. There is no correlation between clicked on ad and age.
  4. All other columns excluding one that involves clicked on add and male have a moderate correlation.

5 Conclusion

Looking at our data analysis, we can see that there is a correlation between our main columns .


6 Recommendation

  1. More emphasis should be put on all ages , not just between 25 and 40 years. This will ensure that we get an accurate representation.

7 Modelling

7.0.0.1 feature selection

#selecting numerical columns for our dataset
df1 <- df[,c(1,2,3,4,7,10)]
head(df1)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage Male
## 1                    68.95  35    61833.90               256.09    0
## 2                    80.23  31    68441.85               193.77    1
## 3                    69.47  26    59785.94               236.50    0
## 4                    74.15  29    54806.18               245.89    1
## 5                    68.37  35    73889.99               225.58    0
## 6                    59.99  23    59761.56               226.74    1
##   Clicked.on.Ad
## 1             0
## 2             0
## 3             0
## 4             0
## 5             0
## 6             0
set.seed(1234)
#creating a distribution of 240
random <- runif(240)
df1_random <- df1[order(random),]
# viewing our random sample
head(df1_random)
##     Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage Male
## 7                      88.91  33    53852.85               208.36    0
## 64                     86.06  32    61601.05               178.92    1
## 73                     55.35  39    75509.61               153.17    1
## 186                    46.88  54    43444.86               136.64    0
## 98                     39.94  41    64927.19               156.30    0
## 222                    75.83  27    67516.07               200.59    0
##     Clicked.on.Ad
## 7               0
## 64              0
## 73              1
## 186             1
## 98              1
## 222             0

7.0.0.2 Normalizing

normal <- function(x) (
  return( ((x - min(x)) /(max(x)-min(x))) )
)
normal(1:5)
## [1] 0.00 0.25 0.50 0.75 1.00
df_new <- as.data.frame(lapply(df1_random[,-5], normal))
summary(df_new)
##  Daily.Time.Spent.on.Site      Age          Area.Income    
##  Min.   :0.0000           Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2970           1st Qu.:0.2500   1st Qu.:0.4749  
##  Median :0.6016           Median :0.3750   Median :0.6722  
##  Mean   :0.5457           Mean   :0.4101   Mean   :0.6281  
##  3rd Qu.:0.7724           3rd Qu.:0.5500   3rd Qu.:0.7948  
##  Max.   :1.0000           Max.   :1.0000   Max.   :1.0000  
##  Daily.Internet.Usage Clicked.on.Ad   
##  Min.   :0.0000       Min.   :0.0000  
##  1st Qu.:0.2010       1st Qu.:0.0000  
##  Median :0.4194       Median :1.0000  
##  Mean   :0.4421       Mean   :0.5292  
##  3rd Qu.:0.6608       3rd Qu.:1.0000  
##  Max.   :1.0000       Max.   :1.0000

7.0.1 Model training

7.0.2 SVM

#Loading libraries
library(rpart,quietly = TRUE)
library(caret,quietly = TRUE)
library(rpart.plot,quietly = TRUE)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

7.0.2.1 Splicing the data

intrain <- createDataPartition(y = df1_random$Clicked.on.Ad, p= 0.7, list = FALSE)
training <- df1_random[intrain,]
testing <- df1_random[-intrain,]
#checking dimensions of our sets
dim(training); 
## [1] 168   6
dim(testing);
## [1] 72  6

7.0.2.2 Changing uur target variable category to factor

training[["Clicked.on.Ad"]] = factor(training[["Clicked.on.Ad"]])

7.0.2.3 Training our model

#we train our model with 10 resampling iterations repeating it 3 times
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(Clicked.on.Ad ~., data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
#checking results of our model
svm_Linear
## Support Vector Machines with Linear Kernel 
## 
## 168 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 151, 151, 151, 151, 151, 151, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9783088  0.9566439
## 
## Tuning parameter 'C' was held constant at a value of 1

7.0.2.4 perfoming predictions on our model

test_pred <- predict(svm_Linear, newdata = testing)
test_pred
##  [1] 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1
## [39] 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0
## Levels: 0 1

7.0.2.5 Checking the accuracy

confusionMatrix(table(test_pred, testing$Clicked.on.Ad))
## Confusion Matrix and Statistics
## 
##          
## test_pred  0  1
##         0 33  0
##         1  1 38
##                                          
##                Accuracy : 0.9861         
##                  95% CI : (0.925, 0.9996)
##     No Information Rate : 0.5278         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9721         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9706         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9744         
##              Prevalence : 0.4722         
##          Detection Rate : 0.4583         
##    Detection Prevalence : 0.4583         
##       Balanced Accuracy : 0.9853         
##                                          
##        'Positive' Class : 0              
## 
# we can see that our model has achieved a decent accuracy of 97.2 % 

7.0.2.6 Customizing our model

grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))
svm_Linear_Grid <- train(Clicked.on.Ad ~., data = training, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid,
tuneLength = 10)
## Warning: model fit failed for Fold01.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep1: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold01.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep2: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold01.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold02.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold03.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold04.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold05.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold06.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold07.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold08.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold09.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning: model fit failed for Fold10.Rep3: C=0.00 Error in .local(x, ...) : 
##   No Support Vectors found. You may want to change your parameters
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
## Warning in train.default(x, y, weights = w, ...): missing values found in
## aggregated results
svm_Linear_Grid
## Support Vector Machines with Linear Kernel 
## 
## 168 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 151, 151, 152, 151, 151, 152, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.00        NaN        NaN
##   0.01  0.9501225  0.9008833
##   0.05  0.9621324  0.9246358
##   0.10  0.9621324  0.9246358
##   0.25  0.9678922  0.9361013
##   0.50  0.9719363  0.9441760
##   0.75  0.9738971  0.9480294
##   1.00  0.9758578  0.9519921
##   1.25  0.9738971  0.9480294
##   1.50  0.9738971  0.9480294
##   1.75  0.9759804  0.9521960
##   2.00  0.9740196  0.9482880
##   5.00  0.9698529  0.9398203
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 1.75.
plot(svm_Linear_Grid)

# we can see that our model is giving best accuracy when c=0.25
test_pred_grid <- predict(svm_Linear_Grid, newdata = testing)
test_pred_grid
##  [1] 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1
## [39] 0 1 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0
## Levels: 0 1
confusionMatrix(table(test_pred_grid, testing$Clicked.on.Ad))
## Confusion Matrix and Statistics
## 
##               
## test_pred_grid  0  1
##              0 33  0
##              1  1 38
##                                          
##                Accuracy : 0.9861         
##                  95% CI : (0.925, 0.9996)
##     No Information Rate : 0.5278         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9721         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9706         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9744         
##              Prevalence : 0.4722         
##          Detection Rate : 0.4583         
##    Detection Prevalence : 0.4583         
##       Balanced Accuracy : 0.9853         
##                                          
##        'Positive' Class : 0              
## 
# here our accuracy reduces abit to 97.2% which is same as previous one

7.1 Conclusion

We are able to see that by using the svm modelling method we get good accuracy scores.