This section of the assessment covers unsupervised learning with R.
We will revisit our last week’s case study and using the learnings and the given datasets. We will be tasked to create a supervised learning model to help identify which individuals are most likely to click on the ads in the blog.
Note that you will be required to include your last week’s IP insights thus you can add a modeling section to your last week’s submission submit it.
We had already performed some of the steps below in the previous week’s project.
# Importing the required libraries
library("data.table")
## Warning: package 'data.table' was built under R version 4.1.3
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("rpart")
# Loading our libraries
df <- read.csv('http://bit.ly/IPAdvertisingData')
head(df, n=10)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## 7 88.91 33 53852.85 208.36
## 8 66.00 48 24593.33 131.76
## 9 74.53 30 68862.00 221.51
## 10 69.88 20 55642.32 183.82
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## 7 Enhanced dedicated support Brandonstad 0 Myanmar
## 8 Reactive local challenge Port Jefferybury 1 Australia
## 9 Configurable coherent function West Colin 1 Grenada
## 10 Mandatory homogeneous architecture Ramirezton 1 Ghana
## Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11 0
## 2 2016-04-04 01:39:02 0
## 3 2016-03-13 20:35:42 0
## 4 2016-01-10 02:31:19 0
## 5 2016-06-03 03:36:18 0
## 6 2016-05-19 14:30:17 0
## 7 2016-01-28 20:59:32 0
## 8 2016-03-07 01:40:15 1
## 9 2016-04-18 09:33:42 0
## 10 2016-07-11 01:42:51 0
# Dataset description
dim(df)
## [1] 1000 10
# Checking for missing values
colSums(is.na(df))
## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad
## 0
# Checking for duplicates
df[duplicated(df),]
## [1] Daily.Time.Spent.on.Site Age Area.Income
## [4] Daily.Internet.Usage Ad.Topic.Line City
## [7] Male Country Timestamp
## [10] Clicked.on.Ad
## <0 rows> (or 0-length row.names)
# Checking for Outliers
# There are 3 numeric variables.(Age, Area.Income, Daily.Internet.Usage)
# Plotting boxplots to check for outliers.
# Boxplot for age column.
bxplt_Age = boxplot(df$Age,
main = "Boxplot for Age variable",
xlab = "Age",
col = "pink",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
There are no outliers in the Area.Income.
# Boxplot for Area.Income column.
#
bxplt_Area.Income = boxplot(df$Area.Income,
main = "Boxplot for Area.Income variable",
xlab = "Area Income",
col = "green",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
There are some outliers in the area.income variable
# Boxplot for Age column
#
bxplt_Area.Daily.Time.Spent.on.Site = boxplot(df$Daily.Time.Spent.on.Site,
main = "Boxplot for Daily.Time.Spent.on.Site",
xlab = "Age",
col = "gold",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
No outliers in the age column
# Boxplot for Daily.Internet.Usage column
#
bxplt_Daily.Internet.Usage = boxplot(df$Daily.Internet.Usage,
main = "Boxplot for Daily.Internet.Usage variable",
xlab = "Daily Internet Usage",
col = "blue",
border = "brown",
horizontal = TRUE,
notch = TRUE
)
No outliers in the Daily.Internet.Usage column
# Handling the outliers in the area income variable
# we store the outliers in a variable outliers
#
outliers = bxplt_Area.Income$out
outliers
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
# This vector is to be excluded from our dataset
# The which() function tells us the rows in which the outliers exist,
# These rows will be removed from our data set.
# The dataset advertising will be stored in a new variable so as not to destroy dataset
#
df_new = df
df_new = df_new[-which(df_new$Area.Income %in% outliers),]
# Checking if the new data frame has outliers.
#
boxplot(df_new$Area.Income)
Outliers in the area income have been removed. The data is now free of
outliers
Since the data is not normally distributed, a non-parametric model (decision trees) will be used for this modeling. As seen in our previous week EDA, city and country columns have high cardinality and low variance, so they can be dropped for modelling.
# subsetting our data to contain labels and features.
df_suset = df_new[, c(1,2,3,4,7,10)]
head(df_suset)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage Male
## 1 68.95 35 61833.90 256.09 0
## 2 80.23 31 68441.85 193.77 1
## 3 69.47 26 59785.94 236.50 0
## 4 74.15 29 54806.18 245.89 1
## 5 68.37 35 73889.99 225.58 0
## 6 59.99 23 59761.56 226.74 1
## Clicked.on.Ad
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
#data splicing
set.seed(0)
train <- sample(1:nrow(df_suset),size = ceiling(0.80*nrow(df)),replace = FALSE)
# training set
ad_train <- df_suset[train,]
# test set
ad_test <- df_suset[-train,]
set.seed(0)
ad_tree <- rpart(Clicked.on.Ad~., data= ad_train,
method = "class")
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
rpart.plot(ad_tree)
The feature that splits the data best is daily internet usage, followed by daily time spent on the site and area income.
# Prediction
ad_pred <-predict(ad_tree, ad_test , type = 'class')
# Calculating accuracy
t <- table(ad_test$Clicked.on.Ad, ad_pred)
paste(t)
## [1] "90" "4" "2" "96"
accuracy_Test <- sum(diag(t)) / sum(t)
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.96875"
96% accuracy is satisfactory for the decision tree model.
ad_tree$variable.importance
## Daily.Internet.Usage Daily.Time.Spent.on.Site Area.Income
## 274.03264 219.05370 104.35657
## Age
## 99.72096
The above confusion matrix shows that the mode performed fairly well in classifying the independent variable
mean(df_suset$Clicked.on.Ad == ad_pred)
## Warning in `==.default`(df_suset$Clicked.on.Ad, ad_pred): longer object length
## is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## [1] 0.5050403
0.959677419354839
The decision trees algorithm has performed well in classification with an accuracy of 96%
From the decision tree, we see that the most important features for determining whether a potential customer will click on the advertisement for the course are: daily internet usage, daily time spent on the site, age, and area income. The time and date of clicking on the ad are not very important for this prediction.
We have now created a model that will be important in identifying which individuals are most likely to click on the ads in the blog.