Sample Solution - Moringa Data Science Independent Project - Week 12 (Part 1)

Research Question

This section of the assessment covers unsupervised learning with R.

We will revisit our last week’s case study and using the learnings and the given datasets. We will be tasked to create a supervised learning model to help identify which individuals are most likely to click on the ads in the blog.

Note that you will be required to include your last week’s IP insights thus you can add a modeling section to your last week’s submission submit it.

Solution

We had already performed some of the steps below in the previous week’s project.

Loading our Dataset

# Importing the required libraries
library("data.table") 
## Warning: package 'data.table' was built under R version 4.1.3
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("rpart")
# Loading our libraries
df <- read.csv('http://bit.ly/IPAdvertisingData') 
head(df, n=10)
##    Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                     68.95  35    61833.90               256.09
## 2                     80.23  31    68441.85               193.77
## 3                     69.47  26    59785.94               236.50
## 4                     74.15  29    54806.18               245.89
## 5                     68.37  35    73889.99               225.58
## 6                     59.99  23    59761.56               226.74
## 7                     88.91  33    53852.85               208.36
## 8                     66.00  48    24593.33               131.76
## 9                     74.53  30    68862.00               221.51
## 10                    69.88  20    55642.32               183.82
##                            Ad.Topic.Line             City Male    Country
## 1     Cloned 5thgeneration orchestration      Wrightburgh    0    Tunisia
## 2     Monitored national standardization        West Jodi    1      Nauru
## 3       Organic bottom-line service-desk         Davidton    0 San Marino
## 4  Triple-buffered reciprocal time-frame   West Terrifurt    1      Italy
## 5          Robust logistical utilization     South Manuel    0    Iceland
## 6        Sharable client-driven software        Jamieberg    1     Norway
## 7             Enhanced dedicated support      Brandonstad    0    Myanmar
## 8               Reactive local challenge Port Jefferybury    1  Australia
## 9         Configurable coherent function       West Colin    1    Grenada
## 10    Mandatory homogeneous architecture       Ramirezton    1      Ghana
##              Timestamp Clicked.on.Ad
## 1  2016-03-27 00:53:11             0
## 2  2016-04-04 01:39:02             0
## 3  2016-03-13 20:35:42             0
## 4  2016-01-10 02:31:19             0
## 5  2016-06-03 03:36:18             0
## 6  2016-05-19 14:30:17             0
## 7  2016-01-28 20:59:32             0
## 8  2016-03-07 01:40:15             1
## 9  2016-04-18 09:33:42             0
## 10 2016-07-11 01:42:51             0
# Dataset description
dim(df)
## [1] 1000   10

Data Preparation

# Checking for missing values
colSums(is.na(df))
## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                        0                        0                        0 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked.on.Ad 
##                        0
# Checking for duplicates
df[duplicated(df),]
##  [1] Daily.Time.Spent.on.Site Age                      Area.Income             
##  [4] Daily.Internet.Usage     Ad.Topic.Line            City                    
##  [7] Male                     Country                  Timestamp               
## [10] Clicked.on.Ad           
## <0 rows> (or 0-length row.names)
# Checking for Outliers 
# There are 3 numeric variables.(Age, Area.Income, Daily.Internet.Usage)
# Plotting boxplots to check for outliers.
# Boxplot for age column.

bxplt_Age = boxplot(df$Age,
                   main = "Boxplot for Age variable",
                   xlab = "Age",
                   col = "pink",
                   border = "brown",
                   horizontal = TRUE,
                   notch = TRUE
)

There are no outliers in the Area.Income.

# Boxplot for Area.Income column.
#
bxplt_Area.Income = boxplot(df$Area.Income,
                    main = "Boxplot for Area.Income variable",
                    xlab = "Area Income",   
                    col = "green",
                    border = "brown",
                    horizontal = TRUE,
                    notch = TRUE
)

There are some outliers in the area.income variable

# Boxplot for Age column
#
bxplt_Area.Daily.Time.Spent.on.Site = boxplot(df$Daily.Time.Spent.on.Site,
                    main = "Boxplot for Daily.Time.Spent.on.Site",
                    xlab = "Age",
                    col = "gold",
                    border = "brown",
                    horizontal = TRUE,
                    notch = TRUE
)

No outliers in the age column

# Boxplot for Daily.Internet.Usage column
#
bxplt_Daily.Internet.Usage = boxplot(df$Daily.Internet.Usage,
                            main = "Boxplot for Daily.Internet.Usage variable",
                            xlab = "Daily Internet Usage",
                            
                            col = "blue",
                            border = "brown",
                            horizontal = TRUE,
                            notch = TRUE
)

No outliers in the Daily.Internet.Usage column

# Handling the outliers in the area income variable
# we store the outliers in a variable outliers
#
outliers = bxplt_Area.Income$out
outliers
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
# This vector is to be excluded from our dataset
# The which() function tells us the rows in which the outliers exist,
# These rows will be removed from our data set.

# The dataset advertising will be stored in a new variable so as not to destroy dataset
#
df_new = df
df_new = df_new[-which(df_new$Area.Income %in% outliers),]
# Checking if the new data frame has outliers.
#
boxplot(df_new$Area.Income)

Outliers in the area income have been removed. The data is now free of outliers

Data Modeling

Since the data is not normally distributed, a non-parametric model (decision trees) will be used for this modeling. As seen in our previous week EDA, city and country columns have high cardinality and low variance, so they can be dropped for modelling.

# subsetting our data to contain labels and features.
df_suset = df_new[, c(1,2,3,4,7,10)]
head(df_suset)
##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage Male
## 1                    68.95  35    61833.90               256.09    0
## 2                    80.23  31    68441.85               193.77    1
## 3                    69.47  26    59785.94               236.50    0
## 4                    74.15  29    54806.18               245.89    1
## 5                    68.37  35    73889.99               225.58    0
## 6                    59.99  23    59761.56               226.74    1
##   Clicked.on.Ad
## 1             0
## 2             0
## 3             0
## 4             0
## 5             0
## 6             0
#data splicing
set.seed(0)
train <- sample(1:nrow(df_suset),size = ceiling(0.80*nrow(df)),replace = FALSE)
# training set
ad_train <- df_suset[train,]
# test set
ad_test <- df_suset[-train,]
set.seed(0)
ad_tree <- rpart(Clicked.on.Ad~., data= ad_train,
method = "class")
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.1.3
rpart.plot(ad_tree)

The feature that splits the data best is daily internet usage, followed by daily time spent on the site and area income.

# Prediction
ad_pred <-predict(ad_tree, ad_test , type = 'class')
# Calculating accuracy
t <- table(ad_test$Clicked.on.Ad, ad_pred) 
paste(t)
## [1] "90" "4"  "2"  "96"
accuracy_Test <- sum(diag(t)) / sum(t) 
print(paste('Accuracy for test', accuracy_Test))
## [1] "Accuracy for test 0.96875"

96% accuracy is satisfactory for the decision tree model.

ad_tree$variable.importance
##     Daily.Internet.Usage Daily.Time.Spent.on.Site              Area.Income 
##                274.03264                219.05370                104.35657 
##                      Age 
##                 99.72096

The above confusion matrix shows that the mode performed fairly well in classifying the independent variable

mean(df_suset$Clicked.on.Ad == ad_pred)
## Warning in `==.default`(df_suset$Clicked.on.Ad, ad_pred): longer object length
## is not a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## [1] 0.5050403

0.959677419354839

The decision trees algorithm has performed well in classification with an accuracy of 96%

Conclusion

From the decision tree, we see that the most important features for determining whether a potential customer will click on the advertisement for the course are: daily internet usage, daily time spent on the site, age, and area income. The time and date of clicking on the ad are not very important for this prediction.

We have now created a model that will be important in identifying which individuals are most likely to click on the ads in the blog.