Research Question
A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.
Determine which kind of people are likely to click on the ads based on characteristics given in the dataset.
Performing univariate and bivariate analysis on the cleaned dataset and later determining the attributes of individuals who are likely to click on our client’s ads.
The dataset provided contains attributes of people who veiw the client’s blog and to whether they clicked on a specific ad about her previous cryptography course. Through analysis, we are able to determine the demographics of these people so as to govern strategy of attracting more customers for her new course.
This is an analysis problem which can be solved by understanding the preferences of the client’s viewers. We need to determine the kind of viewer that is most likely to click on a cryptography course ad and most likely sign up for it as well.
# load Tibble library
library(tibble)
# load the dataset as dataframe
df <- read.csv('http://bit.ly/IPAdvertisingData')
# convert dataframe to tibble and set options to show all columns
df <- as_tibble(df)
#check data structure of our dataset
class(df)
## [1] "tbl_df" "tbl" "data.frame"
# preview the first 6 rows of the dataset
head(df)
# preview the last 6 rows of the dataset
tail(df)
# get number of records
nrow(df); ncol(df)
## [1] 1000
## [1] 10
The dataset contains 1000 customer attributes and 10 variables.
# get datatypes of each column using class
sapply(df, class)
## Daily.Time.Spent.on.Site Age Area.Income
## "numeric" "integer" "numeric"
## Daily.Internet.Usage Ad.Topic.Line City
## "numeric" "character" "character"
## Male Country Timestamp
## "integer" "character" "character"
## Clicked.on.Ad
## "integer"
# inspect variable classes
str(df)
## tibble [1,000 × 10] (S3: tbl_df/tbl/data.frame)
## $ Daily.Time.Spent.on.Site: num [1:1000] 69 80.2 69.5 74.2 68.4 ...
## $ Age : int [1:1000] 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num [1:1000] 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num [1:1000] 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr [1:1000] "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr [1:1000] "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int [1:1000] 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr [1:1000] "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : chr [1:1000] "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
## $ Clicked.on.Ad : int [1:1000] 0 0 0 0 0 0 0 1 0 0 ...
# create function to convert categorical columns to factor datatype
tofactor <- function(column){
as.factor(column)
}
# columns to be converted to factors
df$Male <- tofactor(df$Male)
df$Clicked.on.Ad <- tofactor(df$Clicked.on.Ad)
df$City <- tofactor(df$City)
df$Country <- tofactor(df$Country)
# convert timestamp to datetime
df$Timestamp <- as.POSIXct(df$Timestamp, format = "%Y-%m-%d %H:%M:%S")
# split date and time variables
df$Date <- as.Date(df$Timestamp)
df$Time <- format(df$Timestamp,"%H:%M:%S")
# check datatypes again
sapply(df, class)
## $Daily.Time.Spent.on.Site
## [1] "numeric"
##
## $Age
## [1] "integer"
##
## $Area.Income
## [1] "numeric"
##
## $Daily.Internet.Usage
## [1] "numeric"
##
## $Ad.Topic.Line
## [1] "character"
##
## $City
## [1] "factor"
##
## $Male
## [1] "factor"
##
## $Country
## [1] "factor"
##
## $Timestamp
## [1] "POSIXct" "POSIXt"
##
## $Clicked.on.Ad
## [1] "factor"
##
## $Date
## [1] "Date"
##
## $Time
## [1] "character"
head(df)
# check for outliers
# select only numerical variables
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_num <- select_if(df, is.numeric)
head(data_num)
# boxplot for the numerical values except for Area.Income column which is on another scale
library(ggplot2)
#ggplot(stack(select(data_num, -Area.Income)), aes(x = ind, y = values)) + geom_boxplot()
# height=10, boxwex = 0.3, outwex = 1
columns <- c(names(data_num))
boxplots <- function(column_name){
column <- c(data_num[,column_name])
boxplot(column, width=1, main=paste("Boxplot for", columns[x]))
}
# create 4 subplots
par(mfrow=c(2,2))
for (x in 1:length(columns)){
boxplots(x)
}
There are no outliers for Daily.Time.Spent.on.Site, Age and Daily.Internet.Usage columns.
# boxplot for Area.Income
boxplot(Area.Income~Male, data=df)
There are outliers in the Area Income column especially for the Male gender whose mean income is a bit higher than for the female. Outliers are men earning less than the rest. We shall check if the outliers are valid income figures.
# check for Area Income less than 0 which would be invalid
sum(df[,"Area.Income"] < 0)
## [1] 0
# import Amelia and Rcpp libraries
library(Amelia,Rcpp)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
#checking missing data visualization
missmap(df)
## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.
## Warning: Unknown or uninitialised column: `imputations`.
There are no missing values in our dataframe.
# confirming we have no missing values
colSums(is.na(df))
## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad Date Time
## 0 0 0
# check for duplicates
anyDuplicated(df)
## [1] 0
There are no duplicated values in our dataset, otherwise we would have to drop them.
# check univariate summary of our dataset.
summary(df)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
##
## Ad.Topic.Line City Male Country
## Length:1000 Lisamouth : 3 0:519 Czech Republic: 9
## Class :character Williamsport : 3 1:481 France : 9
## Mode :character Benjaminchester: 2 Afghanistan : 8
## East John : 2 Australia : 8
## East Timothy : 2 Cyprus : 8
## Johnstad : 2 Greece : 8
## (Other) :986 (Other) :950
## Timestamp Clicked.on.Ad Date
## Min. :2016-01-01 02:52:10.00 0:500 Min. :2015-12-31
## 1st Qu.:2016-02-18 02:55:42.00 1:500 1st Qu.:2016-02-17
## Median :2016-04-07 17:27:29.50 Median :2016-04-07
## Mean :2016-04-10 10:34:06.64 Mean :2016-04-09
## 3rd Qu.:2016-05-31 03:18:14.00 3rd Qu.:2016-05-30
## Max. :2016-07-24 00:22:16.00 Max. :2016-07-23
##
## Time
## Length:1000
## Class :character
## Mode :character
##
##
##
##
From the table above we are able to get the mean, median, quantiles and range of all numerical variables.
The average daily time spent on the site is 65. The most time a client spent online is 91.43 and the least time is 32.60.
The average area income is 55000, the highest being 79485 and the lowest being 13996.
The average age of clients in our dataset is 36. Max age is 61 and min age is 19.
The average daily internet usage is 180, highest being 270 and the lowest being 104.8.
The data was colected from 1st January 2016 to 24th July, 2016.
Our dataset is balanced as it contains an equal share of people who clicked on the ad and those who did not.
# let's define a function that will print results for numeric columns
library(dplyr)
printing <- function(func){
print(paste("Daily.Time.Spent.on.Site: ", func(df$Daily.Time.Spent.on.Site)))
print(paste("Daily.Internet.Usage: ", func(df$Daily.Internet.Usage)))
print(paste("Age: ", func(df$Age)))
print(paste("Area.Income: ", func(df$Area.Income)))
}
# use above function to print range
paste("The range of each column is as shown below:-")
## [1] "The range of each column is as shown below:-"
printing(range)
## [1] "Daily.Time.Spent.on.Site: 32.6" "Daily.Time.Spent.on.Site: 91.43"
## [1] "Daily.Internet.Usage: 104.78" "Daily.Internet.Usage: 269.96"
## [1] "Age: 19" "Age: 61"
## [1] "Area.Income: 13996.5" "Area.Income: 79484.8"
# other columns whose range is obtainable
print(paste("Ad.Topic.Line: ", range(df$Ad.Topic.Line)))
## [1] "Ad.Topic.Line: Adaptive 24hour Graphic Interface"
## [2] "Ad.Topic.Line: Visionary reciprocal circuit"
print(paste("Date: ", range(df$Date)))
## [1] "Date: 2015-12-31" "Date: 2016-07-23"
paste("The interquartile range of each column is shown beow:-")
## [1] "The interquartile range of each column is shown beow:-"
# use IQR function
printing(IQR)
## [1] "Daily.Time.Spent.on.Site: 27.1875"
## [1] "Daily.Internet.Usage: 79.9625"
## [1] "Age: 13"
## [1] "Area.Income: 18438.8325"
The interquatile range explains the range where the bulk of the values lie.
# mode for all columns which is not included in the summary
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
paste("The mode of the columns is as shown below:-")
## [1] "The mode of the columns is as shown below:-"
printing(getmode)
## [1] "Daily.Time.Spent.on.Site: 62.26"
## [1] "Daily.Internet.Usage: 167.22"
## [1] "Age: 31"
## [1] "Area.Income: 61833.9"
# add the rest of the columns whosemode can be obtained
print(paste("Male: ", getmode(df$Male)))
## [1] "Male: 0"
print(paste("Clicked.on.Ad: ", getmode(df$Clicked.on.Ad)))
## [1] "Clicked.on.Ad: 0"
print(paste("Ad.Topic.Line: ", getmode(df$Ad.Topic.Line)))
## [1] "Ad.Topic.Line: Cloned 5thgeneration orchestration"
print(paste("Country: ", getmode(df$Country)))
## [1] "Country: Czech Republic"
print(paste("City: ", getmode(df$City)))
## [1] "City: Lisamouth"
print(paste("Date: ", getmode(df$Date)))
## [1] "Date: 2016-02-14"
The mode of the columns as shown above represents the column values that are most repeated for each column.
print("The variance of numerical columns is as shown below:-")
## [1] "The variance of numerical columns is as shown below:-"
printing(var)
## [1] "Daily.Time.Spent.on.Site: 251.337094854855"
## [1] "Daily.Internet.Usage: 1927.41539618619"
## [1] "Age: 77.1861051051051"
## [1] "Area.Income: 179952405.951775"
The variance explains the measure of spread of the data points within the data set.
print("The standard deviation of the numerical columns is as shown below:-")
## [1] "The standard deviation of the numerical columns is as shown below:-"
printing(sd)
## [1] "Daily.Time.Spent.on.Site: 15.8536145675002"
## [1] "Daily.Internet.Usage: 43.9023393019801"
## [1] "Age: 8.78556231012592"
## [1] "Area.Income: 13414.6340222824"
The standard deviation measures the distribution of data points around the mean.
# Pie chart function
# import plotrix for 3D plot
library(plotrix)
piechart <- function(column, title){
# define labels and values
slices <- c(tabulate(column))
lbls <- c(unique(column))
# convert values to percentage
pct <- round(slices/sum(slices)*100)
# add percentages to labels
lbls <- paste(lbls,":", pct)
# ad % to labels
lbls <- paste(lbls,"%",sep="")
# plot pie chart in 3D
pie3D(slices,labels=lbls, col=rainbow(length(lbls)),
explode=0.1, radius=1, start=0, main=title)
}
# Clicked.on.ad representation
piechart(df$Clicked.on.Ad, "Pie Chart for Clicked on ad representation")
Half of the population of the dataset clicked on the ad.
# Gender representation
piechart(df$Male, "Pie Chart for Gender representation")
52% are female while 48% of the population is male.
# filter data to only have those who clicked on ad
data <- df %>% filter(Clicked.on.Ad == 1)
data
# filter data to only those who didn't click on the ad
data_ <- df %>% filter(Clicked.on.Ad == 0)
data_
# Gender representation of those who clicked on ad
piechart(data$Male, "Pie Chart for Gender representation(Clicked on Ad)")
More female clicked on the ad than male.
# histograms function
histogram <- function(column,title, xlab){
hist(column, main= title, xlab=xlab, ylab = "Frequency", col = "lightgreen")
}
# age representation
histogram(df$Age, "Histogram for Age representation", "Age")
Age representation in the dataset is skewed to the right. Most people lie between the age of 25 and 40.
# age representation of those who clicked on ad
histogram(data$Age, "Histogram for Age representation(Clicked on Ad)", "Age")
The data seems normally distributed. Those who clicked on the ad are aged between 35 and 45.
# area income representation
histogram(df$Area.Income, "Histogram for Area Income representation", "Area income")
Area Income representation is skewed to the left, most people earning between 50k and 70k.
# Area representation of those who clicked on ad
histogram(data$Area.Income, "Histogram for Area Income representation(Clicked on Ad)", "Area Income")
Those who clicked on ad earn between 40k and 60k.
# Daily time spent on site representation
histogram(df$Daily.Time.Spent.on.Site, "Histogram for Daily Time spent on site representation", "Daily Time spent on site")
Most people spend between 65 to 85 minutes on site.
# Daily time spent on site representation of those who clicked on ad
histogram(data$Daily.Time.Spent.on.Site, "Histogram for Daily Time spent on site representation(Clicked on Ad)", "Daily Time spent on site")
Most people who click on the ad spend between 35 to 60 minutes on site.
# Daily internet usage representation
# customizing bin width
bin_width <- 10
nbins <- seq(min(df$Daily.Internet.Usage) - bin_width,
max(df$Daily.Internet.Usage) + bin_width, by = bin_width)
# plot histogram
hist(df$Daily.Internet.Usage, breaks = nbins, main = "Histogram for Daily Internet Usage representation", xlab = "Daily Internet Usage", ylab="Frequency", col="lightgreen")
Most people use internet bundles between 115 to 145MBs and between 195 and 235MBs.
# Daily internet usage representation of those who clicked on ad
hist(data$Daily.Internet.Usage, breaks = nbins, main = "Histogram for Daily Internet Usage representation(Clicked on Ad)", xlab = "Daily Internet Usage", ylab="Frequency", col="lightgreen")
Most people who click on ads use internet data between 105 and 185MBs.
# barplot function
bar <- function(column, title, xlab){
# create frequency table
freq <- table(column)
# sort frequency table
sorted_freq <- (freq[order(freq,decreasing=TRUE)])
# adjust margins of frequency table
par(mar=c(11,4,1,0))
# plot bar graph for first 20 values with the highest count
barplot(sorted_freq[1:20], main=title, ylab="Frequency", las=2, col="lightblue")
title(xlab = xlab, line=10)
}
# City representation
bar(df$City, "Barplot for City representation", "City")
Most people come from Lisamouth and Williamsport cities both with 3 counts each.
# City representation of those who clicked on ad
bar(data$City, "Barplot for City representation(Clicked on ad)", "City")
Most people who click the ads are in 10 cities namely: - Lake David - Lake James - Lisamouth - Michelleside - Millerbury - Robertfurt - South Lisa - West Amanda - West Shannon - Williamsport
# Country representation
bar(df$Country, "Barplot for Country representation", "Country")
Czech Republic and France have the highest representation of the population in our dataset both with 9 counts.
# Country representation of those who clicked on ad
bar(data$Country, "Barplot for Country representation(Clicked on ad)", "Country")
Most people who clicked on the ad are in 3 countries namely: Australia, Ethopia and Turkey.
# Timestamp representation
bar(df$Date, "Barplot for DateTime representation", "Date")
Data was collected most for 2016-02-14 and 2016-04-04 each with a frequency of 12.
# Timestamp representation for those who clicked on ad
bar(data$Date, "Barplot for DateTime representation(Clicked on Ad)", "Date")
2016-02-14 recorded the highest number of people who clicked on the ad.
length(unique(df$Ad.Topic.Line))
## [1] 1000
The topics of the Ads are all unique hence cannot help us determine which ind of ad topics our viewers are likely to click on.
# check covariance in numerical variables with covariance matrix
cov(data_num)
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 251.33709 -46.17415 66130.81
## Age -46.17415 77.18611 -21520.93
## Area.Income 66130.81091 -21520.92580 179952405.95
## Daily.Internet.Usage 360.99188 -141.63482 198762.53
## Daily.Internet.Usage
## Daily.Time.Spent.on.Site 360.9919
## Age -141.6348
## Area.Income 198762.5315
## Daily.Internet.Usage 1927.4154
Covariance is the measure of how two random variables vary together. A high negative covariance indicates negative correlation while a high positive covariance indicates positive correlation. A value close to zero indicates weak covariance.
We can say the following have positive covariance indicating positive correlation:- * Area.Income and Daily.Time.Spent.On.Site * Area.Income and Daily.Internet.Usage * Daily.Time.Spent.On.Site and Daily.Internet.Usage
The following have negative covariance indicating negative correlation:- Age and Area.Income Age and Daily.Internet.Usage
In order to measure the strength of this relationship, we shall use a correlation matrix.
# correlation matrix
cormat <- cor(data_num)
# melt correlation matrix to enable heatmap plot
library(reshape2)
melted_cormat <- melt(cormat)
# plot heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() + geom_text(aes(Var2, Var1, label = value), color = "white", size = 2) + ggtitle("Correlation Heatmap")
The correlation matrix tries to explain the covariance between variables. There is a strong positive correlation between Daily.Time.Spent.On.Site and Daily.Internet.Usage since time spent on site depends on internet accessibility.
All other variables are weakly correlated.
# scatterplot for the two correlated variables
plot(df$Daily.Time.Spent.on.Site, df$Daily.Internet.Usage, col =df$Clicked.on.Ad, xlab="Daily Time Spent on site", ylab="Daily Internet Usage", main="Daily Internet Usage vs Daily Time Spent on Site")
# Legend
legend("topleft",
legend = levels(df$Clicked.on.Ad),
pch = 19,
col = factor(levels(df$Clicked.on.Ad)))
Most people with high internet usage who spent most daily time on site did not click on the ad.
# create funtion to plot stacked barchart
stacked <- function(column1, column2, title, value1, value2, legend){
# vectorize the two columns to plot
attribute1 <- c(column1)
attribute2 <- c(column2)
# create dataframe of the two columns
data <- data.frame(attribute1, attribute2)
# plot stacked bargraph
ggplot(data=data) + geom_bar(aes(fill=attribute2, x=attribute1)) + ggtitle(title) + xlab(value1) + ylab(value2) + theme(plot.title = element_text(hjust=0.5)) + labs(fill=legend) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5))
}
# Gender against clicking on ad
stacked(df$Male, df$Clicked.on.Ad, "Gender vs Clicked.on.Ad representation", "Gender", "Count", "Clicked.on.Ad")
There is almost equal representation of male and female who clicked and those who did not click on the ad. This implies that gender may not influence who clicks on the ad.
# Age representation vs clicking on ad
stacked(df$Age, df$Clicked.on.Ad, "Age vs Clicked.on.Ad representation", "Age", "Count", "Clicked.on.Ad")
Most people who clicked the ad were 45 years old. Almost all people aged between 47 and 65 clicked on the ad. Most people aged 35 and below seem uninterested in clicking on the ads.
# City representation vs clicking on ad
stacked(df$City[1:20], df$Clicked.on.Ad, "City vs Clicked.on.Ad representation", "City", "Count", "Clicked.on.Ad")
# Date representation
# group days by number of clicks on ad.
times <- aggregate(x=as.factor(data$Clicked.on.Ad), by = list(data$Date), FUN=length)
# group days by number of no clicks on ad.
times_ <- aggregate(x=as.factor(data_$Clicked.on.Ad), by = list(data_$Date), FUN=length)
# plot line graph
ggplot() +
geom_line(data=times, aes(x=Group.1, y=x, group=1), col = "blue")+
geom_line(data=times_, aes(x=Group.1, y=x, group=1), col = "red") +
geom_point() + scale_x_date(date_labels = "%Y %b %d", date_breaks = "5 days") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Day") + ggtitle("Trend for clicks on Ad") + ylab("Frequency of Ad clicks")
The highest number of ad clicks was recorded in the week of Februrary 2016 between 9th and 14th followed by the first week of June, 2016.
ggplot() +
geom_line(data=times, aes(x=Group.1, y=x, group=1), col = "blue")+
geom_line(data=times_, aes(x=Group.1, y=x, group=1), col = "red") +
geom_point() + scale_x_date(date_labels = "%b", date_breaks = "1 month") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Month") + ggtitle("Trend for clicks on Ad") + ylab("Frequency of Ad clicks")
The ad was mostly clicked in early February and towards the beginning of June.
We shall build a classification model which will determine the kind of people who are likely to click the ad based on certain given attributes.
# convert categorical columns to numeric
df[, c("Country", "City")] <- sapply(df[, c("Country", "City")], unclass)
# convert country and city directly to factor
df$Country <- as.integer(as.character(df$Country))
df$City <- as.integer(as.character(df$City))
df$Male <- as.integer(as.character(df$Male))
head(df)
# extract month, day of the week and hour from date and time columns
# convert to integer
# import package
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Take Monday as first day of the week
df$Weekday <- as.integer(wday(df$Date, week_start=1))
# Month
df$Month <- as.integer(format(as.Date(df$Date), "%m"))
# Hour
df$Hour <- as.integer(format(as.POSIXct(df$Timestamp), format = "%H"))
head(df)
# check column datatypes
sapply(df, class)
## $Daily.Time.Spent.on.Site
## [1] "numeric"
##
## $Age
## [1] "integer"
##
## $Area.Income
## [1] "numeric"
##
## $Daily.Internet.Usage
## [1] "numeric"
##
## $Ad.Topic.Line
## [1] "character"
##
## $City
## [1] "integer"
##
## $Male
## [1] "integer"
##
## $Country
## [1] "integer"
##
## $Timestamp
## [1] "POSIXct" "POSIXt"
##
## $Clicked.on.Ad
## [1] "factor"
##
## $Date
## [1] "Date"
##
## $Time
## [1] "character"
##
## $Weekday
## [1] "integer"
##
## $Month
## [1] "integer"
##
## $Hour
## [1] "integer"
# Select numerical columns only
df1 <- df[ , purrr::map_lgl(df, is.numeric)]
head(df1)
df2 <- df[ , purrr::map_lgl(df, is.factor)]
head(df2)
# Bind the two dataframes
df <- cbind(df1,df2)
tail(df)
# Normalize data since our data is skewed.
normal <- function(x) (
return( ((x - min(x)) /(max(x)-min(x))) )
)
normal(1:11)
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
data.scaled <- as.data.frame(lapply(select(df, -Clicked.on.Ad), normal))
# return label column
data.scaled$Clicked.on.Ad <- df$Clicked.on.Ad
# Split the dataset into 70/30 ratio
set.seed(234) # Ensures the same random sample is given each time
split <- sample(c(rep(0, 0.7 * nrow(df)), rep(1, 0.3 * nrow(df))))
# define train and test
train.data <- data.scaled[split == 0, ]
test.data <- data.scaled[split == 1, ]
# baseline model
# load packages
library(dplyr)
library(caTools)
library(ROCR)
# Training the model
logistic_model <- glm(Clicked.on.Ad ~.,
data = train.data,
family = binomial(link="logit"))
logistic_model
##
## Call: glm(formula = Clicked.on.Ad ~ ., family = binomial(link = "logit"),
## data = train.data)
##
## Coefficients:
## (Intercept) Daily.Time.Spent.on.Site Age
## 18.1389 -13.2369 7.7323
## Area.Income Daily.Internet.Usage City
## -10.3287 -10.9263 -0.8950
## Male Country Weekday
## -0.3358 -0.1055 0.5672
## Month Hour
## 0.2008 -0.4208
##
## Degrees of Freedom: 699 Total (i.e. Null); 689 Residual
## Null Deviance: 970.1
## Residual Deviance: 110.4 AIC: 132.4
The significant difference in Null deviance and residual difference shows the improved performance of our model when varaible coefficients are considered in addition to the intercept.
# Summary
summary(logistic_model)
##
## Call:
## glm(formula = Clicked.on.Ad ~ ., family = binomial(link = "logit"),
## data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8168 -0.1029 -0.0451 0.0079 3.2854
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.1389 2.5493 7.115 1.12e-12 ***
## Daily.Time.Spent.on.Site -13.2369 1.7516 -7.557 4.12e-14 ***
## Age 7.7323 1.4394 5.372 7.80e-08 ***
## Area.Income -10.3287 1.6371 -6.309 2.80e-10 ***
## Daily.Internet.Usage -10.9263 1.4716 -7.425 1.13e-13 ***
## City -0.8950 0.9539 -0.938 0.348
## Male -0.3358 0.5219 -0.644 0.520
## Country -0.1055 0.8523 -0.124 0.901
## Weekday 0.5672 0.7462 0.760 0.447
## Month 0.2008 1.5049 0.133 0.894
## Hour -0.4208 0.9191 -0.458 0.647
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 970.13 on 699 degrees of freedom
## Residual deviance: 110.42 on 689 degrees of freedom
## AIC: 132.42
##
## Number of Fisher Scoring iterations: 8
The above model suggests that Daily.Time.Spent.on.site(4.12e-14), Daily.Internet.Usage(1.13e-13), Area.Income(2.80e-10) and Age(7.80e-08) in that order are most statistically significant when determining whether a user clicks on the ad as seen by the low p_value in that order.
The least significant are Country(0.901), Month(0.894), Hour(0.647) and Gender(0.520) in that order.
Negative estimate values(variable coefficients) are interpreted as having a decrease in the whole function when increased by a unit and other variables are kept constant.
# Run the anova() function on the model to analyze the table of deviance
anova(logistic_model, test="Chisq")
The significant drop in residual deviance for the first 4 variables contributes to the large difference between the null deviance and the residual deviance overall. The null deviance is the deviance of the model with only the intercept without variable coefficients.
# assess the model fit with pseudo-rsquared.
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(logistic_model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -55.2094927 -485.0630171 859.7070487 0.8861808 0.7071670 0.9430152
McFadden index is 88.62%.
# prediction
# Predict test data based on model
pred <- predict(logistic_model,
test.data, type = "response")
# Changing probabilities to binary class
pred <- ifelse(pred >0.5, 1, 0)
# Evaluating model accuracy using confusion matrix
table(test.data$Clicked.on.Ad, pred)
## pred
## 0 1
## 0 141 2
## 1 8 149
Correct predictions are 290/300, an accuracy of 96.67% which is pretty good.
# ROC curve plot
pr <- prediction(pred, test.data$Clicked.on.Ad)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
# Area under the curve(auc)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.9675293
link :https://www.r-bloggers.com/2015/09/how-to-perform-a-logistic-regression-in-r/
# load libraries
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked _by_ '.GlobalEnv':
##
## histogram
# Training the model
rf <- randomForest(as.factor(Clicked.on.Ad)~., data=train.data, importance=TRUE)
print(rf)
##
## Call:
## randomForest(formula = as.factor(Clicked.on.Ad) ~ ., data = train.data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.43%
## Confusion matrix:
## 0 1 class.error
## 0 346 11 0.03081232
## 1 13 330 0.03790087
Out of bag(OOB) error is 2.86%. This means that our model is 97.14% accurate on the train data.
# prediction on test data
p1 <- predict(rf, test.data)
confusionMatrix(p1, as.factor(test.data$Clicked.on.Ad))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 141 11
## 1 2 146
##
## Accuracy : 0.9567
## 95% CI : (0.927, 0.9767)
## No Information Rate : 0.5233
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9134
##
## Mcnemar's Test P-Value : 0.0265
##
## Sensitivity : 0.9860
## Specificity : 0.9299
## Pos Pred Value : 0.9276
## Neg Pred Value : 0.9865
## Prevalence : 0.4767
## Detection Rate : 0.4700
## Detection Prevalence : 0.5067
## Balanced Accuracy : 0.9580
##
## 'Positive' Class : 0
##
The model accurately predicts 286/300 giving an accuracy of 95.33%.
# plot error rate
plot(rf)
# plot histogram for number of nodes for the trees
hist(treesize(rf),
main = "No. of Nodes for the Trees",
col = "darkgreen")
# plot for variable importance
varImpPlot(rf,
sort = T,
n.var = 10,
main = "Top 10 - Variable Importance")
Variables that contribute to highly prediction results are
Daily.Time.Spent.on.site, Daily.Internet.Usage, Area.Income and Age.
# Variable importance table
importance(rf)
## 0 1 MeanDecreaseAccuracy
## Daily.Time.Spent.on.Site 70.34756620 43.9237489 74.6417855
## Age 25.61975557 14.5936398 28.3155211
## Area.Income 33.87424027 16.6940831 35.8757079
## Daily.Internet.Usage 72.07179889 47.8056818 75.7550585
## City -0.03163044 -1.9528844 -1.3024274
## Male -1.64125001 0.7121310 -0.8089057
## Country -1.91312199 -1.4012224 -2.5439292
## Weekday 0.84543797 -1.3807511 -0.1250853
## Month -0.40545072 1.0491655 0.3166916
## Hour -0.50677050 -0.6480012 -0.7820694
## MeanDecreaseGini
## Daily.Time.Spent.on.Site 117.0112002
## Age 32.8555732
## Area.Income 36.0218761
## Daily.Internet.Usage 140.7513451
## City 5.6945610
## Male 0.9232822
## Country 5.8342330
## Weekday 3.2211116
## Month 2.8348992
## Hour 4.1770765
Link: https://www.r-bloggers.com/2021/04/random-forest-in-r/
# install class package
library(class)
# define k to be the the square root of the number of observations
k <- sqrt(length(df))
# build knn classifier
knn_model_pred <- knn(train = train.data, test = test.data ,cl = train.data[,11], k=k)
# Evaluate model performance
# import gmodels package
library(gmodels)
# plot crosstable for correct labels vs predicted values
CrossTable(x=test.data[,11], y = knn_model_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | knn_model_pred
## test.data[, 11] | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## 0 | 143 | 0 | 143 |
## | 1.000 | 0.000 | 0.477 |
## | 1.000 | 0.000 | |
## | 0.477 | 0.000 | |
## ----------------|-----------|-----------|-----------|
## 1 | 0 | 157 | 157 |
## | 0.000 | 1.000 | 0.523 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.523 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 143 | 157 | 300 |
## | 0.477 | 0.523 | |
## ----------------|-----------|-----------|-----------|
##
##
There are no False Negatives or False positives, leading to accuracy result of 100%. This is an overfitting case hence we need to improve our model. This can be done by altering the value of k.
# build knn classifier
knn_model_pred <- knn(train = train.data, test = test.data ,cl = train.data[,11], k=200)
# plot crosstable for correct labels vs predicted values
CrossTable(x=test.data[,11], y = knn_model_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 300
##
##
## | knn_model_pred
## test.data[, 11] | 0 | 1 | Row Total |
## ----------------|-----------|-----------|-----------|
## 0 | 143 | 0 | 143 |
## | 1.000 | 0.000 | 0.477 |
## | 1.000 | 0.000 | |
## | 0.477 | 0.000 | |
## ----------------|-----------|-----------|-----------|
## 1 | 0 | 157 | 157 |
## | 0.000 | 1.000 | 0.523 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.523 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 143 | 157 | 300 |
## | 0.477 | 0.523 | |
## ----------------|-----------|-----------|-----------|
##
##
No matter the value of k, the model still attains an accuracy of 100%. Therefore KNN classifier is a poor choice for this machine learning problem.
Link: https://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-algorithms-programming/
# import library
library(e1071)
# build model classifier
svm_model <- svm(Clicked.on.Ad ~ .,
data = train.data,
type = 'C-classification',
kernel = 'linear')
svm_model
##
## Call:
## svm(formula = Clicked.on.Ad ~ ., data = train.data, type = "C-classification",
## kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 60
# predict on the test data
# Predicting the Test set results
y_pred <- predict(svm_model, test.data)
# evaluate results with Confusion Matrix
table(test.data[,11], y_pred)
## y_pred
## 0 1
## 0 141 2
## 1 9 148
The model correctly predicted 289/300 results accurately achieving an accuracy of 96.33%.
Link: https://www.geeksforgeeks.org/classifying-data-using-support-vector-machinessvms-in-r/
# import library
library(naivebayes)
## naivebayes 0.9.7 loaded
# build naive bayes model
nb_model <- naive_bayes(as.factor(Clicked.on.Ad) ~ ., data = train.data, usekernel = T)
nb_model
##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = as.factor(Clicked.on.Ad) ~ ., data = train.data,
## usekernel = T)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## 0 1
## 0.51 0.49
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: Daily.Time.Spent.on.Site::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (357 obs.); Bandwidth 'bw' = 0.03499
##
## x y
## Min. :0.1605 Min. :0.000402
## 1st Qu.:0.3966 1st Qu.:0.066860
## Median :0.6328 Median :0.601003
## Mean :0.6328 Mean :1.057785
## 3rd Qu.:0.8689 3rd Qu.:1.921940
## Max. :1.1050 Max. :3.258957
##
## ---------------------------------------------------------------------------------
## ::: Daily.Time.Spent.on.Site::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (343 obs.); Bandwidth 'bw' = 0.0626
##
## x y
## Min. :-0.1878 Min. :0.0002387
## 1st Qu.: 0.1559 1st Qu.:0.1151764
## Median : 0.4995 Median :0.5877182
## Mean : 0.4995 Mean :0.7267807
## 3rd Qu.: 0.8431 3rd Qu.:1.3986312
## Max. : 1.1868 Max. :1.7890429
##
## ---------------------------------------------------------------------------------
## ::: Age::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (357 obs.); Bandwidth 'bw' = 0.04093
##
## x y
## Min. :-0.1228 Min. :0.0003176
## 1st Qu.: 0.1410 1st Qu.:0.0723495
## Median : 0.4048 Median :0.5354041
## Mean : 0.4048 Mean :0.9468385
## 3rd Qu.: 0.6685 3rd Qu.:1.7892283
## Max. : 0.9323 Max. :2.6997800
##
## ---------------------------------------------------------------------------------
## ::: Age::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (343 obs.); Bandwidth 'bw' = 0.05833
##
## x y
## Min. :-0.1512 Min. :0.000243
## 1st Qu.: 0.1804 1st Qu.:0.107486
## Median : 0.5119 Median :0.677683
## Mean : 0.5119 Mean :0.753323
## 3rd Qu.: 0.8434 3rd Qu.:1.330641
## Max. : 1.1750 Max. :1.702461
##
## ---------------------------------------------------------------------------------
## ::: Area.Income::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (357 obs.); Bandwidth 'bw' = 0.03699
##
## x y
## Min. :0.1992 Min. :0.000419
## 1st Qu.:0.4271 1st Qu.:0.156402
## Median :0.6551 Median :0.803906
## Mean :0.6551 Mean :1.095657
## 3rd Qu.:0.8830 3rd Qu.:1.842427
## Max. :1.1110 Max. :3.238919
##
## ---------------------------------------------------------------------------------
## ::: Area.Income::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (343 obs.); Bandwidth 'bw' = 0.05721
##
## x y
## Min. :-0.1716 Min. :0.0005004
## 1st Qu.: 0.1552 1st Qu.:0.1198769
## Median : 0.4821 Median :0.6874327
## Mean : 0.4821 Mean :0.7640373
## 3rd Qu.: 0.8090 3rd Qu.:1.3237183
## Max. : 1.1359 Max. :1.8983892
##
## ---------------------------------------------------------------------------------
## ::: Daily.Internet.Usage::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (357 obs.); Bandwidth 'bw' = 0.0406
##
## x y
## Min. :0.1289 Min. :0.0003578
## 1st Qu.:0.3727 1st Qu.:0.1207824
## Median :0.6164 Median :0.6923053
## Mean :0.6164 Mean :1.0245869
## 3rd Qu.:0.8602 3rd Qu.:1.9726890
## Max. :1.1039 Max. :2.5895565
##
## ---------------------------------------------------------------------------------
## ::: Daily.Internet.Usage::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (343 obs.); Bandwidth 'bw' = 0.04779
##
## x y
## Min. :-0.1434 Min. :0.000277
## 1st Qu.: 0.1783 1st Qu.:0.090494
## Median : 0.5000 Median :0.366081
## Mean : 0.5000 Mean :0.776370
## 3rd Qu.: 0.8217 3rd Qu.:1.443802
## Max. : 1.1434 Max. :2.587663
##
## ---------------------------------------------------------------------------------
## ::: City::0 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (357 obs.); Bandwidth 'bw' = 0.08019
##
## x y
## Min. :-0.2385 Min. :0.001563
## 1st Qu.: 0.1310 1st Qu.:0.261023
## Median : 0.5005 Median :0.904067
## Mean : 0.5005 Mean :0.675850
## 3rd Qu.: 0.8700 3rd Qu.:0.968428
## Max. : 1.2396 Max. :1.101864
##
## ---------------------------------------------------------------------------------
## ::: City::1 (KDE)
## ---------------------------------------------------------------------------------
##
## Call:
## density.default(x = x, na.rm = TRUE)
##
## Data: x (343 obs.); Bandwidth 'bw' = 0.08241
##
## x y
## Min. :-0.2472 Min. :0.001645
## 1st Qu.: 0.1259 1st Qu.:0.255874
## Median : 0.4990 Median :0.916110
## Mean : 0.4990 Mean :0.669351
## 3rd Qu.: 0.8721 3rd Qu.:0.962704
## Max. : 1.2452 Max. :1.064139
##
## ---------------------------------------------------------------------------------
##
## # ... and 5 more tables
##
## ---------------------------------------------------------------------------------
# Make prediction
test.data$Clicked.on.Ad <- as.factor(test.data$Clicked.on.Ad)
y_pred <- predict(nb_model, test.data)
## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.
# Evaluate results with confusion matrix
table(test.data$Clicked.on.Ad, y_pred)
## y_pred
## 0 1
## 0 140 3
## 1 12 145
The confusion matrix shows that the model is able to accurately classify 285 observations out of 300 resulting in an accuracy of 95%.
The performance of all our models is very good. The best performing model is the Logistic Regression model. Cross validation method such as k-fold cross validation can be used to validate this.
All models except KNN classifier are good enough to solve this classification problem.
According to our analysis, gender contributes the least to who clicks on the ad or not. This is seen from the small change in percentage from the whole dataset to those who clicked on the ad. However, the rest of the attributes to some extent have an impact to this effect. Geographical location, Internet accessibility, Age and Date determine who clicks the ad.
From our findings, we are able to deduce the following in an attempt to determine a target market for our client:
i.) More females are likely to click on the ad than male. ii.) Mid
Income earners between 40k and 60k are a likely target market. iii.) The
ratio of the elderly who may click on the ad is higher than that of the
younger group. This age ranges from 45 and above.
iv.) Client should target people from Australia, Ethopia and Turkey
countries who seem most interested in the cryptography course. v.)
Client should also narrow her target market to the following cities: -
Lake David - Lake James - Lisamouth - Michelleside - Millerbury -
Robertfurt - South Lisa - West Amanda - West Shannon - Williamsport vi.)
In order to attract more customers, the client should consider discounts
for her course in February and June in order to attract more customers.
This way, she is likely to attract more people to her blog and
course.