1. Defining the Question

Research Question

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

a.) Specifying the question

Determine which kind of people are likely to click on the ads based on characteristics given in the dataset.

b.) The metric for success

Performing univariate and bivariate analysis on the cleaned dataset and later determining the attributes of individuals who are likely to click on our client’s ads.

c.) Understanding the context

The dataset provided contains attributes of people who veiw the client’s blog and to whether they clicked on a specific ad about her previous cryptography course. Through analysis, we are able to determine the demographics of these people so as to govern strategy of attracting more customers for her new course.

d.) Experimental design taken

Reading the data
Checking the data - data understanding
Data cleaning - checking for outliers and anomalies, missing values and duplicates.
Performing univariate analysis on the dataset.
Performing bivariate analysis on the dataset.
Conclusion.
Recommendations.

e.) Data appropriateness to answer the given question.

This is an analysis problem which can be solved by understanding the preferences of the client’s viewers. We need to determine the kind of viewer that is most likely to click on a cryptography course ad and most likely sign up for it as well.

2. Loading the dataset

# load Tibble library
library(tibble)

# load the dataset as dataframe

df <- read.csv('http://bit.ly/IPAdvertisingData') 

# convert dataframe to tibble and set options to show all columns

df <- as_tibble(df)

#check data structure of our dataset

class(df)

## [1] "tbl_df"     "tbl"        "data.frame"

3. Checking the data

# preview the first 6 rows of the dataset
head(df)

# preview the last 6 rows of the dataset
tail(df)

# get number of records
nrow(df); ncol(df)

## [1] 1000

## [1] 10

The dataset contains 1000 customer attributes and 10 variables.

# get datatypes of each column using class
sapply(df, class)

## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                "numeric"                "integer"                "numeric" 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                "numeric"              "character"              "character" 
##                     Male                  Country                Timestamp 
##                "integer"              "character"              "character" 
##            Clicked.on.Ad 
##                "integer"

# inspect variable classes
str(df)

## tibble [1,000 × 10] (S3: tbl_df/tbl/data.frame)
##  $ Daily.Time.Spent.on.Site: num [1:1000] 69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int [1:1000] 35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num [1:1000] 61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num [1:1000] 256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr [1:1000] "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr [1:1000] "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int [1:1000] 0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr [1:1000] "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr [1:1000] "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int [1:1000] 0 0 0 0 0 0 0 1 0 0 ...

# create function to convert categorical columns to factor datatype

tofactor <- function(column){
  as.factor(column)
}

# columns to be converted to factors

df$Male <- tofactor(df$Male)
df$Clicked.on.Ad <- tofactor(df$Clicked.on.Ad)
df$City <- tofactor(df$City)
df$Country <- tofactor(df$Country)


# convert timestamp to datetime


df$Timestamp <- as.POSIXct(df$Timestamp, format = "%Y-%m-%d  %H:%M:%S")

# split date and time variables 
df$Date <- as.Date(df$Timestamp)

df$Time <- format(df$Timestamp,"%H:%M:%S")

# check datatypes again
sapply(df, class)

## $Daily.Time.Spent.on.Site
## [1] "numeric"
## 
## $Age
## [1] "integer"
## 
## $Area.Income
## [1] "numeric"
## 
## $Daily.Internet.Usage
## [1] "numeric"
## 
## $Ad.Topic.Line
## [1] "character"
## 
## $City
## [1] "factor"
## 
## $Male
## [1] "factor"
## 
## $Country
## [1] "factor"
## 
## $Timestamp
## [1] "POSIXct" "POSIXt" 
## 
## $Clicked.on.Ad
## [1] "factor"
## 
## $Date
## [1] "Date"
## 
## $Time
## [1] "character"

head(df)

4. Data Cleaning

a.) Outliers

# check for outliers

# select only numerical variables
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data_num <- select_if(df, is.numeric) 
head(data_num)

# boxplot for the numerical values except for Area.Income column which is on another scale
library(ggplot2)
#ggplot(stack(select(data_num, -Area.Income)), aes(x = ind, y = values)) + geom_boxplot()
# height=10, boxwex = 0.3, outwex = 1

columns <- c(names(data_num))

boxplots <- function(column_name){
  column <- c(data_num[,column_name])
  boxplot(column, width=1, main=paste("Boxplot for", columns[x]))
}

# create 4 subplots
par(mfrow=c(2,2))
for (x in 1:length(columns)){
   boxplots(x)
}

There are no outliers for Daily.Time.Spent.on.Site, Age and Daily.Internet.Usage columns.

# boxplot for Area.Income
boxplot(Area.Income~Male, data=df)

There are outliers in the Area Income column especially for the Male gender whose mean income is a bit higher than for the female. Outliers are men earning less than the rest. We shall check if the outliers are valid income figures.

# check for Area Income less than 0 which would be invalid
sum(df[,"Area.Income"] < 0)

## [1] 0

b.) Missing values

# import Amelia and Rcpp libraries  

library(Amelia,Rcpp)

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

#checking missing data visualization
missmap(df)

## Warning: Unknown or uninitialised column: `arguments`.
## Unknown or uninitialised column: `arguments`.

## Warning: Unknown or uninitialised column: `imputations`.

There are no missing values in our dataframe.

# confirming we have no missing values

colSums(is.na(df))

## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                        0                        0                        0 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked.on.Ad                     Date                     Time 
##                        0                        0                        0

c.) Duplicate values

# check for duplicates
anyDuplicated(df)

## [1] 0

There are no duplicated values in our dataset, otherwise we would have to drop them.

5. Univariate Analysis

# check univariate summary of our dataset.
summary(df)

##  Daily.Time.Spent.on.Site      Age         Area.Income    Daily.Internet.Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##                                                                               
##  Ad.Topic.Line                   City     Male              Country   
##  Length:1000        Lisamouth      :  3   0:519   Czech Republic:  9  
##  Class :character   Williamsport   :  3   1:481   France        :  9  
##  Mode  :character   Benjaminchester:  2           Afghanistan   :  8  
##                     East John      :  2           Australia     :  8  
##                     East Timothy   :  2           Cyprus        :  8  
##                     Johnstad       :  2           Greece        :  8  
##                     (Other)        :986           (Other)       :950  
##    Timestamp                      Clicked.on.Ad      Date           
##  Min.   :2016-01-01 02:52:10.00   0:500         Min.   :2015-12-31  
##  1st Qu.:2016-02-18 02:55:42.00   1:500         1st Qu.:2016-02-17  
##  Median :2016-04-07 17:27:29.50                 Median :2016-04-07  
##  Mean   :2016-04-10 10:34:06.64                 Mean   :2016-04-09  
##  3rd Qu.:2016-05-31 03:18:14.00                 3rd Qu.:2016-05-30  
##  Max.   :2016-07-24 00:22:16.00                 Max.   :2016-07-23  
##                                                                     
##      Time          
##  Length:1000       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

From the table above we are able to get the mean, median, quantiles and range of all numerical variables.

The average daily time spent on the site is 65. The most time a client spent online is 91.43 and the least time is 32.60.

The average area income is 55000, the highest being 79485 and the lowest being 13996.

The average age of clients in our dataset is 36. Max age is 61 and min age is 19.

The average daily internet usage is 180, highest being 270 and the lowest being 104.8.

The data was colected from 1st January 2016 to 24th July, 2016.

Our dataset is balanced as it contains an equal share of people who clicked on the ad and those who did not.

# let's define a function that will print results for numeric columns

library(dplyr)

printing <- function(func){
  
   print(paste("Daily.Time.Spent.on.Site: ", func(df$Daily.Time.Spent.on.Site)))
   print(paste("Daily.Internet.Usage: ", func(df$Daily.Internet.Usage)))
   print(paste("Age: ", func(df$Age)))
   print(paste("Area.Income: ", func(df$Area.Income)))

}

# use above function to print range
paste("The range of each column is as shown below:-")

## [1] "The range of each column is as shown below:-"

printing(range)

## [1] "Daily.Time.Spent.on.Site:  32.6"  "Daily.Time.Spent.on.Site:  91.43"
## [1] "Daily.Internet.Usage:  104.78" "Daily.Internet.Usage:  269.96"
## [1] "Age:  19" "Age:  61"
## [1] "Area.Income:  13996.5" "Area.Income:  79484.8"

# other columns whose range is obtainable
print(paste("Ad.Topic.Line: ", range(df$Ad.Topic.Line)))

## [1] "Ad.Topic.Line:  Adaptive 24hour Graphic Interface"
## [2] "Ad.Topic.Line:  Visionary reciprocal circuit"

print(paste("Date: ", range(df$Date)))

## [1] "Date:  2015-12-31" "Date:  2016-07-23"

paste("The interquartile range of each column is shown beow:-")

## [1] "The interquartile range of each column is shown beow:-"

# use IQR function
printing(IQR)

## [1] "Daily.Time.Spent.on.Site:  27.1875"
## [1] "Daily.Internet.Usage:  79.9625"
## [1] "Age:  13"
## [1] "Area.Income:  18438.8325"

The interquatile range explains the range where the bulk of the values lie.

# mode for all columns which is not included in the summary
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

paste("The mode of the columns is as shown below:-")

## [1] "The mode of the columns is as shown below:-"

printing(getmode)

## [1] "Daily.Time.Spent.on.Site:  62.26"
## [1] "Daily.Internet.Usage:  167.22"
## [1] "Age:  31"
## [1] "Area.Income:  61833.9"

# add the rest of the columns whosemode can be obtained
print(paste("Male: ", getmode(df$Male)))

## [1] "Male:  0"

print(paste("Clicked.on.Ad: ", getmode(df$Clicked.on.Ad)))

## [1] "Clicked.on.Ad:  0"

print(paste("Ad.Topic.Line: ", getmode(df$Ad.Topic.Line)))

## [1] "Ad.Topic.Line:  Cloned 5thgeneration orchestration"

print(paste("Country: ", getmode(df$Country)))

## [1] "Country:  Czech Republic"

print(paste("City: ", getmode(df$City)))

## [1] "City:  Lisamouth"

print(paste("Date: ", getmode(df$Date)))

## [1] "Date:  2016-02-14"

The mode of the columns as shown above represents the column values that are most repeated for each column.

print("The variance of numerical columns is as shown below:-")

## [1] "The variance of numerical columns is as shown below:-"

printing(var)

## [1] "Daily.Time.Spent.on.Site:  251.337094854855"
## [1] "Daily.Internet.Usage:  1927.41539618619"
## [1] "Age:  77.1861051051051"
## [1] "Area.Income:  179952405.951775"

The variance explains the measure of spread of the data points within the data set.

print("The standard deviation of the numerical columns is as shown below:-")

## [1] "The standard deviation of the numerical columns is as shown below:-"

printing(sd)

## [1] "Daily.Time.Spent.on.Site:  15.8536145675002"
## [1] "Daily.Internet.Usage:  43.9023393019801"
## [1] "Age:  8.78556231012592"
## [1] "Area.Income:  13414.6340222824"

The standard deviation measures the distribution of data points around the mean.

# Pie chart function

# import plotrix for 3D plot
library(plotrix)

piechart <- function(column, title){
  
  # define labels and values
  slices <- c(tabulate(column))
  lbls <- c(unique(column))
  
  # convert values to percentage
  pct <- round(slices/sum(slices)*100)
  
  # add percentages to labels
  lbls <- paste(lbls,":", pct) 
  
  # ad % to labels
  lbls <- paste(lbls,"%",sep="") 
  
  # plot pie chart in 3D
  pie3D(slices,labels=lbls, col=rainbow(length(lbls)), 
        explode=0.1, radius=1, start=0, main=title)
}

# Clicked.on.ad representation
piechart(df$Clicked.on.Ad, "Pie Chart for Clicked on ad representation")

Half of the population of the dataset clicked on the ad.

# Gender representation
piechart(df$Male, "Pie Chart for Gender representation")

52% are female while 48% of the population is male.

# filter data to only have those who clicked on ad
data <- df %>% filter(Clicked.on.Ad == 1)
data

# filter data to only those who didn't click on the ad
data_ <- df %>% filter(Clicked.on.Ad == 0)
data_

# Gender representation of those who clicked on ad
piechart(data$Male, "Pie Chart for Gender representation(Clicked on Ad)")

More female clicked on the ad than male.

# histograms function
histogram <- function(column,title, xlab){
  
  hist(column, main= title, xlab=xlab, ylab = "Frequency", col = "lightgreen")
  
}

# age representation
histogram(df$Age, "Histogram for Age representation", "Age")

Age representation in the dataset is skewed to the right. Most people lie between the age of 25 and 40.

# age representation of those who clicked on ad
histogram(data$Age, "Histogram for Age representation(Clicked on Ad)", "Age")

The data seems normally distributed. Those who clicked on the ad are aged between 35 and 45.

# area income representation
histogram(df$Area.Income, "Histogram for Area Income representation", "Area income")

Area Income representation is skewed to the left, most people earning between 50k and 70k.

# Area representation of those who clicked on ad
histogram(data$Area.Income, "Histogram for Area Income representation(Clicked on Ad)", "Area Income")

Those who clicked on ad earn between 40k and 60k.

# Daily time spent on site representation
histogram(df$Daily.Time.Spent.on.Site, "Histogram for Daily Time spent on site representation", "Daily Time spent on site")

Most people spend between 65 to 85 minutes on site.

# Daily time spent on site representation of those who clicked on ad
histogram(data$Daily.Time.Spent.on.Site, "Histogram for Daily Time spent on site representation(Clicked on Ad)", "Daily Time spent on site")

Most people who click on the ad spend between 35 to 60 minutes on site.

# Daily internet usage representation

# customizing bin width
bin_width <- 10

nbins <- seq(min(df$Daily.Internet.Usage) - bin_width,
             max(df$Daily.Internet.Usage) + bin_width, by = bin_width)

# plot histogram
hist(df$Daily.Internet.Usage, breaks = nbins, main = "Histogram for Daily Internet Usage representation", xlab = "Daily Internet Usage", ylab="Frequency", col="lightgreen")

Most people use internet bundles between 115 to 145MBs and between 195 and 235MBs.

# Daily internet usage representation of those who clicked on ad

hist(data$Daily.Internet.Usage, breaks = nbins, main = "Histogram for Daily Internet Usage representation(Clicked on Ad)", xlab = "Daily Internet Usage", ylab="Frequency", col="lightgreen")

Most people who click on ads use internet data between 105 and 185MBs.

# barplot function
bar <- function(column, title, xlab){
  
  # create frequency table
  freq <- table(column)
  
  # sort frequency table
  sorted_freq <- (freq[order(freq,decreasing=TRUE)])
  
  # adjust margins of frequency table
  par(mar=c(11,4,1,0))
  
  # plot bar graph for first 20 values with the highest count
  barplot(sorted_freq[1:20], main=title, ylab="Frequency", las=2, col="lightblue")
  title(xlab = xlab, line=10)
}

# City representation
bar(df$City, "Barplot for City representation", "City")

Most people come from Lisamouth and Williamsport cities both with 3 counts each.

# City representation of those who clicked on ad

bar(data$City, "Barplot for City representation(Clicked on ad)", "City")

Most people who click the ads are in 10 cities namely: - Lake David - Lake James - Lisamouth - Michelleside - Millerbury - Robertfurt - South Lisa - West Amanda - West Shannon - Williamsport

# Country representation
bar(df$Country, "Barplot for Country representation", "Country")

Czech Republic and France have the highest representation of the population in our dataset both with 9 counts.

# Country representation of those who clicked on ad 

bar(data$Country, "Barplot for Country representation(Clicked on ad)", "Country")

Most people who clicked on the ad are in 3 countries namely: Australia, Ethopia and Turkey.

# Timestamp representation
bar(df$Date, "Barplot for DateTime representation", "Date")

Data was collected most for 2016-02-14 and 2016-04-04 each with a frequency of 12.

# Timestamp representation for those who clicked on ad
bar(data$Date, "Barplot for DateTime representation(Clicked on Ad)", "Date")

2016-02-14 recorded the highest number of people who clicked on the ad.

length(unique(df$Ad.Topic.Line))

## [1] 1000

The topics of the Ads are all unique hence cannot help us determine which ind of ad topics our viewers are likely to click on.

6. Bivariate analysis

# check covariance in numerical variables with covariance matrix
cov(data_num)

##                          Daily.Time.Spent.on.Site          Age  Area.Income
## Daily.Time.Spent.on.Site                251.33709    -46.17415     66130.81
## Age                                     -46.17415     77.18611    -21520.93
## Area.Income                           66130.81091 -21520.92580 179952405.95
## Daily.Internet.Usage                    360.99188   -141.63482    198762.53
##                          Daily.Internet.Usage
## Daily.Time.Spent.on.Site             360.9919
## Age                                 -141.6348
## Area.Income                       198762.5315
## Daily.Internet.Usage                1927.4154

Covariance is the measure of how two random variables vary together. A high negative covariance indicates negative correlation while a high positive covariance indicates positive correlation. A value close to zero indicates weak covariance.

We can say the following have positive covariance indicating positive correlation:- * Area.Income and Daily.Time.Spent.On.Site * Area.Income and Daily.Internet.Usage * Daily.Time.Spent.On.Site and Daily.Internet.Usage

The following have negative covariance indicating negative correlation:- Age and Area.Income Age and Daily.Internet.Usage

In order to measure the strength of this relationship, we shall use a correlation matrix.

# correlation matrix

cormat <- cor(data_num)

# melt correlation matrix to enable heatmap plot 
library(reshape2)
melted_cormat <- melt(cormat)

# plot heatmap
library(ggplot2)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
geom_tile() + geom_text(aes(Var2, Var1, label = value), color = "white", size = 2) + ggtitle("Correlation Heatmap")

The correlation matrix tries to explain the covariance between variables. There is a strong positive correlation between Daily.Time.Spent.On.Site and Daily.Internet.Usage since time spent on site depends on internet accessibility.

All other variables are weakly correlated.

# scatterplot for the two correlated variables

plot(df$Daily.Time.Spent.on.Site, df$Daily.Internet.Usage, col =df$Clicked.on.Ad, xlab="Daily Time Spent on site", ylab="Daily Internet Usage", main="Daily Internet Usage vs Daily Time Spent on Site")

# Legend
legend("topleft",
       legend = levels(df$Clicked.on.Ad),
       pch = 19,
       col = factor(levels(df$Clicked.on.Ad)))

Most people with high internet usage who spent most daily time on site did not click on the ad.

# create funtion to plot stacked barchart
stacked <- function(column1, column2, title, value1, value2, legend){
  
    # vectorize the two columns to plot
    attribute1 <- c(column1) 
    attribute2 <- c(column2)
    
    # create dataframe of the two columns
    data <- data.frame(attribute1, attribute2)
    
    # plot stacked bargraph
    ggplot(data=data) + geom_bar(aes(fill=attribute2, x=attribute1)) + ggtitle(title) + xlab(value1) + ylab(value2) + theme(plot.title = element_text(hjust=0.5)) + labs(fill=legend) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) 
    
    
}

# Gender against clicking on ad

stacked(df$Male, df$Clicked.on.Ad, "Gender vs Clicked.on.Ad representation", "Gender", "Count", "Clicked.on.Ad")

There is almost equal representation of male and female who clicked and those who did not click on the ad. This implies that gender may not influence who clicks on the ad.

# Age representation vs clicking on ad
stacked(df$Age, df$Clicked.on.Ad, "Age vs Clicked.on.Ad representation", "Age", "Count", "Clicked.on.Ad")

Most people who clicked the ad were 45 years old. Almost all people aged between 47 and 65 clicked on the ad. Most people aged 35 and below seem uninterested in clicking on the ads.

# City representation vs clicking on ad
stacked(df$City[1:20], df$Clicked.on.Ad, "City vs Clicked.on.Ad representation", "City", "Count", "Clicked.on.Ad")

# Date representation


# group days by number of clicks on ad.
times <- aggregate(x=as.factor(data$Clicked.on.Ad), by = list(data$Date), FUN=length)

# group days by number of no clicks on ad.
times_ <- aggregate(x=as.factor(data_$Clicked.on.Ad), by = list(data_$Date), FUN=length)

# plot line graph

ggplot() +
  geom_line(data=times, aes(x=Group.1, y=x, group=1), col = "blue")+
  geom_line(data=times_, aes(x=Group.1, y=x, group=1), col = "red") +
  geom_point() + scale_x_date(date_labels = "%Y %b %d", date_breaks = "5 days") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Day") + ggtitle("Trend for clicks on Ad") + ylab("Frequency of Ad clicks")

The highest number of ad clicks was recorded in the week of Februrary 2016 between 9th and 14th followed by the first week of June, 2016.

ggplot() +
  geom_line(data=times, aes(x=Group.1, y=x, group=1), col = "blue")+
  geom_line(data=times_, aes(x=Group.1, y=x, group=1), col = "red") +
  geom_point() + scale_x_date(date_labels = "%b", date_breaks = "1 month") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5)) + xlab("Month") + ggtitle("Trend for clicks on Ad") + ylab("Frequency of Ad clicks")

The ad was mostly clicked in early February and towards the beginning of June.

7. Modelling

We shall build a classification model which will determine the kind of people who are likely to click the ad based on certain given attributes.

Feature Engineering

# convert categorical columns to numeric

df[, c("Country", "City")] <- sapply(df[, c("Country", "City")], unclass)

# convert country and city directly to factor

df$Country  <-  as.integer(as.character(df$Country))
df$City  <-  as.integer(as.character(df$City))
df$Male <- as.integer(as.character(df$Male))

head(df)

# extract month, day of the week and hour from date and time columns

# convert to integer

# import package
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

# Take Monday as first day of the week

df$Weekday <- as.integer(wday(df$Date, week_start=1))

# Month
df$Month <- as.integer(format(as.Date(df$Date), "%m"))

# Hour
df$Hour <- as.integer(format(as.POSIXct(df$Timestamp), format = "%H"))

head(df)

# check column datatypes
sapply(df, class)

## $Daily.Time.Spent.on.Site
## [1] "numeric"
## 
## $Age
## [1] "integer"
## 
## $Area.Income
## [1] "numeric"
## 
## $Daily.Internet.Usage
## [1] "numeric"
## 
## $Ad.Topic.Line
## [1] "character"
## 
## $City
## [1] "integer"
## 
## $Male
## [1] "integer"
## 
## $Country
## [1] "integer"
## 
## $Timestamp
## [1] "POSIXct" "POSIXt" 
## 
## $Clicked.on.Ad
## [1] "factor"
## 
## $Date
## [1] "Date"
## 
## $Time
## [1] "character"
## 
## $Weekday
## [1] "integer"
## 
## $Month
## [1] "integer"
## 
## $Hour
## [1] "integer"

# Select numerical columns only

df1 <- df[ , purrr::map_lgl(df, is.numeric)]
head(df1)

df2 <- df[ , purrr::map_lgl(df, is.factor)]
head(df2)

# Bind the two dataframes

df <- cbind(df1,df2)
tail(df)

# Normalize data since our data is skewed.
normal <- function(x) (
  return( ((x - min(x)) /(max(x)-min(x))) )
)

normal(1:11)

##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

data.scaled <- as.data.frame(lapply(select(df,  -Clicked.on.Ad), normal))

# return label column
data.scaled$Clicked.on.Ad <- df$Clicked.on.Ad

# Split the dataset into 70/30 ratio
set.seed(234) # Ensures the same random sample is given each time
split <- sample(c(rep(0, 0.7 * nrow(df)), rep(1, 0.3 * nrow(df))))

# define train and test 

train.data <- data.scaled[split == 0, ]
test.data <- data.scaled[split == 1, ]

i.) Logisitic Regression Model

# baseline model 

# load packages
library(dplyr)
library(caTools)
library(ROCR)

   
# Training the model
logistic_model <- glm(Clicked.on.Ad ~., 
                      data = train.data, 
                      family = binomial(link="logit"))

logistic_model

## 
## Call:  glm(formula = Clicked.on.Ad ~ ., family = binomial(link = "logit"), 
##     data = train.data)
## 
## Coefficients:
##              (Intercept)  Daily.Time.Spent.on.Site                       Age  
##                  18.1389                  -13.2369                    7.7323  
##              Area.Income      Daily.Internet.Usage                      City  
##                 -10.3287                  -10.9263                   -0.8950  
##                     Male                   Country                   Weekday  
##                  -0.3358                   -0.1055                    0.5672  
##                    Month                      Hour  
##                   0.2008                   -0.4208  
## 
## Degrees of Freedom: 699 Total (i.e. Null);  689 Residual
## Null Deviance:       970.1 
## Residual Deviance: 110.4     AIC: 132.4

The significant difference in Null deviance and residual difference shows the improved performance of our model when varaible coefficients are considered in addition to the intercept.

# Summary
summary(logistic_model)

## 
## Call:
## glm(formula = Clicked.on.Ad ~ ., family = binomial(link = "logit"), 
##     data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8168  -0.1029  -0.0451   0.0079   3.2854  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               18.1389     2.5493   7.115 1.12e-12 ***
## Daily.Time.Spent.on.Site -13.2369     1.7516  -7.557 4.12e-14 ***
## Age                        7.7323     1.4394   5.372 7.80e-08 ***
## Area.Income              -10.3287     1.6371  -6.309 2.80e-10 ***
## Daily.Internet.Usage     -10.9263     1.4716  -7.425 1.13e-13 ***
## City                      -0.8950     0.9539  -0.938    0.348    
## Male                      -0.3358     0.5219  -0.644    0.520    
## Country                   -0.1055     0.8523  -0.124    0.901    
## Weekday                    0.5672     0.7462   0.760    0.447    
## Month                      0.2008     1.5049   0.133    0.894    
## Hour                      -0.4208     0.9191  -0.458    0.647    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 970.13  on 699  degrees of freedom
## Residual deviance: 110.42  on 689  degrees of freedom
## AIC: 132.42
## 
## Number of Fisher Scoring iterations: 8

The above model suggests that Daily.Time.Spent.on.site(4.12e-14), Daily.Internet.Usage(1.13e-13), Area.Income(2.80e-10) and Age(7.80e-08) in that order are most statistically significant when determining whether a user clicks on the ad as seen by the low p_value in that order.

The least significant are Country(0.901), Month(0.894), Hour(0.647) and Gender(0.520) in that order.

Negative estimate values(variable coefficients) are interpreted as having a decrease in the whole function when increased by a unit and other variables are kept constant.

# Run the anova() function on the model to analyze the table of deviance

anova(logistic_model, test="Chisq")

The significant drop in residual deviance for the first 4 variables contributes to the large difference between the null deviance and the residual deviance overall. The null deviance is the deviance of the model with only the intercept without variable coefficients.

# assess the model fit with pseudo-rsquared. 
library(pscl)

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

pR2(logistic_model)

## fitting null model for pseudo-r2

##          llh      llhNull           G2     McFadden         r2ML         r2CU 
##  -55.2094927 -485.0630171  859.7070487    0.8861808    0.7071670    0.9430152

McFadden index is 88.62%.

# prediction

# Predict test data based on model
pred <- predict(logistic_model, 
                       test.data, type = "response")

# Changing probabilities to binary class
pred <- ifelse(pred >0.5, 1, 0) 

# Evaluating model accuracy using confusion matrix
table(test.data$Clicked.on.Ad, pred)

##    pred
##       0   1
##   0 141   2
##   1   8 149

Correct predictions are 290/300, an accuracy of 96.67% which is pretty good.

# ROC curve plot
pr <- prediction(pred, test.data$Clicked.on.Ad)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

# Area under the curve(auc)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc

## [1] 0.9675293

link :https://www.r-bloggers.com/2015/09/how-to-perform-a-logistic-regression-in-r/

ii.) Random Forest Classifier Model

# load libraries

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked _by_ '.GlobalEnv':
## 
##     histogram

# Training the model

rf <- randomForest(as.factor(Clicked.on.Ad)~., data=train.data, importance=TRUE) 
print(rf)

## 
## Call:
##  randomForest(formula = as.factor(Clicked.on.Ad) ~ ., data = train.data,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.43%
## Confusion matrix:
##     0   1 class.error
## 0 346  11  0.03081232
## 1  13 330  0.03790087

Out of bag(OOB) error is 2.86%. This means that our model is 97.14% accurate on the train data.

# prediction on test data

p1 <- predict(rf, test.data)
confusionMatrix(p1, as.factor(test.data$Clicked.on.Ad))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 141  11
##          1   2 146
##                                          
##                Accuracy : 0.9567         
##                  95% CI : (0.927, 0.9767)
##     No Information Rate : 0.5233         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9134         
##                                          
##  Mcnemar's Test P-Value : 0.0265         
##                                          
##             Sensitivity : 0.9860         
##             Specificity : 0.9299         
##          Pos Pred Value : 0.9276         
##          Neg Pred Value : 0.9865         
##              Prevalence : 0.4767         
##          Detection Rate : 0.4700         
##    Detection Prevalence : 0.5067         
##       Balanced Accuracy : 0.9580         
##                                          
##        'Positive' Class : 0              
##

The model accurately predicts 286/300 giving an accuracy of 95.33%.

# plot error rate

plot(rf)

# plot histogram for number of nodes for the trees

hist(treesize(rf),
     main = "No. of Nodes for the Trees",
     col = "darkgreen")

# plot for variable importance
varImpPlot(rf,
           sort = T,
           n.var = 10,
           main = "Top 10 - Variable Importance")

Variables that contribute to highly prediction results are Daily.Time.Spent.on.site, Daily.Internet.Usage, Area.Income and Age.

# Variable importance table 

importance(rf)

##                                    0          1 MeanDecreaseAccuracy
## Daily.Time.Spent.on.Site 70.34756620 43.9237489           74.6417855
## Age                      25.61975557 14.5936398           28.3155211
## Area.Income              33.87424027 16.6940831           35.8757079
## Daily.Internet.Usage     72.07179889 47.8056818           75.7550585
## City                     -0.03163044 -1.9528844           -1.3024274
## Male                     -1.64125001  0.7121310           -0.8089057
## Country                  -1.91312199 -1.4012224           -2.5439292
## Weekday                   0.84543797 -1.3807511           -0.1250853
## Month                    -0.40545072  1.0491655            0.3166916
## Hour                     -0.50677050 -0.6480012           -0.7820694
##                          MeanDecreaseGini
## Daily.Time.Spent.on.Site      117.0112002
## Age                            32.8555732
## Area.Income                    36.0218761
## Daily.Internet.Usage          140.7513451
## City                            5.6945610
## Male                            0.9232822
## Country                         5.8342330
## Weekday                         3.2211116
## Month                           2.8348992
## Hour                            4.1770765

Link: https://www.r-bloggers.com/2021/04/random-forest-in-r/

iii.) KNN classifier model

# install class package
library(class)


# define k to be the the square root of the number of observations
k <- sqrt(length(df))

# build knn classifier
knn_model_pred <- knn(train = train.data, test = test.data ,cl = train.data[,11], k=k)

# Evaluate model performance

# import gmodels package
library(gmodels)

# plot crosstable for correct labels vs predicted values
CrossTable(x=test.data[,11], y = knn_model_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##                 | knn_model_pred 
## test.data[, 11] |         0 |         1 | Row Total | 
## ----------------|-----------|-----------|-----------|
##               0 |       143 |         0 |       143 | 
##                 |     1.000 |     0.000 |     0.477 | 
##                 |     1.000 |     0.000 |           | 
##                 |     0.477 |     0.000 |           | 
## ----------------|-----------|-----------|-----------|
##               1 |         0 |       157 |       157 | 
##                 |     0.000 |     1.000 |     0.523 | 
##                 |     0.000 |     1.000 |           | 
##                 |     0.000 |     0.523 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       143 |       157 |       300 | 
##                 |     0.477 |     0.523 |           | 
## ----------------|-----------|-----------|-----------|
## 
##

There are no False Negatives or False positives, leading to accuracy result of 100%. This is an overfitting case hence we need to improve our model. This can be done by altering the value of k.

# build knn classifier
knn_model_pred <- knn(train = train.data, test = test.data ,cl = train.data[,11], k=200)


# plot crosstable for correct labels vs predicted values
CrossTable(x=test.data[,11], y = knn_model_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##                 | knn_model_pred 
## test.data[, 11] |         0 |         1 | Row Total | 
## ----------------|-----------|-----------|-----------|
##               0 |       143 |         0 |       143 | 
##                 |     1.000 |     0.000 |     0.477 | 
##                 |     1.000 |     0.000 |           | 
##                 |     0.477 |     0.000 |           | 
## ----------------|-----------|-----------|-----------|
##               1 |         0 |       157 |       157 | 
##                 |     0.000 |     1.000 |     0.523 | 
##                 |     0.000 |     1.000 |           | 
##                 |     0.000 |     0.523 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       143 |       157 |       300 | 
##                 |     0.477 |     0.523 |           | 
## ----------------|-----------|-----------|-----------|
## 
##

No matter the value of k, the model still attains an accuracy of 100%. Therefore KNN classifier is a poor choice for this machine learning problem.

Link: https://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-algorithms-programming/

iv.) Support Vector Machine Classifier

# import library
library(e1071)

# build model classifier 
svm_model <- svm(Clicked.on.Ad ~ .,
                 data = train.data,
                 type = 'C-classification',
                 kernel = 'linear')
svm_model

## 
## Call:
## svm(formula = Clicked.on.Ad ~ ., data = train.data, type = "C-classification", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  60

# predict on the test data

# Predicting the Test set results
y_pred <- predict(svm_model, test.data)

# evaluate results with Confusion Matrix

table(test.data[,11], y_pred)

##    y_pred
##       0   1
##   0 141   2
##   1   9 148

The model correctly predicted 289/300 results accurately achieving an accuracy of 96.33%.

Link: https://www.geeksforgeeks.org/classifying-data-using-support-vector-machinessvms-in-r/

v.) Naive Bayes Classifier model

# import library
library(naivebayes)

## naivebayes 0.9.7 loaded

# build naive bayes model
nb_model <- naive_bayes(as.factor(Clicked.on.Ad) ~ ., data = train.data, usekernel = T) 
nb_model

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = as.factor(Clicked.on.Ad) ~ ., data = train.data, 
##     usekernel = T)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##    0    1 
## 0.51 0.49 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: Daily.Time.Spent.on.Site::0 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (357 obs.);  Bandwidth 'bw' = 0.03499
## 
##        x                y           
##  Min.   :0.1605   Min.   :0.000402  
##  1st Qu.:0.3966   1st Qu.:0.066860  
##  Median :0.6328   Median :0.601003  
##  Mean   :0.6328   Mean   :1.057785  
##  3rd Qu.:0.8689   3rd Qu.:1.921940  
##  Max.   :1.1050   Max.   :3.258957  
## 
## --------------------------------------------------------------------------------- 
##  ::: Daily.Time.Spent.on.Site::1 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (343 obs.);  Bandwidth 'bw' = 0.0626
## 
##        x                 y            
##  Min.   :-0.1878   Min.   :0.0002387  
##  1st Qu.: 0.1559   1st Qu.:0.1151764  
##  Median : 0.4995   Median :0.5877182  
##  Mean   : 0.4995   Mean   :0.7267807  
##  3rd Qu.: 0.8431   3rd Qu.:1.3986312  
##  Max.   : 1.1868   Max.   :1.7890429  
## 
## --------------------------------------------------------------------------------- 
##  ::: Age::0 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (357 obs.);  Bandwidth 'bw' = 0.04093
## 
##        x                 y            
##  Min.   :-0.1228   Min.   :0.0003176  
##  1st Qu.: 0.1410   1st Qu.:0.0723495  
##  Median : 0.4048   Median :0.5354041  
##  Mean   : 0.4048   Mean   :0.9468385  
##  3rd Qu.: 0.6685   3rd Qu.:1.7892283  
##  Max.   : 0.9323   Max.   :2.6997800  
## 
## --------------------------------------------------------------------------------- 
##  ::: Age::1 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (343 obs.);  Bandwidth 'bw' = 0.05833
## 
##        x                 y           
##  Min.   :-0.1512   Min.   :0.000243  
##  1st Qu.: 0.1804   1st Qu.:0.107486  
##  Median : 0.5119   Median :0.677683  
##  Mean   : 0.5119   Mean   :0.753323  
##  3rd Qu.: 0.8434   3rd Qu.:1.330641  
##  Max.   : 1.1750   Max.   :1.702461  
## 
## --------------------------------------------------------------------------------- 
##  ::: Area.Income::0 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (357 obs.);  Bandwidth 'bw' = 0.03699
## 
##        x                y           
##  Min.   :0.1992   Min.   :0.000419  
##  1st Qu.:0.4271   1st Qu.:0.156402  
##  Median :0.6551   Median :0.803906  
##  Mean   :0.6551   Mean   :1.095657  
##  3rd Qu.:0.8830   3rd Qu.:1.842427  
##  Max.   :1.1110   Max.   :3.238919  
## 
## --------------------------------------------------------------------------------- 
##  ::: Area.Income::1 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (343 obs.);  Bandwidth 'bw' = 0.05721
## 
##        x                 y            
##  Min.   :-0.1716   Min.   :0.0005004  
##  1st Qu.: 0.1552   1st Qu.:0.1198769  
##  Median : 0.4821   Median :0.6874327  
##  Mean   : 0.4821   Mean   :0.7640373  
##  3rd Qu.: 0.8090   3rd Qu.:1.3237183  
##  Max.   : 1.1359   Max.   :1.8983892  
## 
## --------------------------------------------------------------------------------- 
##  ::: Daily.Internet.Usage::0 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (357 obs.);  Bandwidth 'bw' = 0.0406
## 
##        x                y            
##  Min.   :0.1289   Min.   :0.0003578  
##  1st Qu.:0.3727   1st Qu.:0.1207824  
##  Median :0.6164   Median :0.6923053  
##  Mean   :0.6164   Mean   :1.0245869  
##  3rd Qu.:0.8602   3rd Qu.:1.9726890  
##  Max.   :1.1039   Max.   :2.5895565  
## 
## --------------------------------------------------------------------------------- 
##  ::: Daily.Internet.Usage::1 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (343 obs.);  Bandwidth 'bw' = 0.04779
## 
##        x                 y           
##  Min.   :-0.1434   Min.   :0.000277  
##  1st Qu.: 0.1783   1st Qu.:0.090494  
##  Median : 0.5000   Median :0.366081  
##  Mean   : 0.5000   Mean   :0.776370  
##  3rd Qu.: 0.8217   3rd Qu.:1.443802  
##  Max.   : 1.1434   Max.   :2.587663  
## 
## --------------------------------------------------------------------------------- 
##  ::: City::0 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (357 obs.);  Bandwidth 'bw' = 0.08019
## 
##        x                 y           
##  Min.   :-0.2385   Min.   :0.001563  
##  1st Qu.: 0.1310   1st Qu.:0.261023  
##  Median : 0.5005   Median :0.904067  
##  Mean   : 0.5005   Mean   :0.675850  
##  3rd Qu.: 0.8700   3rd Qu.:0.968428  
##  Max.   : 1.2396   Max.   :1.101864  
## 
## --------------------------------------------------------------------------------- 
##  ::: City::1 (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (343 obs.);  Bandwidth 'bw' = 0.08241
## 
##        x                 y           
##  Min.   :-0.2472   Min.   :0.001645  
##  1st Qu.: 0.1259   1st Qu.:0.255874  
##  Median : 0.4990   Median :0.916110  
##  Mean   : 0.4990   Mean   :0.669351  
##  3rd Qu.: 0.8721   3rd Qu.:0.962704  
##  Max.   : 1.2452   Max.   :1.064139  
## 
## ---------------------------------------------------------------------------------
## 
## # ... and 5 more tables
## 
## ---------------------------------------------------------------------------------

# Make prediction
test.data$Clicked.on.Ad <- as.factor(test.data$Clicked.on.Ad)
y_pred <- predict(nb_model, test.data)

## Warning: predict.naive_bayes(): more features in the newdata are provided as
## there are probability tables in the object. Calculation is performed based on
## features to be found in the tables.

# Evaluate results with confusion matrix
table(test.data$Clicked.on.Ad, y_pred)

##    y_pred
##       0   1
##   0 140   3
##   1  12 145

The confusion matrix shows that the model is able to accurately classify 285 observations out of 300 resulting in an accuracy of 95%.

Conclusion

The performance of all our models is very good. The best performing model is the Logistic Regression model. Cross validation method such as k-fold cross validation can be used to validate this.

All models except KNN classifier are good enough to solve this classification problem.

8. Conclusion

According to our analysis, gender contributes the least to who clicks on the ad or not. This is seen from the small change in percentage from the whole dataset to those who clicked on the ad. However, the rest of the attributes to some extent have an impact to this effect. Geographical location, Internet accessibility, Age and Date determine who clicks the ad.

9. Recommendations

From our findings, we are able to deduce the following in an attempt to determine a target market for our client:

i.) More females are likely to click on the ad than male. ii.) Mid Income earners between 40k and 60k are a likely target market. iii.) The ratio of the elderly who may click on the ad is higher than that of the younger group. This age ranges from 45 and above.
iv.) Client should target people from Australia, Ethopia and Turkey countries who seem most interested in the cryptography course. v.) Client should also narrow her target market to the following cities: - Lake David - Lake James - Lisamouth - Michelleside - Millerbury - Robertfurt - South Lisa - West Amanda - West Shannon - Williamsport vi.) In order to attract more customers, the client should consider discounts for her course in February and June in order to attract more customers. This way, she is likely to attract more people to her blog and course.

Cleaning And EDA with R

2022-05-27