A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.
The individuals most likely to click on her advertisements are correctly identified
Defining the research questions and work plan
Loading the dataset
Previewing the dataset
Cleaning the dataset which will entail dealing with outliers, duplicates and missing values appropriately
Feature engineering
Performing Uni variate, bivariate and multivariate analysis on the data set
Creating supervised learning algorithm
Challenging solution
Concluding based on the findings of the research
Providing recommendations based on the conclusions arrived at
Further questions
The dataset that shall be used shall be an advertising dataset that contains a total of 10 features.
Age- The age of the individual that clicked the ad
Daily Time Spent on Site - The average time an individual spends on the site
Area Income - The average income of the area from which the ad was clicked
Daily Internet Usage - The daily internet usage information for the area in which the ad was clicked
Ad Topic Line - The topic line of the advertisement
City - The city from where the ad was clicked
Male - The gender of the individual that clicked the add (0- Female, 1- Male)
Country - The country from which the add was clicked
Timestamp - The time that the ad was clicked
Clicked on Add - Contains information whether the individual clicked on the ad or not (0 - Did not click on add, 1 - Clicked on the add)
# Loading the relevant libraries for this study
library(stringr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(ggplot2)
library(countrycode)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ tibble 3.1.7 ✔ purrr 0.3.4
## ✔ tidyr 1.2.0 ✔ forcats 0.5.1
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(moments)
library(paletteer)
library(rpart,quietly = TRUE)
library(caret,quietly = TRUE)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
library(rpart.plot,quietly = TRUE)
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(e1071)
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:moments':
##
## kurtosis, moment, skewness
## The following object is masked from 'package:Hmisc':
##
## impute
library(caTools)
# Reading the advertisement dataset
#
ad_dataset <- read.csv("http://bit.ly/IPAdvertisingData")
Here the structure/shape of the dataset, the data types of the various attributes shall be investigated
# Previewing the first six records of dataset
head(ad_dataset)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11 0
## 2 2016-04-04 01:39:02 0
## 3 2016-03-13 20:35:42 0
## 4 2016-01-10 02:31:19 0
## 5 2016-06-03 03:36:18 0
## 6 2016-05-19 14:30:17 0
# view the number of rows and columns in the dataset
#
dim(ad_dataset)
## [1] 1000 10
The data set has a total of 1000 records and 10 attributes/columns.
# Previewing the structure of the ad dataset
#
str(ad_dataset)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : chr "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
## $ Clicked.on.Ad : int 0 0 0 0 0 0 0 1 0 0 ...
The are three datatypes in the data set: Number(num), Integer(int), and Character(chr). All attributes have appropriate datatypes excluding the country, city male and clicked on ad columns. These are labelled as integers and are factors. They take only two values (1 or 0)
# Establish the data set class
#
class(ad_dataset)
## [1] "data.frame"
The advertisement data set is a data frame
The features in data set with are categorical data types (City, Country, Male, and Clicked_ad ) but are in character and intger format. They shall be converted to factors
# Converting the attribute male from integer to factor
#
as.factor(ad_dataset$Male) -> ad_dataset$Male
# Converting the attribute clicked.on.ad fr0m integer to factor
#
as.factor(ad_dataset$Clicked.on.Ad) -> ad_dataset$Clicked.on.Ad
# Converting the attribute word_counter frm integer to factor
#
as.factor(ad_dataset$Country) -> ad_dataset$Country
# Converting the attribute word_counter frm integer to factor
#
as.factor(ad_dataset$City) -> ad_dataset$City
# converting to datetime object
#
ad_dataset[['Timestamp']] <- as.POSIXct(ad_dataset[['Timestamp']],
format = "%Y-%m-%d %H:%M:%S")
# Check the structure structure after reassigning the data types
str(ad_dataset)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
## $ Male : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
## $ Country : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked.on.Ad : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
All columns now have appropriate data types. We have numerical, factor, character and POSIXct(datetime datatypes)
# Checking the number of missing values per column in the data set
#
colSums(is.na(ad_dataset))
## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad
## 0
The dataset has no missing values in any of the attributes
# finding the duplicated rows in the data set and assign to a variable duplicated_rows below
#
duplicated_rows = ad_dataset[duplicated(ad_dataset),]
# Printing out the duplicated rows
duplicated_rows
## [1] Daily.Time.Spent.on.Site Age Area.Income
## [4] Daily.Internet.Usage Ad.Topic.Line City
## [7] Male Country Timestamp
## [10] Clicked.on.Ad
## <0 rows> (or 0-length row.names)
The advertisement data set has no duplicate records
# number of rows in data frame
#
num_rows = nrow(ad_dataset)
# creating ID column vector
#
ID <- c(1:num_rows)
# binding id column to the data frame
#
ad_dataset1 <- cbind(ID , ad_dataset)
# Applying names function to get column names from numeric columns in dataset
# as a list
#
colnames <- names(select_if(ad_dataset1, is.numeric))
# Print vector of column names
#
colnames
## [1] "ID" "Daily.Time.Spent.on.Site"
## [3] "Age" "Area.Income"
## [5] "Daily.Internet.Usage"
# creating the modified data frame
#
data_mod1 <- melt(ad_dataset1, id.vars='ID',
measure.vars=c("Area.Income"))
# creating a plot of area income
#
p <- ggplot(data_mod1) +
geom_boxplot(aes(x=ID, y=value, color=variable))
# printing the plot
#
print(p)
# creating the modified data frame
#
data_mod2 <- melt(ad_dataset1, id.vars='ID',
measure.vars=c("Daily.Time.Spent.on.Site", "Age",
"Daily.Internet.Usage" ))
# creating a plot of three other numerical columns
#
p <- ggplot(data_mod2) +
geom_boxplot(aes(x=ID, y=value, color=variable))
# printing the plot
#
print(p)
Outliers were observed only in the attribute containing area income information. This is expected due to the great disparity in development and GDP levels for the different countries globally.
# Creating a new column that counts the number of words per ad topic lin
#
ad_dataset <- ad_dataset %>%
mutate(word.Counter = str_count(ad_dataset$Ad.Topic.Line, pattern = "\\w+"))
# Grouping the countries according to continent
#
ad_dataset$continent <- countrycode(sourcevar = ad_dataset[, "Country"],
origin = "country.name",
destination = "continent")
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Antarctica (the territory South of 60 deg S), Bouvet Island (Bouvetoya), British Indian Ocean Territory (Chagos Archipelago), French Southern Territories, Heard Island and McDonald Islands, Micronesia, Saint Martin, South Georgia and the South Sandwich Islands, United States Minor Outlying Islands
# Getting unique values in continent feature
#
unique(ad_dataset$continent)
## [1] "Africa" "Oceania" "Europe" "Asia" "Americas" NA
# Finding out the total number of null values in the ad_dataset
#
sum(is.na(ad_dataset))
## [1] 35
There are 35 missing records in the ad data set
# Isolating the records with null values in the ad data set to investigate
# them further
#
test <-
ad_dataset %>%
filter(is.na(continent))
# previewing first six records of the test data set
#
head(test)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 54.70 36 31087.54 118.39
## 2 76.02 22 46179.97 209.82
## 3 50.33 50 62657.53 133.20
## 4 46.13 31 60248.97 139.01
## 5 70.79 31 74535.94 184.10
## 6 43.67 31 25686.34 166.29
## Ad.Topic.Line City Male
## 1 Grass-roots solution-oriented conglomeration Jessicastad 1
## 2 Business-focused value-added definition West Guybury 0
## 3 Sharable analyzing alliance South Lauraton 1
## 4 Customer-focused optimizing moderator Davidmouth 0
## 5 Distributed tertiary system engine Sharpberg 0
## 6 Automated directional function New Theresa 1
## Country Timestamp
## 1 British Indian Ocean Territory (Chagos Archipelago) 2016-02-13 07:53:55
## 2 Bouvet Island (Bouvetoya) 2016-01-27 12:38:16
## 3 Micronesia 2016-03-02 04:57:51
## 4 Bouvet Island (Bouvetoya) 2016-02-01 09:00:55
## 5 Bouvet Island (Bouvetoya) 2016-03-15 15:49:14
## 6 Antarctica (the territory South of 60 deg S) 2016-02-28 06:41:44
## Clicked.on.Ad word.Counter continent
## 1 1 5 <NA>
## 2 0 5 <NA>
## 3 1 3 <NA>
## 4 1 4 <NA>
## 5 0 4 <NA>
## 6 1 3 <NA>
All missing values occurred in the continent column. These are regions that could not be classified into the five continents using the country code library. These regions shall be explored further for appropriate classification
# Previewing the unique regions in the test data set that have missing
# continent data
#
unique(test$Country)
## [1] British Indian Ocean Territory (Chagos Archipelago)
## [2] Bouvet Island (Bouvetoya)
## [3] Micronesia
## [4] Antarctica (the territory South of 60 deg S)
## [5] Saint Martin
## [6] United States Minor Outlying Islands
## [7] French Southern Territories
## [8] Heard Island and McDonald Islands
## [9] South Georgia and the South Sandwich Islands
## 237 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe
These regions are located in the Antarctica a continent that is not included in the Country Code library. The null values shall be replaced with the Antarctica continent.
# replacing NA with Antarctica
#
ad_dataset$continent[is.na(ad_dataset$continent)] <- "Antarctica"
# Preview unique continent values
#
unique(ad_dataset$continent)
## [1] "Africa" "Oceania" "Europe" "Asia" "Americas"
## [6] "Antarctica"
The missing continent value has been successfully replaced
# Extracting the year from the time stamp
#
ad_dataset$year <- format (as.Date(ad_dataset$Timestamp, format="%d/%m/%Y"),"%Y")
# Extracting the month from the time stamp
#
ad_dataset$month <- format (as.Date(ad_dataset$Timestamp, format="%d/%m/%Y"),"%m")
# Convert Date to Weekday in R (weekdays Function)
#
ad_dataset$weekday <- weekdays(ad_dataset$Timestamp)
# Extracting the hour of day from the time stamp
#
ad_dataset$hour <- format (as.POSIXct(ad_dataset$Timestamp, format="%H:%M:%S"),"%H")
# Checking data types of newly created columns
#
str(ad_dataset)
## 'data.frame': 1000 obs. of 16 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
## $ Male : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
## $ Country : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked.on.Ad : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ word.Counter : int 3 3 5 5 3 4 3 3 3 3 ...
## $ continent : chr "Africa" "Oceania" "Europe" "Europe" ...
## $ year : chr "2016" "2016" "2016" "2016" ...
## $ month : chr "03" "04" "03" "01" ...
## $ weekday : chr "Sunday" "Monday" "Sunday" "Sunday" ...
## $ hour : chr "00" "01" "20" "02" ...
The year, month, weekday and hour have inappropriate data types. The year shall be converted to integer data type while the remaining three shall be converted to factor data types.
# Converting the attribute year from character to integer
#
as.integer(ad_dataset$year) -> ad_dataset$year
# Converting continent, hour, month, and weekday from character data type
# to integer data type
#
cols <- c("hour", "weekday", "continent", "month")
# Applying factor conversion
#
ad_dataset[cols] <- lapply(ad_dataset[cols], factor)
# Checking the datatypes of the different columns after reassigning
# The datatypes
#
str(ad_dataset)
## 'data.frame': 1000 obs. of 16 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
## $ Male : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
## $ Country : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked.on.Ad : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ word.Counter : int 3 3 5 5 3 4 3 3 3 3 ...
## $ continent : Factor w/ 6 levels "Africa","Americas",..: 1 6 5 5 5 5 4 6 2 1 ...
## $ year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ month : Factor w/ 8 levels "01","02","03",..: 3 4 3 1 6 5 1 3 4 7 ...
## $ weekday : Factor w/ 7 levels "Friday","Monday",..: 4 2 4 4 1 5 5 2 2 2 ...
## $ hour : Factor w/ 24 levels "00","01","02",..: 1 2 21 3 4 15 21 2 10 2 ...
Two columns are being drop (Ad.Topic.Line and Country): • The ad topic line is dropped since with sentence word counter column It ceases adding value to the study • The country column is dropped since with the continent data the Country column becomes redundant
# Selecting non numeric columns in the ad data set
#
non_num <- ad_dataset %>% select_if(negate(is.numeric))
# Previewing first six records of non_numeric columns in data frame
#
head(non_num)
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked.on.Ad continent month weekday hour
## 1 2016-03-27 00:53:11 0 Africa 03 Sunday 00
## 2 2016-04-04 01:39:02 0 Oceania 04 Monday 01
## 3 2016-03-13 20:35:42 0 Europe 03 Sunday 20
## 4 2016-01-10 02:31:19 0 Europe 01 Sunday 02
## 5 2016-06-03 03:36:18 0 Europe 06 Friday 03
## 6 2016-05-19 14:30:17 0 Europe 05 Thursday 14
# Finding unique values of the non_numeric columns
#
rapply(non_num,function(x)length(unique(x)))
## Ad.Topic.Line City Male Country Timestamp
## 1000 969 2 237 1000
## Clicked.on.Ad continent month weekday hour
## 2 6 8 7 24
From the summary above, ad topic line has 1000 unique instances which is equal to the number of records in the data set. This makes it hard for us to draw any insights from this feature. And with the word count column created from the ad topic line this column is redundant hence shall be dropped
The city feature too has 969 unique values from a total of 1000 records. The insights drawn from this would not be helpful hence the city column shall be dropped
The target variable has two unique values.
# Dropping the country, time stamp,city and ad topic line columns
#
ad_dataset2 <- subset(ad_dataset, select = -c(City, Ad.Topic.Line,
Country, Timestamp))
# Confirming to see whether the columns have been dropped
#
colnames(ad_dataset2)
## [1] "Daily.Time.Spent.on.Site" "Age"
## [3] "Area.Income" "Daily.Internet.Usage"
## [5] "Male" "Clicked.on.Ad"
## [7] "word.Counter" "continent"
## [9] "year" "month"
## [11] "weekday" "hour"
The city and ad topic line columns have been succefully dropped
# Creating data set with numeric variables only
# Identifying the numeric class in the data and evaluating if there are any
# outliers
#
num_cols <- unlist(lapply(ad_dataset2, is.numeric))
# Subset numeric columns of data
#
num_dataset <- ad_dataset2[ , num_cols]
# Printing the subset to RStudio console
#
head(num_dataset)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage word.Counter
## 1 68.95 35 61833.90 256.09 3
## 2 80.23 31 68441.85 193.77 3
## 3 69.47 26 59785.94 236.50 5
## 4 74.15 29 54806.18 245.89 5
## 5 68.37 35 73889.99 225.58 3
## 6 59.99 23 59761.56 226.74 4
## year
## 1 2016
## 2 2016
## 3 2016
## 4 2016
## 5 2016
## 6 2016
# Creating the mode function that will perform our mode operation for us
# ---
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Computing some descriptive statistics
# ---
#
desc_stats <- data.frame(
Mode = apply(num_dataset, 2, getmode), # Mode
Med = apply(num_dataset, 2, median), # median
Mean = apply(num_dataset, 2, mean), # mean
SD = apply(num_dataset, 2, sd), # Standard deviation
Var = apply(num_dataset, 2, var), # Variance
Min = apply(num_dataset, 2, min), # minimum
Max = apply(num_dataset, 2, max), # Maximum
skewness = skewness(num_dataset), # skewness
kurtosis = kurtosis(num_dataset) # kurtosis
)
## Warning in mean.default(x): argument is not numeric or logical: returning NA
## Warning in mean.default(x): argument is not numeric or logical: returning NA
desc_stats <- round(desc_stats, 2)
desc_stats
## Mode Med Mean SD Var
## Daily.Time.Spent.on.Site 62.26 68.22 65.00 15.85 251.34
## Age 31.00 35.00 36.01 8.79 77.19
## Area.Income 61833.90 57012.30 55000.00 13414.63 179952405.95
## Daily.Internet.Usage 167.22 183.13 180.00 43.90 1927.42
## word.Counter 4.00 4.00 3.98 0.87 0.75
## year 2016.00 2016.00 2016.00 0.03 0.00
## Min Max skewness kurtosis
## Daily.Time.Spent.on.Site 32.60 91.43 NA NA
## Age 19.00 61.00 NA NA
## Area.Income 13996.50 79484.80 NA NA
## Daily.Internet.Usage 104.78 269.96 NA NA
## word.Counter 3.00 7.00 NA NA
## year 2015.00 2016.00 NA NA
From the descriptive statistics,
Most individuals were aged 31 years, had daily internet usage of 167.22 mbs spent 62.26 seconds on the site, and lived in areas with area incomes of 61833.90 USD
all numeric columns had comparable deviations from their means. From the max and min values it s clear that there is a great disparity in area incomes, age, daily internet usage and time spent on the site for the different site users.
The data was collected in the year 2015 and 2016 only. Most ad topic lines had 4 words.
All numeric columns are skewed either to the left or right aside from daily internet usage. Daily internet usage had the a very small but insignificant skew to the left(-0.03)
# Histogram plots of numeric data in the ad_dataset
hist.data.frame(num_dataset)
from the histogram plots and skewness data, all columns with numeric
data are not normally distributed aside from the column containing
information on daily internet usage.
Daily time spent on site and area incomes are skewed to the right while the age data and word count are skewed to the left.
From the histograms, most people spend around 80 seconds on the site, and most ads have 4 words, most people are from areas with annual incomes of between 60,000-70,000 USD. The age of most individuals ranges between 30-40 years
# Bar chart of the genders in data set
ggplot(ad_dataset2, aes(x = Male)) +
geom_bar(fill = "coral") +
theme_classic()
# Bar chart of the individuals who clicked and those who did not click on ad
ggplot(ad_dataset2, aes(x = Clicked.on.Ad)) +
geom_bar(fill = "coral") +
theme_classic()
# Bar chart of the months the data was collected
ggplot(ad_dataset2, aes(x = month)) +
geom_bar(fill = "coral") +
theme_classic()
# Bar chart of the hours the data was collected
ggplot(ad_dataset2, aes(x = hour)) +
geom_bar(fill = "coral") +
theme_classic()
# Bar chart of the weekdays the data was collected
ggplot(ad_dataset2, aes(x = weekday)) +
geom_bar(fill = "coral") +
theme_classic()
# Bar chart of the continent the data was collected
ggplot(ad_dataset2, aes(x = continent)) +
geom_bar(fill = "coral") +
theme_classic()
From the bar plots;
The hour the site was most visited was 0700hrs and least visited was at 1000hrs and 0100hrs Most of the visitors to the site were female. The data set had an equal number of records indicating the ad was clicked to those not clicked. the site had the highest number of visitors on February and and the least number of visitors in December. The site had the highest number of visitors on a Sunday and the least visitors on a Tuesday The highest number of visitors to the site originated from Asia and america the least from the Antarctica region.
Covariance is a statistical representation of the degree to which two variables vary together.Here the relationship between the different numerical data in data Frame shall be calculated
# Create Covariance matrix of the numerical data in dataset
#
cov(num_dataset)
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 2.513371e+02 -4.617415e+01 6.613081e+04
## Age -4.617415e+01 7.718611e+01 -2.152093e+04
## Area.Income 6.613081e+04 -2.152093e+04 1.799524e+08
## Daily.Internet.Usage 3.609919e+02 -1.416348e+02 1.987625e+05
## word.Counter -3.507864e-01 -2.260280e-01 4.037198e+02
## year -1.568549e-02 2.011011e-03 -3.913273e+00
## Daily.Internet.Usage word.Counter year
## Daily.Time.Spent.on.Site 3.609919e+02 -3.507864e-01 -1.568549e-02
## Age -1.416348e+02 -2.260280e-01 2.011011e-03
## Area.Income 1.987625e+05 4.037198e+02 -3.913273e+00
## Daily.Internet.Usage 1.927415e+03 2.486909e-01 -5.981972e-02
## word.Counter 2.486909e-01 7.542703e-01 -2.202202e-05
## year -5.981972e-02 -2.202202e-05 1.000000e-03
From the covariance matrix, age and year varied negatively with all other numerical variables; Daily time spent on site, area income, and daily internet usage. The other variables have a positive covariance among each other.
# Correlation matrix of numerical data in the ad dataset
#
cor(num_dataset)
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 1.00000000 -0.331513343 0.310954413
## Age -0.33151334 1.000000000 -0.182604955
## Area.Income 0.31095441 -0.182604955 1.000000000
## Daily.Internet.Usage 0.51865848 -0.367208560 0.337495533
## word.Counter -0.02547716 -0.029623014 0.034652749
## year -0.03128741 0.007238438 -0.009224893
## Daily.Internet.Usage word.Counter year
## Daily.Time.Spent.on.Site 0.518658475 -0.025477156 -0.031287414
## Age -0.367208560 -0.029623014 0.007238438
## Area.Income 0.337495533 0.034652749 -0.009224893
## Daily.Internet.Usage 1.000000000 0.006522419 -0.043088037
## word.Counter 0.006522419 1.000000000 -0.000801851
## year -0.043088037 -0.000801851 1.000000000
Age has a negative correlation with the other numerical variables. All other variables positive correlation among each other. Daily time spent on site had a strong correlation with daily internet usage.
# pair plot of variables with numeric data
#
pairs(num_dataset, # Data frame of variables
labels = colnames(num_dataset), # Variable names
pch = 21, # Pch symbol
main = "Advertisement dataset", # Title of the plot
row1attop = TRUE, # If FALSE, changes the direction of the diagonal
gap = 1, # Distance between subplots
cex.labels = NULL, # Size of the diagonal text
font.labels = 1) # Font style of the diagonal text
From the pair plot, Daily time spent on the site, user age, daily internet usage and area incomes provided plots that insights could be drawn from. This shall further be investigated by performing multivariate analysis factoring in the target variable. No pattern was observable between word counter and year with the other numeric variables
# Bar chart side by side of genders to know ratios of those that clicked the
# ads and those that did not (0 - female, 1- male)
#
ggplot(ad_dataset2, aes(x = Male, fill = Clicked.on.Ad)) +
geom_bar(position = position_dodge()) +
theme_classic()
From the bar plot above more women who visit the site click the ad compared to those who don’t. The opposite is true for men.
# Bar chart side by side of continents comparing ad clicks
#
ggplot(ad_dataset2, aes(x = continent, fill = Clicked.on.Ad)) +
geom_bar(position = position_dodge()) +
theme_classic()
For all continents aside from Antarctica and Asia, more people who visited the site clicked on the ad compared to those who did not.
# Bar chart side by side of hour comparing ad clicks
#
ggplot(ad_dataset2, aes(x = hour, fill = Clicked.on.Ad)) +
geom_bar(position = position_dodge()) +
theme_classic()
At 0200hrs, 0500hrs, 0700hrs, 1000hrs, 1200hrs, 1600hrs, 2100hrs, 2200hrs and 2300hrs a lot more of the individuals that visited the site did not click on the ad compared to those that did
At 0000hrs, 0300hrs, 0900hrs, 1100hrs, 1700hrs, and 1800hrs more of the individuals that visited the site clicked on the ad compared to those that didn’t.
# Bar chart side by side of month comparing ad clicks
#
ggplot(ad_dataset2, aes(x = month, fill = Clicked.on.Ad)) +
geom_bar(position = position_dodge()) +
theme_classic()
January, March, July and December where the months where the ads that were not clicked exceeded the ads that were clicked.
# Bar chart side by side of weekday comparing ad clicks
#
ggplot(ad_dataset2, aes(x = weekday, fill = Clicked.on.Ad)) +
geom_bar(position = position_dodge()) +
theme_classic()
Friday and Tuesday where the days in the week where the more ads were just viewed and not clicked on compared to those that were clicked.
Here the relationship between the all the feature variables shall be explored further.
# Scatter plot of daily time spent on site vs daily internet usage
#
ggplot(ad_dataset, aes(Daily.Time.Spent.on.Site, Daily.Internet.Usage,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")
From the scatter plot above, it is clear that individuals with lower daily internet use and spend less time on the site are more likely to click on the ad.
# scatter plot of daily time spent on the site vs age
#
ggplot(ad_dataset, aes(Daily.Time.Spent.on.Site, Age,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")
From the scatter plot above, older individuals who spend less time on the site are more likely to click on the ad.
# Scatter plot of daily internet usage versus age
#
ggplot(ad_dataset, aes(Daily.Internet.Usage, Age,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")
From the scatter plot above, older individuals with low daily internet usage are most likely to click on an ad
# Scatter plot of area income vs age
#
ggplot(ad_dataset, aes(Area.Income, Age,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")
From the scatter plot above, younger individuals(below 35 years) from areas with high incomes are more least to click on an ad.
# scatter plot of daily internet usage vs area income
#
ggplot(ad_dataset, aes(Area.Income, Daily.Internet.Usage,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")
from the scatter plot, individuals with high daily internet usage from areas with high incomes are least likely to click on an advertisement
To predict whether a site visitor clicked the ad or not, we shall implement decision tree model since our data set has few variables and the numeric variables do not have a normal distribution. The decision to choose this model is also driven by the desire to obtain interpret_able results.
Data Splicing is the process of splitting the data into a training set and a testing set. The training set is used to build the Decision Tree model and the testing set is used to validate the efficiency of the model. The splitting is performed in the below code snippet:
# Moving the target variable to the end of the data set
#
ad_dataset2 <- ad_dataset2 %>% relocate(Clicked.on.Ad, .before= Daily.Time.Spent.on.Site)
# data splicing.
#
set.seed(12345)
# 80% of the data shall be used to train model
#
train <- sample(1:nrow(ad_dataset2),size = ceiling(0.80*nrow(ad_dataset2)),replace = FALSE)
# training set
#
ad_train <- ad_dataset2[train,]
# test set
#
ad_test <- ad_dataset2[-train,]
In this stage, we’re going to build a Decision Tree by using the rpart (Recursive Partitioning And Regression Trees) algorithm:
# building the classification tree with rpart
tree <- rpart(Clicked.on.Ad ~ .,
data=ad_train,
method = "class")
In this step, we’ll be using the rpart.plot library to plot our final Decision Tree:
# Visualize the decision tree with rpart.plot
rpart.plot(tree, nn=TRUE)
From the decision tree plot the most important feature that determines whether a visitor to the site clicks ad or not is the individuals daily internet usage.
Now in order to test our Decision Tree model, we’ll be applying the testing data set on our model like so:
#Testing the model
pred <- predict(object=tree,ad_test[-1],type="class")
We’ll be using a confusion matrix to calculate the accuracy of the model. Here’s the code:
#Calculating accuracy
t <- table(ad_test$Clicked.on.Ad,pred)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
## pred
## 0 1
## 0 92 7
## 1 7 94
##
## Accuracy : 0.93
## 95% CI : (0.8853, 0.9612)
## No Information Rate : 0.505
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.86
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9293
## Specificity : 0.9307
## Pos Pred Value : 0.9293
## Neg Pred Value : 0.9307
## Prevalence : 0.4950
## Detection Rate : 0.4600
## Detection Prevalence : 0.4950
## Balanced Accuracy : 0.9300
##
## 'Positive' Class : 0
##
The output shows that 93% of all the samples in the test data set have been correctly classified and we’ve attained an accuracy of 93% on the test data set with a 95% confidence interval (0.8853, 0.9612). Thus we can correctly classify an ad whether it was clicked or not.
Interpretation of the confusion matrix; TP - 92 ads were correctly classified as not clicked TN - 94 ads were correctly classified as clicked FP - 7 ads were incorrectly classified as not clicked FN - 7 ads were incorrectly classified as not clicked.
The decision tree model provides us with a very high accuracy with minimal pre-processing steps in a very short time.
Mcnemar’s Test P-Value 1 is greater than 0.05. This implies that the test is statically insignificant. We do not have enough evidence that the number of ad clicked is significantly different from ads not clicked.
We shall challenge this solution by employing a naive bayes classifier model and comparing their performance metrics
We shall attempt to correctly classify the ads as clicked or not clicked by using a naive bayes classifier model. The model is selected because:
It is simple and easy to implement It doesn’t require as much training data It handles both continuous and discrete data It is highly scalable with the number of predictors and data points It is fast and can be used to make real-time predictions It is not sensitive to irrelevant features
Data Splicing is the process of splitting the data into a training set and a testing set. The training set is used to build the Decision Tree model and the testing set is used to validate the efficiency of the model. The splitting is performed in the below code snippet:
# Splitting the dataset. 80 percent of data shall be used to train
# model
split <- sample.split(ad_dataset2, SplitRatio = 0.8)
# Train set
#
train_cl <- subset(ad_dataset2, split == "TRUE")
# test set
#
test_cl <- subset(ad_dataset2, split == "FALSE")
# Feature Scaling
cols <- c("Daily.Time.Spent.on.Site", "Age", "Area.Income", "Daily.Internet.Usage", "word.Counter", "year")
train_scale <- scale(train_cl[cols])
test_scale <- scale(test_cl[cols])
# Fitting Naive Bayes Model to training data set
#
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Clicked.on.Ad ~ ., data = train_cl)
classifier_cl
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.4946667 0.5053333
##
## Conditional probabilities:
## Daily.Time.Spent.on.Site
## Y [,1] [,2]
## 0 77.07520 7.694517
## 1 52.92715 12.726188
##
## Age
## Y [,1] [,2]
## 0 31.89757 6.146875
## 1 40.52770 8.779856
##
## Area.Income
## Y [,1] [,2]
## 0 61779.52 8769.018
## 1 48237.24 14396.965
##
## Daily.Internet.Usage
## Y [,1] [,2]
## 0 213.5033 23.69419
## 1 145.7495 29.89758
##
## Male
## Y 0 1
## 0 0.4797844 0.5202156
## 1 0.5461741 0.4538259
##
## word.Counter
## Y [,1] [,2]
## 0 3.989218 0.9211580
## 1 3.960422 0.8300034
##
## continent
## Y Africa Americas Antarctica Asia Europe Oceania
## 0 0.20215633 0.22371968 0.03234501 0.24528302 0.19137466 0.10512129
## 1 0.21635884 0.21899736 0.03693931 0.20844327 0.21635884 0.10290237
##
## year
## Y [,1] [,2]
## 0 2015.997 0.05191741
## 1 2016.000 0.00000000
##
## month
## Y 01 02 03 04 05 06
## 0 0.153638814 0.142857143 0.180592992 0.142857143 0.145552561 0.121293801
## 1 0.137203166 0.166226913 0.155672823 0.147757256 0.160949868 0.139841689
## month
## Y 07 12
## 0 0.110512129 0.002695418
## 1 0.092348285 0.000000000
##
## weekday
## Y Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 0 0.1698113 0.1401617 0.1293801 0.1671159 0.1212938 0.1239892 0.1482480
## 1 0.1398417 0.1451187 0.1160950 0.1503958 0.1688654 0.1187335 0.1609499
##
## hour
## Y 00 01 02 03 04 05
## 0 0.04582210 0.02425876 0.04043127 0.04312668 0.03504043 0.05390836
## 1 0.05277045 0.03166227 0.02638522 0.05013193 0.03957784 0.04485488
## hour
## Y 06 07 08 09 10 11
## 0 0.03773585 0.05121294 0.04582210 0.03773585 0.04312668 0.03234501
## 1 0.05804749 0.06332454 0.03957784 0.06068602 0.02374670 0.04749340
## hour
## Y 12 13 14 15 16 17
## 0 0.03773585 0.03773585 0.04043127 0.03504043 0.03234501 0.02695418
## 1 0.03166227 0.02902375 0.04485488 0.03166227 0.03430079 0.04485488
## hour
## Y 18 19 20 21 22 23
## 0 0.03773585 0.04043127 0.05121294 0.06738544 0.05121294 0.05121294
## 1 0.04749340 0.03430079 0.05013193 0.04749340 0.03693931 0.02902375
# Predicting on test data'
#
y_pred <- predict(classifier_cl, newdata = test_cl)
We’ll be using a confusion matrix to calculate the accuracy of the model. Here’s the code:
#Calculating accuracy
t <- table(test_cl$Clicked.on.Ad, y_pred)
confusionMatrix(t)
## Confusion Matrix and Statistics
##
## y_pred
## 0 1
## 0 106 23
## 1 3 118
##
## Accuracy : 0.896
## 95% CI : (0.8513, 0.9309)
## No Information Rate : 0.564
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7928
##
## Mcnemar's Test P-Value : 0.0001944
##
## Sensitivity : 0.9725
## Specificity : 0.8369
## Pos Pred Value : 0.8217
## Neg Pred Value : 0.9752
## Prevalence : 0.4360
## Detection Rate : 0.4240
## Detection Prevalence : 0.5160
## Balanced Accuracy : 0.9047
##
## 'Positive' Class : 0
##
The output shows that 89.6% of all the samples in the test data set have been correctly classified and we’ve attained an accuracy of 89.6% on the test data set with a 95% confidence interval (00.8513, 0.9309). Thus we can correctly classify an ad whether it was clicked or not.
Interpretation of the confusion matrix; TP - 106 ads were correctly classified as not clicked TN - 118 ads were correctly classified as clicked FP - 23 ads were incorrectly classified as not clicked FN - 3 ads were incorrectly classified as not clicked.
Mcnemar’s Test P-Value for this model 1 is less than 0.05. This implies that the test is statically significant. We have enough evidence that the number of ad clicked is significantly different from ads not clicked.
However, Compared to the decision tree model, the Naive bayes model is out-performed by the decision tree model in almost every metric.
Hence for this classification problem a decision tree model is the best in helping the entrepreneur determine which client is most likely to click on her ads.
It can be concluded that to a decision tree model is the best model to predict whether a site visitor clicks an ad or not. It will accurately predict whether an individual clicks the ad or not 93% of the time.
The most important feature that determines whether a visitor to the site clicks ad or not is the individuals daily internet usage.
The entrepreneur is hence advised to incorporate the decision tree model in order to better improve the targeting of her advertisements
For this study and to meet the objectives set by the entrepreneur, this data provides relevant information to meet those objectives.
Yes. Developing a machine learning algorithm will to help her determine who is most likely to click on her ads is going to improve her customer targeting and increase her returns in the long run.