1. Research question

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

2. Success criteria

The individuals most likely to click on her advertisements are correctly identified

3. Research Methodology

Defining the research questions and work plan
Loading the dataset
Previewing the dataset
Cleaning the dataset which will entail dealing with outliers, duplicates and missing values appropriately
Feature engineering
Performing Uni variate, bivariate and multivariate analysis on the data set
Creating supervised learning algorithm
Challenging solution
Concluding based on the findings of the research
Providing recommendations based on the conclusions arrived at
Further questions

4. Understanding the data provided

The dataset that shall be used shall be an advertising dataset that contains a total of 10 features.

Age- The age of the individual that clicked the ad
Daily Time Spent on Site - The average time an individual spends on the site
Area Income - The average income of the area from which the ad was clicked
Daily Internet Usage - The daily internet usage information for the area in which the ad was clicked
Ad Topic Line - The topic line of the advertisement
City - The city from where the ad was clicked
Male - The gender of the individual that clicked the add (0- Female, 1- Male)
Country - The country from which the add was clicked
Timestamp - The time that the ad was clicked
Clicked on Add - Contains information whether the individual clicked on the ad or not (0 - Did not click on add, 1 - Clicked on the add)

Loading libraries

# Loading the relevant libraries for this study
library(stringr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(reshape2)
library(ggplot2)
library(countrycode)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✔ tibble  3.1.7     ✔ purrr   0.3.4
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(moments)
library(paletteer)
library(rpart,quietly = TRUE)
library(caret,quietly = TRUE)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart.plot,quietly = TRUE)
library(rattle)

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(e1071)

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:moments':
## 
##     kurtosis, moment, skewness

## The following object is masked from 'package:Hmisc':
## 
##     impute

library(caTools)

Loading the dataset

# Reading the advertisement dataset
#
ad_dataset <- read.csv("http://bit.ly/IPAdvertisingData")

Previewing the dataset

Here the structure/shape of the dataset, the data types of the various attributes shall be investigated

# Previewing the first six records of dataset
head(ad_dataset)

##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0

# view the number of rows and columns in the dataset
#
dim(ad_dataset)

## [1] 1000   10

The data set has a total of 1000 records and 10 attributes/columns.

# Previewing the structure of the ad dataset
#
str(ad_dataset)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...

The are three datatypes in the data set: Number(num), Integer(int), and Character(chr). All attributes have appropriate datatypes excluding the country, city male and clicked on ad columns. These are labelled as integers and are factors. They take only two values (1 or 0)

# Establish the data set class
#
class(ad_dataset)

## [1] "data.frame"

The advertisement data set is a data frame

Cleaning Dataset

Data Validity

The features in data set with are categorical data types (City, Country, Male, and Clicked_ad ) but are in character and intger format. They shall be converted to factors

# Converting the attribute male from integer to factor
#
as.factor(ad_dataset$Male) -> ad_dataset$Male

# Converting the attribute clicked.on.ad fr0m integer to factor
#
as.factor(ad_dataset$Clicked.on.Ad) -> ad_dataset$Clicked.on.Ad

# Converting the attribute word_counter frm integer to factor
#
as.factor(ad_dataset$Country) -> ad_dataset$Country

# Converting the attribute word_counter frm integer to factor
#
as.factor(ad_dataset$City) -> ad_dataset$City

# converting to datetime object
#
ad_dataset[['Timestamp']] <- as.POSIXct(ad_dataset[['Timestamp']],
format = "%Y-%m-%d %H:%M:%S")

# Check the structure structure after reassigning the data types
str(ad_dataset)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
##  $ Male                    : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ Country                 : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked.on.Ad           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...

All columns now have appropriate data types. We have numerical, factor, character and POSIXct(datetime datatypes)

Dealing with missing values

# Checking the number of missing values per column in the data set
#
colSums(is.na(ad_dataset))

## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                        0                        0                        0 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked.on.Ad 
##                        0

The dataset has no missing values in any of the attributes

Checking for duplicate records

# finding the duplicated rows in the data set and assign to a variable duplicated_rows below
#
duplicated_rows = ad_dataset[duplicated(ad_dataset),]

# Printing out the duplicated rows
duplicated_rows

##  [1] Daily.Time.Spent.on.Site Age                      Area.Income             
##  [4] Daily.Internet.Usage     Ad.Topic.Line            City                    
##  [7] Male                     Country                  Timestamp               
## [10] Clicked.on.Ad           
## <0 rows> (or 0-length row.names)

The advertisement data set has no duplicate records

Checking for outliers in the numeric data

# number of rows in data frame
#
num_rows = nrow(ad_dataset)
  
# creating ID column vector
#
ID <- c(1:num_rows)
 
# binding id column to the data frame
#
ad_dataset1 <- cbind(ID , ad_dataset)

# Applying names function to get column names from numeric columns in dataset
# as a list
#
colnames <- names(select_if(ad_dataset1, is.numeric))   

# Print vector of column names
#
colnames

## [1] "ID"                       "Daily.Time.Spent.on.Site"
## [3] "Age"                      "Area.Income"             
## [5] "Daily.Internet.Usage"

# creating the modified data frame
#
data_mod1 <- melt(ad_dataset1, id.vars='ID',
                  measure.vars=c("Area.Income"))

# creating a plot of area income
#
p <- ggplot(data_mod1) +
geom_boxplot(aes(x=ID, y=value, color=variable))
  
# printing the plot
#
print(p)

# creating the modified data frame
#
data_mod2 <- melt(ad_dataset1, id.vars='ID',
                  measure.vars=c("Daily.Time.Spent.on.Site", "Age",
                                 "Daily.Internet.Usage" ))

# creating a plot of three other numerical columns
#
p <- ggplot(data_mod2) +
geom_boxplot(aes(x=ID, y=value, color=variable))
  
# printing the plot
#
print(p)

Outliers were observed only in the attribute containing area income information. This is expected due to the great disparity in development and GDP levels for the different countries globally.

feature engineering

# Creating a new column that counts the number of words per ad topic lin
#
ad_dataset <- ad_dataset %>%
  mutate(word.Counter = str_count(ad_dataset$Ad.Topic.Line, pattern = "\\w+"))

# Grouping the countries according to continent
#
ad_dataset$continent <- countrycode(sourcevar = ad_dataset[, "Country"],
                                    origin = "country.name", 
                                    destination = "continent")

## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Antarctica (the territory South of 60 deg S), Bouvet Island (Bouvetoya), British Indian Ocean Territory (Chagos Archipelago), French Southern Territories, Heard Island and McDonald Islands, Micronesia, Saint Martin, South Georgia and the South Sandwich Islands, United States Minor Outlying Islands

# Getting unique values in continent feature
#
unique(ad_dataset$continent)

## [1] "Africa"   "Oceania"  "Europe"   "Asia"     "Americas" NA

# Finding out the total number of null values in the ad_dataset
#
sum(is.na(ad_dataset))

## [1] 35

There are 35 missing records in the ad data set

# Isolating the records with null values in the ad data set to investigate
# them further
#
test <- 
  ad_dataset %>% 
  filter(is.na(continent))

# previewing first six records of the test data set
#
head(test)

##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    54.70  36    31087.54               118.39
## 2                    76.02  22    46179.97               209.82
## 3                    50.33  50    62657.53               133.20
## 4                    46.13  31    60248.97               139.01
## 5                    70.79  31    74535.94               184.10
## 6                    43.67  31    25686.34               166.29
##                                  Ad.Topic.Line           City Male
## 1 Grass-roots solution-oriented conglomeration    Jessicastad    1
## 2      Business-focused value-added definition   West Guybury    0
## 3                  Sharable analyzing alliance South Lauraton    1
## 4        Customer-focused optimizing moderator     Davidmouth    0
## 5           Distributed tertiary system engine      Sharpberg    0
## 6               Automated directional function    New Theresa    1
##                                               Country           Timestamp
## 1 British Indian Ocean Territory (Chagos Archipelago) 2016-02-13 07:53:55
## 2                           Bouvet Island (Bouvetoya) 2016-01-27 12:38:16
## 3                                          Micronesia 2016-03-02 04:57:51
## 4                           Bouvet Island (Bouvetoya) 2016-02-01 09:00:55
## 5                           Bouvet Island (Bouvetoya) 2016-03-15 15:49:14
## 6        Antarctica (the territory South of 60 deg S) 2016-02-28 06:41:44
##   Clicked.on.Ad word.Counter continent
## 1             1            5      <NA>
## 2             0            5      <NA>
## 3             1            3      <NA>
## 4             1            4      <NA>
## 5             0            4      <NA>
## 6             1            3      <NA>

All missing values occurred in the continent column. These are regions that could not be classified into the five continents using the country code library. These regions shall be explored further for appropriate classification

# Previewing the unique regions in the test data set that have missing
# continent data
#
unique(test$Country)

## [1] British Indian Ocean Territory (Chagos Archipelago)
## [2] Bouvet Island (Bouvetoya)                          
## [3] Micronesia                                         
## [4] Antarctica (the territory South of 60 deg S)       
## [5] Saint Martin                                       
## [6] United States Minor Outlying Islands               
## [7] French Southern Territories                        
## [8] Heard Island and McDonald Islands                  
## [9] South Georgia and the South Sandwich Islands       
## 237 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

These regions are located in the Antarctica a continent that is not included in the Country Code library. The null values shall be replaced with the Antarctica continent.

# replacing NA with Antarctica 
#
ad_dataset$continent[is.na(ad_dataset$continent)] <- "Antarctica"

# Preview unique continent values
#
unique(ad_dataset$continent)

## [1] "Africa"     "Oceania"    "Europe"     "Asia"       "Americas"  
## [6] "Antarctica"

The missing continent value has been successfully replaced

# Extracting the year from the time stamp
#
ad_dataset$year <- format (as.Date(ad_dataset$Timestamp, format="%d/%m/%Y"),"%Y")

# Extracting the month from the time stamp
#
ad_dataset$month <- format (as.Date(ad_dataset$Timestamp, format="%d/%m/%Y"),"%m")

# Convert Date to Weekday in R (weekdays Function)
# 
ad_dataset$weekday <- weekdays(ad_dataset$Timestamp)

# Extracting the hour of day from the time stamp
#
ad_dataset$hour <- format (as.POSIXct(ad_dataset$Timestamp, format="%H:%M:%S"),"%H")

# Checking data types of newly created columns
#
str(ad_dataset)

## 'data.frame':    1000 obs. of  16 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
##  $ Male                    : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ Country                 : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked.on.Ad           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ word.Counter            : int  3 3 5 5 3 4 3 3 3 3 ...
##  $ continent               : chr  "Africa" "Oceania" "Europe" "Europe" ...
##  $ year                    : chr  "2016" "2016" "2016" "2016" ...
##  $ month                   : chr  "03" "04" "03" "01" ...
##  $ weekday                 : chr  "Sunday" "Monday" "Sunday" "Sunday" ...
##  $ hour                    : chr  "00" "01" "20" "02" ...

The year, month, weekday and hour have inappropriate data types. The year shall be converted to integer data type while the remaining three shall be converted to factor data types.

# Converting the attribute year from character to integer
#
as.integer(ad_dataset$year) -> ad_dataset$year

# Converting continent, hour, month, and weekday from character data type
# to integer data type
#
cols <- c("hour", "weekday", "continent", "month")

# Applying factor conversion
#
ad_dataset[cols] <- lapply(ad_dataset[cols], factor)

# Checking the datatypes of the different columns after reassigning
# The datatypes
#
str(ad_dataset)

## 'data.frame':    1000 obs. of  16 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
##  $ Male                    : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ Country                 : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked.on.Ad           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ word.Counter            : int  3 3 5 5 3 4 3 3 3 3 ...
##  $ continent               : Factor w/ 6 levels "Africa","Americas",..: 1 6 5 5 5 5 4 6 2 1 ...
##  $ year                    : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ month                   : Factor w/ 8 levels "01","02","03",..: 3 4 3 1 6 5 1 3 4 7 ...
##  $ weekday                 : Factor w/ 7 levels "Friday","Monday",..: 4 2 4 4 1 5 5 2 2 2 ...
##  $ hour                    : Factor w/ 24 levels "00","01","02",..: 1 2 21 3 4 15 21 2 10 2 ...

Two columns are being drop (Ad.Topic.Line and Country): • The ad topic line is dropped since with sentence word counter column It ceases adding value to the study • The country column is dropped since with the continent data the Country column becomes redundant

Univariate analysis

Non_numeric data

# Selecting non numeric columns in the ad data set
#
non_num <- ad_dataset %>% select_if(negate(is.numeric))

# Previewing first six records of non_numeric columns in data frame
#
head(non_num)

##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad continent month  weekday hour
## 1 2016-03-27 00:53:11             0    Africa    03   Sunday   00
## 2 2016-04-04 01:39:02             0   Oceania    04   Monday   01
## 3 2016-03-13 20:35:42             0    Europe    03   Sunday   20
## 4 2016-01-10 02:31:19             0    Europe    01   Sunday   02
## 5 2016-06-03 03:36:18             0    Europe    06   Friday   03
## 6 2016-05-19 14:30:17             0    Europe    05 Thursday   14

# Finding unique values of the non_numeric columns
#
rapply(non_num,function(x)length(unique(x)))

## Ad.Topic.Line          City          Male       Country     Timestamp 
##          1000           969             2           237          1000 
## Clicked.on.Ad     continent         month       weekday          hour 
##             2             6             8             7            24

From the summary above, ad topic line has 1000 unique instances which is equal to the number of records in the data set. This makes it hard for us to draw any insights from this feature. And with the word count column created from the ad topic line this column is redundant hence shall be dropped

The city feature too has 969 unique values from a total of 1000 records. The insights drawn from this would not be helpful hence the city column shall be dropped

The target variable has two unique values.

# Dropping the country, time stamp,city and ad topic line columns
#
ad_dataset2 <- subset(ad_dataset, select = -c(City, Ad.Topic.Line, 
                                               Country, Timestamp))

# Confirming to see whether the columns have been dropped
#
colnames(ad_dataset2)

##  [1] "Daily.Time.Spent.on.Site" "Age"                     
##  [3] "Area.Income"              "Daily.Internet.Usage"    
##  [5] "Male"                     "Clicked.on.Ad"           
##  [7] "word.Counter"             "continent"               
##  [9] "year"                     "month"                   
## [11] "weekday"                  "hour"

The city and ad topic line columns have been succefully dropped

Numerical Data

# Creating data set with numeric variables only

# Identifying the numeric class in the data and evaluating if there are any
# outliers
#
num_cols <- unlist(lapply(ad_dataset2, is.numeric)) 

# Subset numeric columns of data
#
num_dataset <- ad_dataset2[ , num_cols]
# Printing the subset to RStudio console
#
head(num_dataset)

##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage word.Counter
## 1                    68.95  35    61833.90               256.09            3
## 2                    80.23  31    68441.85               193.77            3
## 3                    69.47  26    59785.94               236.50            5
## 4                    74.15  29    54806.18               245.89            5
## 5                    68.37  35    73889.99               225.58            3
## 6                    59.99  23    59761.56               226.74            4
##   year
## 1 2016
## 2 2016
## 3 2016
## 4 2016
## 5 2016
## 6 2016

Measures of central tendency and dispersion

# Creating the mode function that will perform our mode operation for us
# ---
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Computing some descriptive statistics
# ---
# 
desc_stats <- data.frame(
  Mode = apply(num_dataset, 2, getmode), # Mode
  Med = apply(num_dataset, 2, median), # median
  Mean = apply(num_dataset, 2, mean),  # mean
  SD = apply(num_dataset, 2, sd),      # Standard deviation
  Var = apply(num_dataset, 2, var),     # Variance
  Min = apply(num_dataset, 2, min),     # minimum
  Max = apply(num_dataset, 2, max),      # Maximum
  skewness = skewness(num_dataset),      # skewness
  kurtosis = kurtosis(num_dataset)      # kurtosis
)

## Warning in mean.default(x): argument is not numeric or logical: returning NA

## Warning in mean.default(x): argument is not numeric or logical: returning NA

desc_stats <- round(desc_stats, 2)
desc_stats

##                              Mode      Med     Mean       SD          Var
## Daily.Time.Spent.on.Site    62.26    68.22    65.00    15.85       251.34
## Age                         31.00    35.00    36.01     8.79        77.19
## Area.Income              61833.90 57012.30 55000.00 13414.63 179952405.95
## Daily.Internet.Usage       167.22   183.13   180.00    43.90      1927.42
## word.Counter                 4.00     4.00     3.98     0.87         0.75
## year                      2016.00  2016.00  2016.00     0.03         0.00
##                               Min      Max skewness kurtosis
## Daily.Time.Spent.on.Site    32.60    91.43       NA       NA
## Age                         19.00    61.00       NA       NA
## Area.Income              13996.50 79484.80       NA       NA
## Daily.Internet.Usage       104.78   269.96       NA       NA
## word.Counter                 3.00     7.00       NA       NA
## year                      2015.00  2016.00       NA       NA

From the descriptive statistics,

Most individuals were aged 31 years, had daily internet usage of 167.22 mbs spent 62.26 seconds on the site, and lived in areas with area incomes of 61833.90 USD

all numeric columns had comparable deviations from their means. From the max and min values it s clear that there is a great disparity in area incomes, age, daily internet usage and time spent on the site for the different site users.

The data was collected in the year 2015 and 2016 only. Most ad topic lines had 4 words.

All numeric columns are skewed either to the left or right aside from daily internet usage. Daily internet usage had the a very small but insignificant skew to the left(-0.03)

Graphicals

# Histogram plots of numeric data in the ad_dataset
hist.data.frame(num_dataset)

from the histogram plots and skewness data, all columns with numeric data are not normally distributed aside from the column containing information on daily internet usage.

Daily time spent on site and area incomes are skewed to the right while the age data and word count are skewed to the left.

From the histograms, most people spend around 80 seconds on the site, and most ads have 4 words, most people are from areas with annual incomes of between 60,000-70,000 USD. The age of most individuals ranges between 30-40 years

# Bar chart of the genders in data set
ggplot(ad_dataset2, aes(x = Male)) +
    geom_bar(fill = "coral") +
    theme_classic()

# Bar chart of the individuals who clicked and those who did not click on ad 
ggplot(ad_dataset2, aes(x = Clicked.on.Ad)) +
    geom_bar(fill = "coral") +
    theme_classic()

# Bar chart of the months the data was collected
ggplot(ad_dataset2, aes(x = month)) +
    geom_bar(fill = "coral") +
    theme_classic()

# Bar chart of the hours the data was collected
ggplot(ad_dataset2, aes(x = hour)) +
    geom_bar(fill = "coral") +
    theme_classic()

# Bar chart of the weekdays the data was collected
ggplot(ad_dataset2, aes(x = weekday)) +
    geom_bar(fill = "coral") +
    theme_classic()

# Bar chart of the continent the data was collected
ggplot(ad_dataset2, aes(x = continent)) +
    geom_bar(fill = "coral") +
    theme_classic()

From the bar plots;

The hour the site was most visited was 0700hrs and least visited was at 1000hrs and 0100hrs Most of the visitors to the site were female. The data set had an equal number of records indicating the ad was clicked to those not clicked. the site had the highest number of visitors on February and and the least number of visitors in December. The site had the highest number of visitors on a Sunday and the least visitors on a Tuesday The highest number of visitors to the site originated from Asia and america the least from the Antarctica region.

Bivariate analysis

Covariance

Covariance is a statistical representation of the degree to which two variables vary together.Here the relationship between the different numerical data in data Frame shall be calculated

# Create Covariance matrix of the numerical data in dataset
#
cov(num_dataset)

##                          Daily.Time.Spent.on.Site           Age   Area.Income
## Daily.Time.Spent.on.Site             2.513371e+02 -4.617415e+01  6.613081e+04
## Age                                 -4.617415e+01  7.718611e+01 -2.152093e+04
## Area.Income                          6.613081e+04 -2.152093e+04  1.799524e+08
## Daily.Internet.Usage                 3.609919e+02 -1.416348e+02  1.987625e+05
## word.Counter                        -3.507864e-01 -2.260280e-01  4.037198e+02
## year                                -1.568549e-02  2.011011e-03 -3.913273e+00
##                          Daily.Internet.Usage  word.Counter          year
## Daily.Time.Spent.on.Site         3.609919e+02 -3.507864e-01 -1.568549e-02
## Age                             -1.416348e+02 -2.260280e-01  2.011011e-03
## Area.Income                      1.987625e+05  4.037198e+02 -3.913273e+00
## Daily.Internet.Usage             1.927415e+03  2.486909e-01 -5.981972e-02
## word.Counter                     2.486909e-01  7.542703e-01 -2.202202e-05
## year                            -5.981972e-02 -2.202202e-05  1.000000e-03

From the covariance matrix, age and year varied negatively with all other numerical variables; Daily time spent on site, area income, and daily internet usage. The other variables have a positive covariance among each other.

Correlation

# Correlation matrix of numerical data in the ad dataset
#
cor(num_dataset)

##                          Daily.Time.Spent.on.Site          Age  Area.Income
## Daily.Time.Spent.on.Site               1.00000000 -0.331513343  0.310954413
## Age                                   -0.33151334  1.000000000 -0.182604955
## Area.Income                            0.31095441 -0.182604955  1.000000000
## Daily.Internet.Usage                   0.51865848 -0.367208560  0.337495533
## word.Counter                          -0.02547716 -0.029623014  0.034652749
## year                                  -0.03128741  0.007238438 -0.009224893
##                          Daily.Internet.Usage word.Counter         year
## Daily.Time.Spent.on.Site          0.518658475 -0.025477156 -0.031287414
## Age                              -0.367208560 -0.029623014  0.007238438
## Area.Income                       0.337495533  0.034652749 -0.009224893
## Daily.Internet.Usage              1.000000000  0.006522419 -0.043088037
## word.Counter                      0.006522419  1.000000000 -0.000801851
## year                             -0.043088037 -0.000801851  1.000000000

Age has a negative correlation with the other numerical variables. All other variables positive correlation among each other. Daily time spent on site had a strong correlation with daily internet usage.

Bivariate graphicals

# pair plot of variables with numeric data
#
pairs(num_dataset,                     # Data frame of variables
      labels = colnames(num_dataset),  # Variable names
      pch = 21,                 # Pch symbol
      main = "Advertisement dataset",    # Title of the plot
      row1attop = TRUE,         # If FALSE, changes the direction of the diagonal
      gap = 1,                  # Distance between subplots
      cex.labels = NULL,        # Size of the diagonal text
      font.labels = 1)          # Font style of the diagonal text

From the pair plot, Daily time spent on the site, user age, daily internet usage and area incomes provided plots that insights could be drawn from. This shall further be investigated by performing multivariate analysis factoring in the target variable. No pattern was observable between word counter and year with the other numeric variables

# Bar chart side by side of genders to know ratios of those that clicked the
# ads and those that did not (0 - female, 1- male)
#
ggplot(ad_dataset2, aes(x = Male, fill = Clicked.on.Ad)) +
    geom_bar(position = position_dodge()) +
    theme_classic()

From the bar plot above more women who visit the site click the ad compared to those who don’t. The opposite is true for men.

# Bar chart side by side of continents comparing ad clicks
#
ggplot(ad_dataset2, aes(x = continent, fill = Clicked.on.Ad)) +
    geom_bar(position = position_dodge()) +
    theme_classic()

For all continents aside from Antarctica and Asia, more people who visited the site clicked on the ad compared to those who did not.

# Bar chart side by side of hour comparing ad clicks
#
ggplot(ad_dataset2, aes(x = hour, fill = Clicked.on.Ad)) +
    geom_bar(position = position_dodge()) +
    theme_classic()

At 0200hrs, 0500hrs, 0700hrs, 1000hrs, 1200hrs, 1600hrs, 2100hrs, 2200hrs and 2300hrs a lot more of the individuals that visited the site did not click on the ad compared to those that did

At 0000hrs, 0300hrs, 0900hrs, 1100hrs, 1700hrs, and 1800hrs more of the individuals that visited the site clicked on the ad compared to those that didn’t.

# Bar chart side by side of month comparing ad clicks
#
ggplot(ad_dataset2, aes(x = month, fill = Clicked.on.Ad)) +
    geom_bar(position = position_dodge()) +
    theme_classic()

January, March, July and December where the months where the ads that were not clicked exceeded the ads that were clicked.

# Bar chart side by side of weekday comparing ad clicks
#
ggplot(ad_dataset2, aes(x = weekday, fill = Clicked.on.Ad)) +
    geom_bar(position = position_dodge()) +
    theme_classic()

Friday and Tuesday where the days in the week where the more ads were just viewed and not clicked on compared to those that were clicked.

Multivariate analysis

Here the relationship between the all the feature variables shall be explored further.

# Scatter plot of daily time spent on site vs daily internet usage
#
ggplot(ad_dataset, aes(Daily.Time.Spent.on.Site, Daily.Internet.Usage,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")

From the scatter plot above, it is clear that individuals with lower daily internet use and spend less time on the site are more likely to click on the ad.

# scatter plot of daily time spent on the site vs age
#
ggplot(ad_dataset, aes(Daily.Time.Spent.on.Site, Age,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")

From the scatter plot above, older individuals who spend less time on the site are more likely to click on the ad.

# Scatter plot of daily internet usage versus age
#
ggplot(ad_dataset, aes(Daily.Internet.Usage, Age,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")

From the scatter plot above, older individuals with low daily internet usage are most likely to click on an ad

# Scatter plot of area income vs age
#
ggplot(ad_dataset, aes(Area.Income, Age,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")

From the scatter plot above, younger individuals(below 35 years) from areas with high incomes are more least to click on an ad.

# scatter plot of daily internet usage vs area income
#
ggplot(ad_dataset, aes(Area.Income, Daily.Internet.Usage,
color = Clicked.on.Ad)) + geom_point() + scale_color_paletteer_d("nord::aurora")

from the scatter plot, individuals with high daily internet usage from areas with high incomes are least likely to click on an advertisement

Implementing solution

To predict whether a site visitor clicked the ad or not, we shall implement decision tree model since our data set has few variables and the numeric variables do not have a normal distribution. The decision to choose this model is also driven by the desire to obtain interpret_able results.

Step 1: Data Splicing

Data Splicing is the process of splitting the data into a training set and a testing set. The training set is used to build the Decision Tree model and the testing set is used to validate the efficiency of the model. The splitting is performed in the below code snippet:

# Moving the target variable to the end of the data set
#
ad_dataset2 <- ad_dataset2 %>% relocate(Clicked.on.Ad, .before= Daily.Time.Spent.on.Site)

# data splicing. 
#
set.seed(12345)

# 80% of the data shall be used to train model
#
train <- sample(1:nrow(ad_dataset2),size = ceiling(0.80*nrow(ad_dataset2)),replace = FALSE)

# training set
#
ad_train <- ad_dataset2[train,]

# test set
#
ad_test <- ad_dataset2[-train,]

Step 2: Building a model

In this stage, we’re going to build a Decision Tree by using the rpart (Recursive Partitioning And Regression Trees) algorithm:

# building the classification tree with rpart
tree <- rpart(Clicked.on.Ad ~ .,
data=ad_train,
method = "class")

Step 3: Visualising the tree

In this step, we’ll be using the rpart.plot library to plot our final Decision Tree:

# Visualize the decision tree with rpart.plot
rpart.plot(tree, nn=TRUE)

From the decision tree plot the most important feature that determines whether a visitor to the site clicks ad or not is the individuals daily internet usage.

Step 4: Testing the model

Now in order to test our Decision Tree model, we’ll be applying the testing data set on our model like so:

#Testing the model
pred <- predict(object=tree,ad_test[-1],type="class")

Step 5: Calculating accuracy

We’ll be using a confusion matrix to calculate the accuracy of the model. Here’s the code:

#Calculating accuracy
t <- table(ad_test$Clicked.on.Ad,pred) 
confusionMatrix(t)

## Confusion Matrix and Statistics
## 
##    pred
##      0  1
##   0 92  7
##   1  7 94
##                                           
##                Accuracy : 0.93            
##                  95% CI : (0.8853, 0.9612)
##     No Information Rate : 0.505           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.86            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9293          
##             Specificity : 0.9307          
##          Pos Pred Value : 0.9293          
##          Neg Pred Value : 0.9307          
##              Prevalence : 0.4950          
##          Detection Rate : 0.4600          
##    Detection Prevalence : 0.4950          
##       Balanced Accuracy : 0.9300          
##                                           
##        'Positive' Class : 0               
##

The output shows that 93% of all the samples in the test data set have been correctly classified and we’ve attained an accuracy of 93% on the test data set with a 95% confidence interval (0.8853, 0.9612). Thus we can correctly classify an ad whether it was clicked or not.

Interpretation of the confusion matrix; TP - 92 ads were correctly classified as not clicked TN - 94 ads were correctly classified as clicked FP - 7 ads were incorrectly classified as not clicked FN - 7 ads were incorrectly classified as not clicked.

The decision tree model provides us with a very high accuracy with minimal pre-processing steps in a very short time.

Mcnemar’s Test P-Value 1 is greater than 0.05. This implies that the test is statically insignificant. We do not have enough evidence that the number of ad clicked is significantly different from ads not clicked.

We shall challenge this solution by employing a naive bayes classifier model and comparing their performance metrics

Challenging solution

We shall attempt to correctly classify the ads as clicked or not clicked by using a naive bayes classifier model. The model is selected because:

It is simple and easy to implement It doesn’t require as much training data It handles both continuous and discrete data It is highly scalable with the number of predictors and data points It is fast and can be used to make real-time predictions It is not sensitive to irrelevant features

Step 1: Data Splicing

# Splitting the dataset. 80 percent of data shall be used to train 
# model
split <- sample.split(ad_dataset2, SplitRatio = 0.8)

# Train set
#
train_cl <- subset(ad_dataset2, split == "TRUE")

# test set
#
test_cl <- subset(ad_dataset2, split == "FALSE")

Step 2 - Feature scaling

# Feature Scaling
cols <- c("Daily.Time.Spent.on.Site", "Age", "Area.Income", "Daily.Internet.Usage", "word.Counter", "year")
train_scale <- scale(train_cl[cols])
test_scale <- scale(test_cl[cols])

Step 3 -Fitting Naive Bayes Model

# Fitting Naive Bayes Model to training data set
#
set.seed(120)  # Setting Seed
classifier_cl <- naiveBayes(Clicked.on.Ad ~ ., data = train_cl)
classifier_cl

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.4946667 0.5053333 
## 
## Conditional probabilities:
##    Daily.Time.Spent.on.Site
## Y       [,1]      [,2]
##   0 77.07520  7.694517
##   1 52.92715 12.726188
## 
##    Age
## Y       [,1]     [,2]
##   0 31.89757 6.146875
##   1 40.52770 8.779856
## 
##    Area.Income
## Y       [,1]      [,2]
##   0 61779.52  8769.018
##   1 48237.24 14396.965
## 
##    Daily.Internet.Usage
## Y       [,1]     [,2]
##   0 213.5033 23.69419
##   1 145.7495 29.89758
## 
##    Male
## Y           0         1
##   0 0.4797844 0.5202156
##   1 0.5461741 0.4538259
## 
##    word.Counter
## Y       [,1]      [,2]
##   0 3.989218 0.9211580
##   1 3.960422 0.8300034
## 
##    continent
## Y       Africa   Americas Antarctica       Asia     Europe    Oceania
##   0 0.20215633 0.22371968 0.03234501 0.24528302 0.19137466 0.10512129
##   1 0.21635884 0.21899736 0.03693931 0.20844327 0.21635884 0.10290237
## 
##    year
## Y       [,1]       [,2]
##   0 2015.997 0.05191741
##   1 2016.000 0.00000000
## 
##    month
## Y            01          02          03          04          05          06
##   0 0.153638814 0.142857143 0.180592992 0.142857143 0.145552561 0.121293801
##   1 0.137203166 0.166226913 0.155672823 0.147757256 0.160949868 0.139841689
##    month
## Y            07          12
##   0 0.110512129 0.002695418
##   1 0.092348285 0.000000000
## 
##    weekday
## Y      Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday
##   0 0.1698113 0.1401617 0.1293801 0.1671159 0.1212938 0.1239892 0.1482480
##   1 0.1398417 0.1451187 0.1160950 0.1503958 0.1688654 0.1187335 0.1609499
## 
##    hour
## Y           00         01         02         03         04         05
##   0 0.04582210 0.02425876 0.04043127 0.04312668 0.03504043 0.05390836
##   1 0.05277045 0.03166227 0.02638522 0.05013193 0.03957784 0.04485488
##    hour
## Y           06         07         08         09         10         11
##   0 0.03773585 0.05121294 0.04582210 0.03773585 0.04312668 0.03234501
##   1 0.05804749 0.06332454 0.03957784 0.06068602 0.02374670 0.04749340
##    hour
## Y           12         13         14         15         16         17
##   0 0.03773585 0.03773585 0.04043127 0.03504043 0.03234501 0.02695418
##   1 0.03166227 0.02902375 0.04485488 0.03166227 0.03430079 0.04485488
##    hour
## Y           18         19         20         21         22         23
##   0 0.03773585 0.04043127 0.05121294 0.06738544 0.05121294 0.05121294
##   1 0.04749340 0.03430079 0.05013193 0.04749340 0.03693931 0.02902375

Step 3 - Testing the model

# Predicting on test data'
#
y_pred <- predict(classifier_cl, newdata = test_cl)

Step 4: Calculating accuracy

We’ll be using a confusion matrix to calculate the accuracy of the model. Here’s the code:

#Calculating accuracy
t <- table(test_cl$Clicked.on.Ad, y_pred) 
confusionMatrix(t)

## Confusion Matrix and Statistics
## 
##    y_pred
##       0   1
##   0 106  23
##   1   3 118
##                                           
##                Accuracy : 0.896           
##                  95% CI : (0.8513, 0.9309)
##     No Information Rate : 0.564           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7928          
##                                           
##  Mcnemar's Test P-Value : 0.0001944       
##                                           
##             Sensitivity : 0.9725          
##             Specificity : 0.8369          
##          Pos Pred Value : 0.8217          
##          Neg Pred Value : 0.9752          
##              Prevalence : 0.4360          
##          Detection Rate : 0.4240          
##    Detection Prevalence : 0.5160          
##       Balanced Accuracy : 0.9047          
##                                           
##        'Positive' Class : 0               
##

The output shows that 89.6% of all the samples in the test data set have been correctly classified and we’ve attained an accuracy of 89.6% on the test data set with a 95% confidence interval (00.8513, 0.9309). Thus we can correctly classify an ad whether it was clicked or not.

Interpretation of the confusion matrix; TP - 106 ads were correctly classified as not clicked TN - 118 ads were correctly classified as clicked FP - 23 ads were incorrectly classified as not clicked FN - 3 ads were incorrectly classified as not clicked.

Mcnemar’s Test P-Value for this model 1 is less than 0.05. This implies that the test is statically significant. We have enough evidence that the number of ad clicked is significantly different from ads not clicked.

However, Compared to the decision tree model, the Naive bayes model is out-performed by the decision tree model in almost every metric.

Hence for this classification problem a decision tree model is the best in helping the entrepreneur determine which client is most likely to click on her ads.

Conclusion

It can be concluded that to a decision tree model is the best model to predict whether a site visitor clicks an ad or not. It will accurately predict whether an individual clicks the ad or not 93% of the time.

The most important feature that determines whether a visitor to the site clicks ad or not is the individuals daily internet usage.

The entrepreneur is hence advised to incorporate the decision tree model in order to better improve the targeting of her advertisements

Further Questions

A) Do we have the right data

For this study and to meet the objectives set by the entrepreneur, this data provides relevant information to meet those objectives.

B) Do we have the right question?

Yes. Developing a machine learning algorithm will to help her determine who is most likely to click on her ads is going to improve her customer targeting and increase her returns in the long run.

IP week 13

Kiprop Amos

2022-06-03

1. Research question

2. Success criteria

3. Research Methodology

4. Understanding the data provided

Loading libraries

Loading the dataset

Previewing the dataset

Cleaning Dataset

Data Validity

Dealing with missing values

Checking for duplicate records

Checking for outliers in the numeric data

feature engineering

Univariate analysis

Non_numeric data

Numerical Data

Measures of central tendency and dispersion

Graphicals

Bivariate analysis

Covariance

Correlation

Bivariate graphicals

Multivariate analysis

Implementing solution

Step 1: Data Splicing

Step 2: Building a model

Step 3: Visualising the tree

Step 4: Testing the model

Step 5: Calculating accuracy

Challenging solution

Step 1: Data Splicing

Step 2 - Feature scaling

Step 3 -Fitting Naive Bayes Model

Step 3 - Testing the model

Step 4: Calculating accuracy

Conclusion

Further Questions

A) Do we have the right data

B) Do we have the right question?