TARGETED ADVERTISING

1. Defining the Question

a) Specifying the Question

Which individuals are most likely to clink on course advertisement ads?

b) Defining the Metric for Success

For the project to be successful, users who are likely to click on the ads should be correctly identified (factors that determine if a user will click an ad).

c) Understanding the Context

A Kenyan entrepreneur had created an online cryptography course. The client wishes to advertise the course on her blog. She currently targets users from all continents, in various countries.

The client provided historical data on a previous related course that was ran on the same blog. This data is to be analyzed to determine the users who are most likely to click on the ads, to improve targeted advertising.

To ensure that targeted advertising is carried out correctly for a good ROI, the correct and most relevant characteristics of users who click on ads needs to be identified.

d) Recording the Experimental Design

  1. Data sourcing/loading
  2. Data Understanding
  3. Data Relevance
  4. External Dataset Validation
  5. Data Preparation
  6. Univariate Analysis
  7. Bivariate Analysis
  8. Multivariate Analysis
  9. Implementing the solution
  10. Challenging the solution
  11. Conclusion
  12. Follow up questions

e) Data Relevance

For the data to be relevant, it should have meaningful insights that can be used to identify the characteristics of users likely to click on the course ads.

2. Data Understanding

Libraries

# Libraries
library(data.table)
library (plyr)
library(ggplot2)
library(moments)
library(ggcorrplot)

a) Reading the Data

# Loading the data.
ad <- fread('advertising.csv')
# Countries and continents data set used to create a continent column.
cont <- fread('https://raw.githubusercontent.com/dbouquin/IS_608/master/NanosatDB_munging/Countries-Continents.csv')
# Dataset preview
head(cont)
##    Continent  Country
## 1:    Africa  Algeria
## 2:    Africa   Angola
## 3:    Africa    Benin
## 4:    Africa Botswana
## 5:    Africa  Burkina
## 6:    Africa  Burundi

b) Checking the Data

Number of Records

# Number of rows and columns of the advertisement dataset.
cat('Number of rows = ', nrow(ad), 'and the number of columns = ', ncol(ad),'.')
## Number of rows =  1000 and the number of columns =  10 .

Top Dataset Preview

# First 5 records.
head(ad, 5)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0

Bottom Dataset Preview

# Last 5 records.
tail(ad, 5)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    72.97  30    71384.57               208.58
## 2:                    51.30  45    67782.17               134.42
## 3:                    51.63  51    42415.72               120.37
## 4:                    55.55  19    41920.79               187.95
## 5:                    45.01  26    29875.80               178.35
##                           Ad Topic Line          City Male
## 1:        Fundamental modular algorithm     Duffystad    1
## 2:      Grass-roots cohesive monitoring   New Darlene    1
## 3:         Expanded intangible solution South Jessica    1
## 4: Proactive bandwidth-monitored policy   West Steven    0
## 5:      Virtual 5thgeneration emulation   Ronniemouth    0
##                   Country           Timestamp Clicked on Ad
## 1:                Lebanon 2016-02-11 21:49:00             1
## 2: Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 3:               Mongolia 2016-02-01 17:24:57             1
## 4:              Guatemala 2016-03-24 02:35:54             0
## 5:                 Brazil 2016-06-03 21:43:21             1

At first glance of the data set, no anomalies can be seen.

c) Checking Datatypes

# Data set structure.
str(ad)
## Classes 'data.table' and 'data.frame':   1000 obs. of  10 variables:
##  $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily Internet Usage    : num  256 194 236 246 226 ...
##  $ Ad Topic Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked on Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

All columns have the required datatypes, however, the data type of the categorical columns will be converted to factors.

# Type conversion to factors

# Categorical columns
num <- unlist(lapply(ad, is.numeric))
cat_cols <- ad[, !num]
# Excluding the Timestamp column
cat_cols['Timestamp'] <- FALSE
# Coercing character columns to factors
  # Data frame with character columns only
char_df <- ad[, ..cat_cols]
 # Getting character vector from the original logical vector.
c <- as.vector(colnames(char_df))
# Target data set columns
a <- ad[ , ..c]
# Converting target character columns to factors
ad[ ,c] <- lapply(a, factor)
# Checking changes
head(ad, 5)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0
# Converting the encoded Male and Clicked on Ad columns to factors.
ad[ ,c('Male', 'Clicked on Ad')] <- lapply(ad[, c('Male', 'Clicked on Ad')], factor)
# Checking changes
head(ad, 5)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0

3. External Dataset Validation

The data was provided by the client, and was based on a previous related course ad, that was ran on the same blog, therefore, there is no need for external dataset validation.

4. Data Preperation

a) Validation

Column Validity

Checking for invalid/unnecessary columns that do not contribute relevant information to the study.

# Column names 
colnames(ad)
##  [1] "Daily Time Spent on Site" "Age"                     
##  [3] "Area Income"              "Daily Internet Usage"    
##  [5] "Ad Topic Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked on Ad"

All columns are valid.

Checking for invalid values

# Checking for anomalies
# Data set summary
summary(ad)
##  Daily Time Spent on Site      Age         Area Income    Daily Internet Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##                                                                               
##                                  Ad Topic Line              City     Male   
##  Adaptive 24hour Graphic Interface      :  1   Lisamouth      :  3   0:519  
##  Adaptive asynchronous attitude         :  1   Williamsport   :  3   1:481  
##  Adaptive context-sensitive application :  1   Benjaminchester:  2          
##  Adaptive contextually-based methodology:  1   East John      :  2          
##  Adaptive demand-driven knowledgebase   :  1   East Timothy   :  2          
##  Adaptive uniform capability            :  1   Johnstad       :  2          
##  (Other)                                :994   (Other)        :986          
##            Country      Timestamp                      Clicked on Ad
##  Czech Republic:  9   Min.   :2016-01-01 02:52:10.00   0:500        
##  France        :  9   1st Qu.:2016-02-18 02:55:42.00   1:500        
##  Afghanistan   :  8   Median :2016-04-07 17:27:29.50                
##  Australia     :  8   Mean   :2016-04-10 10:34:06.64                
##  Cyprus        :  8   3rd Qu.:2016-05-31 03:18:14.00                
##  Greece        :  8   Max.   :2016-07-24 00:22:16.00                
##  (Other)       :950

All numeric columns are >= 0 and the columns have the correct data type, therefore, there are no anomalies in these columns.

# Checking unique categorical column values.
str(ad)
## Classes 'data.table' and 'data.frame':   1000 obs. of  10 variables:
##  $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily Internet Usage    : num  256 194 236 246 226 ...
##  $ Ad Topic Line           : Factor w/ 1000 levels "Adaptive 24hour Graphic Interface",..: 92 465 567 904 767 806 223 724 108 455 ...
##  $ City                    : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
##  $ Male                    : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ Country                 : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked on Ad           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

No anomalies are present in the categorical columns.

b) Consistency

# Checking for missing values
colSums(is.na(ad))
## Daily Time Spent on Site                      Age              Area Income 
##                        0                        0                        0 
##     Daily Internet Usage            Ad Topic Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked on Ad 
##                        0

There are no missing values present in the data set.

c) Completeness

# Checking for duplicates.
sum(duplicated(ad))
## [1] 0

There are no duplicated records.

d) Uniformity

# Checking uniformity of column names.
colnames(ad)
##  [1] "Daily Time Spent on Site" "Age"                     
##  [3] "Area Income"              "Daily Internet Usage"    
##  [5] "Ad Topic Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked on Ad"

Column names have the same case, therefore, they are uniform. However, the spaces will be replaced with ’_’ for easier dataset manipulation and record access.

# Replacing white spaces with underscores.
colnames(ad) <- gsub(' ', '_', colnames(ad))
# Checking changes
colnames(ad)
##  [1] "Daily_Time_Spent_on_Site" "Age"                     
##  [3] "Area_Income"              "Daily_Internet_Usage"    
##  [5] "Ad_Topic_Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked_on_Ad"

e) Outliers

# Numerical columns
num_df <- ad[ , ..num]
# Removing the encoded categorical columns from the numerical columns set.
num_df <- num_df[ , !c('Male', 'Clicked_on_Ad') ]
# Checking for outliers

# Plotting boxplots
  # Number of plots
length(num_df)
## [1] 4
# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
  boxplot(num_df[ , ..i], main = paste('Boxplot of',  names(num_df)[i]), 
          ylab = 'Count')
}

From the box plots of the numerical columns, it can be seen that only the ‘Area Income’ column has outliers. Outliers will be retained for further analysis.

f) Feature Engineering

Adding a continent column for easier generalization.

# Continents and country vectors
continents <- unlist(unique(cont$Continent))
country <- unlist(unique(ad$Country))
# Cities in each continent
africa <- unlist(unique(cont[cont$Continent == 'Africa'][ ,2]))

asia <- unlist(unique(cont[cont$Continent == 'Asia'][ ,2]))
europe <- unlist(unique(cont[cont$Continent == 'Europe'][ ,2]))
north.america <- unlist(unique(cont[cont$Continent == 'North America'][ ,2]))
oceania <- unlist(unique(cont[cont$Continent == 'Oceania'][ ,2]))
south.america <- unlist(unique(cont[cont$Continent == 'South America'][ ,2]))
# Copy of original dataset
ad2 <- data.frame(ad)

# Introducing a continent column
ad2$Continent <- ''

# Filling the continent column with respect to the courties in each record.
for (i in country){
  if (i %in% africa){
    ad2[(ad2$Country == i),]$Continent <- 'Africa'
  }
  else if(i %in% asia){
    ad2[(ad2$Country == i),]$Continent <- 'Asia'
  }
  else if(i %in% europe){
    ad2[(ad2$Country == i),]$Continent <- 'Europe'
  }
  else if(i %in% north.america){
    ad2[(ad2$Country == i),]$Continent <- 'North America'
  }
  else if(i %in% oceania){
    ad2[(ad2$Country == i),]$Continent <- 'Oceania'
  }
  else if(i %in% south.america){
    ad2[(ad2$Country == i),]$Continent <- 'South America'
  }
}

# Confirming that changes have been made.
head(ad2[ , c('Country', 'Continent')])
##      Country Continent
## 1    Tunisia    Africa
## 2      Nauru   Oceania
## 3 San Marino    Europe
## 4      Italy    Europe
## 5    Iceland    Europe
## 6     Norway    Europe
# Ensuring that all continent rows have been filled.
nrow(ad2[ad2$Continent == '',])
## [1] 264

264 rows have missing continents.

# Unique countries that have  missing continent values.
unique(ad2[ad2$Continent == '',]$Country)
##  [1] Myanmar                                            
##  [2] Palestinian Territory                              
##  [3] British Indian Ocean Territory (Chagos Archipelago)
##  [4] Korea                                              
##  [5] Tokelau                                            
##  [6] British Virgin Islands                             
##  [7] Bouvet Island (Bouvetoya)                          
##  [8] Aruba                                              
##  [9] Saint Helena                                       
## [10] Svalbard & Jan Mayen Islands                       
## [11] Christmas Island                                   
## [12] Turks and Caicos Islands                           
## [13] Norfolk Island                                     
## [14] Cook Islands                                       
## [15] Cote d'Ivoire                                      
## [16] Faroe Islands                                      
## [17] Montserrat                                         
## [18] Timor-Leste                                        
## [19] Puerto Rico                                        
## [20] Wallis and Futuna                                  
## [21] Jersey                                             
## [22] Antarctica (the territory South of 60 deg S)       
## [23] Hong Kong                                          
## [24] Western Sahara                                     
## [25] Czech Republic                                     
## [26] Guernsey                                           
## [27] Martinique                                         
## [28] Falkland Islands (Malvinas)                        
## [29] Saint Martin                                       
## [30] United States Minor Outlying Islands               
## [31] Gibraltar                                          
## [32] Holy See (Vatican City State)                      
## [33] Mayotte                                            
## [34] Guam                                               
## [35] Kyrgyz Republic                                    
## [36] Brunei Darussalam                                  
## [37] Taiwan                                             
## [38] Saint Pierre and Miquelon                          
## [39] French Southern Territories                        
## [40] Greenland                                          
## [41] Guadeloupe                                         
## [42] French Guiana                                      
## [43] Northern Mariana Islands                           
## [44] American Samoa                                     
## [45] New Caledonia                                      
## [46] United States of America                           
## [47] Niue                                               
## [48] Pitcairn Islands                                   
## [49] Anguilla                                           
## [50] Libyan Arab Jamahiriya                             
## [51] Saint Barthelemy                                   
## [52] Reunion                                            
## [53] Burkina Faso                                       
## [54] Heard Island and McDonald Islands                  
## [55] Netherlands Antilles                               
## [56] French Polynesia                                   
## [57] Lao People's Democratic Republic                   
## [58] Isle of Man                                        
## [59] Macao                                              
## [60] United States Virgin Islands                       
## [61] Cayman Islands                                     
## [62] Syrian Arab Republic                               
## [63] Slovakia (Slovak Republic)                         
## [64] South Georgia and the South Sandwich Islands       
## [65] Bermuda                                            
## 237 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

Some countries have differing names(after renaming), therefore, those not included in the continents dataset will be added manually.

# Continent vectors and their countries
africa2 <- c('Saint Helena', 'Cote d\'Ivoire', 'Mayotte', 
             'Burkina Faso', 'Western Sahara', 'Libyan Arab Jamahiriya', 'Reunion')
asia2 <- c('Myanmar', 'British Indian Ocean Territory (Chagos Archipelago)', 
           'Christmas Island', 'Hong Kong', 'Kyrgyz Republic', 'Taiwan',
           'Lao People\'s Democratic Republic', 'Macao', 'Palestinian Territory',
           'Korea', 'Timor-Leste', 'Brunei Darussalam', 'Syrian Arab Republic')
europe2 <- c('Jersey', 'Czech Republic', 'Gibraltar',
            'Slovakia (Slovak Republic)', 'Svalbard & Jan Mayen Islands', 
            'Faroe Islands', 'Guernsey', 'Holy See (Vatican City State)',
            'Isle of Man')
oceania2 <- c('Tokelau', 'Norfolk Island', 'Northern Mariana Islands', 
              'New Caledonia', 'Niue', 'Cook Islands', 'Wallis and Futuna', 
              'Guam', 'American Samoa', 'Pitcairn Islands', 'French Polynesia')
north.america2 <- c('Montserrat', 'Puerto Rico', 'Martinique', 'Saint Martin',
                    'Guadeloupe', 'Anguilla', 'Saint Barthelemy', 'Cayman Islands',
                    'Bermuda','British Virgin Islands', 'Turks and Caicos Islands',
                    'United States Minor Outlying Islands', 
                    'Saint Pierre and Miquelon', 'Greenland', 
                    'United States of America', 'United States Virgin Islands')
south.america2 <- c('Netherlands Antilles', 'Aruba', 'Falkland Islands (Malvinas)',
                    'French Guiana')
antarctica <- c('French Southern Territories', 'Bouvet Island (Bouvetoya)', 
                'Antarctica (the territory South of 60 deg S)',
                'Heard Island and McDonald Islands',
                'South Georgia and the South Sandwich Islands')
# Filling in missing continent records
for (i in country){
  if (i %in% africa2){
    ad2[(ad2$Country == i),]$Continent <- 'Africa'
  }
  else if(i %in% asia2){
    ad2[(ad2$Country == i),]$Continent <- 'Asia'
  }
  else if(i %in% europe2){
    ad2[(ad2$Country == i),]$Continent <- 'Europe'
  }
  else if(i %in% north.america2){
    ad2[(ad2$Country == i),]$Continent <- 'North America'
  }
  else if(i %in% oceania2){
    ad2[(ad2$Country == i),]$Continent <- 'Oceania'
  }
  else if(i %in% south.america2){
    ad2[(ad2$Country == i),]$Continent <- 'South America'
  }
  else if(i %in% antarctica){
    ad2[(ad2$Country == i),]$Continent <- 'Antarctica'
  }
}


# Ensuring that all continent rows have been filled.
nrow(ad2[ad2$Continent == '',])
## [1] 0

All continent rows have been filled according to their respective countries.

5. Descritpive Analysis

a) Univariate Analysis

Categorical

# Data set of categorical columns
  # Selecting categorical columns
cat_cols <- unlist(lapply(ad, is.factor))
cat_df <- ad[, ..cat_cols]
names(cat_df)
## [1] "Ad_Topic_Line" "City"          "Male"          "Country"      
## [5] "Clicked_on_Ad"

Ad Topic Line

# Ad Topic Line

# Checking unique values
nrow(unique(cat_df[, 1]))
## [1] 1000

The topic lines are all unique therefore they will not be plotted.

City

# Checking unique values.
nrow(unique(cat_df[,2]))
## [1] 969

There are 969 unique cities. Therefore, the ads are displayed to users in numerous cities.

# Frequency table function
frequencies <- function(col_no, col){
  freq <- lapply(cat_df[, ..col_no], count)
# Converting list to data frame
a <- data.frame(freq)
# Only selecting cities with a frequency > 1
a <- a[a[, col] > 1,]
# Ordering by frequency
high_f <- a[order(a[, col], decreasing = TRUE),]
high_f

}
# City Frequencies
head(frequencies(2, 'City.freq'))
##              City.x City.freq
## 427       Lisamouth         3
## 956    Williamsport         3
## 31  Benjaminchester         2
## 158       East John         2
## 183    East Timothy         2
## 307        Johnstad         2

Lisamouth and Williamsport cities had the highest frequency in the dataset.

Checking if these cities belong to one country.

Lisamouth

# Checking if these cities belong to one country.
ad2[ad2$City == 'Lisamouth',  c('Country', 'Continent')]
##       Country     Continent
## 236    Norway        Europe
## 778   Bolivia South America
## 830 Indonesia          Asia

Williamsport

# Checking if these cities belong to one country.
ad2[ad2$City == 'Williamsport', c('Country', 'Continent')]
##              Country Continent
## 429 Papua New Guinea   Oceania
## 572            India      Asia
## 809 Marshall Islands   Oceania

Cities with the highest frequency belong to multiple countries, therefore, the cities will not be used in the study. Location information will be derived from the country and continent columns.

Country

# Country frequencies
head(frequencies(4, 'Country.freq'))
##         Country.x Country.freq
## 55 Czech Republic            9
## 71         France            9
## 1     Afghanistan            8
## 13      Australia            8
## 54         Cyprus            8
## 81         Greece            8

The Czech Republic and France have the highest frequencies.

Continent

# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend, colors, method, col2 = NULL){
  if (method == 'count'){
    ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
    ggtitle(paste(title, 'Frequency Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
  else if (method == 'bar'){
    ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
  
}
# Continent column bar plot
bar.plt (ad2, Continent, title = 'Continent Column', method = 'count', legend = 
           'Continent', colors = c('Africa' = '#0276AB', 'Antarctica' = '#026592',
                                   'Asia' = '#02557A', 'Europe' = '#014462',
                                   'North America' = '#013349', 'Oceania' = '#012231',
                                   'South America' = '#001118'), col2 = NULL)

Europe, Africa and Asia had the highest frequency.

# Gender column
bar.plt (cat_df, Male, title = 'Male Column', legend = 'Gender',
         colors = c("0" = '#757C88', "1" = "#1338BE"), method = 'count', col2 = NULL)

The frequency of female observations/records is higher.

# Clicked on ad column bar plot
bar.plt (cat_df, Clicked_on_Ad, col2 = NULL, method = 'count',
         title = 'Clicked on Ad Column', legend = 'Clicked',
         colors = c("0" = '#757C88', "1" = "#1338BE"))

The label/target column has balanced classes.

Year

# Dataset copy 
ad3 <- data.frame(ad2)
# Unique years
year<- format(ad2$Timestamp, format="%Y")
unique(year)
## [1] "2016"
# Creating a year column
ad3$Year <- year

The data only contains records from 2016.

Months

# Unique months
month <- format(ad2$Timestamp, format="%m")
# Adding a month column
ad3$Month <- month
sort(unique(month))
## [1] "01" "02" "03" "04" "05" "06" "07"

The data was only taken during the first 7 months of 2016.

# Month Column
bar.plt (ad3, Month, title = 'Months', legend = 'Month',
         colors = c('01' = '#0276AB', '02' = '#026592',
                                   '03' = '#02557A', '04' = '#014462',
                                   '05' = '#013349', '06' = '#012231',
                                   '07' = '#001118'), method = 'count', col2 = NULL)

The months of February, March and January had the highest frequencies.

Day of the Week

# Adding a weekday column
weekday <- wday(ad3$Timestamp)
ad3$Weekday <- weekday
# Converting to a factor for easy manipulation
ad3$Weekday <- as.factor(ad3$Weekday)
sort(unique(ad3$Weekday))
## [1] 1 2 3 4 5 6 7
## Levels: 1 2 3 4 5 6 7
# Weekday Column
bar.plt (ad3, Weekday, title = 'Day of the Week', legend = 'Day',
         colors = c('1' = '#0276AB', '2' = '#026592', '3' = '#02557A',
                    '4' = '#014462', '5' = '#013349', '6' = '#012231', 
                    '7' = '#001118'), method = 'count', col2 = NULL)

1 indicates Monday.

Monday has the highest frequency, followed by Thursday then Saturday. Wednesday had the lowest.

# Adding day and hour columns to the new dataset
# Days column
day <- format(ad2$Timestamp, format="%d")
ad3$Day <- day
# Hour column
hour <- format(ad2$Timestamp, format="%H")
ad3$Hour <- hour

Numerical

# Mode function
mode <- function(col, data) {
   unique.value <- unique(data[ , c(col)])
   unique.value[which.max(tabulate(match(data[ , c(col)], unique.value)))]
   
}
# Mode function
mode <- function(col, data) {
   unique.value <- unique(data[, col])
   unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}

central.tendency <- function(col, data){
  cat('Measures of Central Tendency \n')
  # Mean
  cat('Mean = ', mean(data[, col]), '\n')
  # Median
  cat('Median = ', median(data[,col]), '\n')
  # Mode
  cat('Mode = ', mode(col, data), '\n')
  
}

dispersion <- function(col, data){
  # Range
  cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
  # IQR
  cat('IQR = ', IQR(data[ ,col]), '\n')
  # Variance
  cat('Variance = ', var(data[ ,col]), '\n')
  # Standard Deviation
  cat('Standard Deviation = ', sd(data[ ,col]), '\n')
  # Skewness
  cat('Skewness = ', skewness(data[ ,col]), '\n')
  # Kurtosis
  cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
}

Daily Time Spent on Site

# Measures of central tendency
central.tendency(names(num_df)[1], ad2)
## Measures of Central Tendency 
## Mean =  65.0002 
## Median =  68.215 
## Mode =  62.26

The average time spent on the site is 65 minutes. The median is greater than the mean, therefore, the distribution is negatively skewed.

# Measures of dispersion
dispersion(names(num_df)[1], ad2)
## Range =  32.6 - 91.43 
## IQR =  27.1875 
## Variance =  251.3371 
## Standard Deviation =  15.85361 
## Skewness =  -0.3712026 
## Kurtosis =  1.903942
  • The skewness value is negative, corroborating the previous observation. Therefore, most users tend to spend more time on the site.
  • The distribution is also platykurtic.
# Daily Time Spent on Site Histogram
hist(num_df$ Daily_Time_Spent_on_Site, 
     main = 'Histogram of Daily Time Spent on Site', 
     xlab = 'Daily Time Spent Spent on Site')

Most users spend 80 minutes on the site.

Age

# Measures of Central Distribution
central.tendency('Age', ad2)
## Measures of Central Tendency 
## Mean =  36.009 
## Median =  35 
## Mode =  31

The average age of users is 36. The distribution has a positive skew as median < mean.

# Measures of Dispersion
dispersion('Age', ad2)
## Range =  19 - 61 
## IQR =  13 
## Variance =  77.18611 
## Standard Deviation =  8.785562 
## Skewness =  0.4784227 
## Kurtosis =  2.595482
  • From the skewness value, the column distribution is fairly symmetrical, with a slight positive skew.
  • The kurtosis value indicates that the distribution is almost leptokurtic link.
# Age Histogram
hist(num_df$ Age, main = 'Histogram of Age', xlab = 'Age')

The most frequent ages are within the range of 25 to 35.

Area Income

# Measures of Central Distribution
central.tendency('Area_Income', ad2)
## Measures of Central Tendency 
## Mean =  55000 
## Median =  57012.3 
## Mode =  61833.9

The average area income is 55,000. The distribution has a negative skew as median > mean.

# Measures of Dispersion
dispersion('Area_Income', ad2)
## Range =  13996.5 - 79484.8 
## IQR =  18438.83 
## Variance =  179952406 
## Standard Deviation =  13414.63 
## Skewness =  -0.6493967 
## Kurtosis =  2.894694
  • From the skewness value, the column distribution is fairly symmetrical, with a slight negative skew.
  • The kurtosis value indicates that the distribution is almost leptokurtic.
# Area Income Histogram
hist(num_df$ Area_Income, main = 'Histogram of Area Income', xlab = 'Area Income')

The distribution leans towards higher are incomes, with 65,000 having the highest frequency.

Daily Internet Usage

# Measures of Central Distribution
central.tendency('Daily_Internet_Usage', ad2)
## Measures of Central Tendency 
## Mean =  180.0001 
## Median =  183.13 
## Mode =  167.22

The average daily internet usage is 180 minutes. The distribution has a negative skew as mean < median.

# Measures of Dispersion
dispersion('Daily_Internet_Usage', ad2)
## Range =  104.78 - 269.96 
## IQR =  79.9625 
## Variance =  1927.415 
## Standard Deviation =  43.90234 
## Skewness =  -0.03348703 
## Kurtosis =  1.727701
  • From the skewness value, the column distribution is fairly symmetrical, with a slight negative skew.
  • The kurtosis value indicates that the distribution is fairly mesokurtic. link.
# Daily Internet Usage Histogram
hist(num_df$ Daily_Internet_Usage, main = 'Histogram of Daily Internet Usage',
     xlab = 'Daily Internet Usage')

Most users spent 125 minutes on the internet.

Summary

The univariate analysis provided insights to the distribution of data, to derive the characteristics of users who clicked on the previouse course ads, bivariate analysis will be done.

b) Bivariate Analysis

Categorical-Categorical

Clicked on Ad Vs Continent

# Target columns
Continent <- ad2$Continent
Clicked_on_Ad <- ad2$Clicked_on_Ad

# Contingency table
contingency.table <- table(Clicked_on_Ad, Continent)
contingency.table
##              Continent
## Clicked_on_Ad Africa Antarctica Asia Europe North America Oceania South America
##             0    105         10  109    114            74      50            38
##             1    109          8  101    115            84      52            31
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Continent', ylab='Clicked on Ad',
           main='Clicked on Ad Vs Continent', color = '#1338BE', las = 1)

Europe had the highest clicks(115), followed by Africa(109), then Asia(101).

Clicked on Ad Vs Gender

# Target columns
Gender <- ad2$Male
Clicked_on_Ad <- ad2$Clicked_on_Ad

# Contingency table
contingency.table <- table(Gender, Clicked_on_Ad)
contingency.table
##       Clicked_on_Ad
## Gender   0   1
##      0 250 269
##      1 250 231
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Gender', ylab='Clicked on Ad',
           main='Clicked on Ad Vs Gender', color = '#1338BE', las = 1)

Females clicked more on the ads, but by a small margin (38).

Clicked on Ad Vs Month

# Target columns
Month <- ad3$Month
Clicked_on_Ad <- ad3$Clicked_on_Ad

# Contingency table
contingency.table <- table(Clicked_on_Ad, Month)
contingency.table
##              Month
## Clicked_on_Ad 01 02 03 04 05 06 07
##             0 78 77 82 73 68 71 51
##             1 69 83 74 74 79 71 50
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Month', ylab='Clicked on Ad',
           main='Clicked on Ad Vs Month', color = '#1338BE', las = 1)

The months of February, May, March and April had the highest number of ad clicks.

Clicked on Ad Vs Day of Week

# Target columns
Day <- ad3$Weekday
Clicked_on_Ad <- ad3$Clicked_on_Ad

# Contingency Table
contingency.table <- table(Clicked_on_Ad, Day)
contingency.table
##              Day
## Clicked_on_Ad  1  2  3  4  5  6  7
##             0 79 68 67 77 63 84 62
##             1 80 72 55 79 79 71 64
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Day', ylab='Clicked on Ad',
           main='Clicked on Ad Vs Day of Week', color = '#1338BE', las = 1)

The ads were mostly clicked on Monday, Thursday and Friday.

Hour <- ad3$Hour
Clicked_on_Ad <- ad3$Clicked_on_Ad

# contingency.table <- table(ad2$Continent, ad2$Clicked_on_Ad)
contingency.table <- table(Clicked_on_Ad, Hour)
contingency.table
##              Hour
## Clicked_on_Ad 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21
##             0 19 16 19 19 21 23 16 28 22 21 17 16 22 21 22 16 23 18 16 20 26 29
##             1 26 16 17 23 21 21 23 26 21 28 14 24 16 21 21 19 16 23 25 19 24 19
##              Hour
## Clicked_on_Ad 22 23
##             0 24 26
##             1 19 18

The highest ad clicks occurred at 9 AM, 12 AM, 7AM, and 6PM, in decreasing order.

Numeircal-Numerical

# Scatter plot and correlation function
scatter.plt <- function(col1, col2, corr1, corr2, data, title){
  # Scatter plot
data <- ggplot(data, aes(x = {{col1}}, y= {{col2}})) + 
  geom_point(color = '#281E5D') + ggtitle(paste(title, 'Scatter Plot')) + 
  theme(plot.title = element_text(hjust = 0.5))
  # Correlation
correlation <- cor(ad2[ , c(corr1)], ad2[, c(corr2)])
plot(data)
print(paste0('Correlation = ', correlation, '.'))
}

Age Vs Area Income

# Scatter plot and correlation
scatter.plt(Area_Income, Age, data = ad2, corr1 = 'Area_Income', 
            corr2 = 'Age', 'Age Vs Area Income')

## [1] "Correlation = -0.182604955032622."

From the scatter plot and the correlation value, it can be seen that the age and area income have a very weak and negative correlation.

Age Vs Daily Time Spent on Site

# Scatter plot and correlation
scatter.plt(Daily_Time_Spent_on_Site, Age, data = ad2, 
            corr1 = 'Daily_Time_Spent_on_Site', 
            corr2 = 'Age', 'Age Vs Daily Time Spent on Site')

## [1] "Correlation = -0.331513342786584."

There is a weak and negative correlation between the time spent on the site and the age of the user.

Age Vs Daily Internet Usage

# Scatter plot and correlation
scatter.plt(Daily_Internet_Usage, Age, data = ad2, 
            corr1 = 'Daily_Internet_Usage', 
            corr2 = 'Age', 'Age Vs Daily Internet Usage')

## [1] "Correlation = -0.367208560147359."

The correlation between the ge of the user and the daily internet usage is also weak and negative.

Daily Time Spent on Site Vs Area Income

# Scatter plot and correlation
scatter.plt(Daily_Time_Spent_on_Site, Area_Income, data = ad2, 
            corr1 = 'Daily_Time_Spent_on_Site', corr2 = 'Area_Income',
            'Daily Time Spent on Site Vs Area Income')

## [1] "Correlation = 0.310954412522883."

There is a weak and positive correlation between the daily time spent on the site and the area income.

# Scatter plot and correlation function
line.plt <- function(col1, col2, data, title){
ggplot(data, aes(x = {{col1}}, y= {{col2}})) + geom_line(color = '#281E5D')+ ggtitle(paste(title, 'Line Plot')) +
    theme(plot.title = element_text(hjust = 0.5)) 

}

Daily Time Spent on Site Vs Month

line.plt(data = ad3, col1 =  Timestamp, col2 = Daily_Time_Spent_on_Site,
         title = 'Daily Time Spent on Site Vs Month')

The overall trend of daily time spent on the site is constant, however, as a whole, the daily time spent varies drastically for all months. This could be as a result of hourly changes within each day.

Numerical-Categorical

# Bar plot for averaged y axis
bar.plt.summary <- function(data, col1, col2, title, legend, colors){
  ggplot(data, mapping=aes(x= {{col1}}, y= {{col2}}, fill = {{col1}})) + 
  stat_summary(fun=median, geom="bar") + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)
}

The median will be used for the following plots as most of the distributions are skewed.

Clicked_on_Ad Vs Age

# Bar plot
bar.plt.summary(ad2, Clicked_on_Ad, Age,'Clicked on Ad Vs Age', 'Clicked on Ad',
                c("0" = '#757C88', "1" = "#1338BE"))

The median age of users who clicked on ads was 40, which was higher than those who didn’t click on the ads (31).

Clicked on Ad Vs Daily Time Spent on Site

# Bar plot
bar.plt.summary(ad2, Clicked_on_Ad, Daily_Time_Spent_on_Site,
                'Clicked on Ad Vs Daily Time Spent on Site', 'Clicked on Ad',
                c("0" = '#757C88', "1" = "#1338BE"))

  • The median time spent on the site by people who did not click on the ads was higher (~79 minutes).
  • Those who clicked on the ads spent less time on the site.

Clicked on Ad Vs Daily Internet Usage

# Bar plot
bar.plt.summary(ad2, Clicked_on_Ad, Daily_Internet_Usage,
                'Clicked on Ad Vs Daily Internet Usage', 'Clicked on Ad',
                c("0" = '#757C88', "1" = "#1338BE"))

  • Most people who spent more time on the internet did not click the ads.
  • Those who clicked the ads spent a median of 148 minutes on the internet.

Clicked on Ad Vs Area Income

# Bar plot
bar.plt.summary(ad2, Clicked_on_Ad, Area_Income,'Clicked on Ad Vs Area_Income',
                'Clicked on Ad', c("0" = '#757C88', "1" = "#1338BE"))

Users with a larger median area income did not click on the ads, while those with a lower income did.

Day of the Week Vs Daily Time Spent on Site

# Bar plot
bar.plt.summary (ad3, Weekday, title = 'Day of the Week Vs Daily Time Spent on Site',
         legend = 'Day', colors = c('1' = '#0276AB', '2' = '#026592',
                                    '3' = '#02557A', '4' = '#014462',
                                    '5' = '#013349', '6' = '#012231', 
                                    '7' = '#001118'), 
         col2 = Daily_Time_Spent_on_Site)

Users spend more time on the site on Monday and Wednesday, followed by Saturday and Sunday, then Tuesday.

Summary

Factors that determine if a user clicked on ad or not were successfully identified from the data. For a more in depth understanding of user action, more demographic information can be collected in the future. This will provide a more granular understanding of what influences a user’s decision to click on ads.

c) Multivariate Analysis

Correlation matrix

# Visualize correlation matrix
ggcorrplot(cor(num_df), lab = TRUE, title = 'Correlation Heatmap',
           colors = c('#022D36', 'white', '#48AAAD'))

There is only moderate correlation between the daily internet usage and the time spent on the site.

Clicked on Ad Vs Age Vs Daily Time Spent on Site

# Multivariate Scatter plot
scatter.plt.multi <- function(data, col1, col2, col3, legend, colors)
ggplot(data, aes(x = {{col1}}, y= {{col2}}, color = {{col3}}, 
                shape = {{col3}})) + geom_point() + ggtitle('Scatter Plot') +
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_color_manual(values = colors)
# Function call
scatter.plt.multi(ad2, Daily_Time_Spent_on_Site, Age, Clicked_on_Ad,
                  'Clicked on Ad', colors = c('black', 'blue3'))

Most people, of all ages, who clicked on the ads spent a lower amount of time on the site, compared to those who stayed longer (most ages are below 40).

Clicked on Ad Vs Area Income Vs Daily Time Spent on Site

# Function call
scatter.plt.multi(ad2, Daily_Time_Spent_on_Site, Area_Income, Clicked_on_Ad,
                  'Clicked on Ad', colors = c('black', 'blue3'))

  • People of varied area income, and who spent a lower amount of time on the site clicked on ads. More of them had an area income > 40,000.
  • People with an area income within the range of 50,00 - 80,000, and who spent a higher amount of time on the site did not click on ads.

Summary

From this analysis section, the relationship between various user habits and traits was analyzed to understand how they lead to a user’s choice to/ or not to click on an ad. Similar to the bivariate conclusion, granulated data can be collected to provide a clearer view of the overall situation.

6. Implemeting the Solution

The solution will be a summary of the analysis.

Analysis Summary

Bivariate Analysis

  1. Europe had the highest clicks(115), followed by Africa(109), then Asia(101).
  2. Females clicked more on the ads, but by a small margin (38).
  3. The median age of users who clicked on ads was 40, which was higher than those who didn’t click on the ads (31).
  4. The median time spent on the site by people who did not click on the ads was higher (~79 minutes). Those who clicked on the ads spent less time on the site.
  5. Most people who spent more time on the internet did not click the ads. Those who clicked the ads spent a median of 148 minutes on the internet.
  6. Users with a larger median area income did not click on the ads, while those with a lower income did.
  7. The months of February, May, March and April had the highest number of ad clicks.
  8. The ads were mostly clicked on Monday, Thursday and Friday.
  9. Users spend more time on the site on Monday and Wednesday, followed by Saturday and Sunday, then Tuesday.
  10. The highest ad clicks occurred at 9 AM, 12 AM, 7AM, and 6PM, in decreasing order.
  11. The overall trend of daily time spent on the site is constant, however, as a whole, the daily time spent varies drastically for all months. This could be as a result of hourly changes within each day.

Multivariate Analysis

  1. Most people, of all ages, who clicked on the ads spent a lower amount of time on the site, compared to those who stayed longer (most ages are below 40).
  2. People of varied area income, and who spent a lower amount of time on the site clicked on ads. More of them had an area income > 40,000. People with an area income within the range of 50,00 - 80,000, and who spent a higher amount of time on the site did not click on ads.

7. Challenging the Solution

The project was an analysis question, therefore, the solution will not be challenged as it is based on observation derived from the data.

8. Conclusion

In conclusion, from the analysis, the major factors that determine if a user will click an ad are the:

  1. Continent
  2. Gender
  3. Daily time spent on the site
  4. Area Income
  5. Time of day and month

9. Recommendations

The targeted advertising should be prioritized to:

  1. Users in European, African and Asian countries.
  2. Both genders should be targeted as the margin of users who clicked on the ads by gender is small.
  3. More money can be invested in the months of the months of February, May, March and April as they had the highest number of ad clicks, meaning that interest spikes during these months. Further research can be done to determine why.
  4. The ads were mostly clicked on Monday, Thursday and Friday, therefore, these days should be prioritized.
  5. The highest ad clicks occurred at 9 AM, 12 AM, 7AM, and 6PM, in decreasing order, therefore, ad display can be prioritized to these times (save on money).
  6. More data can be sourced to provide more insight into factors that determine if a user will click an ad.

10. Follow Up Questions

a) Did we have the right data?

Yes, as the data was previously collected by the client in order to derive insights from previous ads the ran to advertise a related course on the same blog.

b) Do we need other data to answer our question?

More data on other user habits, such as the frequency of course enrollment and other demographics can be used to fine tune the targeted advertising.

c) Did we have the right question?

Yes, as the aim of the project is to identify individuals who are likely to click on the ads, as per the client’s request.

# Temporary Directory
dir.create(tempdir())
## Warning in dir.create(tempdir()): 'C:\Users\HP\AppData\Local\Temp\RtmpQzgagg'
## already exists