1. DEFINING THE QUESTION

1.1. Specifying The Objective

Identify which individuals are most likely to click on the ads from the given datset.

1.2. The Metric For Success

Appropriate Recommendations gained from the Analysis

1.3. The Context

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

1.4. Experimental Design Taken

  1. Loading Data into RStudio.
  2. Checking the Data and Cleaning it.
  3. Conducting Univariate Analysis.
  4. Conducting Bivariate Analysis.
  5. Challenging the Solution.
  6. Recommendations.

1.5 Appropriateness Of The Available Data

The data was relevant and appropriate since it did not have many outliers or missing data. Furthermore, the activities were linked to internet activity, considering the course would be online.

2. DATA PREPARATION

Loading the Data and Previewing it

library(data.table)

ad = fread("~/Desktop/advertising.csv")
head(ad)
##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
## 6:                    59.99  23    59761.56               226.74
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
## 6:       Sharable client-driven software      Jamieberg    1     Norway
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0
## 6: 2016-05-19 14:30:17             0

Checking its Type

class(ad)
## [1] "data.table" "data.frame"

Taking a Glimpse of the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(ad)
## Rows: 1,000
## Columns: 10
## $ `Daily Time Spent on Site` <dbl> 68.95, 80.23, 69.47, 74.15, 68.37, 59.99, 8…
## $ Age                        <int> 35, 31, 26, 29, 35, 23, 33, 48, 30, 20, 49,…
## $ `Area Income`              <dbl> 61833.90, 68441.85, 59785.94, 54806.18, 738…
## $ `Daily Internet Usage`     <dbl> 256.09, 193.77, 236.50, 245.89, 225.58, 226…
## $ `Ad Topic Line`            <chr> "Cloned 5thgeneration orchestration", "Moni…
## $ City                       <chr> "Wrightburgh", "West Jodi", "Davidton", "We…
## $ Male                       <int> 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0…
## $ Country                    <chr> "Tunisia", "Nauru", "San Marino", "Italy", …
## $ Timestamp                  <dttm> 2016-03-27 00:53:11, 2016-04-04 01:39:02, …
## $ `Clicked on Ad`            <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1…

Summary Stats of the Data

summary(ad)
##  Daily Time Spent on Site      Age         Area Income    Daily Internet Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##  Ad Topic Line          City                Male         Country         
##  Length:1000        Length:1000        Min.   :0.000   Length:1000       
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.481                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :1.000                     
##    Timestamp                   Clicked on Ad
##  Min.   :2016-01-01 02:52:10   Min.   :0.0  
##  1st Qu.:2016-02-18 02:55:42   1st Qu.:0.0  
##  Median :2016-04-07 17:27:29   Median :0.5  
##  Mean   :2016-04-10 10:34:06   Mean   :0.5  
##  3rd Qu.:2016-05-31 03:18:14   3rd Qu.:1.0  
##  Max.   :2016-07-24 00:22:16   Max.   :1.0

Checking Number of Rows and Columns

dim(ad)
## [1] 1000   10
cat("Rows:", nrow(ad), "\nCols:", ncol(ad))
## Rows: 1000 
## Cols: 10

3. DATA CLEANING

Changing column names to lowercase for easier manipulation

colnames(ad) = tolower(colnames(ad))
colnames(ad)
##  [1] "daily time spent on site" "age"                     
##  [3] "area income"              "daily internet usage"    
##  [5] "ad topic line"            "city"                    
##  [7] "male"                     "country"                 
##  [9] "timestamp"                "clicked on ad"

Replacing the spaces in column names for easier manipulation

library(stringr)
colnames(ad) = str_replace_all(colnames(ad), c(' ' = '_'))
colnames(ad)
##  [1] "daily_time_spent_on_site" "age"                     
##  [3] "area_income"              "daily_internet_usage"    
##  [5] "ad_topic_line"            "city"                    
##  [7] "male"                     "country"                 
##  [9] "timestamp"                "clicked_on_ad"

Checking For Missing Data

colSums(is.na(ad))
## daily_time_spent_on_site                      age              area_income 
##                        0                        0                        0 
##     daily_internet_usage            ad_topic_line                     city 
##                        0                        0                        0 
##                     male                  country                timestamp 
##                        0                        0                        0 
##            clicked_on_ad 
##                        0

There seems to be no missing data.

Checking For Duplicates

anyDuplicated(ad)
## [1] 0

There are no Duplicates

4. EDA

4.1. UNIVARIATE

boxplot(ad$daily_time_spent_on_site)

boxplot(ad$age)

boxplot(ad$area_income)

There’s a Few Outliers to the bottom.

boxplot(ad$daily_internet_usage)

boxplot(ad$male)

boxplot(ad$timestamp)

boxplot(ad$clicked_on_ad)

Means of Numeric Columns

numeric_columns = c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage", 
                    "male", "timestamp")

mean(ad$daily_time_spent_on_site)
## [1] 65.0002
mean(ad$age)
## [1] 36.009
mean(ad$area_income)
## [1] 55000
mean(ad$daily_internet_usage)
## [1] 180.0001
mean(ad$male)
## [1] 0.481
mean(ad$timestamp)
## [1] "2016-04-10 10:34:06 UTC"

Medians of Numeric Columns

median(ad$daily_time_spent_on_site)
## [1] 68.215
median(ad$age)
## [1] 35
median(ad$area_income)
## [1] 57012.3
median(ad$daily_internet_usage)
## [1] 183.13
median(ad$male)
## [1] 0
median(ad$timestamp)
## [1] "2016-04-07 17:27:29 UTC"

Modes of Numeric Columns

# We create the mode function that will perform our mode operation for us
# ---
# 
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

getmode(ad$daily_time_spent_on_site)
## [1] 62.26
getmode(ad$age)
## [1] 31
getmode(ad$area_income)
## [1] 61833.9
getmode(ad$daily_internet_usage)
## [1] 167.22
getmode(ad$male)
## [1] 0
getmode(ad$timestamp)
## [1] "2016-03-27 00:53:11 UTC"

Minimums of Numeric Columns

min(ad$daily_time_spent_on_site)
## [1] 32.6
min(ad$age)
## [1] 19
min(ad$area_income)
## [1] 13996.5
min(ad$daily_internet_usage)
## [1] 104.78
min(ad$male)
## [1] 0
min(ad$timestamp)
## [1] "2016-01-01 02:52:10 UTC"

Maximums of Numeric Columns

max(ad$daily_time_spent_on_site)
## [1] 91.43
max(ad$age)
## [1] 61
max(ad$area_income)
## [1] 79484.8
max(ad$daily_internet_usage)
## [1] 269.96
max(ad$male)
## [1] 1
max(ad$timestamp)
## [1] "2016-07-24 00:22:16 UTC"

Ranges of Numeric Columns

range(ad$daily_time_spent_on_site)
## [1] 32.60 91.43
range(ad$age)
## [1] 19 61
range(ad$area_income)
## [1] 13996.5 79484.8
range(ad$daily_internet_usage)
## [1] 104.78 269.96
range(ad$male)
## [1] 0 1
range(ad$timestamp)
## [1] "2016-01-01 02:52:10 UTC" "2016-07-24 00:22:16 UTC"

Quantiles of Numeric Columns

quantile(ad$daily_time_spent_on_site)
##      0%     25%     50%     75%    100% 
## 32.6000 51.3600 68.2150 78.5475 91.4300
quantile(ad$age)
##   0%  25%  50%  75% 100% 
##   19   29   35   42   61
quantile(ad$area_income)
##       0%      25%      50%      75%     100% 
## 13996.50 47031.80 57012.30 65470.64 79484.80
quantile(ad$daily_internet_usage)
##       0%      25%      50%      75%     100% 
## 104.7800 138.8300 183.1300 218.7925 269.9600
quantile(ad$male)
##   0%  25%  50%  75% 100% 
##    0    0    0    1    1
quantile(ad$timestamp)
##                        0%                       25%                       50% 
## "2016-01-01 02:52:10 UTC" "2016-02-18 02:55:42 UTC" "2016-04-07 17:27:29 UTC" 
##                       75%                      100% 
## "2016-05-31 03:18:14 UTC" "2016-07-24 00:22:16 UTC"

Variances of Numeric Columns

var(ad$daily_time_spent_on_site)
## [1] 251.3371
var(ad$age)
## [1] 77.18611
var(ad$area_income)
## [1] 179952406
var(ad$daily_internet_usage)
## [1] 1927.415
var(ad$male)
## [1] 0.2498889
var(ad$timestamp)
## [1] 2.590788e+13

Standard Deviations of Numeric Columns

sd(ad$daily_time_spent_on_site)
## [1] 15.85361
sd(ad$age)
## [1] 8.785562
sd(ad$area_income)
## [1] 13414.63
sd(ad$daily_internet_usage)
## [1] 43.90234
sd(ad$male)
## [1] 0.4998889
sd(ad$timestamp)
## [1] 5089978

4.2. BIVARIATE

numeric_columns
## [1] "daily_time_spent_on_site" "age"                     
## [3] "area_income"              "daily_internet_usage"    
## [5] "male"                     "timestamp"

Covariances with daily time spent

cov(ad$daily_time_spent_on_site, ad$age)
## [1] -46.17415
cov(ad$daily_time_spent_on_site, ad$area_income)
## [1] 66130.81
cov(ad$daily_time_spent_on_site, ad$daily_internet_usage)
## [1] 360.9919
cov(ad$daily_time_spent_on_site, ad$male)
## [1] -0.1501864
cov(ad$daily_time_spent_on_site, ad$clicked_on_ad)
## [1] -5.933143

Covariances with age

cov(ad$age, ad$area_income)
## [1] -21520.93
cov(ad$age, ad$daily_internet_usage)
## [1] -141.6348
cov(ad$age, ad$male)
## [1] -0.09242142
cov(ad$age, ad$clicked_on_ad)
## [1] 2.164665

Covariances with area income

cov(ad$area_income, ad$daily_internet_usage)
## [1] 198762.5
cov(ad$area_income, ad$male)
## [1] 8.867509
cov(ad$area_income, ad$clicked_on_ad)
## [1] -3195.989

Covariances with daily internet usage

cov(ad$daily_internet_usage, ad$male)
## [1] 0.6147667
cov(ad$daily_internet_usage, ad$clicked_on_ad)
## [1] -17.27409

Covariances with male

cov(ad$male, ad$clicked_on_ad)
## [1] -0.00950951

The higher the positive values, the higher the covariance and, the higher the negative values, the higher the covariance as well.

Correlations with daily time spent

cor(ad$daily_time_spent_on_site, ad$age)
## [1] -0.3315133
cor(ad$daily_time_spent_on_site, ad$area_income)
## [1] 0.3109544
cor(ad$daily_time_spent_on_site, ad$daily_internet_usage)
## [1] 0.5186585
cor(ad$daily_time_spent_on_site, ad$male)
## [1] -0.01895085
cor(ad$daily_time_spent_on_site, ad$clicked_on_ad)
## [1] -0.7481166

Correlations with age

cor(ad$age, ad$area_income)
## [1] -0.182605
cor(ad$age, ad$daily_internet_usage)
## [1] -0.3672086
cor(ad$age, ad$male)
## [1] -0.02104406
cor(ad$age, ad$clicked_on_ad)
## [1] 0.4925313

Correlations with area income

cor(ad$area_income, ad$daily_internet_usage)
## [1] 0.3374955
cor(ad$area_income, ad$male)
## [1] 0.001322359
cor(ad$area_income, ad$clicked_on_ad)
## [1] -0.4762546

Correlations with daily internet usage

cor(ad$daily_internet_usage, ad$male)
## [1] 0.02801233
cor(ad$daily_internet_usage, ad$clicked_on_ad)
## [1] -0.7865392

Correlations with male

cor(ad$male, ad$clicked_on_ad)
## [1] -0.03802747

Plots of highly correlated values

# Most of the correlations are low, with the highest being -0.786 and -0.748 which is 
# moderately-to-highly negatively correlated.

plot(ad$daily_time_spent_on_site, ad$clicked_on_ad, xlab="Time Spent On Site Daily", 
     ylab="Clicked On Ad")

plot(ad$daily_internet_usage, ad$clicked_on_ad, xlab="Internet Usage Daily", ylab="Clicked On Ad")

Correlation Matrix

# Selecting only the Numeric Columns
ad_subset = subset(ad,select = -c(ad_topic_line,city,country,timestamp))
ad_subset
##       daily_time_spent_on_site age area_income daily_internet_usage male
##    1:                    68.95  35    61833.90               256.09    0
##    2:                    80.23  31    68441.85               193.77    1
##    3:                    69.47  26    59785.94               236.50    0
##    4:                    74.15  29    54806.18               245.89    1
##    5:                    68.37  35    73889.99               225.58    0
##   ---                                                                   
##  996:                    72.97  30    71384.57               208.58    1
##  997:                    51.30  45    67782.17               134.42    1
##  998:                    51.63  51    42415.72               120.37    1
##  999:                    55.55  19    41920.79               187.95    0
## 1000:                    45.01  26    29875.80               178.35    0
##       clicked_on_ad
##    1:             0
##    2:             0
##    3:             0
##    4:             0
##    5:             0
##   ---              
##  996:             1
##  997:             1
##  998:             1
##  999:             0
## 1000:             1
res <- cor(ad_subset)
round(res, 2)
##                          daily_time_spent_on_site   age area_income
## daily_time_spent_on_site                     1.00 -0.33        0.31
## age                                         -0.33  1.00       -0.18
## area_income                                  0.31 -0.18        1.00
## daily_internet_usage                         0.52 -0.37        0.34
## male                                        -0.02 -0.02        0.00
## clicked_on_ad                               -0.75  0.49       -0.48
##                          daily_internet_usage  male clicked_on_ad
## daily_time_spent_on_site                 0.52 -0.02         -0.75
## age                                     -0.37 -0.02          0.49
## area_income                              0.34  0.00         -0.48
## daily_internet_usage                     1.00  0.03         -0.79
## male                                     0.03  1.00         -0.04
## clicked_on_ad                           -0.79 -0.04          1.00

The cor() function returns only the correlation coefficients between variables.

# Using Hmisc R package to calculate the correlation p-values.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
res2 <- rcorr(as.matrix(ad_subset))
res2
##                          daily_time_spent_on_site   age area_income
## daily_time_spent_on_site                     1.00 -0.33        0.31
## age                                         -0.33  1.00       -0.18
## area_income                                  0.31 -0.18        1.00
## daily_internet_usage                         0.52 -0.37        0.34
## male                                        -0.02 -0.02        0.00
## clicked_on_ad                               -0.75  0.49       -0.48
##                          daily_internet_usage  male clicked_on_ad
## daily_time_spent_on_site                 0.52 -0.02         -0.75
## age                                     -0.37 -0.02          0.49
## area_income                              0.34  0.00         -0.48
## daily_internet_usage                     1.00  0.03         -0.79
## male                                     0.03  1.00         -0.04
## clicked_on_ad                           -0.79 -0.04          1.00
## 
## n= 1000 
## 
## 
## P
##                          daily_time_spent_on_site age    area_income
## daily_time_spent_on_site                          0.0000 0.0000     
## age                      0.0000                          0.0000     
## area_income              0.0000                   0.0000            
## daily_internet_usage     0.0000                   0.0000 0.0000     
## male                     0.5495                   0.5062 0.9667     
## clicked_on_ad            0.0000                   0.0000 0.0000     
##                          daily_internet_usage male   clicked_on_ad
## daily_time_spent_on_site 0.0000               0.5495 0.0000       
## age                      0.0000               0.5062 0.0000       
## area_income              0.0000               0.9667 0.0000       
## daily_internet_usage                          0.3762 0.0000       
## male                     0.3762                      0.2296       
## clicked_on_ad            0.0000               0.2296

The output of rcorr() is a list containing : - r : the correlation matrix - n : the matrix of the number of observations used in analyzing each pair of variables and, - P : the p-values corresponding to the significance levels of correlations.

# Extract p-values
res2$P
##                          daily_time_spent_on_site          age  area_income
## daily_time_spent_on_site                       NA 0.000000e+00 0.000000e+00
## age                                     0.0000000           NA 6.019232e-09
## area_income                             0.0000000 6.019232e-09           NA
## daily_internet_usage                    0.0000000 0.000000e+00 0.000000e+00
## male                                    0.5494511 5.062341e-01 9.666865e-01
## clicked_on_ad                           0.0000000 0.000000e+00 0.000000e+00
##                          daily_internet_usage      male clicked_on_ad
## daily_time_spent_on_site            0.0000000 0.5494511      0.000000
## age                                 0.0000000 0.5062341      0.000000
## area_income                         0.0000000 0.9666865      0.000000
## daily_internet_usage                       NA 0.3762142      0.000000
## male                                0.3762142        NA      0.229571
## clicked_on_ad                       0.0000000 0.2295710            NA

Visualizing Correlation Matrix

symnum(res, abbr.colnames = FALSE)
##                          daily_time_spent_on_site age area_income
## daily_time_spent_on_site 1                                       
## age                      .                        1              
## area_income              .                            1          
## daily_internet_usage     .                        .   .          
## male                                                             
## clicked_on_ad            ,                        .   .          
##                          daily_internet_usage male clicked_on_ad
## daily_time_spent_on_site                                        
## age                                                             
## area_income                                                     
## daily_internet_usage     1                                      
## male                                          1                 
## clicked_on_ad            ,                         1            
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

The function above checks the correlation matrix using the cutoff points shown below: symnum(x, cutpoints = c(0.3, 0.6, 0.8, 0.9, 0.95), symbols = c(" “,”.“,”,“,”+“,”*“,”B"), abbr.colnames = TRUE) This shows that most columns are not strongly correlated.

library(corrplot)
## corrplot 0.84 loaded
corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)

From the function above, the second argument (type=“upper”) is used to display only the upper triangular of the correlation matrix. Possible values for the argument type are : “upper”, “lower”, “full”. Positive correlations are in blue and negative correlations in red. Color intensity and the size of the circle are proportional to the correlation coefficients. The correlation matrix is reordered according to the correlation coefficient using “hclust” method. tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.

Considering our target variable “clicked_on_ad”, the features that seem to have the strongest relation with it are “daily_internet_usage” and “daily_time_spent_on_site” albeit in a negative way. They suggest that the more these two features increase, the less the “clicked_on_ad” observations. The next related feature is “age”, which is moderately related to the target variable, showing that an increase in age most likely led to an increase in “clicked_on_ad” observations. The final feature that seems to correlate to the target variable is the “area_income”, which suggest that it is also moderately related to it. The more the income, the less the probability of “clicked_on_ad” observations.

# Insignificant correlation are crossed
corrplot(res2$r, type="upper", order="hclust", p.mat = res2$P, sig.level = 0.005, insig = "blank")

Correlations with p-value > 0.005 are considered as insignificant and are left blank. We have combined correlogram with the significance test using the result res2 generated in the previous section with rcorr() function in Hmisc package.

The observations of the correlation are similar to the previous code chunk that didn’t involve levels of significance.

library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## The following objects are masked from 'package:data.table':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(ad_subset, histogram=TRUE, pch=19)

The distribution of each variable is shown on the diagonal. On the bottom of the diagonal, the bivariate scatter plots with a fitted line are displayed. On the top of the diagonal, the value of the correlation plus the significance level as stars. Each significance level is associated to a symbol : p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(“”, “”, “”, “.”, " “)

As can be seen, this confirms the correlations we had seen in the previous plots where the correlated variables have p values way below 0.05, showing their significance is confirmed. We can see that “daily_internet_usage” and “daily_time_spent” had the highest correlations respectively, Followed by “age” then finally “area_income”.

4.3. EDA

plot(ad$age, ad$area_income, xlab = 'Age', ylab = 'Area Income')

From the above plot we can see that most of the people who earned above 60,000 were between 25 and 45 years old.

#clicked_ad = ad$clicked_on_ad[ad$clicked_on_ad == 1]
#plot(ad$age, clicked_ad, xlab = 'Age', ylab = 'Area Income')

library(ggplot2)
ggplot(data = ad, aes(x = age, fill = clicked_on_ad))+ geom_histogram(bins = 27, color = 'cyan') + 
    labs(title = 'Distribution of Age with Ad clicks', x = 'Age', y = 'Frequency', 
         fill = 'Clicked on Ad') + scale_color_brewer(palette = 'Set2')

The plot above shows that most click ad activity was also between 25 to 45 years old, with the most activity happening with people in their 30s.

ggplot(data = ad, aes(x = area_income, fill = clicked_on_ad))+ 
  geom_histogram(bins = 27, color = 'cyan') + labs(title = 'Distribution of Income with Ad clicks', 
                                                   x = 'Income', y = 'Frequency', 
                                                   fill = 'Clicked on Ad') +
  scale_color_brewer(palette = 'Set2')

Most click activity happened with those that eanered above 40,000.

ggplot(data = ad, aes(x = daily_time_spent_on_site, fill = clicked_on_ad))+ 
  geom_histogram(bins = 27, color = 'cyan') + labs(title = 'Daily Time Spent On Site with Ad clicks', 
                                                   x = 'Daily Time Spent On Site', y = 'Frequency', 
                                                   fill = 'Clicked on Ad') +
  scale_color_brewer(palette = 'Set3')

Most activity happened with people who spent more than 60 minutes on the site.

ggplot(data = ad, aes(x = daily_internet_usage, fill = clicked_on_ad))+ 
  geom_histogram(bins = 27, color = 'cyan') + labs(title = 'Daily Internet Usage with Ad clicks', 
                                                   x = 'Daily Internet Usage', y = 'Frequency', 
                                                   fill = 'Clicked on Ad') +
  scale_color_brewer(palette = 'Set3')

Daily Internet usage has 2 regions of extensive click add activity. The first is those who spend 100 - 150 minutes and the second between 200 and 230 minutes.

5. CONCLUSION

The factors that seem to contribute the most to the click add activity are “daily_internet_usage”, “daily_time_spent_on_site”,“age” and “area_income” in that order. Daily internet usage has a strong negative correlation with clicked ads showing that the more time spent on the internet, the less the clicked adds. This trend is generally true as seen from the histogram, where, between 100 and 150 minutes there’s more activity, which decreases between 150 and 200 minutes, then increases between 200 and 250 minutes before dropping drastically. Daily time spent on site also has a strong negative correlation with clicked ads. The trend is generally true as seen in the histogram, where click adds activity increases upto 45 minutes, then dropping between 45 and 64 minutes, before increasing again between 64 and 80 minutes. It then drastically drops after 80 minutes. Age was the only positively correlated feature. The correlation was however moderate. The most click ad activity was also between 25 to 45 years old, with the most activity happening with people in their 30s. Finally, area income showed a moderate negative relationship with click ad activity, where most click activity happened with those that eanered above 40,000. However, earners from 66,000 and above showed a drastic decline in activity.

6. CHALLENGING THE SOLUTION

For a better understanding of the factors that mostly contributed to the click ad activity, and to better be able to know which individuals exactly would be more willing to the target, modeling of predictions should be done to achieve this goal better.

7. RECOMMENDATIONS

From our analysis, it seems that the target audience for the course are people earning between 40,000 and 66,000. These people should be aged 25 - 45 years, with focus on people in their 30s. They should be spending either upto 45 minutes or between 64 and 80 minutes on the site. Furthermore, their time on the internet should either be between 100 - 150 minutes or between 200 - 250 minutes. With these metrics, the success rate of clicking on the ad of the course is increased.