Identify which individuals are most likely to click on the ads from the given datset.
Appropriate Recommendations gained from the Analysis
A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.
The data was relevant and appropriate since it did not have many outliers or missing data. Furthermore, the activities were linked to internet activity, considering the course would be online.
Loading the Data and Previewing it
library(data.table)
ad = fread("~/Desktop/advertising.csv")
head(ad)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## 6: 59.99 23 59761.56 226.74
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## 6: Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
## 6: 2016-05-19 14:30:17 0
Checking its Type
class(ad)
## [1] "data.table" "data.frame"
Taking a Glimpse of the data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(ad)
## Rows: 1,000
## Columns: 10
## $ `Daily Time Spent on Site` <dbl> 68.95, 80.23, 69.47, 74.15, 68.37, 59.99, 8…
## $ Age <int> 35, 31, 26, 29, 35, 23, 33, 48, 30, 20, 49,…
## $ `Area Income` <dbl> 61833.90, 68441.85, 59785.94, 54806.18, 738…
## $ `Daily Internet Usage` <dbl> 256.09, 193.77, 236.50, 245.89, 225.58, 226…
## $ `Ad Topic Line` <chr> "Cloned 5thgeneration orchestration", "Moni…
## $ City <chr> "Wrightburgh", "West Jodi", "Davidton", "We…
## $ Male <int> 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0…
## $ Country <chr> "Tunisia", "Nauru", "San Marino", "Italy", …
## $ Timestamp <dttm> 2016-03-27 00:53:11, 2016-04-04 01:39:02, …
## $ `Clicked on Ad` <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1…
Summary Stats of the Data
summary(ad)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
## Ad Topic Line City Male Country
## Length:1000 Length:1000 Min. :0.000 Length:1000
## Class :character Class :character 1st Qu.:0.000 Class :character
## Mode :character Mode :character Median :0.000 Mode :character
## Mean :0.481
## 3rd Qu.:1.000
## Max. :1.000
## Timestamp Clicked on Ad
## Min. :2016-01-01 02:52:10 Min. :0.0
## 1st Qu.:2016-02-18 02:55:42 1st Qu.:0.0
## Median :2016-04-07 17:27:29 Median :0.5
## Mean :2016-04-10 10:34:06 Mean :0.5
## 3rd Qu.:2016-05-31 03:18:14 3rd Qu.:1.0
## Max. :2016-07-24 00:22:16 Max. :1.0
Checking Number of Rows and Columns
dim(ad)
## [1] 1000 10
cat("Rows:", nrow(ad), "\nCols:", ncol(ad))
## Rows: 1000
## Cols: 10
Changing column names to lowercase for easier manipulation
colnames(ad) = tolower(colnames(ad))
colnames(ad)
## [1] "daily time spent on site" "age"
## [3] "area income" "daily internet usage"
## [5] "ad topic line" "city"
## [7] "male" "country"
## [9] "timestamp" "clicked on ad"
Replacing the spaces in column names for easier manipulation
library(stringr)
colnames(ad) = str_replace_all(colnames(ad), c(' ' = '_'))
colnames(ad)
## [1] "daily_time_spent_on_site" "age"
## [3] "area_income" "daily_internet_usage"
## [5] "ad_topic_line" "city"
## [7] "male" "country"
## [9] "timestamp" "clicked_on_ad"
Checking For Missing Data
colSums(is.na(ad))
## daily_time_spent_on_site age area_income
## 0 0 0
## daily_internet_usage ad_topic_line city
## 0 0 0
## male country timestamp
## 0 0 0
## clicked_on_ad
## 0
There seems to be no missing data.
Checking For Duplicates
anyDuplicated(ad)
## [1] 0
There are no Duplicates
boxplot(ad$daily_time_spent_on_site)
boxplot(ad$age)
boxplot(ad$area_income)
There’s a Few Outliers to the bottom.
boxplot(ad$daily_internet_usage)
boxplot(ad$male)
boxplot(ad$timestamp)
boxplot(ad$clicked_on_ad)
Means of Numeric Columns
numeric_columns = c("daily_time_spent_on_site", "age", "area_income", "daily_internet_usage",
"male", "timestamp")
mean(ad$daily_time_spent_on_site)
## [1] 65.0002
mean(ad$age)
## [1] 36.009
mean(ad$area_income)
## [1] 55000
mean(ad$daily_internet_usage)
## [1] 180.0001
mean(ad$male)
## [1] 0.481
mean(ad$timestamp)
## [1] "2016-04-10 10:34:06 UTC"
Medians of Numeric Columns
median(ad$daily_time_spent_on_site)
## [1] 68.215
median(ad$age)
## [1] 35
median(ad$area_income)
## [1] 57012.3
median(ad$daily_internet_usage)
## [1] 183.13
median(ad$male)
## [1] 0
median(ad$timestamp)
## [1] "2016-04-07 17:27:29 UTC"
Modes of Numeric Columns
# We create the mode function that will perform our mode operation for us
# ---
#
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(ad$daily_time_spent_on_site)
## [1] 62.26
getmode(ad$age)
## [1] 31
getmode(ad$area_income)
## [1] 61833.9
getmode(ad$daily_internet_usage)
## [1] 167.22
getmode(ad$male)
## [1] 0
getmode(ad$timestamp)
## [1] "2016-03-27 00:53:11 UTC"
Minimums of Numeric Columns
min(ad$daily_time_spent_on_site)
## [1] 32.6
min(ad$age)
## [1] 19
min(ad$area_income)
## [1] 13996.5
min(ad$daily_internet_usage)
## [1] 104.78
min(ad$male)
## [1] 0
min(ad$timestamp)
## [1] "2016-01-01 02:52:10 UTC"
Maximums of Numeric Columns
max(ad$daily_time_spent_on_site)
## [1] 91.43
max(ad$age)
## [1] 61
max(ad$area_income)
## [1] 79484.8
max(ad$daily_internet_usage)
## [1] 269.96
max(ad$male)
## [1] 1
max(ad$timestamp)
## [1] "2016-07-24 00:22:16 UTC"
Ranges of Numeric Columns
range(ad$daily_time_spent_on_site)
## [1] 32.60 91.43
range(ad$age)
## [1] 19 61
range(ad$area_income)
## [1] 13996.5 79484.8
range(ad$daily_internet_usage)
## [1] 104.78 269.96
range(ad$male)
## [1] 0 1
range(ad$timestamp)
## [1] "2016-01-01 02:52:10 UTC" "2016-07-24 00:22:16 UTC"
Quantiles of Numeric Columns
quantile(ad$daily_time_spent_on_site)
## 0% 25% 50% 75% 100%
## 32.6000 51.3600 68.2150 78.5475 91.4300
quantile(ad$age)
## 0% 25% 50% 75% 100%
## 19 29 35 42 61
quantile(ad$area_income)
## 0% 25% 50% 75% 100%
## 13996.50 47031.80 57012.30 65470.64 79484.80
quantile(ad$daily_internet_usage)
## 0% 25% 50% 75% 100%
## 104.7800 138.8300 183.1300 218.7925 269.9600
quantile(ad$male)
## 0% 25% 50% 75% 100%
## 0 0 0 1 1
quantile(ad$timestamp)
## 0% 25% 50%
## "2016-01-01 02:52:10 UTC" "2016-02-18 02:55:42 UTC" "2016-04-07 17:27:29 UTC"
## 75% 100%
## "2016-05-31 03:18:14 UTC" "2016-07-24 00:22:16 UTC"
Variances of Numeric Columns
var(ad$daily_time_spent_on_site)
## [1] 251.3371
var(ad$age)
## [1] 77.18611
var(ad$area_income)
## [1] 179952406
var(ad$daily_internet_usage)
## [1] 1927.415
var(ad$male)
## [1] 0.2498889
var(ad$timestamp)
## [1] 2.590788e+13
Standard Deviations of Numeric Columns
sd(ad$daily_time_spent_on_site)
## [1] 15.85361
sd(ad$age)
## [1] 8.785562
sd(ad$area_income)
## [1] 13414.63
sd(ad$daily_internet_usage)
## [1] 43.90234
sd(ad$male)
## [1] 0.4998889
sd(ad$timestamp)
## [1] 5089978
numeric_columns
## [1] "daily_time_spent_on_site" "age"
## [3] "area_income" "daily_internet_usage"
## [5] "male" "timestamp"
Covariances with daily time spent
cov(ad$daily_time_spent_on_site, ad$age)
## [1] -46.17415
cov(ad$daily_time_spent_on_site, ad$area_income)
## [1] 66130.81
cov(ad$daily_time_spent_on_site, ad$daily_internet_usage)
## [1] 360.9919
cov(ad$daily_time_spent_on_site, ad$male)
## [1] -0.1501864
cov(ad$daily_time_spent_on_site, ad$clicked_on_ad)
## [1] -5.933143
Covariances with age
cov(ad$age, ad$area_income)
## [1] -21520.93
cov(ad$age, ad$daily_internet_usage)
## [1] -141.6348
cov(ad$age, ad$male)
## [1] -0.09242142
cov(ad$age, ad$clicked_on_ad)
## [1] 2.164665
Covariances with area income
cov(ad$area_income, ad$daily_internet_usage)
## [1] 198762.5
cov(ad$area_income, ad$male)
## [1] 8.867509
cov(ad$area_income, ad$clicked_on_ad)
## [1] -3195.989
Covariances with daily internet usage
cov(ad$daily_internet_usage, ad$male)
## [1] 0.6147667
cov(ad$daily_internet_usage, ad$clicked_on_ad)
## [1] -17.27409
Covariances with male
cov(ad$male, ad$clicked_on_ad)
## [1] -0.00950951
The higher the positive values, the higher the covariance and, the higher the negative values, the higher the covariance as well.
Correlations with daily time spent
cor(ad$daily_time_spent_on_site, ad$age)
## [1] -0.3315133
cor(ad$daily_time_spent_on_site, ad$area_income)
## [1] 0.3109544
cor(ad$daily_time_spent_on_site, ad$daily_internet_usage)
## [1] 0.5186585
cor(ad$daily_time_spent_on_site, ad$male)
## [1] -0.01895085
cor(ad$daily_time_spent_on_site, ad$clicked_on_ad)
## [1] -0.7481166
Correlations with age
cor(ad$age, ad$area_income)
## [1] -0.182605
cor(ad$age, ad$daily_internet_usage)
## [1] -0.3672086
cor(ad$age, ad$male)
## [1] -0.02104406
cor(ad$age, ad$clicked_on_ad)
## [1] 0.4925313
Correlations with area income
cor(ad$area_income, ad$daily_internet_usage)
## [1] 0.3374955
cor(ad$area_income, ad$male)
## [1] 0.001322359
cor(ad$area_income, ad$clicked_on_ad)
## [1] -0.4762546
Correlations with daily internet usage
cor(ad$daily_internet_usage, ad$male)
## [1] 0.02801233
cor(ad$daily_internet_usage, ad$clicked_on_ad)
## [1] -0.7865392
Correlations with male
cor(ad$male, ad$clicked_on_ad)
## [1] -0.03802747
Plots of highly correlated values
# Most of the correlations are low, with the highest being -0.786 and -0.748 which is
# moderately-to-highly negatively correlated.
plot(ad$daily_time_spent_on_site, ad$clicked_on_ad, xlab="Time Spent On Site Daily",
ylab="Clicked On Ad")
plot(ad$daily_internet_usage, ad$clicked_on_ad, xlab="Internet Usage Daily", ylab="Clicked On Ad")
Correlation Matrix
# Selecting only the Numeric Columns
ad_subset = subset(ad,select = -c(ad_topic_line,city,country,timestamp))
ad_subset
## daily_time_spent_on_site age area_income daily_internet_usage male
## 1: 68.95 35 61833.90 256.09 0
## 2: 80.23 31 68441.85 193.77 1
## 3: 69.47 26 59785.94 236.50 0
## 4: 74.15 29 54806.18 245.89 1
## 5: 68.37 35 73889.99 225.58 0
## ---
## 996: 72.97 30 71384.57 208.58 1
## 997: 51.30 45 67782.17 134.42 1
## 998: 51.63 51 42415.72 120.37 1
## 999: 55.55 19 41920.79 187.95 0
## 1000: 45.01 26 29875.80 178.35 0
## clicked_on_ad
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## ---
## 996: 1
## 997: 1
## 998: 1
## 999: 0
## 1000: 1
res <- cor(ad_subset)
round(res, 2)
## daily_time_spent_on_site age area_income
## daily_time_spent_on_site 1.00 -0.33 0.31
## age -0.33 1.00 -0.18
## area_income 0.31 -0.18 1.00
## daily_internet_usage 0.52 -0.37 0.34
## male -0.02 -0.02 0.00
## clicked_on_ad -0.75 0.49 -0.48
## daily_internet_usage male clicked_on_ad
## daily_time_spent_on_site 0.52 -0.02 -0.75
## age -0.37 -0.02 0.49
## area_income 0.34 0.00 -0.48
## daily_internet_usage 1.00 0.03 -0.79
## male 0.03 1.00 -0.04
## clicked_on_ad -0.79 -0.04 1.00
The cor() function returns only the correlation coefficients between variables.
# Using Hmisc R package to calculate the correlation p-values.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
res2 <- rcorr(as.matrix(ad_subset))
res2
## daily_time_spent_on_site age area_income
## daily_time_spent_on_site 1.00 -0.33 0.31
## age -0.33 1.00 -0.18
## area_income 0.31 -0.18 1.00
## daily_internet_usage 0.52 -0.37 0.34
## male -0.02 -0.02 0.00
## clicked_on_ad -0.75 0.49 -0.48
## daily_internet_usage male clicked_on_ad
## daily_time_spent_on_site 0.52 -0.02 -0.75
## age -0.37 -0.02 0.49
## area_income 0.34 0.00 -0.48
## daily_internet_usage 1.00 0.03 -0.79
## male 0.03 1.00 -0.04
## clicked_on_ad -0.79 -0.04 1.00
##
## n= 1000
##
##
## P
## daily_time_spent_on_site age area_income
## daily_time_spent_on_site 0.0000 0.0000
## age 0.0000 0.0000
## area_income 0.0000 0.0000
## daily_internet_usage 0.0000 0.0000 0.0000
## male 0.5495 0.5062 0.9667
## clicked_on_ad 0.0000 0.0000 0.0000
## daily_internet_usage male clicked_on_ad
## daily_time_spent_on_site 0.0000 0.5495 0.0000
## age 0.0000 0.5062 0.0000
## area_income 0.0000 0.9667 0.0000
## daily_internet_usage 0.3762 0.0000
## male 0.3762 0.2296
## clicked_on_ad 0.0000 0.2296
The output of rcorr() is a list containing : - r : the correlation matrix - n : the matrix of the number of observations used in analyzing each pair of variables and, - P : the p-values corresponding to the significance levels of correlations.
# Extract p-values
res2$P
## daily_time_spent_on_site age area_income
## daily_time_spent_on_site NA 0.000000e+00 0.000000e+00
## age 0.0000000 NA 6.019232e-09
## area_income 0.0000000 6.019232e-09 NA
## daily_internet_usage 0.0000000 0.000000e+00 0.000000e+00
## male 0.5494511 5.062341e-01 9.666865e-01
## clicked_on_ad 0.0000000 0.000000e+00 0.000000e+00
## daily_internet_usage male clicked_on_ad
## daily_time_spent_on_site 0.0000000 0.5494511 0.000000
## age 0.0000000 0.5062341 0.000000
## area_income 0.0000000 0.9666865 0.000000
## daily_internet_usage NA 0.3762142 0.000000
## male 0.3762142 NA 0.229571
## clicked_on_ad 0.0000000 0.2295710 NA
Visualizing Correlation Matrix
symnum(res, abbr.colnames = FALSE)
## daily_time_spent_on_site age area_income
## daily_time_spent_on_site 1
## age . 1
## area_income . 1
## daily_internet_usage . . .
## male
## clicked_on_ad , . .
## daily_internet_usage male clicked_on_ad
## daily_time_spent_on_site
## age
## area_income
## daily_internet_usage 1
## male 1
## clicked_on_ad , 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
The function above checks the correlation matrix using the cutoff points shown below: symnum(x, cutpoints = c(0.3, 0.6, 0.8, 0.9, 0.95), symbols = c(" “,”.“,”,“,”+“,”*“,”B"), abbr.colnames = TRUE) This shows that most columns are not strongly correlated.
library(corrplot)
## corrplot 0.84 loaded
corrplot(res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)
From the function above, the second argument (type=“upper”) is used to display only the upper triangular of the correlation matrix. Possible values for the argument type are : “upper”, “lower”, “full”. Positive correlations are in blue and negative correlations in red. Color intensity and the size of the circle are proportional to the correlation coefficients. The correlation matrix is reordered according to the correlation coefficient using “hclust” method. tl.col (for text label color) and tl.srt (for text label string rotation) are used to change text colors and rotations.
Considering our target variable “clicked_on_ad”, the features that seem to have the strongest relation with it are “daily_internet_usage” and “daily_time_spent_on_site” albeit in a negative way. They suggest that the more these two features increase, the less the “clicked_on_ad” observations. The next related feature is “age”, which is moderately related to the target variable, showing that an increase in age most likely led to an increase in “clicked_on_ad” observations. The final feature that seems to correlate to the target variable is the “area_income”, which suggest that it is also moderately related to it. The more the income, the less the probability of “clicked_on_ad” observations.
# Insignificant correlation are crossed
corrplot(res2$r, type="upper", order="hclust", p.mat = res2$P, sig.level = 0.005, insig = "blank")
Correlations with p-value > 0.005 are considered as insignificant and are left blank. We have combined correlogram with the significance test using the result res2 generated in the previous section with rcorr() function in Hmisc package.
The observations of the correlation are similar to the previous code chunk that didn’t involve levels of significance.
library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following objects are masked from 'package:data.table':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(ad_subset, histogram=TRUE, pch=19)
The distribution of each variable is shown on the diagonal. On the bottom of the diagonal, the bivariate scatter plots with a fitted line are displayed. On the top of the diagonal, the value of the correlation plus the significance level as stars. Each significance level is associated to a symbol : p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(“”, “”, “”, “.”, " “)
As can be seen, this confirms the correlations we had seen in the previous plots where the correlated variables have p values way below 0.05, showing their significance is confirmed. We can see that “daily_internet_usage” and “daily_time_spent” had the highest correlations respectively, Followed by “age” then finally “area_income”.
plot(ad$age, ad$area_income, xlab = 'Age', ylab = 'Area Income')
From the above plot we can see that most of the people who earned above 60,000 were between 25 and 45 years old.
#clicked_ad = ad$clicked_on_ad[ad$clicked_on_ad == 1]
#plot(ad$age, clicked_ad, xlab = 'Age', ylab = 'Area Income')
library(ggplot2)
ggplot(data = ad, aes(x = age, fill = clicked_on_ad))+ geom_histogram(bins = 27, color = 'cyan') +
labs(title = 'Distribution of Age with Ad clicks', x = 'Age', y = 'Frequency',
fill = 'Clicked on Ad') + scale_color_brewer(palette = 'Set2')
The plot above shows that most click ad activity was also between 25 to 45 years old, with the most activity happening with people in their 30s.
ggplot(data = ad, aes(x = area_income, fill = clicked_on_ad))+
geom_histogram(bins = 27, color = 'cyan') + labs(title = 'Distribution of Income with Ad clicks',
x = 'Income', y = 'Frequency',
fill = 'Clicked on Ad') +
scale_color_brewer(palette = 'Set2')
Most click activity happened with those that eanered above 40,000.
ggplot(data = ad, aes(x = daily_time_spent_on_site, fill = clicked_on_ad))+
geom_histogram(bins = 27, color = 'cyan') + labs(title = 'Daily Time Spent On Site with Ad clicks',
x = 'Daily Time Spent On Site', y = 'Frequency',
fill = 'Clicked on Ad') +
scale_color_brewer(palette = 'Set3')
Most activity happened with people who spent more than 60 minutes on the site.
ggplot(data = ad, aes(x = daily_internet_usage, fill = clicked_on_ad))+
geom_histogram(bins = 27, color = 'cyan') + labs(title = 'Daily Internet Usage with Ad clicks',
x = 'Daily Internet Usage', y = 'Frequency',
fill = 'Clicked on Ad') +
scale_color_brewer(palette = 'Set3')
Daily Internet usage has 2 regions of extensive click add activity. The first is those who spend 100 - 150 minutes and the second between 200 and 230 minutes.
The factors that seem to contribute the most to the click add activity are “daily_internet_usage”, “daily_time_spent_on_site”,“age” and “area_income” in that order. Daily internet usage has a strong negative correlation with clicked ads showing that the more time spent on the internet, the less the clicked adds. This trend is generally true as seen from the histogram, where, between 100 and 150 minutes there’s more activity, which decreases between 150 and 200 minutes, then increases between 200 and 250 minutes before dropping drastically. Daily time spent on site also has a strong negative correlation with clicked ads. The trend is generally true as seen in the histogram, where click adds activity increases upto 45 minutes, then dropping between 45 and 64 minutes, before increasing again between 64 and 80 minutes. It then drastically drops after 80 minutes. Age was the only positively correlated feature. The correlation was however moderate. The most click ad activity was also between 25 to 45 years old, with the most activity happening with people in their 30s. Finally, area income showed a moderate negative relationship with click ad activity, where most click activity happened with those that eanered above 40,000. However, earners from 66,000 and above showed a drastic decline in activity.
For a better understanding of the factors that mostly contributed to the click ad activity, and to better be able to know which individuals exactly would be more willing to the target, modeling of predictions should be done to achieve this goal better.
From our analysis, it seems that the target audience for the course are people earning between 40,000 and 66,000. These people should be aged 25 - 45 years, with focus on people in their 30s. They should be spending either upto 45 minutes or between 64 and 80 minutes on the site. Furthermore, their time on the internet should either be between 100 - 150 minutes or between 200 - 250 minutes. With these metrics, the success rate of clicking on the ad of the course is increased.