Analysis on Advertisments

Matilda Kadzo 25/05/2022

Defining The Question

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran adverts to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her adverts.

Metric of Success

To provide an accurate depiction of the people most likely to view the clients advertisements and provide recommendations to the client based on the results of the univariate and bivariate analysis conducted on the dataset.

Understanding the context

Clicks on adverts can help you understand how appealing your advert is to people who see it. Highly targeted ads are more likely to receive clicks. This can help you gauge how enticing your advert is. In this case, it would help us know how many people would be interested in the online cryptography course through the number of clicks on our client’s blog.

Experimental Design

Steps taken:

Loading the dataset
Performing data cleaning
Exploratory Data Analysis
Conclusion and recommendation

Data Relevance

Daily Time Spent on Site - Time spent per day on the blog
Age - Age of the respondents
Area Income - Income Distribution of the respondents’ area
Daily Internet Usage - How much internet is used on a daily
Ad Topic Line - Topic of the advert
City - City of respondents
Male - gender of respondents; 1 if male and 0 if female.
Country -country of respondents
Time stamp - the time the data is recorded
Clicked on Ad - whether the respondents click on the ads; 0 if they don’t and 1 if they do.

Loading the Dataset

advert <- read.csv("/home/binti/Downloads/R/advertising.csv")

Previewing the top of our dataset

head(advert)

##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0

Previewing the tail of our dataset

tail(advert)

##      Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995                     43.70  28    63126.96               173.01
## 996                     72.97  30    71384.57               208.58
## 997                     51.30  45    67782.17               134.42
## 998                     51.63  51    42415.72               120.37
## 999                     55.55  19    41920.79               187.95
## 1000                    45.01  26    29875.80               178.35
##                             Ad.Topic.Line          City Male
## 995         Front-line bifurcated ability  Nicholasland    0
## 996         Fundamental modular algorithm     Duffystad    1
## 997       Grass-roots cohesive monitoring   New Darlene    1
## 998          Expanded intangible solution South Jessica    1
## 999  Proactive bandwidth-monitored policy   West Steven    0
## 1000      Virtual 5thgeneration emulation   Ronniemouth    0
##                     Country           Timestamp Clicked.on.Ad
## 995                 Mayotte 2016-04-04 03:57:48             1
## 996                 Lebanon 2016-02-11 21:49:00             1
## 997  Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 998                Mongolia 2016-02-01 17:24:57             1
## 999               Guatemala 2016-03-24 02:35:54             0
## 1000                 Brazil 2016-06-03 21:43:21             1

Dataset Columns

names(advert)

##  [1] "Daily.Time.Spent.on.Site" "Age"                     
##  [3] "Area.Income"              "Daily.Internet.Usage"    
##  [5] "Ad.Topic.Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked.on.Ad"

Cleaning Data

Finding the total missing values in our dataset.

colSums(is.na(advert))

## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                        0                        0                        0 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked.on.Ad 
##                        0

#There are no missing values in our dataset

Checking for duplicates across our rows.

sum(advert[duplicated(advert),])

## [1] 0

#There are no duplicates in this dataset.

The dataset had neither missing values or any duplicated values

Exploring the dataset

Checking the descriptive statistics of the dataset

summary(advert)

##  Daily.Time.Spent.on.Site      Age         Area.Income    Daily.Internet.Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##  Ad.Topic.Line          City                Male         Country         
##  Length:1000        Length:1000        Min.   :0.000   Length:1000       
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.481                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :1.000                     
##   Timestamp         Clicked.on.Ad
##  Length:1000        Min.   :0.0  
##  Class :character   1st Qu.:0.0  
##  Mode  :character   Median :0.5  
##                     Mean   :0.5  
##                     3rd Qu.:1.0  
##                     Max.   :1.0

Checking the structure of the dataframe

str(advert)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...

Checking for Outliers

Checking for outliers in the dataset. These show a visual shape of our data distribution.

boxplot(advert$Area.Income,
        main ="Area Income",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

#There are a few outliers in the area income column.

boxplot(advert$Daily.Time.Spent.on.Site,
        main ="Daily Time Spent on Site",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

#There are no outliers in the daily time spent on site column.

boxplot(advert$Age,
        main ="Age",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

#There are no outliers in the age column.

boxplot(advert$Daily.Internet.Usage,
        main ="Daily Internet Usage",
        col = "orange",
        border  = 'brown',
        horizontal = TRUE,
        notch = TRUE)

#There are no outliers in the daily internet usage column

Exploratory Data Analysis

Univariate Analysis

Measures of Central Tendency

Mean of the numeric columns

colMeans(advert[sapply(advert,is.numeric)])

## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                  65.0002                  36.0090               55000.0001 
##     Daily.Internet.Usage                     Male            Clicked.on.Ad 
##                 180.0001                   0.4810                   0.5000

Median of our numeric columns

ad_time_median <- median(advert$Daily.Time.Spent.on.Site)
print(ad_time_median)

## [1] 68.215

ad_age_median <- median(advert$Age)
ad_age_median

## [1] 35

ad_income_median <- median(advert$Area.Income)
ad_income_median

## [1] 57012.3

ad_internet_usage_median <- median(advert$Daily.Internet.Usage)
ad_internet_usage_median

## [1] 183.13

Mode of our numeric columns.

Let’s create the mode function

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]}

Finding the mode in the age column

getmode(advert$Age)

## [1] 31

getmode(advert$Daily.Time.Spent.on.Site)

## [1] 62.26

getmode(advert$Area.Income)

## [1] 61833.9

getmode(advert$Daily.Internet.Usage)

## [1] 167.22

getmode(advert$City)

## [1] "Lisamouth"

getmode(advert$Ad.Topic.Line)

## [1] "Cloned 5thgeneration orchestration"

getmode(advert$Male)

## [1] 0

getmode(advert$Country)

## [1] "Czech Republic"

getmode(advert$Timestamp)

## [1] "2016-03-27 00:53:11"

Minimum values in the numeric columns

min(advert$Age)

## [1] 19

min(advert$Daily.Time.Spent.on.Site)

## [1] 32.6

min(advert$Area.Income)

## [1] 13996.5

min(advert$Daily.Internet.Usage)

## [1] 104.78

Maximum values in the numeric columns

max(advert$Age)

## [1] 61

max(advert$Daily.Time.Spent.on.Site)

## [1] 91.43

max(advert$Area.Income)

## [1] 79484.8

max(advert$Daily.Internet.Usage)

## [1] 269.96

Range in the numeric columns

range(advert$Age)

## [1] 19 61

range(advert$Daily.Time.Spent.on.Site)

## [1] 32.60 91.43

range(advert$Area.Income)

## [1] 13996.5 79484.8

range(advert$Daily.Internet.Usage)

## [1] 104.78 269.96

Summary * The youngest respondent is 19 and the oldest 61 years of age. * The least time spent on her site is 32 minutes and the highest 91 minutes. * The lowest income earner among the respondents earns 13,996 while the highest earns 79,484. * Daily internet usage ranges from 104 - 269.

Quantiles in the columns

quantile(advert$Age)

##   0%  25%  50%  75% 100% 
##   19   29   35   42   61

quantile(advert$Daily.Time.Spent.on.Site)

##      0%     25%     50%     75%    100% 
## 32.6000 51.3600 68.2150 78.5475 91.4300

quantile(advert$Area.Income)

##       0%      25%      50%      75%     100% 
## 13996.50 47031.80 57012.30 65470.64 79484.80

quantile(advert$Daily.Internet.Usage)

##       0%      25%      50%      75%     100% 
## 104.7800 138.8300 183.1300 218.7925 269.9600

Variance of the numeric columns.

This shows how the data values are dispersed around the mean.

var(advert$Age)

## [1] 77.18611

var(advert$Daily.Time.Spent.on.Site)

## [1] 251.3371

var(advert$Area.Income)

## [1] 179952406

var(advert$Daily.Internet.Usage)

## [1] 1927.415

Finding the standard deviation of the columns.

sd(advert$Age)

## [1] 8.785562

sd(advert$Daily.Time.Spent.on.Site)

## [1] 15.85361

sd(advert$Area.Income)

## [1] 13414.63

sd(advert$Daily.Internet.Usage)

## [1] 43.90234

Frequency Distribution

requency Distribution in the age column

table(advert$Age)

## 
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 
##  6  6  6 13 19 21 27 37 33 48 48 39 60 38 43 39 39 50 36 37 30 36 32 26 23 21 
## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 
## 30 18 13 16 18 20 12 15 10  9  7  2  6  4  2  4  1

# Most respondents fall between theage bracket 25-42. The age with the highest number of readers is 31 which has a total of 61 people in total.

Histogram

Plotting histograms for the columns

hist(advert$Age, col  = "Cyan")

#Most respondents fall in the age bracket 25-40.

hist(advert$Area.Income, col = "Purple")

#The respondents mostly earn between 50K - 70K

hist(advert$Daily.Time.Spent.on.Site, col = "gold")

hist(advert$Daily.Internet.Usage, col = "pink")

### Plotting count plots for Categorical data

Countplots for categorical data were plotted and it was observed that:

library(ggplot2)
ggplot(advert, aes(x=Male)) + geom_bar(fill=rgb(0.4,0.1,0.5))

There were more male than female users that visited the site and clicked on the advert

ggplot(advert, aes(x=factor(`Clicked.on.Ad`))) + geom_bar( fill=rgb(0.6,0.4,0.4))

The number of users that clicked the advert are equal to those that did not click on the advert.

Bivariate Analysis

Ggplots

library(ggplot2)

ggplot(data = advert, aes(x = Area.Income, fill = Clicked.on.Ad))+
        geom_histogram(bins  =20,col = "orange")+
        labs(title = "Income Distribution", x = "Area Income", y= "Frequency", fill = "Clicked on Ad")+ scale_color_brewer(
                palette = "Set1"
        )

ggplot(data = advert, aes(x = Age, fill = Clicked.on.Ad))+
        geom_histogram(bins  =20,col = "orange")+
        labs(title = "Age Distribution", x = "Age", y= "Frequency", fill = "Clicked on Ad")+ scale_color_brewer(
                palette = "Set1"
        )

ggplot(data = advert, aes(x =Daily.Time.Spent.on.Site, fill = Clicked.on.Ad))+
        geom_histogram(bins  =20,col = "orange")+
        labs(title = "Daily Time Spent on Site", x = "Time Spent on Site", y= "Frequency", fill = "Clicked on Ad")+ scale_color_brewer(
                palette = "Set1"
        )

Covariance

Covariance is a statistical representation of the degree to which two variables vary together.

cov(advert$Age, advert$Daily.Time.Spent.on.Site)

## [1] -46.17415

#There is a negative relationship between the age and the time spent on site which means as the age increases, the daily time spent on the site decreases. The opposite is true.

cov(advert$Age, advert$Daily.Internet.Usage)

## [1] -141.6348

#There is a negative relationship between the age and the daily internet usage as well.

cov(advert$Area.Income,advert$Daily.Time.Spent.on.Site)

## [1] 66130.81

#There is a strong positive relationship between the income and daily time spent on site variables. That goes to say that the higher the income, the more the time spent on site and the lower the income, the less the time spent on site.

cov(advert$Age,advert$Area.Income)

## [1] -21520.93

#There is a negative correlation between the age and income variables.

Correlation matrix

cor(advert$Age, advert$Daily.Time.Spent.on.Site)

## [1] -0.3315133

cor(advert$Age,advert$Daily.Internet.Usage)

## [1] -0.3672086

cor(advert$Area.Income,advert$Daily.Internet.Usage)

## [1] 0.3374955

cor(advert$Area.Income,advert$Daily.Time.Spent.on.Site)

## [1] 0.3109544

cor(advert$Age,advert$Area.Income)

## [1] -0.182605

cor(advert[, c("Age","Daily.Time.Spent.on.Site","Daily.Internet.Usage")])

##                                 Age Daily.Time.Spent.on.Site
## Age                       1.0000000               -0.3315133
## Daily.Time.Spent.on.Site -0.3315133                1.0000000
## Daily.Internet.Usage     -0.3672086                0.5186585
##                          Daily.Internet.Usage
## Age                                -0.3672086
## Daily.Time.Spent.on.Site            0.5186585
## Daily.Internet.Usage                1.0000000

cor(advert[,unlist(lapply(advert, is.numeric))])

##                          Daily.Time.Spent.on.Site         Age  Area.Income
## Daily.Time.Spent.on.Site               1.00000000 -0.33151334  0.310954413
## Age                                   -0.33151334  1.00000000 -0.182604955
## Area.Income                            0.31095441 -0.18260496  1.000000000
## Daily.Internet.Usage                   0.51865848 -0.36720856  0.337495533
## Male                                  -0.01895085 -0.02104406  0.001322359
## Clicked.on.Ad                         -0.74811656  0.49253127 -0.476254628
##                          Daily.Internet.Usage         Male Clicked.on.Ad
## Daily.Time.Spent.on.Site           0.51865848 -0.018950855   -0.74811656
## Age                               -0.36720856 -0.021044064    0.49253127
## Area.Income                        0.33749553  0.001322359   -0.47625463
## Daily.Internet.Usage               1.00000000  0.028012326   -0.78653918
## Male                               0.02801233  1.000000000   -0.03802747
## Clicked.on.Ad                     -0.78653918 -0.038027466    1.00000000

Plotting a correlation heatmap for the numerical variables

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(ggcorrplot)

# Selecting the Numerical Variables of the dataset
corr <- dplyr::select(advert,Age,Area.Income,Clicked.on.Ad,Daily.Internet.Usage,Daily.Time.Spent.on.Site,Male )

# Plotting the Correlation Heatmap
library(ggcorrplot)
ggcorrplot(cor(corr), hc.order = F,type = 
"lower", lab = T,
  ggtheme = ggplot2::theme_gray,
  colors = c("#00798c", "violet", "#edae49"))

Here, it was noted noted that :

There was a strong negative correlation between the Daily Internet usage and Clicked on Ad variables. This means that the higher ones income the less likely they are to click on the blog ads. The same can also be said for the Daily Time Spent on Site and Click on ad variables.

The Click on Ad variable had a strong positive correlation with the Age Variable, the older users were more likely to click on the ad , as we observed above in our analysis.

The clicked on ad variable was also strongly negatively correlated with the Area Income , where the higher ones income was the less likely they were to click on the ad.

Scatter Plots

Scatter plots are used when we want to see a graphical representation of two different variables. They show how the variables are correlated.

Let’s plot a scatter plot for age and daily time spent on site.

ggplot(advert, aes(Area.Income,Age))+geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
  labs(title = "Scatter Plot of Age Distribution vs Area Income",
       x = "Area Income",
       y = "Age")

The scatter plot for the Area Income against Age showed that , majority of the users who did not click on the ad were the high income earners and many of these were aged between 20 and 40 years.

Scatter plot for Income and Daily Internet Usage

ggplot(advert, aes(Area.Income, Daily.Internet.Usage))+
  geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
  labs(title = "Scatter Plot of Area Income vs Daily Internet Usage",
       x = "Area Income",
       y = "Daily Internet Usage")

Scatter Plot of Age Distribution vs Time Spent on Site

ggplot(advert, aes(Age, Daily.Time.Spent.on.Site))+
  geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
  labs(title = "Scatter Plot of Age Distribution vs Time Spent on Site",
       x = "Age",
       y = "Time Spent on Site")

Plotting the Age against Time spent on the site variable we see that the younger demographic are less tolerant to ads despite spending significant amounts of time on the site.

The reason for this may be that younger people , are more tech savvy and therefore are more likely to detect ads and avoid them while using the internet compared to their older counterparts.

Scatter plot for Income Distribution and Daily time spent on site.

ggplot(advert, aes(Daily.Time.Spent.on.Site, Area.Income))+
  geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
  labs(title = "Time spent on site vs Income",
       x = "Daily Time Spent on Site",
       y = "Income Distribution")

The people who were least likely to click on the ad were the higher income earners , this was despite the fact that they seemed to spend a over an hour a day on the site.

The same sentiments can be echoed for the Usage , total internet usage per day, variable . When plotted against income we see that those who spend over 200 minutes online all day and earn more than 50,000 are the least likely to click on ads on the internet.

Scatter plot for Age and Income Distribution

ggplot(advert, aes(Age, Daily.Internet.Usage))+
  geom_point(aes(colour= factor(`Clicked.on.Ad`)))+
  labs(title = "Scatter Plot of Age Distribution vs Daily Usage",
       x = "Age",
       y = "Daily Usage")

Conclusion and Recommendations

In analyzing this data it was deduced that:
- Older people, those over 35 were more likely to click on the course advert.
- The individuals earning higher salaries were more likely not to click on the advert.
- The probability that a consumer would click on the advert was 0.5.
- The more time users spent on the blog , the less likely they were to click on the advertisement.
Thus, given these observations I would recommend that:

Focusing more on those earning a lower income i.e less than 60,000 would prove to be more beneficial as these consumers click on adverts more.
Users who were aged over 35 should be targeted more, as they were more likely to click on the ad.
Finally the users who spend less time on the site and on the internet in general would prove a better demographic for the ads.