A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.
Our main aim is to do thorough exploratory data analysis for univariate and bivariate data and come up with recommendations for our client.
We aim to build elaborate visualizations for univariate and bivariate analysis
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.2 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ads <- read.csv("http://bit.ly/IPAdvertisingData")
view(ads)
Checking the data summary
summary(ads)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
## Ad.Topic.Line City Male Country
## Length:1000 Length:1000 Min. :0.000 Length:1000
## Class :character Class :character 1st Qu.:0.000 Class :character
## Mode :character Mode :character Median :0.000 Mode :character
## Mean :0.481
## 3rd Qu.:1.000
## Max. :1.000
## Timestamp Clicked.on.Ad
## Length:1000 Min. :0.0
## Class :character 1st Qu.:0.0
## Mode :character Median :0.5
## Mean :0.5
## 3rd Qu.:1.0
## Max. :1.0
From the table above, we can see all our measures of central tendency (median, mean).
Checking top and bottom rows and columns
tail(ads)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995 43.70 28 63126.96 173.01
## 996 72.97 30 71384.57 208.58
## 997 51.30 45 67782.17 134.42
## 998 51.63 51 42415.72 120.37
## 999 55.55 19 41920.79 187.95
## 1000 45.01 26 29875.80 178.35
## Ad.Topic.Line City Male
## 995 Front-line bifurcated ability Nicholasland 0
## 996 Fundamental modular algorithm Duffystad 1
## 997 Grass-roots cohesive monitoring New Darlene 1
## 998 Expanded intangible solution South Jessica 1
## 999 Proactive bandwidth-monitored policy West Steven 0
## 1000 Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked.on.Ad
## 995 Mayotte 2016-04-04 03:57:48 1
## 996 Lebanon 2016-02-11 21:49:00 1
## 997 Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 998 Mongolia 2016-02-01 17:24:57 1
## 999 Guatemala 2016-03-24 02:35:54 0
## 1000 Brazil 2016-06-03 21:43:21 1
head(ads)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1 68.95 35 61833.90 256.09
## 2 80.23 31 68441.85 193.77
## 3 69.47 26 59785.94 236.50
## 4 74.15 29 54806.18 245.89
## 5 68.37 35 73889.99 225.58
## 6 59.99 23 59761.56 226.74
## Ad.Topic.Line City Male Country
## 1 Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2 Monitored national standardization West Jodi 1 Nauru
## 3 Organic bottom-line service-desk Davidton 0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5 Robust logistical utilization South Manuel 0 Iceland
## 6 Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11 0
## 2 2016-04-04 01:39:02 0
## 3 2016-03-13 20:35:42 0
## 4 2016-01-10 02:31:19 0
## 5 2016-06-03 03:36:18 0
## 6 2016-05-19 14:30:17 0
Checking the classes
class(ads)
## [1] "data.frame"
Checking the number of rows and in our dataset
cat("Rows in dataset:", nrow(ads), "\nCols in dataset:", ncol(ads))
## Rows in dataset: 1000
## Cols in dataset: 10
cat("\nThe dimension of the dataset is:", dim(ads))
##
## The dimension of the dataset is: 1000 10
Range of Time Spent on Site by users
site.time.range <- range(ads$Daily.Time.Spent.on.Site)
cat("The Range of Time Spent on Site by users is:",site.time.range)
## The Range of Time Spent on Site by users is: 32.6 91.43
Range of Daily Internet Usage
internet.time.range <- range(ads$Daily.Internet.Usage)
cat("The Range of Daily Internet Usage is:", internet.time.range)
## The Range of Daily Internet Usage is: 104.78 269.96
Range of Age
age.range <- range(ads$Age)
cat("The Range of Users' age is:",age.range)
## The Range of Users' age is: 19 61
Range of Income
income.range <- range(ads$Area.Income)
cat("The Range of Users' income is:",income.range)
## The Range of Users' income is: 13996.5 79484.8
Structure of our dataframe
str(ads)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : chr "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
## $ Clicked.on.Ad : int 0 0 0 0 0 0 0 1 0 0 ...
Our dataset is of type dataframe, with 1000 records and 10 variables. 3 variables of tye numeric, 3 integer types, 4 character types including the date and time which will be converted to the standard format.
Converting the date and time
The time looks like a character string when you display it, but its data type. It should be in the class “POSIXct” “POSIXt” (it has two classes).
class(ads$Timestamp)
## [1] "character"
ads$Timestamp <- strptime(paste( ads$Timestamp), format = "%Y-%m-%d %H:%M:%S",tz="UTC")
class(ads$Timestamp)
## [1] "POSIXlt" "POSIXt"
Checking for duplicates
sum(duplicated(ads))
## [1] 0
There are no duplicates in our data
Checking for missing values
colSums(is.na(ads))
## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad
## 0
The dataset has no missing values in any of the columns.
install.packages("dataMaid", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'dataMaid' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages
install.packages("inspectdf", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'inspectdf' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'inspectdf'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\ruth\Documents\R\win-library\4.0\00LOCK\inspectdf\libs\x64\inspectdf.dll
## to C:\Users\ruth\Documents\R\win-library\4.0\inspectdf\libs\x64\inspectdf.dll:
## Permission denied
## Warning: restored 'inspectdf'
##
## The downloaded binary packages are in
## C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages
Calling the libraries
library(dplyr)
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.0.3
The 2 packages will give us more insights on our data.
inspect_cat(ads)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 3 x 5
## col_name cnt common common_pcnt levels
## <chr> <int> <chr> <dbl> <named list>
## 1 Ad.Topic.Line 1000 Adaptive 24hour Graphic Int~ 0.1 <tibble [1,000 x~
## 2 City 969 Lisamouth 0.3 <tibble [969 x 3~
## 3 Country 237 Czech Republic 0.900 <tibble [237 x 3~
common_pcnt, the percentage of each column occupied by the most common level shown in common.
Here we check for correlation between the different columns and the target variable Clicked.On.Ad.
inspect_cor(ads, df2 = NULL, method = "pearson", with_col = 'Clicked.on.Ad', alpha = 0.05)
## Warning: Columns with 0 variance found: Male, Clicked.on.Ad
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 5 x 7
## col_1 col_2 corr p_value lower upper pcnt_nna
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Clicked.on.Ad Daily.Internet.Usage -0.787 3.74e-136 -0.809 -0.762 100
## 2 Clicked.on.Ad Daily.Time.Spent.on.~ -0.748 2.29e-123 -0.774 -0.719 100
## 3 Clicked.on.Ad Age 0.493 1.55e- 54 0.444 0.538 100
## 4 Clicked.on.Ad Area.Income -0.476 4.15e- 51 -0.523 -0.427 100
## 5 Clicked.on.Ad Male -0.0380 2.30e- 1 -0.0998 0.0240 100
The summary above covers Pearson’s correlation coefficients for all the numeric columns, compared against the Clicked.On.Ads column.
Across the board, we can see that there are negative correlation values for Daily.Internet.Usage, Daily.Time.Spent.on.Site, Area Income. The only positive correlation is between Clicked.On.Ad and Age.
inspect_cor(ads, df2 = NULL, method = "pearson", alpha = 0.05)
## Warning: Columns with 0 variance found: Male, Clicked.on.Ad
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 15 x 7
## col_1 col_2 corr p_value lower upper pcnt_nna
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Clicked.on.Ad Daily.Internet.U~ -0.787 3.74e-136 -0.809 -0.762 100
## 2 Clicked.on.Ad Daily.Time.Spent~ -0.748 2.29e-123 -0.774 -0.719 100
## 3 Daily.Internet~ Daily.Time.Spent~ 0.519 2.80e- 60 0.472 0.563 100
## 4 Clicked.on.Ad Age 0.493 1.55e- 54 0.444 0.538 100
## 5 Clicked.on.Ad Area.Income -0.476 4.15e- 51 -0.523 -0.427 100
## 6 Daily.Internet~ Age -0.367 4.38e- 31 -0.420 -0.312 100
## 7 Daily.Internet~ Area.Income 0.337 1.63e- 26 0.281 0.391 100
## 8 Age Daily.Time.Spent~ -0.332 1.22e- 25 -0.386 -0.275 100
## 9 Area.Income Daily.Time.Spent~ 0.311 9.37e- 23 0.254 0.366 100
## 10 Area.Income Age -0.183 8.13e- 9 -0.242 -0.122 100
## 11 Clicked.on.Ad Male -0.0380 2.30e- 1 -0.0998 0.0240 100
## 12 Male Daily.Internet.U~ 0.0280 3.76e- 1 -0.0340 0.0898 100
## 13 Male Age -0.0210 5.06e- 1 -0.0829 0.0410 100
## 14 Male Daily.Time.Spent~ -0.0190 5.50e- 1 -0.0808 0.0431 100
## 15 Male Area.Income 0.00132 9.67e- 1 -0.0607 0.0633 100
install.packages("PerformanceAnalytics")
## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages
install.packages("corrplot")
## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages
library(corrplot)
## corrplot 0.84 loaded
ads_num <- Filter(is.numeric, ads)
corrplot(cor(ads_num))
The Daily internet usage and Daily time spent on the site columns have a large positive correlation and so does the Clicked.On.Ad and age columns.
We plan on using the Clicked.On.Ad feature to determine fill colors for these graphs, but that won’t work if they stay as they’re currently set (integer data type). I’ll change that in the following code chunk.
library(ggplot2)
ggplot(data = ads, aes(x = Age, fill = Clicked.on.Ad))+
geom_histogram(bins = 27, color = 'cyan') +
labs(title = 'Age distribution with Ad clicks', x = 'Age', y = 'Frequency', fill = 'Clicked.on.Ad') +
scale_color_brewer(palette = 'Set2')
Income and Click on Ad distribution
ggplot(data = ads, aes(x = Area.Income, fill = Clicked.on.Ad))+
geom_histogram(bins = 20, color = 'cyan') +
labs(title = 'Income distribution', x = 'Income', y = 'Frequency', fill = 'Clicked.on.Ad') +
scale_color_brewer(palette = 'Set1')
Daily Internet Use and the clicked on ad relationship
ggplot(data = ads, aes(x = Daily.Internet.Usage, fill = Clicked.on.Ad))+
geom_histogram(bins = 35, color = 'cyan') +
labs(title = 'Daily Internet Use distribution', x = 'Daily Internet Usage (minutes)', y = 'Frequency', fill = 'Clicked.on.Ad') +
scale_color_brewer(palette = 'Set1')
Daily Time Spent on Site and the clicked on ad relationship
ggplot(data = ads, aes(x = Daily.Time.Spent.on.Site, fill = Clicked.on.Ad))+
geom_histogram(bins = 25, color = 'cyan') +
labs(title = 'Daily Time Spent On Site', x = 'Time Spent(minutes)', y = 'Frequency', fill = 'Clicked.on.Ad') +
scale_color_brewer(palette = 'Set1')
ggplot(data = ads, aes(x =Area.Income , fill = Daily.Time.Spent.on.Site))+
geom_histogram(bins = 30, color = 'cyan') +
labs(title = 'Daily Time Spent On Site vs Income', x = 'Income', y = 'Frequency', fill = 'Clicked.on.Ad') +
scale_color_brewer(palette = 'Set1')
ggplot(data = ads, aes(x =Age , fill = Daily.Time.Spent.on.Site))+
geom_histogram(bins = 30, color = 'cyan') +
labs(title = 'Daily Time Spent On Site vs Age', x = 'Age', y = 'Frequency', fill = 'Clicked.on.Ad') +
scale_color_brewer(palette = 'Set1')
The ages between 26 and 42 record the highest frequency of ad clicks on the site and also the highest amount of time spent on the internet. Income levels between 50k to 70k record the highest frequency of ad clicks on the site. People who spend more time on the internet have a high income.
The ads posted on the client’s site should be more relevant to this demographic between late twenties and early forties. Her users also skew more on the high income end of the spectrum. This was expected considering her age demographic data.
Perhaps she could maximize revenue gain from her advertising by raising the cost of the courses, or introducing tiered lesson levels structured in a way that users are more likely to select the courses that cost more. She should be able to do this without losing users. Her demographics older and has more spending money, and are more likely to value/ assess qualiity before gasping at higher prices.