Research Question

A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.

1. Defining the Question

1.1 Specifying the data analytic objective

Our main aim is to do thorough exploratory data analysis for univariate and bivariate data and come up with recommendations for our client.

1.2 Defining the metric for success

We aim to build elaborate visualizations for univariate and bivariate analysis

2.Loading and reading Our Datasets

library(tidyverse)

## -- Attaching packages ---------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.2     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

ads <- read.csv("http://bit.ly/IPAdvertisingData")
view(ads)

Checking the data summary

summary(ads)

##  Daily.Time.Spent.on.Site      Age         Area.Income    Daily.Internet.Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##  Ad.Topic.Line          City                Male         Country         
##  Length:1000        Length:1000        Min.   :0.000   Length:1000       
##  Class :character   Class :character   1st Qu.:0.000   Class :character  
##  Mode  :character   Mode  :character   Median :0.000   Mode  :character  
##                                        Mean   :0.481                     
##                                        3rd Qu.:1.000                     
##                                        Max.   :1.000                     
##   Timestamp         Clicked.on.Ad
##  Length:1000        Min.   :0.0  
##  Class :character   1st Qu.:0.0  
##  Mode  :character   Median :0.5  
##                     Mean   :0.5  
##                     3rd Qu.:1.0  
##                     Max.   :1.0

From the table above, we can see all our measures of central tendency (median, mean).

Checking top and bottom rows and columns

tail(ads)

##      Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 995                     43.70  28    63126.96               173.01
## 996                     72.97  30    71384.57               208.58
## 997                     51.30  45    67782.17               134.42
## 998                     51.63  51    42415.72               120.37
## 999                     55.55  19    41920.79               187.95
## 1000                    45.01  26    29875.80               178.35
##                             Ad.Topic.Line          City Male
## 995         Front-line bifurcated ability  Nicholasland    0
## 996         Fundamental modular algorithm     Duffystad    1
## 997       Grass-roots cohesive monitoring   New Darlene    1
## 998          Expanded intangible solution South Jessica    1
## 999  Proactive bandwidth-monitored policy   West Steven    0
## 1000      Virtual 5thgeneration emulation   Ronniemouth    0
##                     Country           Timestamp Clicked.on.Ad
## 995                 Mayotte 2016-04-04 03:57:48             1
## 996                 Lebanon 2016-02-11 21:49:00             1
## 997  Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 998                Mongolia 2016-02-01 17:24:57             1
## 999               Guatemala 2016-03-24 02:35:54             0
## 1000                 Brazil 2016-06-03 21:43:21             1

head(ads)

##   Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## 1                    68.95  35    61833.90               256.09
## 2                    80.23  31    68441.85               193.77
## 3                    69.47  26    59785.94               236.50
## 4                    74.15  29    54806.18               245.89
## 5                    68.37  35    73889.99               225.58
## 6                    59.99  23    59761.56               226.74
##                           Ad.Topic.Line           City Male    Country
## 1    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2    Monitored national standardization      West Jodi    1      Nauru
## 3      Organic bottom-line service-desk       Davidton    0 San Marino
## 4 Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5         Robust logistical utilization   South Manuel    0    Iceland
## 6       Sharable client-driven software      Jamieberg    1     Norway
##             Timestamp Clicked.on.Ad
## 1 2016-03-27 00:53:11             0
## 2 2016-04-04 01:39:02             0
## 3 2016-03-13 20:35:42             0
## 4 2016-01-10 02:31:19             0
## 5 2016-06-03 03:36:18             0
## 6 2016-05-19 14:30:17             0

Checking the classes

class(ads)

## [1] "data.frame"

Checking the number of rows and in our dataset

cat("Rows in dataset:", nrow(ads), "\nCols in dataset:", ncol(ads))

## Rows in dataset: 1000 
## Cols in dataset: 10

cat("\nThe dimension of the dataset is:", dim(ads))

## 
## The dimension of the dataset is: 1000 10

Range of Time Spent on Site by users

site.time.range <- range(ads$Daily.Time.Spent.on.Site)
cat("The Range of Time Spent on Site by users is:",site.time.range)

## The Range of Time Spent on Site by users is: 32.6 91.43

Range of Daily Internet Usage

internet.time.range <- range(ads$Daily.Internet.Usage)
cat("The Range of Daily Internet Usage is:", internet.time.range)

## The Range of Daily Internet Usage is: 104.78 269.96

Range of Age

age.range <- range(ads$Age)
cat("The Range of Users' age is:",age.range)

## The Range of Users' age is: 19 61

Range of Income

income.range <- range(ads$Area.Income)
cat("The Range of Users' income is:",income.range)

## The Range of Users' income is: 13996.5 79484.8

Structure of our dataframe

str(ads)

## 'data.frame':    1000 obs. of  10 variables:
##  $ Daily.Time.Spent.on.Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area.Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily.Internet.Usage    : num  256 194 236 246 226 ...
##  $ Ad.Topic.Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : chr  "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
##  $ Clicked.on.Ad           : int  0 0 0 0 0 0 0 1 0 0 ...

Our dataset is of type dataframe, with 1000 records and 10 variables. 3 variables of tye numeric, 3 integer types, 4 character types including the date and time which will be converted to the standard format.

Converting the date and time

The time looks like a character string when you display it, but its data type. It should be in the class “POSIXct” “POSIXt” (it has two classes).

class(ads$Timestamp)

## [1] "character"

ads$Timestamp <- strptime(paste( ads$Timestamp), format = "%Y-%m-%d %H:%M:%S",tz="UTC") 
class(ads$Timestamp)

## [1] "POSIXlt" "POSIXt"

Checking for duplicates

sum(duplicated(ads))

## [1] 0

There are no duplicates in our data

Checking for missing values

colSums(is.na(ads))

## Daily.Time.Spent.on.Site                      Age              Area.Income 
##                        0                        0                        0 
##     Daily.Internet.Usage            Ad.Topic.Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked.on.Ad 
##                        0

The dataset has no missing values in any of the columns.

Exploratory Data Analysis

install.packages("dataMaid", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'dataMaid' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages

install.packages("inspectdf", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'inspectdf' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'inspectdf'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying C:
## \Users\ruth\Documents\R\win-library\4.0\00LOCK\inspectdf\libs\x64\inspectdf.dll
## to C:\Users\ruth\Documents\R\win-library\4.0\inspectdf\libs\x64\inspectdf.dll:
## Permission denied

## Warning: restored 'inspectdf'

## 
## The downloaded binary packages are in
##  C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages

Calling the libraries

library(dplyr)
library(inspectdf)

## Warning: package 'inspectdf' was built under R version 4.0.3

The 2 packages will give us more insights on our data.

inspect_cat(ads)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 3 x 5
##   col_name        cnt common                       common_pcnt levels           
##   <chr>         <int> <chr>                              <dbl> <named list>     
## 1 Ad.Topic.Line  1000 Adaptive 24hour Graphic Int~       0.1   <tibble [1,000 x~
## 2 City            969 Lisamouth                          0.3   <tibble [969 x 3~
## 3 Country         237 Czech Republic                     0.900 <tibble [237 x 3~

common_pcnt, the percentage of each column occupied by the most common level shown in common.

Bivariate Analysis visualization

Here we check for correlation between the different columns and the target variable Clicked.On.Ad.

inspect_cor(ads, df2 = NULL, method = "pearson", with_col = 'Clicked.on.Ad', alpha = 0.05)

## Warning: Columns with 0 variance found: Male, Clicked.on.Ad

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 5 x 7
##   col_1         col_2                    corr   p_value   lower   upper pcnt_nna
##   <chr>         <chr>                   <dbl>     <dbl>   <dbl>   <dbl>    <dbl>
## 1 Clicked.on.Ad Daily.Internet.Usage  -0.787  3.74e-136 -0.809  -0.762       100
## 2 Clicked.on.Ad Daily.Time.Spent.on.~ -0.748  2.29e-123 -0.774  -0.719       100
## 3 Clicked.on.Ad Age                    0.493  1.55e- 54  0.444   0.538       100
## 4 Clicked.on.Ad Area.Income           -0.476  4.15e- 51 -0.523  -0.427       100
## 5 Clicked.on.Ad Male                  -0.0380 2.30e-  1 -0.0998  0.0240      100

The summary above covers Pearson’s correlation coefficients for all the numeric columns, compared against the Clicked.On.Ads column.

Across the board, we can see that there are negative correlation values for Daily.Internet.Usage, Daily.Time.Spent.on.Site, Area Income. The only positive correlation is between Clicked.On.Ad and Age.

inspect_cor(ads, df2 = NULL, method = "pearson", alpha = 0.05)

## Warning: Columns with 0 variance found: Male, Clicked.on.Ad

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 15 x 7
##    col_1           col_2                 corr   p_value   lower   upper pcnt_nna
##    <chr>           <chr>                <dbl>     <dbl>   <dbl>   <dbl>    <dbl>
##  1 Clicked.on.Ad   Daily.Internet.U~ -0.787   3.74e-136 -0.809  -0.762       100
##  2 Clicked.on.Ad   Daily.Time.Spent~ -0.748   2.29e-123 -0.774  -0.719       100
##  3 Daily.Internet~ Daily.Time.Spent~  0.519   2.80e- 60  0.472   0.563       100
##  4 Clicked.on.Ad   Age                0.493   1.55e- 54  0.444   0.538       100
##  5 Clicked.on.Ad   Area.Income       -0.476   4.15e- 51 -0.523  -0.427       100
##  6 Daily.Internet~ Age               -0.367   4.38e- 31 -0.420  -0.312       100
##  7 Daily.Internet~ Area.Income        0.337   1.63e- 26  0.281   0.391       100
##  8 Age             Daily.Time.Spent~ -0.332   1.22e- 25 -0.386  -0.275       100
##  9 Area.Income     Daily.Time.Spent~  0.311   9.37e- 23  0.254   0.366       100
## 10 Area.Income     Age               -0.183   8.13e-  9 -0.242  -0.122       100
## 11 Clicked.on.Ad   Male              -0.0380  2.30e-  1 -0.0998  0.0240      100
## 12 Male            Daily.Internet.U~  0.0280  3.76e-  1 -0.0340  0.0898      100
## 13 Male            Age               -0.0210  5.06e-  1 -0.0829  0.0410      100
## 14 Male            Daily.Time.Spent~ -0.0190  5.50e-  1 -0.0808  0.0431      100
## 15 Male            Area.Income        0.00132 9.67e-  1 -0.0607  0.0633      100

install.packages("PerformanceAnalytics")

## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages

install.packages("corrplot")

## Installing package into 'C:/Users/ruth/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)

## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\ruth\AppData\Local\Temp\RtmpML9PSf\downloaded_packages

library(corrplot)

## corrplot 0.84 loaded

ads_num <- Filter(is.numeric, ads)
corrplot(cor(ads_num))

The Daily internet usage and Daily time spent on the site columns have a large positive correlation and so does the Clicked.On.Ad and age columns.

We plan on using the Clicked.On.Ad feature to determine fill colors for these graphs, but that won’t work if they stay as they’re currently set (integer data type). I’ll change that in the following code chunk.

library(ggplot2)

ggplot(data = ads, aes(x = Age, fill = Clicked.on.Ad))+
    geom_histogram(bins = 27, color = 'cyan') + 
    labs(title = 'Age distribution with Ad clicks', x = 'Age', y = 'Frequency', fill = 'Clicked.on.Ad') +
        scale_color_brewer(palette = 'Set2')

Income and Click on Ad distribution

ggplot(data = ads, aes(x = Area.Income, fill = Clicked.on.Ad))+
    geom_histogram(bins = 20, color = 'cyan') + 
    labs(title = 'Income distribution', x = 'Income', y = 'Frequency', fill = 'Clicked.on.Ad') +
        scale_color_brewer(palette = 'Set1')

Daily Internet Use and the clicked on ad relationship

ggplot(data = ads, aes(x = Daily.Internet.Usage, fill = Clicked.on.Ad))+
    geom_histogram(bins = 35, color = 'cyan') + 
    labs(title = 'Daily Internet Use distribution', x = 'Daily Internet Usage (minutes)', y = 'Frequency', fill = 'Clicked.on.Ad') +
        scale_color_brewer(palette = 'Set1')

Daily Time Spent on Site and the clicked on ad relationship

ggplot(data = ads, aes(x = Daily.Time.Spent.on.Site, fill = Clicked.on.Ad))+
    geom_histogram(bins = 25, color = 'cyan') + 
    labs(title = 'Daily Time Spent On Site', x = 'Time Spent(minutes)', y = 'Frequency', fill = 'Clicked.on.Ad') +
        scale_color_brewer(palette = 'Set1')

ggplot(data = ads, aes(x =Area.Income , fill = Daily.Time.Spent.on.Site))+
    geom_histogram(bins = 30, color = 'cyan') + 
    labs(title = 'Daily Time Spent On Site vs Income', x = 'Income', y = 'Frequency', fill = 'Clicked.on.Ad') +
        scale_color_brewer(palette = 'Set1')

ggplot(data = ads, aes(x =Age , fill = Daily.Time.Spent.on.Site))+
    geom_histogram(bins = 30, color = 'cyan') + 
    labs(title = 'Daily Time Spent On Site vs Age', x = 'Age', y = 'Frequency', fill = 'Clicked.on.Ad') +
        scale_color_brewer(palette = 'Set1')

Conclusions

The ages between 26 and 42 record the highest frequency of ad clicks on the site and also the highest amount of time spent on the internet. Income levels between 50k to 70k record the highest frequency of ad clicks on the site. People who spend more time on the internet have a high income.

Recommendations

The ads posted on the client’s site should be more relevant to this demographic between late twenties and early forties. Her users also skew more on the high income end of the spectrum. This was expected considering her age demographic data.

Perhaps she could maximize revenue gain from her advertising by raising the cost of the courses, or introducing tiered lesson levels structured in a way that users are more likely to select the courses that cost more. She should be able to do this without losing users. Her demographics older and has more spending money, and are more likely to value/ assess qualiity before gasping at higher prices.

Independent Project: Exploratory Data Analysis using R

MR. N