###–>> Which individuals are more likely to click on adverts on cryptography?
###–>> The project will be considered a success when we can identify which individuals will click on the advert.
###–>> A Kenyan entrepreneur has created an online cryptography course and would want to advertise it on her blog. She currently targets audiences originating from various countries. In the past, she ran ads to advertise a related course on the same blog and collected data in the process. She would now like to employ your services as a Data Science Consultant to help her identify which individuals are most likely to click on her ads.
###–>> (i) Find and deal with outliers, anomalies, and missing data within the dataset. (ii) Perform uni variate and bivariate analysis. (iii) From your insights provide a conclusion and recommendation
###–>> The data is valid and has been provided by the entrepreneur, it was collected from the previous adverts.
adverts <- read.csv("advertising.csv")
head(adverts)
tail(adverts)
dim(adverts)
## [1] 1000 10
###–>> The dataframe has 1000 observations and 10 variables
str(adverts)
## 'data.frame': 1000 obs. of 10 variables:
## $ Daily.Time.Spent.on.Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area.Income : num 61834 68442 59786 54806 73890 ...
## $ Daily.Internet.Usage : num 256 194 236 246 226 ...
## $ Ad.Topic.Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : chr "2016-03-27 00:53:11" "2016-04-04 01:39:02" "2016-03-13 20:35:42" "2016-01-10 02:31:19" ...
## $ Clicked.on.Ad : int 0 0 0 0 0 0 0 1 0 0 ...
###–>> R stores the dataframe and views the variables as lists so to see the the various data types of this list we use the str function.
duplicates <- adverts[duplicated(adverts), ]
duplicates
colSums(is.na(adverts))
## Daily.Time.Spent.on.Site Age Area.Income
## 0 0 0
## Daily.Internet.Usage Ad.Topic.Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked.on.Ad
## 0
###–>> The dataset’s columns does not have missing data.
num_cols <- adverts[,unlist(lapply(adverts, is.numeric))]
num_cols
###–>> 6 columns are numerical in nature
boxplot(num_cols, notch = TRUE)
## Warning in (function (z, notch = FALSE, width = NULL, varwidth = FALSE, : some
## notches went outside hinges ('box'): maybe set notch=FALSE
###–>> The Area.Income variable has outliers which will be
imputed.
boxplot.stats(adverts$Area.Income)$out
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
z_scores <- as.data.frame(sapply(num_cols, function(num_cols) (abs(num_cols-mean(num_cols))/sd(num_cols))))
z_scores
###–>> We will drop values with a Z-Score of more than 3 or -3. They are the outliers
no_outliers <- z_scores[!rowSums(z_scores>3), ]
head(no_outliers)
dim(num_cols)
## [1] 1000 6
dim(no_outliers)
## [1] 998 6
###–>> We removed 2 observations.
boxplot(no_outliers)
###–>> There are still outliers so we will use interquantile range
method to remove outliers
# Apply the Interquartile Range, IQR(), function on the area income column
income.IQR <- 65471-47032
income.IQR <-IQR(adverts$`Area.Income`)
income.IQR
## [1] 18438.83
adverts_2 <- subset(adverts, adverts$`Area.Income`> (47032 - 1.5*income.IQR) & adverts$`Area.Income`<(65471 + 1.5*income.IQR))
dim(adverts_2)
## [1] 991 10
###–>> We have lost 9 observations that included the outliers. We proceed with analysis.
summary(num_cols)
## Daily.Time.Spent.on.Site Age Area.Income Daily.Internet.Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
## Male Clicked.on.Ad
## Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.0
## Median :0.000 Median :0.5
## Mean :0.481 Mean :0.5
## 3rd Qu.:1.000 3rd Qu.:1.0
## Max. :1.000 Max. :1.0
###–>> The summary shows: ###–>> 1. The minimum value for each numerical variable. ###–>> 2. The first quantile for each numerical variable ###–>> 3. The median value for all numeric variables across the dataframe. ###–>> 4. The mean value for all numeric variables. ###–>> 5. The third quantile. ###–>> 6. The maximum value for all numerical columns.
variance <- var(num_cols)
variance
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 251.3370949 -4.617415e+01 6.613081e+04
## Age -46.1741459 7.718611e+01 -2.152093e+04
## Area.Income 66130.8109082 -2.152093e+04 1.799524e+08
## Daily.Internet.Usage 360.9918827 -1.416348e+02 1.987625e+05
## Male -0.1501864 -9.242142e-02 8.867509e+00
## Clicked.on.Ad -5.9331431 2.164665e+00 -3.195989e+03
## Daily.Internet.Usage Male Clicked.on.Ad
## Daily.Time.Spent.on.Site 3.609919e+02 -0.15018639 -5.933143e+00
## Age -1.416348e+02 -0.09242142 2.164665e+00
## Area.Income 1.987625e+05 8.86750903 -3.195989e+03
## Daily.Internet.Usage 1.927415e+03 0.61476667 -1.727409e+01
## Male 6.147667e-01 0.24988889 -9.509510e-03
## Clicked.on.Ad -1.727409e+01 -0.00950951 2.502503e-01
###–>> variance is a measure of how far the set of data points per column is spread out from their mean eg. those of the area income seem to be far spread out from their mean when compared to that of the age column.
sd.function <- function(column) {
standard.deviations <- sd(column)
print(standard.deviations)
}
sd.function(adverts_2$Daily.Time.Spent.on.Site)
## [1] 15.9005
sd.function(adverts_2$Age)
## [1] 8.804716
sd.function(adverts_2$Area.Income)
## [1] 12961.5
sd.function(adverts_2$Daily.Internet.Usage)
## [1] 44.05386
###–>> Where a low standard deviation indicates that values are closer to the mean a high one indicates the standard deviation is far from the mean e.g the age column standard deviation of 8.8 displays that its values are closer to their mean than that of the Area income column whose value is 12961
library(moments)
skewness(num_cols)
## Daily.Time.Spent.on.Site Age Area.Income
## -0.37120261 0.47842268 -0.64939670
## Daily.Internet.Usage Male Clicked.on.Ad
## -0.03348703 0.07605493 0.00000000
###–>> The skewness of the Age variable being positive indicates that its distribution has a longer right tail than left tail while the rest of the columns’ left tails.
cov(num_cols)
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 251.3370949 -4.617415e+01 6.613081e+04
## Age -46.1741459 7.718611e+01 -2.152093e+04
## Area.Income 66130.8109082 -2.152093e+04 1.799524e+08
## Daily.Internet.Usage 360.9918827 -1.416348e+02 1.987625e+05
## Male -0.1501864 -9.242142e-02 8.867509e+00
## Clicked.on.Ad -5.9331431 2.164665e+00 -3.195989e+03
## Daily.Internet.Usage Male Clicked.on.Ad
## Daily.Time.Spent.on.Site 3.609919e+02 -0.15018639 -5.933143e+00
## Age -1.416348e+02 -0.09242142 2.164665e+00
## Area.Income 1.987625e+05 8.86750903 -3.195989e+03
## Daily.Internet.Usage 1.927415e+03 0.61476667 -1.727409e+01
## Male 6.147667e-01 0.24988889 -9.509510e-03
## Clicked.on.Ad -1.727409e+01 -0.00950951 2.502503e-01
###–>> The age variable is the only column with a positive covariance with the ad click variable, the rest have negative covariances.
cor(num_cols)
## Daily.Time.Spent.on.Site Age Area.Income
## Daily.Time.Spent.on.Site 1.00000000 -0.33151334 0.310954413
## Age -0.33151334 1.00000000 -0.182604955
## Area.Income 0.31095441 -0.18260496 1.000000000
## Daily.Internet.Usage 0.51865848 -0.36720856 0.337495533
## Male -0.01895085 -0.02104406 0.001322359
## Clicked.on.Ad -0.74811656 0.49253127 -0.476254628
## Daily.Internet.Usage Male Clicked.on.Ad
## Daily.Time.Spent.on.Site 0.51865848 -0.018950855 -0.74811656
## Age -0.36720856 -0.021044064 0.49253127
## Area.Income 0.33749553 0.001322359 -0.47625463
## Daily.Internet.Usage 1.00000000 0.028012326 -0.78653918
## Male 0.02801233 1.000000000 -0.03802747
## Clicked.on.Ad -0.78653918 -0.038027466 1.00000000
###–>> The variables have a negative correlation with the target variable apart from the age variable which has a positive correlation. Let’s see that in the correlogram below
library(corrplot)
## corrplot 0.92 loaded
corr_ <- cor(num_cols)
corrplot(corr_, method = 'color')