Which individuals are most likely to clink on the course advertisement ads?
# Loading the data.
library(data.table)
ad <- fread('advertising.csv')
Number of Records
# Number of rows and columns.
cat('Number of rows = ', nrow(ad), 'and the number of columns = ', ncol(ad),'.')
## Number of rows = 1000 and the number of columns = 10 .
Top Dataset Preview
# First 5 records.
head(ad, 5)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
Bottom Dataset Preview
# Last 5 records.
tail(ad, 5)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 72.97 30 71384.57 208.58
## 2: 51.30 45 67782.17 134.42
## 3: 51.63 51 42415.72 120.37
## 4: 55.55 19 41920.79 187.95
## 5: 45.01 26 29875.80 178.35
## Ad Topic Line City Male
## 1: Fundamental modular algorithm Duffystad 1
## 2: Grass-roots cohesive monitoring New Darlene 1
## 3: Expanded intangible solution South Jessica 1
## 4: Proactive bandwidth-monitored policy West Steven 0
## 5: Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked on Ad
## 1: Lebanon 2016-02-11 21:49:00 1
## 2: Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 3: Mongolia 2016-02-01 17:24:57 1
## 4: Guatemala 2016-03-24 02:35:54 0
## 5: Brazil 2016-06-03 21:43:21 1
At first glance of the data set, no anomalies can be seen.
# Data set structure.
str(ad)
## Classes 'data.table' and 'data.frame': 1000 obs. of 10 variables:
## $ Daily Time Spent on Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area Income : num 61834 68442 59786 54806 73890 ...
## $ Daily Internet Usage : num 256 194 236 246 226 ...
## $ Ad Topic Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked on Ad : int 0 0 0 0 0 0 0 1 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
# Categorical columns
num <- unlist(lapply(ad, is.numeric))
cat_cols <- ad[, !num]
# Excluding the Timestamp column
cat_cols['Timestamp'] <- FALSE
# Coercing character columns to factors
# Data frame with character columns only
char_df <- ad[, ..cat_cols]
# Getting character vector from the original logical vector.
c <- as.vector(colnames(char_df))
# Target data set columns
a <- ad[ , ..c]
# Converting target character columns to factors
ad[ ,c] <- lapply(a, factor)
# Checking changes
head(ad, 5)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
# Converting the encoded Male and Clicked on Ad columns to factors.
ad[ ,c('Male', 'Clicked on Ad')] <- lapply(ad[, c('Male', 'Clicked on Ad')], factor)
# Checking changes
head(ad, 5)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
Column Validity
Checking for invalid/unnecessary columns that do not contribute relevant information to the study.
# Column names
colnames(ad)
## [1] "Daily Time Spent on Site" "Age"
## [3] "Area Income" "Daily Internet Usage"
## [5] "Ad Topic Line" "City"
## [7] "Male" "Country"
## [9] "Timestamp" "Clicked on Ad"
All columns are valid.
Checking for invalid values
# Checking for anomalies
# Data set summary
summary(ad)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
##
## Ad Topic Line City Male
## Adaptive 24hour Graphic Interface : 1 Lisamouth : 3 0:519
## Adaptive asynchronous attitude : 1 Williamsport : 3 1:481
## Adaptive context-sensitive application : 1 Benjaminchester: 2
## Adaptive contextually-based methodology: 1 East John : 2
## Adaptive demand-driven knowledgebase : 1 East Timothy : 2
## Adaptive uniform capability : 1 Johnstad : 2
## (Other) :994 (Other) :986
## Country Timestamp Clicked on Ad
## Czech Republic: 9 Min. :2016-01-01 02:52:10.00 0:500
## France : 9 1st Qu.:2016-02-18 02:55:42.00 1:500
## Afghanistan : 8 Median :2016-04-07 17:27:29.50
## Australia : 8 Mean :2016-04-10 10:34:06.64
## Cyprus : 8 3rd Qu.:2016-05-31 03:18:14.00
## Greece : 8 Max. :2016-07-24 00:22:16.00
## (Other) :950
All numeric columns are >= 0.
# Checking unique categorical column values.
str(ad)
## Classes 'data.table' and 'data.frame': 1000 obs. of 10 variables:
## $ Daily Time Spent on Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area Income : num 61834 68442 59786 54806 73890 ...
## $ Daily Internet Usage : num 256 194 236 246 226 ...
## $ Ad Topic Line : Factor w/ 1000 levels "Adaptive 24hour Graphic Interface",..: 92 465 567 904 767 806 223 724 108 455 ...
## $ City : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
## $ Male : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
## $ Country : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked on Ad : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
No anomalies can be seen in the categorical columns, therefore, no anomalies are present.
# Checking for missing values
colSums(is.na(ad))
## Daily Time Spent on Site Age Area Income
## 0 0 0
## Daily Internet Usage Ad Topic Line City
## 0 0 0
## Male Country Timestamp
## 0 0 0
## Clicked on Ad
## 0
There are no missing values present in the data set.
# Checking for duplicates.
sum(duplicated(ad))
## [1] 0
There are no duplicated records.
# Checking uniformity of column names.
colnames(ad)
## [1] "Daily Time Spent on Site" "Age"
## [3] "Area Income" "Daily Internet Usage"
## [5] "Ad Topic Line" "City"
## [7] "Male" "Country"
## [9] "Timestamp" "Clicked on Ad"
Column names have the same case, therefore, they are uniform.
# Numerical columns
num_df <- ad[ , ..num]
# Removing the encoded categorical columns from the numerical columns set.
num_df <- num_df[ , !c('Male', 'Clicked on Ad') ]
# Checking for outliers
# Plotting boxplots
# Number of plots
length(num_df)
## [1] 4
# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
boxplot(num_df[ , ..i], main = paste('Boxplot of', names(num_df)[i]),
ylab = 'Count')
}
From the box plots of the numerical columns, it can be seen that only the ‘Area Income’ column has outliers. Outliers will be retained for further analysis.
# Categorical columns
names(cat_cols)
## [1] "Daily Time Spent on Site" "Age"
## [3] "Area Income" "Daily Internet Usage"
## [5] "Ad Topic Line" "City"
## [7] "Male" "Country"
## [9] "Timestamp" "Clicked on Ad"
# Temporary Directory
dir.create(tempdir())
## Warning in dir.create(tempdir()): 'C:\Users\HP\AppData\Local\Temp\Rtmp6hkgNv'
## already exists