ADVERTISING

1. Defining the Question

a) Specifying the Question

Which individuals are most likely to clink on the course advertisement ads?

b) Defining the Metric for Success

c) Understanding the Context

d) Recording the Experimental Design

e) Data Relevance

2. Data Understanding

a) Reading the Data

# Loading the data.
library(data.table)
ad <- fread('advertising.csv')

b) Checking the Data

Number of Records

# Number of rows and columns.
cat('Number of rows = ', nrow(ad), 'and the number of columns = ', ncol(ad),'.')

## Number of rows =  1000 and the number of columns =  10 .

Top Dataset Preview

# First 5 records.
head(ad, 5)

##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0

Bottom Dataset Preview

# Last 5 records.
tail(ad, 5)

##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    72.97  30    71384.57               208.58
## 2:                    51.30  45    67782.17               134.42
## 3:                    51.63  51    42415.72               120.37
## 4:                    55.55  19    41920.79               187.95
## 5:                    45.01  26    29875.80               178.35
##                           Ad Topic Line          City Male
## 1:        Fundamental modular algorithm     Duffystad    1
## 2:      Grass-roots cohesive monitoring   New Darlene    1
## 3:         Expanded intangible solution South Jessica    1
## 4: Proactive bandwidth-monitored policy   West Steven    0
## 5:      Virtual 5thgeneration emulation   Ronniemouth    0
##                   Country           Timestamp Clicked on Ad
## 1:                Lebanon 2016-02-11 21:49:00             1
## 2: Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 3:               Mongolia 2016-02-01 17:24:57             1
## 4:              Guatemala 2016-03-24 02:35:54             0
## 5:                 Brazil 2016-06-03 21:43:21             1

At first glance of the data set, no anomalies can be seen.

c) Checking Datatypes

# Data set structure.
str(ad)

## Classes 'data.table' and 'data.frame':   1000 obs. of  10 variables:
##  $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily Internet Usage    : num  256 194 236 246 226 ...
##  $ Ad Topic Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked on Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

# Categorical columns
num <- unlist(lapply(ad, is.numeric))
cat_cols <- ad[, !num]
# Excluding the Timestamp column
cat_cols['Timestamp'] <- FALSE
# Coercing character columns to factors
  # Data frame with character columns only
char_df <- ad[, ..cat_cols]
 # Getting character vector from the original logical vector.
c <- as.vector(colnames(char_df))
# Target data set columns
a <- ad[ , ..c]
# Converting target character columns to factors
ad[ ,c] <- lapply(a, factor)
# Checking changes
head(ad, 5)

##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0

# Converting the encoded Male and Clicked on Ad columns to factors.
ad[ ,c('Male', 'Clicked on Ad')] <- lapply(ad[, c('Male', 'Clicked on Ad')], factor)
# Checking changes
head(ad, 5)

##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0

3. External Dataset Validation

4. Data Preperation

a) Validation

Column Validity

Checking for invalid/unnecessary columns that do not contribute relevant information to the study.

# Column names 
colnames(ad)

##  [1] "Daily Time Spent on Site" "Age"                     
##  [3] "Area Income"              "Daily Internet Usage"    
##  [5] "Ad Topic Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked on Ad"

All columns are valid.

Checking for invalid values

# Checking for anomalies
# Data set summary
summary(ad)

##  Daily Time Spent on Site      Age         Area Income    Daily Internet Usage
##  Min.   :32.60            Min.   :19.00   Min.   :13996   Min.   :104.8       
##  1st Qu.:51.36            1st Qu.:29.00   1st Qu.:47032   1st Qu.:138.8       
##  Median :68.22            Median :35.00   Median :57012   Median :183.1       
##  Mean   :65.00            Mean   :36.01   Mean   :55000   Mean   :180.0       
##  3rd Qu.:78.55            3rd Qu.:42.00   3rd Qu.:65471   3rd Qu.:218.8       
##  Max.   :91.43            Max.   :61.00   Max.   :79485   Max.   :270.0       
##                                                                               
##                                  Ad Topic Line              City     Male   
##  Adaptive 24hour Graphic Interface      :  1   Lisamouth      :  3   0:519  
##  Adaptive asynchronous attitude         :  1   Williamsport   :  3   1:481  
##  Adaptive context-sensitive application :  1   Benjaminchester:  2          
##  Adaptive contextually-based methodology:  1   East John      :  2          
##  Adaptive demand-driven knowledgebase   :  1   East Timothy   :  2          
##  Adaptive uniform capability            :  1   Johnstad       :  2          
##  (Other)                                :994   (Other)        :986          
##            Country      Timestamp                      Clicked on Ad
##  Czech Republic:  9   Min.   :2016-01-01 02:52:10.00   0:500        
##  France        :  9   1st Qu.:2016-02-18 02:55:42.00   1:500        
##  Afghanistan   :  8   Median :2016-04-07 17:27:29.50                
##  Australia     :  8   Mean   :2016-04-10 10:34:06.64                
##  Cyprus        :  8   3rd Qu.:2016-05-31 03:18:14.00                
##  Greece        :  8   Max.   :2016-07-24 00:22:16.00                
##  (Other)       :950

All numeric columns are >= 0.

# Checking unique categorical column values.
str(ad)

## Classes 'data.table' and 'data.frame':   1000 obs. of  10 variables:
##  $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily Internet Usage    : num  256 194 236 246 226 ...
##  $ Ad Topic Line           : Factor w/ 1000 levels "Adaptive 24hour Graphic Interface",..: 92 465 567 904 767 806 223 724 108 455 ...
##  $ City                    : Factor w/ 969 levels "Adamsbury","Adamside",..: 962 904 112 940 806 283 47 672 885 713 ...
##  $ Male                    : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 2 2 ...
##  $ Country                 : Factor w/ 237 levels "Afghanistan",..: 216 148 185 104 97 159 146 13 83 79 ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked on Ad           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

No anomalies can be seen in the categorical columns, therefore, no anomalies are present.

b) Consistency

# Checking for missing values
colSums(is.na(ad))

## Daily Time Spent on Site                      Age              Area Income 
##                        0                        0                        0 
##     Daily Internet Usage            Ad Topic Line                     City 
##                        0                        0                        0 
##                     Male                  Country                Timestamp 
##                        0                        0                        0 
##            Clicked on Ad 
##                        0

There are no missing values present in the data set.

c) Completeness

# Checking for duplicates.
sum(duplicated(ad))

## [1] 0

There are no duplicated records.

d) Uniformity

# Checking uniformity of column names.
colnames(ad)

##  [1] "Daily Time Spent on Site" "Age"                     
##  [3] "Area Income"              "Daily Internet Usage"    
##  [5] "Ad Topic Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked on Ad"

Column names have the same case, therefore, they are uniform.

e) Outliers

# Numerical columns
num_df <- ad[ , ..num]
# Removing the encoded categorical columns from the numerical columns set.
num_df <- num_df[ , !c('Male', 'Clicked on Ad') ]
# Checking for outliers

# Plotting boxplots
  # Number of plots
length(num_df)

## [1] 4

# Boxplots
par(mfrow = c(2,2))
for (i in 1:length(num_df)){
  boxplot(num_df[ , ..i], main = paste('Boxplot of',  names(num_df)[i]), 
          ylab = 'Count')
}

From the box plots of the numerical columns, it can be seen that only the ‘Area Income’ column has outliers. Outliers will be retained for further analysis.

5. Descritpive Analysis

a) Univariate Analysis

Categorical

# Categorical columns
names(cat_cols)

##  [1] "Daily Time Spent on Site" "Age"                     
##  [3] "Area Income"              "Daily Internet Usage"    
##  [5] "Ad Topic Line"            "City"                    
##  [7] "Male"                     "Country"                 
##  [9] "Timestamp"                "Clicked on Ad"

Moringa School Module 3 Week 1 IP

Deborah Masibo

2022-05-26

ADVERTISING

1. Defining the Question

a) Specifying the Question

b) Defining the Metric for Success

c) Understanding the Context

d) Recording the Experimental Design

e) Data Relevance

2. Data Understanding

a) Reading the Data

b) Checking the Data

c) Checking Datatypes

3. External Dataset Validation

4. Data Preperation

a) Validation

b) Consistency

c) Completeness

d) Uniformity

e) Outliers

5. Descritpive Analysis

a) Univariate Analysis

Categorical

Numerical

b) Bivariate Analysis

Categorical-Categorical

Numeircal-Numerical

Numerical-Categorical

Analysis Summary

c) Multivariate Analysis

6. Implemeting the Solution

7. Challenging the Solution

8. Conclusion

9. Follow Up Questions

a) Did we have the right data?

b) Do we need other data to answer our question?

c) Did we have the right question?