| title: “Cryptography course analysis” |
| author: “Kipsang Mutai Nicholas” |
| date: “01/22/2022” |
| title: “Cryptography course analysis” |
| author: “Kipsang Mutai Nicholas” |
| date: “01/22/2022” |
This research is meant to identify the specific individuals who are likely to click on an advertisement add.
My success will be achieved by correctly identifying individuals that are likely to click into an add. This will be achieved by performing and in depth review of the data at hand to analyse factors influencing the advertisement clicks while assessing individual independent variables and their distribution within the set and their relationship with one another.
The online cryptography course is a new field involved with ensuring secure communication between two individuals. This field though new, could improve information security. Using data collected from a former advertisement on a related course posted on a blog, we can get to view the recipient behavior of this advertisements so as to maximize on high priority recipients to ensure effective advertising and return on investment made in the business by not concentrating on low target potential customers.
This will involve exhaustive techniques to understand our data in and out. This will be done by finding and dealing with extreem values ,anomalies, missing values and duplicated values to ensure the data used is an actual representative of the actual observations. This will be followed by and exhaustive analysis of the attributes (variables) of the data.Using the inference borrowed from our analysis , we will obtain answers to our specific question while challenging the solution by providing how to make improvements to ensure optimum marketing is achieved.
A brief review of the data to inform us what we are working with and its importance for analysis.
# Loading our dataset.
library(data.table)
advert_df<-fread("http://bit.ly/IPAdvertisingData")
# Checking the first 6 observations
head(advert_df)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## 6: 59.99 23 59761.56 226.74
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## 6: Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
## 6: 2016-05-19 14:30:17 0
# Reviewing the dimensions making my table
dimensions <-dim(advert_df)
dimensions
## [1] 1000 10
My dataset has a thousand observations and 10 variables of numerical,integer character datatypes. There is also time stamp variable within the data.
This will be aimed at identifying extreme values, anomalies within the set,missing values and duplicated values in the set.
colnames(advert_df)
## [1] "Daily Time Spent on Site" "Age"
## [3] "Area Income" "Daily Internet Usage"
## [5] "Ad Topic Line" "City"
## [7] "Male" "Country"
## [9] "Timestamp" "Clicked on Ad"
# Changing the column names for easy readability
colnames(advert_df)<-c("Time_on_site","Age","A_income","Internet_Usage","Ad_Topic","City","Male","Country","Timestamp","Clicked")
# Identifying outliers in the numerical columns
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
nums_df<-select_if(advert_df,is.numeric)
nums_df
## Time_on_site Age A_income Internet_Usage Male Clicked
## 1: 68.95 35 61833.90 256.09 0 0
## 2: 80.23 31 68441.85 193.77 1 0
## 3: 69.47 26 59785.94 236.50 0 0
## 4: 74.15 29 54806.18 245.89 1 0
## 5: 68.37 35 73889.99 225.58 0 0
## ---
## 996: 72.97 30 71384.57 208.58 1 1
## 997: 51.30 45 67782.17 134.42 1 1
## 998: 51.63 51 42415.72 120.37 1 1
## 999: 55.55 19 41920.79 187.95 0 0
## 1000: 45.01 26 29875.80 178.35 0 1
colnames(advert_df)
## [1] "Time_on_site" "Age" "A_income" "Internet_Usage"
## [5] "Ad_Topic" "City" "Male" "Country"
## [9] "Timestamp" "Clicked"
# Boxplot of the individual variables
box_plot<-function(data,var,main){
boxplot(data[[var]],ylab="Distribution of values",main=main)
}
box_plot(nums_df,1,"A boxplot of Daily Time Spent on Site")
# There are no outliers in the values in the daily time spent on the site
box_plot(nums_df,2,"A boxplot of Age")
# There are no outliers in the Age distribution
box_plot(nums_df,3,"A boxplot of Area Income")
# There are several outliers lying below the 25th percentile
box_plot(nums_df,4,"A boxplot of Daily Internet Usage")
# There are no outliers in the daily internet usage values
box_plot(nums_df,5,"A boxplot of Males")
# There are no anomalies nor outliers in the male column. Since values are either 0's or 1's for Yes and No respectively
box_plot(nums_df,6,"A boxplot of Clicked on Ads")
# No anomaly nor outliers detected in this set
Numerics<-noquote(names(nums_df))
Numerics
## [1] Time_on_site Age A_income Internet_Usage Male
## [6] Clicked
non_nums<-subset(advert_df,select = -c(Time_on_site,Age,A_income,Internet_Usage,Male,Clicked))
head(non_nums)
## Ad_Topic City Country
## 1: Cloned 5thgeneration orchestration Wrightburgh Tunisia
## 2: Monitored national standardization West Jodi Nauru
## 3: Organic bottom-line service-desk Davidton San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt Italy
## 5: Robust logistical utilization South Manuel Iceland
## 6: Sharable client-driven software Jamieberg Norway
## Timestamp
## 1: 2016-03-27 00:53:11
## 2: 2016-04-04 01:39:02
## 3: 2016-03-13 20:35:42
## 4: 2016-01-10 02:31:19
## 5: 2016-06-03 03:36:18
## 6: 2016-05-19 14:30:17
print(length(unique(non_nums[[1]])))
## [1] 1000
print(length(unique(non_nums[[2]])))
## [1] 969
print(length(unique(non_nums[[3]])))
## [1] 237
There are alot of unique values in this set We can see that some cities come in several times as the unique values do not get to the total number of columns There are 237 unique countries in this set. No anomalies detected on an outwardThere are 237 unique countries in this set. No anomalies detected on an outward
boxplot.stats(advert_df$A_income)$out
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
print(max(advert_df$A_income))
## [1] 79484.8
print(min(advert_df$A_income))
## [1] 13996.5
# Checking for duplicated rows
length(advert_df[duplicated(advert_df),])
## [1] 10
There are 10 duplicated values in this set ### Dealing with duplicates
clean_df<-advert_df[!duplicated(advert_df),]
clean_df
## Time_on_site Age A_income Internet_Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## ---
## 996: 72.97 30 71384.57 208.58
## 997: 51.30 45 67782.17 134.42
## 998: 51.63 51 42415.72 120.37
## 999: 55.55 19 41920.79 187.95
## 1000: 45.01 26 29875.80 178.35
## Ad_Topic City Male
## 1: Cloned 5thgeneration orchestration Wrightburgh 0
## 2: Monitored national standardization West Jodi 1
## 3: Organic bottom-line service-desk Davidton 0
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1
## 5: Robust logistical utilization South Manuel 0
## ---
## 996: Fundamental modular algorithm Duffystad 1
## 997: Grass-roots cohesive monitoring New Darlene 1
## 998: Expanded intangible solution South Jessica 1
## 999: Proactive bandwidth-monitored policy West Steven 0
## 1000: Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked
## 1: Tunisia 2016-03-27 00:53:11 0
## 2: Nauru 2016-04-04 01:39:02 0
## 3: San Marino 2016-03-13 20:35:42 0
## 4: Italy 2016-01-10 02:31:19 0
## 5: Iceland 2016-06-03 03:36:18 0
## ---
## 996: Lebanon 2016-02-11 21:49:00 1
## 997: Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 998: Mongolia 2016-02-01 17:24:57 1
## 999: Guatemala 2016-03-24 02:35:54 0
## 1000: Brazil 2016-06-03 21:43:21 1
clean_df[duplicated(clean_df),]
## Empty data.table (0 rows and 10 cols): Time_on_site,Age,A_income,Internet_Usage,Ad_Topic,City...
# Length which is null
length(which(is.na.data.frame(clean_df)))
## [1] 0
# Re-extracting numerical values from the clean data with no duplicates
library("dplyr")
clean_nums_df<-select_if(clean_df,is.numeric)
clean_nums_df
## Time_on_site Age A_income Internet_Usage Male Clicked
## 1: 68.95 35 61833.90 256.09 0 0
## 2: 80.23 31 68441.85 193.77 1 0
## 3: 69.47 26 59785.94 236.50 0 0
## 4: 74.15 29 54806.18 245.89 1 0
## 5: 68.37 35 73889.99 225.58 0 0
## ---
## 996: 72.97 30 71384.57 208.58 1 1
## 997: 51.30 45 67782.17 134.42 1 1
## 998: 51.63 51 42415.72 120.37 1 1
## 999: 55.55 19 41920.79 187.95 0 0
## 1000: 45.01 26 29875.80 178.35 0 1
# Numerical values
head(clean_nums_df)
## Time_on_site Age A_income Internet_Usage Male Clicked
## 1: 68.95 35 61833.90 256.09 0 0
## 2: 80.23 31 68441.85 193.77 1 0
## 3: 69.47 26 59785.94 236.50 0 0
## 4: 74.15 29 54806.18 245.89 1 0
## 5: 68.37 35 73889.99 225.58 0 0
## 6: 59.99 23 59761.56 226.74 1 0
# Mean of individual columns
colMeans(clean_nums_df)
## Time_on_site Age A_income Internet_Usage Male
## 65.0002 36.0090 55000.0001 180.0001 0.4810
## Clicked
## 0.5000
The values as valid especially for the 0’s and 1’s columns the mean is approximately 0.5 giving us a clue of the distribution of the discrete values.
#A summary of everything in numerical set
summary.data.frame(clean_nums_df)
## Time_on_site Age A_income Internet_Usage
## Min. :32.60 Min. :19.00 Min. :13996 Min. :104.8
## 1st Qu.:51.36 1st Qu.:29.00 1st Qu.:47032 1st Qu.:138.8
## Median :68.22 Median :35.00 Median :57012 Median :183.1
## Mean :65.00 Mean :36.01 Mean :55000 Mean :180.0
## 3rd Qu.:78.55 3rd Qu.:42.00 3rd Qu.:65471 3rd Qu.:218.8
## Max. :91.43 Max. :61.00 Max. :79485 Max. :270.0
## Male Clicked
## Min. :0.000 Min. :0.0
## 1st Qu.:0.000 1st Qu.:0.0
## Median :0.000 Median :0.5
## Mean :0.481 Mean :0.5
## 3rd Qu.:1.000 3rd Qu.:1.0
## Max. :1.000 Max. :1.0
For continous variables a histogram is paramount to display the distribution of values in the set while telling us the skewness and kurtosis of out data
hist(clean_nums_df$Age,xlab="Age",main="Distribution of age")
We can see from this that age of participants was mostly concentrated around 30 years. Its skewed to the left giving us insight that most participants in this sample set were in the mid 30’s
hist(clean_nums_df$Time_on_site,xlab = "Time spent on site",main="The distribution of the time on site")
Frequency of Time spent is relatively high the 3rd quantile of the set. This tells us that a significantly large amount of time is spent on the site
# Distribution of area income
hist(clean_nums_df$A_income,xlab="Area Income",main="This is the ditribution of the Area Income")
The Area income is skewed to the right, the tells us that the income is populated on the higher end with a low population with lower income
# Internet usage
hist(clean_nums_df$Internet_Usage,xlab="Internet Usage",main="Distribution of Internet usage")
The distribution is relatively normal with the lower population on the higher end.
# Getting the values in the male variable
males<-clean_nums_df$Male
# Getting the frequency table of the male set
male_dist<-table(males)
# Plotting a bar plot to understand the distribution of discrete values in the males column
barplot(male_dist,main = "The Ditribution of Males",xlab = "Males and Non males")
# The non male which i would presume to be female were higher in count as compared to males
# Getting the values in the clicks
ad_clicks<-clean_nums_df$Clicked
# Getting the frequency table of the clicks set
ad_clicks_dist<-table(ad_clicks)
# Plotting a bar plot to understand the distribution of discrete values in the clicks column
barplot(ad_clicks_dist,main = "The Ditribution of Adclicks",xlab = "Yes and No")
The distribution was balanced.
Looking at the relationship between two variables and their variations within the set.
install.packages("Hmisc")
## Installing package into 'C:/Users/RoySambu/Documents/R/win-library/4.0'
## (as 'lib' is unspecified)
## package 'Hmisc' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\RoySambu\AppData\Local\Temp\Rtmp82YnjW\downloaded_packages
library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
# Checking the correlation and their corresponding significant levels
correlations<-rcorr(as.matrix(clean_nums_df))
correlations
## Time_on_site Age A_income Internet_Usage Male Clicked
## Time_on_site 1.00 -0.33 0.31 0.52 -0.02 -0.75
## Age -0.33 1.00 -0.18 -0.37 -0.02 0.49
## A_income 0.31 -0.18 1.00 0.34 0.00 -0.48
## Internet_Usage 0.52 -0.37 0.34 1.00 0.03 -0.79
## Male -0.02 -0.02 0.00 0.03 1.00 -0.04
## Clicked -0.75 0.49 -0.48 -0.79 -0.04 1.00
##
## n= 1000
##
##
## P
## Time_on_site Age A_income Internet_Usage Male Clicked
## Time_on_site 0.0000 0.0000 0.0000 0.5495 0.0000
## Age 0.0000 0.0000 0.0000 0.5062 0.0000
## A_income 0.0000 0.0000 0.0000 0.9667 0.0000
## Internet_Usage 0.0000 0.0000 0.0000 0.3762 0.0000
## Male 0.5495 0.5062 0.9667 0.3762 0.2296
## Clicked 0.0000 0.0000 0.0000 0.0000 0.2296
city_tbl<-as.matrix(table(yes_clicked$City))
unique(city_tbl)
## [,1]
## Adamsbury 1
## Lake David 2
# Lake David appears uniquely twice in the cities with ad clicks
Based on our findings, we get to conclude that: