This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
install.packages("r package", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/Gakungi/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## Warning: package 'r package' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
library(tinytex)
1.) DEFINING THE QUESTION: Which individuals are most likely to click on my ads?
2.) METRIC OF SUCCESS: Variables with a strong positive correlation with clicks will dictate which individuals are most likely to click on the new course’s ads
3.) EXPERIMENTAL DESIGN TAKEN: a.Load and preview 1st 6 and last 6 rows of the dataset
b.Check the shape of the data and the datatypes of the columns
c.Check for duplicates
d.Detect outliers in the columns using Interquartile range(IQR)
e.Removing outliers in the Area income column
f.Univariate analysis:calculate the mean,median,mode,range,IQR,standard deviation,variance,skewness,kurtosis and quantiles
g.Bivariate analysis:calculate covariance,correlation among the columns, plotted a correlation matrix and visualised correlation.
h.Recommendations
4.) APPROPRIATENESS OF THE DATA: Our dataset contains all the variables that are required to successfully undertake our study i.e.daily time spent, age , daily internet usage, male, and clicked on Ad.
5.) EXPLORATORY DATA ANALYSIS:
# loading the advertising dataset using the fread function
library(data.table)
#import data
df <- fread("C:\\Users\\Gakungi\\OneDrive\\Desktop\\R\\advertising.csv")
# previewing the first 6 rows of our dataset
head(df)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 68.95 35 61833.90 256.09
## 2: 80.23 31 68441.85 193.77
## 3: 69.47 26 59785.94 236.50
## 4: 74.15 29 54806.18 245.89
## 5: 68.37 35 73889.99 225.58
## 6: 59.99 23 59761.56 226.74
## Ad Topic Line City Male Country
## 1: Cloned 5thgeneration orchestration Wrightburgh 0 Tunisia
## 2: Monitored national standardization West Jodi 1 Nauru
## 3: Organic bottom-line service-desk Davidton 0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt 1 Italy
## 5: Robust logistical utilization South Manuel 0 Iceland
## 6: Sharable client-driven software Jamieberg 1 Norway
## Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11 0
## 2: 2016-04-04 01:39:02 0
## 3: 2016-03-13 20:35:42 0
## 4: 2016-01-10 02:31:19 0
## 5: 2016-06-03 03:36:18 0
## 6: 2016-05-19 14:30:17 0
# previewing the last 6 rows
tail(df)
## Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1: 43.70 28 63126.96 173.01
## 2: 72.97 30 71384.57 208.58
## 3: 51.30 45 67782.17 134.42
## 4: 51.63 51 42415.72 120.37
## 5: 55.55 19 41920.79 187.95
## 6: 45.01 26 29875.80 178.35
## Ad Topic Line City Male
## 1: Front-line bifurcated ability Nicholasland 0
## 2: Fundamental modular algorithm Duffystad 1
## 3: Grass-roots cohesive monitoring New Darlene 1
## 4: Expanded intangible solution South Jessica 1
## 5: Proactive bandwidth-monitored policy West Steven 0
## 6: Virtual 5thgeneration emulation Ronniemouth 0
## Country Timestamp Clicked on Ad
## 1: Mayotte 2016-04-04 03:57:48 1
## 2: Lebanon 2016-02-11 21:49:00 1
## 3: Bosnia and Herzegovina 2016-04-22 02:07:01 1
## 4: Mongolia 2016-02-01 17:24:57 1
## 5: Guatemala 2016-03-24 02:35:54 0
## 6: Brazil 2016-06-03 21:43:21 1
# checking the shape of our dataset
dim(df)
## [1] 1000 10
# we have 1000 rows and 10 columns
# checking the data types of our 10 columns
str(df)
## Classes 'data.table' and 'data.frame': 1000 obs. of 10 variables:
## $ Daily Time Spent on Site: num 69 80.2 69.5 74.2 68.4 ...
## $ Age : int 35 31 26 29 35 23 33 48 30 20 ...
## $ Area Income : num 61834 68442 59786 54806 73890 ...
## $ Daily Internet Usage : num 256 194 236 246 226 ...
## $ Ad Topic Line : chr "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
## $ City : chr "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
## $ Male : int 0 1 0 1 0 1 0 1 1 1 ...
## $ Country : chr "Tunisia" "Nauru" "San Marino" "Italy" ...
## $ Timestamp : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
## $ Clicked on Ad : int 0 0 0 0 0 0 0 1 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
# our columns have the appropriate data types attached to them
# checking for duplicates in the data
dup <- df[duplicated(df),]
dup
## Empty data.table (0 rows and 10 cols): Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City...
OUTLIER DETECTION USING IQR
# Apply the Interquartile Range, IQR(), function on the daily time spent column
time.IQR <- 78.55 - 51.36
time.IQR <-IQR(df$`Daily Time Spent on Site`)
time.IQR
## [1] 27.1875
# Lowertime spent
lowertimespent <- 51.36 - 1.5 * time.IQR
# Uppertime spent
uppertimespent <- 78.55 + 1.5 * time.IQR
lowertimespent
## [1] 10.57875
uppertimespent
## [1] 119.3312
# Check all the points above the uppertime spent using which() function
print(which(df$`Daily Time Spent on Site` > uppertimespent))
## integer(0)
# Check all the points below the lowertime spent using which() function
print(which(df$`Daily Time Spent on Site` < lowertimespent))
## integer(0)
NO OUTLIERS EXIST IN DAILY TIME SPENT ON SITE COLUMN
# Apply the Interquartile Range, IQR(), function on the age column
age.IQR <- 42-29
age.IQR <-IQR(df$Age)
age.IQR
## [1] 13
# Lowerage
lowerage <- 29 - 1.5 * age.IQR
# Upperage
upperage <- 42 + 1.5 * age.IQR
lowerage
## [1] 9.5
upperage
## [1] 61.5
# Check all the points above the upper age using which() function
print(which(df$Age > upperage))
## integer(0)
# Check all the points below the lower age using which() function
print(which(df$Age < lowerage))
## integer(0)
NO OUTLIERS EXIST IN THE AGE COLUMN
# Apply the Interquartile Range, IQR(), function on the area income column
income.IQR <- 65471-47032
income.IQR <-IQR(df$`Area Income`)
income.IQR
## [1] 18438.83
# Lowerincome
lowerincome <- 47032 - 1.5 * income.IQR
# Upperincome
upperincome <- 65471 + 1.5 * income.IQR
lowerincome
## [1] 19373.75
upperincome
## [1] 93129.25
# Check all the points above the upper area income using which() function
print(which(df$`Area Income` > upperincome))
## integer(0)
# Check all the points below the lower area income using which() function
print(which(df$`Area Income` < lowerincome))
## [1] 136 411 511 641 666 693 769 779 953
WE HAVE OUTLIERS IN THE AREA INCOME COLUMN
# VISUALIZING THE AREA INCOME COLUMN OUTLIERS
(boxplot(df$`Area Income`))
## $stats
## [,1]
## [1,] 19345.36
## [2,] 47012.58
## [3,] 57012.30
## [4,] 65479.35
## [5,] 79484.80
##
## $n
## [1] 1000
##
## $conf
## [,1]
## [1,] 56089.63
## [2,] 57934.97
##
## $out
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
##
## $group
## [1] 1 1 1 1 1 1 1 1
##
## $names
## [1] "1"
# REMOVING OUTLIERS IN THE AREA INCOME COLUMN
#only keep rows in dataframe that have values within the IQR
df2 <- subset(df, df$`Area Income`> (47032 - 1.5*income.IQR) & df$`Area Income`<(65471 + 1.5*income.IQR))
# previewing our datasets new shape
dim(df2)
## [1] 991 10
# we now have 991 rows and 10 columns
# Apply the Interquartile Range, IQR(), function on the daily internet column
internet.IQR <- 218.8-138.8
internet.IQR <-IQR(df2$`Daily Internet Usage`)
internet.IQR
## [1] 80.27
# Lowerinternet
lowerinternet <- 138.8 - 1.5 * internet.IQR
# Upperinternet
upperinternet <- 218.8 + 1.5 * internet.IQR
lowerinternet
## [1] 18.395
upperinternet
## [1] 339.205
# Check all the points above the upper internet usage using which() function
print(which(df2$`Daily Internet Usage` > upperinternet))
## integer(0)
# Check all the points below the lower internet usage using which() function
print(which(df2$`Daily Internet Usage` < lowerinternet))
## integer(0)
THERE ARE NO OUTLIERS IN THE DAILY INTERNET USAGE COLUMN
# Apply the Interquartile Range, IQR(), function on the male column
male.IQR <- 1-0
male.IQR <-IQR(df2$Male)
male.IQR
## [1] 1
# Lowermale
lowermale <- 0 - 1.5 * male.IQR
# uppermale
uppermale <- 1 + 1.5 * male.IQR
lowermale
## [1] -1.5
uppermale
## [1] 2.5
# Check all the points above the upper male using which() function
print(which(df2$Male > uppermale))
## integer(0)
# Check all the points below the lower male using which() function
print(which(df2$Male < lowermale))
## integer(0)
NO OUTLIERS EXIST IN THE MALE COLUMN
# Apply the Interquartile Range, IQR(), function on the clicked on ad column
ad.IQR <- 1-0
ad.IQR <-IQR(df2$`Clicked on Ad`)
ad.IQR
## [1] 1
# Lowerad
lowerad <- 0 - 1.5 * ad.IQR
# upperad
upperad <- 1 + 1.5 * ad.IQR
lowerad
## [1] -1.5
upperad
## [1] 2.5
# Check all the points above the upper ad using which() function
print(which(df2$`Clicked on Ad` > upperad))
## integer(0)
# Check all the points below the lower ad using which() function
print(which(df2$`Clicked on Ad` < lowerad))
## integer(0)
NO OUTLIERS EXIST IN THE CLICKED ON AD COLUMN
6.) UNIVARIATE ANALYSIS:
# getting the mean of relevant columns
mean_timespent <- mean(df2$`Daily Time Spent on Site`)
mean_age <- mean(df2$Age)
mean_income <- mean(df2$`Area Income`)
mean_internet <- mean(df2$`Daily Internet Usage`)
print(mean_timespent)
## [1] 65.05689
print(mean_age)
## [1] 35.98587
print(mean_income)
## [1] 55349.1
print(mean_internet)
## [1] 179.9846
# the mean represents the average of the values per column respectively
# getting the median of the relevant columns
median_timespent <- median(df2$`Daily Time Spent on Site`)
median_age <- median(df2$Age)
median_income <- median(df2$`Area Income`)
median_internet <- median(df2$`Daily Internet Usage`)
print(median_timespent)
## [1] 68.41
print(median_age)
## [1] 35
print(median_income)
## [1] 57260.41
print(median_internet)
## [1] 183.43
# the median represents the value that takes up the middle value in the columns respectively
# getting the mode of the relevant columns
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
mode_timespent <- getmode(df2$`Daily Time Spent on Site`)
mode_age <- getmode(df2$Age)
mode_income <- getmode(df2$`Area Income`)
mode_internet <- getmode(df2$`Daily Internet Usage`)
print(mode_timespent)
## [1] 62.26
print(mode_age)
## [1] 31
print(mode_income)
## [1] 61833.9
print(mode_internet)
## [1] 167.22
# the mode represents the most repeated value per column respectively
# getting the range of the relevant columns
range_timespent <- range(df2$`Daily Time Spent on Site`)
range_age <- range(df2$Age)
range_income <- range(df2$`Area Income`)
range_internet <- range(df2$`Daily Internet Usage`)
print(range_timespent)
## [1] 32.60 91.43
print(range_age)
## [1] 19 61
print(range_income)
## [1] 19991.72 79484.80
print(range_internet)
## [1] 104.78 269.96
# the range gives the maximum and minimum figure for each column with the 1st value representing the minimum value and the 2nd value representing the maximum value for each column respectively
# getting the quantiles of relevant columns
quant_timespent <- quantile(df2$`Daily Time Spent on Site`)
quant_age <- quantile(df2$Age)
quant_income <- quantile(df2$`Area Income`)
quant_internet <- quantile(df2$`Daily Internet Usage`)
print(quant_timespent)
## 0% 25% 50% 75% 100%
## 32.60 51.34 68.41 78.59 91.43
print(quant_age)
## 0% 25% 50% 75% 100%
## 19 29 35 42 61
print(quant_income)
## 0% 25% 50% 75% 100%
## 19991.72 47348.17 57260.41 65537.99 79484.80
print(quant_internet)
## 0% 25% 50% 75% 100%
## 104.780 138.615 183.430 218.885 269.960
# the quantiles represent the cut points dividing the range of a probability distribution per column respectively
# getting the variance of the relevant columns
variance_timespent <- var(df2$`Daily Time Spent on Site`)
variance_age <- var(df2$Age)
variance_income <- var(df2$`Area Income`)
variance_internet <- var(df2$`Daily Internet Usage`)
print(variance_timespent)
## [1] 252.8258
print(variance_age)
## [1] 77.52303
print(variance_income)
## [1] 168000385
print(variance_internet)
## [1] 1940.743
# variance is a measure of how far the set of numbers per column is spread out from their mean eg. those of the area income seem to be far spread out from their mean when compared to that of the age column
# getting the standard deviation of the relevant columns
sd_timespent <- sd(df2$`Daily Time Spent on Site`)
sd_age <- sd(df2$Age)
sd_income <- sd(df2$`Area Income`)
sd_internet <- sd(df2$`Daily Internet Usage`)
print(sd_timespent)
## [1] 15.9005
print(sd_age)
## [1] 8.804716
print(sd_income)
## [1] 12961.5
print(sd_internet)
## [1] 44.05386
# a low standard deviation indicates that values are closer to the mean while a high one indicates they are far from the mean e.g the age column standard deviation of 8.8 displays that its values are closer to their mean than that of the Area income column whose value is 12961.5
library(moments)
# getting the skewness of the relevant columns
sk_timespent <- skewness(df2$`Daily Time Spent on Site`)
sk_age <- skewness(df2$Age)
sk_income <- skewness(df2$`Area Income`)
sk_internet <- skewness(df2$`Daily Internet Usage`)
print(sk_timespent)
## [1] -0.3792563
print(sk_age)
## [1] 0.4839501
print(sk_income)
## [1] -0.5683297
print(sk_internet)
## [1] -0.03385825
# skewness of the age column being positive indicates that the distribution of age column has a longer right tail than left tail while the rest of the columns left tails are longer given that they are skewed negatively
# getting the kurtosis of the relevant columns
kt_timespent <- kurtosis(df2$`Daily Time Spent on Site`)
kt_age <- kurtosis(df2$Age)
kt_income <- kurtosis(df2$`Area Income`)
kt_internet <- kurtosis(df2$`Daily Internet Usage`)
print(kt_timespent)
## [1] 1.901479
print(kt_age)
## [1] 2.596795
print(kt_income)
## [1] 2.694045
print(kt_internet)
## [1] 1.717443
# the kurtosis levels are low hence our columns' distributions have light tails indication presence of little to no outliers
7.) BIVARIATE ANALYSIS:
# assigning the relevant columns to variables
time <- df2$`Daily Time Spent on Site`
age <- df2$Age
income <- df2$`Area Income`
internet <- df2$`Daily Internet Usage`
male <- df2$Male
ad <- df2$`Clicked on Ad`
# checking the covariance between the relevant columns and the clicked on ad column
print(cov(time,ad))
## [1] -5.958448
print(cov(age,ad))
## [1] 2.172663
print(cov(income,ad))
## [1] -3048.73
print(cov(internet,ad))
## [1] -17.43896
print(cov(male,ad))
## [1] -0.01044756
#daily time spent and clicks on ad have a negative relationship
#age and clicks on ad have a positive relationship
#area income and clicks on ad have a negative relationship
#internet usage and clicks on ad have a negative relationship
#male and clicks on ad have a negative relationship
# checking the correlation between the relevant columns and the clicked on ad column
print(cor(time,ad))
## [1] -0.7491196
print(cor(age,ad))
## [1] 0.4932938
print(cor(income,ad))
## [1] -0.4702107
print(cor(internet,ad))
## [1] -0.7913439
print(cor(male,ad))
## [1] -0.04178558
#daily time spent and clicks on ad have a strong negative relationship
#age and clicks on ad have a moderate positive relationship
#area income and clicks on ad have a weak negative relationship
#internet usage and clicks on ad have a strong negative relationship
#male and clicks on ad have a weak negative relationship
# creating a matrix of the relevant numeric columns and previewing it
df3 <- cbind(time,age,income,internet,male,ad)
head(df3)
## time age income internet male ad
## [1,] 68.95 35 61833.90 256.09 0 0
## [2,] 80.23 31 68441.85 193.77 1 0
## [3,] 69.47 26 59785.94 236.50 0 0
## [4,] 74.15 29 54806.18 245.89 1 0
## [5,] 68.37 35 73889.99 225.58 0 0
## [6,] 59.99 23 59761.56 226.74 1 0
# getting the correlation matrix of the relevant columns in the new matrix
cor(df3)
## time age income internet male
## time 1.00000000 -0.33285145 0.31345211 0.52003198 -0.01997511
## age -0.33285145 1.00000000 -0.18177184 -0.36795375 -0.02416672
## income 0.31345211 -0.18177184 1.00000000 0.35221300 0.01199546
## internet 0.52003198 -0.36795375 0.35221300 1.00000000 0.02773955
## male -0.01997511 -0.02416672 0.01199546 0.02773955 1.00000000
## ad -0.74911960 0.49329383 -0.47021067 -0.79134386 -0.04178558
## ad
## time -0.74911960
## age 0.49329383
## income -0.47021067
## internet -0.79134386
## male -0.04178558
## ad 1.00000000
# Correlogram in R
# required packages
library(corrplot)
## corrplot 0.92 loaded
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#correlation matrix
x <- cor(df3)
#visualizing correlogram
# as colour
corrplot(x, method="color")
8.)RECOMMENDATIONS: Since our data has revealed the correlation between the relevant columns and clicks on ads we are able to conclude that the enterpreneur should focus on:
a.The older population since the correlation between age and clicks is a moderately positive one indicating that as age increases the more likely the clicks are made
b.The regions with a lower Area income since the correlation between area income and clicks on ads is a weak negative one indicating that as area income decreases the more likely the clicks are made
c.The regions with low daily internet usage since the correlation between the daily internet usage and clicks on ads is a strong negative one indicating that as internet use decreases the more likely the clicks will be made
d.The regions with low daily time spent on site since the correlation between the daily time spent on site and clicks on ads is a strong negative one indicating that as daily time spent decreases the more likely the clicks will be made