R WEEK 1 IP

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

install.packages("r package", repos = "http://cran.us.r-project.org")

## Installing package into 'C:/Users/Gakungi/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## Warning: package 'r package' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

library(tinytex)

1.) DEFINING THE QUESTION: Which individuals are most likely to click on my ads?

2.) METRIC OF SUCCESS: Variables with a strong positive correlation with clicks will dictate which individuals are most likely to click on the new course’s ads

3.) EXPERIMENTAL DESIGN TAKEN: a.Load and preview 1st 6 and last 6 rows of the dataset

b.Check the shape of the data and the datatypes of the columns

c.Check for duplicates

d.Detect outliers in the columns using Interquartile range(IQR)

e.Removing outliers in the Area income column

f.Univariate analysis:calculate the mean,median,mode,range,IQR,standard deviation,variance,skewness,kurtosis and quantiles

g.Bivariate analysis:calculate covariance,correlation among the columns, plotted a correlation matrix and visualised correlation.

h.Recommendations

4.) APPROPRIATENESS OF THE DATA: Our dataset contains all the variables that are required to successfully undertake our study i.e.daily time spent, age , daily internet usage, male, and clicked on Ad.

5.) EXPLORATORY DATA ANALYSIS:

# loading the advertising dataset using the fread function
library(data.table)

#import data
df <- fread("C:\\Users\\Gakungi\\OneDrive\\Desktop\\R\\advertising.csv")

# previewing the first 6 rows of our dataset
head(df)

##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    68.95  35    61833.90               256.09
## 2:                    80.23  31    68441.85               193.77
## 3:                    69.47  26    59785.94               236.50
## 4:                    74.15  29    54806.18               245.89
## 5:                    68.37  35    73889.99               225.58
## 6:                    59.99  23    59761.56               226.74
##                            Ad Topic Line           City Male    Country
## 1:    Cloned 5thgeneration orchestration    Wrightburgh    0    Tunisia
## 2:    Monitored national standardization      West Jodi    1      Nauru
## 3:      Organic bottom-line service-desk       Davidton    0 San Marino
## 4: Triple-buffered reciprocal time-frame West Terrifurt    1      Italy
## 5:         Robust logistical utilization   South Manuel    0    Iceland
## 6:       Sharable client-driven software      Jamieberg    1     Norway
##              Timestamp Clicked on Ad
## 1: 2016-03-27 00:53:11             0
## 2: 2016-04-04 01:39:02             0
## 3: 2016-03-13 20:35:42             0
## 4: 2016-01-10 02:31:19             0
## 5: 2016-06-03 03:36:18             0
## 6: 2016-05-19 14:30:17             0

# previewing the last 6 rows
tail(df)

##    Daily Time Spent on Site Age Area Income Daily Internet Usage
## 1:                    43.70  28    63126.96               173.01
## 2:                    72.97  30    71384.57               208.58
## 3:                    51.30  45    67782.17               134.42
## 4:                    51.63  51    42415.72               120.37
## 5:                    55.55  19    41920.79               187.95
## 6:                    45.01  26    29875.80               178.35
##                           Ad Topic Line          City Male
## 1:        Front-line bifurcated ability  Nicholasland    0
## 2:        Fundamental modular algorithm     Duffystad    1
## 3:      Grass-roots cohesive monitoring   New Darlene    1
## 4:         Expanded intangible solution South Jessica    1
## 5: Proactive bandwidth-monitored policy   West Steven    0
## 6:      Virtual 5thgeneration emulation   Ronniemouth    0
##                   Country           Timestamp Clicked on Ad
## 1:                Mayotte 2016-04-04 03:57:48             1
## 2:                Lebanon 2016-02-11 21:49:00             1
## 3: Bosnia and Herzegovina 2016-04-22 02:07:01             1
## 4:               Mongolia 2016-02-01 17:24:57             1
## 5:              Guatemala 2016-03-24 02:35:54             0
## 6:                 Brazil 2016-06-03 21:43:21             1

# checking the shape of our dataset
dim(df)

## [1] 1000   10

# we have 1000 rows and 10 columns

# checking the data types of our 10 columns
str(df)

## Classes 'data.table' and 'data.frame':   1000 obs. of  10 variables:
##  $ Daily Time Spent on Site: num  69 80.2 69.5 74.2 68.4 ...
##  $ Age                     : int  35 31 26 29 35 23 33 48 30 20 ...
##  $ Area Income             : num  61834 68442 59786 54806 73890 ...
##  $ Daily Internet Usage    : num  256 194 236 246 226 ...
##  $ Ad Topic Line           : chr  "Cloned 5thgeneration orchestration" "Monitored national standardization" "Organic bottom-line service-desk" "Triple-buffered reciprocal time-frame" ...
##  $ City                    : chr  "Wrightburgh" "West Jodi" "Davidton" "West Terrifurt" ...
##  $ Male                    : int  0 1 0 1 0 1 0 1 1 1 ...
##  $ Country                 : chr  "Tunisia" "Nauru" "San Marino" "Italy" ...
##  $ Timestamp               : POSIXct, format: "2016-03-27 00:53:11" "2016-04-04 01:39:02" ...
##  $ Clicked on Ad           : int  0 0 0 0 0 0 0 1 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

# our columns have the appropriate data types attached to them

# checking for duplicates in the data
dup <- df[duplicated(df),]
dup

## Empty data.table (0 rows and 10 cols): Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City...

OUTLIER DETECTION USING IQR

# Apply the Interquartile Range, IQR(), function on the daily time spent column
time.IQR <- 78.55 - 51.36
time.IQR <-IQR(df$`Daily Time Spent on Site`)
time.IQR

## [1] 27.1875

# Lowertime spent
lowertimespent <- 51.36 - 1.5 * time.IQR
# Uppertime spent
uppertimespent <- 78.55 + 1.5 * time.IQR

lowertimespent

## [1] 10.57875

uppertimespent

## [1] 119.3312

# Check all the points above the uppertime spent using which() function
print(which(df$`Daily Time Spent on Site` > uppertimespent))

## integer(0)

# Check all the points below the lowertime spent using which() function
print(which(df$`Daily Time Spent on Site` < lowertimespent))

## integer(0)

NO OUTLIERS EXIST IN DAILY TIME SPENT ON SITE COLUMN

# Apply the Interquartile Range, IQR(), function on the age column
age.IQR <- 42-29
age.IQR <-IQR(df$Age)
age.IQR

## [1] 13

# Lowerage 
lowerage <- 29 - 1.5 * age.IQR
# Upperage 
upperage <- 42 + 1.5 * age.IQR

lowerage

## [1] 9.5

upperage

## [1] 61.5

# Check all the points above the upper age using which() function
print(which(df$Age > upperage))

## integer(0)

# Check all the points below the lower age  using which() function
print(which(df$Age < lowerage))

## integer(0)

NO OUTLIERS EXIST IN THE AGE COLUMN

# Apply the Interquartile Range, IQR(), function on the area income column
income.IQR <- 65471-47032
income.IQR <-IQR(df$`Area Income`)
income.IQR

## [1] 18438.83

# Lowerincome
lowerincome <- 47032 - 1.5 * income.IQR
# Upperincome
upperincome <- 65471 + 1.5 * income.IQR

lowerincome

## [1] 19373.75

upperincome

## [1] 93129.25

# Check all the points above the upper area income using which() function
print(which(df$`Area Income` > upperincome))

## integer(0)

# Check all the points below the lower area income using which() function
print(which(df$`Area Income` < lowerincome))

## [1] 136 411 511 641 666 693 769 779 953

WE HAVE OUTLIERS IN THE AREA INCOME COLUMN

# VISUALIZING THE AREA INCOME COLUMN OUTLIERS

(boxplot(df$`Area Income`))

## $stats
##          [,1]
## [1,] 19345.36
## [2,] 47012.58
## [3,] 57012.30
## [4,] 65479.35
## [5,] 79484.80
## 
## $n
## [1] 1000
## 
## $conf
##          [,1]
## [1,] 56089.63
## [2,] 57934.97
## 
## $out
## [1] 17709.98 18819.34 15598.29 15879.10 14548.06 13996.50 14775.50 18368.57
## 
## $group
## [1] 1 1 1 1 1 1 1 1
## 
## $names
## [1] "1"

# REMOVING OUTLIERS IN THE AREA INCOME COLUMN
#only keep rows in dataframe that have values within the IQR
df2 <- subset(df, df$`Area Income`> (47032 - 1.5*income.IQR) & df$`Area Income`<(65471 + 1.5*income.IQR))

# previewing our datasets new shape
dim(df2)

## [1] 991  10

# we now have 991 rows and 10 columns

# Apply the Interquartile Range, IQR(), function on the daily internet column
internet.IQR <- 218.8-138.8
internet.IQR <-IQR(df2$`Daily Internet Usage`)
internet.IQR

## [1] 80.27

# Lowerinternet
lowerinternet <- 138.8 - 1.5 * internet.IQR
# Upperinternet
upperinternet <- 218.8 + 1.5 * internet.IQR

lowerinternet

## [1] 18.395

upperinternet

## [1] 339.205

# Check all the points above the upper internet usage using which() function
print(which(df2$`Daily Internet Usage` > upperinternet))

## integer(0)

# Check all the points below the lower internet usage using which() function
print(which(df2$`Daily Internet Usage` < lowerinternet))

## integer(0)

THERE ARE NO OUTLIERS IN THE DAILY INTERNET USAGE COLUMN

# Apply the Interquartile Range, IQR(), function on the male column
male.IQR <- 1-0
male.IQR <-IQR(df2$Male)
male.IQR

## [1] 1

# Lowermale
lowermale <- 0 - 1.5 * male.IQR
# uppermale
uppermale <- 1 + 1.5 * male.IQR

lowermale

## [1] -1.5

uppermale

## [1] 2.5

# Check all the points above the upper male using which() function
print(which(df2$Male > uppermale))

## integer(0)

# Check all the points below the lower male using which() function
print(which(df2$Male < lowermale))

## integer(0)

NO OUTLIERS EXIST IN THE MALE COLUMN

# Apply the Interquartile Range, IQR(), function on the clicked on ad column
ad.IQR <- 1-0
ad.IQR <-IQR(df2$`Clicked on Ad`)
ad.IQR

## [1] 1

# Lowerad
lowerad <- 0 - 1.5 * ad.IQR
# upperad
upperad <- 1 + 1.5 * ad.IQR

lowerad

## [1] -1.5

upperad

## [1] 2.5

# Check all the points above the upper ad using which() function
print(which(df2$`Clicked on Ad` > upperad))

## integer(0)

# Check all the points below the lower ad using which() function
print(which(df2$`Clicked on Ad` < lowerad))

## integer(0)

NO OUTLIERS EXIST IN THE CLICKED ON AD COLUMN

6.) UNIVARIATE ANALYSIS:

# getting the mean of relevant columns
mean_timespent <- mean(df2$`Daily Time Spent on Site`)
mean_age <- mean(df2$Age)
mean_income <- mean(df2$`Area Income`)
mean_internet <- mean(df2$`Daily Internet Usage`)

print(mean_timespent)

## [1] 65.05689

print(mean_age)

## [1] 35.98587

print(mean_income)

## [1] 55349.1

print(mean_internet)

## [1] 179.9846

# the mean represents the average of the values per column respectively

# getting the median of the relevant columns
median_timespent <- median(df2$`Daily Time Spent on Site`)
median_age <- median(df2$Age)
median_income <- median(df2$`Area Income`)
median_internet <- median(df2$`Daily Internet Usage`)

print(median_timespent)

## [1] 68.41

print(median_age)

## [1] 35

print(median_income)

## [1] 57260.41

print(median_internet)

## [1] 183.43

# the median represents the value that takes up the middle value in the columns respectively

# getting the mode of the relevant columns
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

mode_timespent <- getmode(df2$`Daily Time Spent on Site`)
mode_age <- getmode(df2$Age)
mode_income <- getmode(df2$`Area Income`)
mode_internet <- getmode(df2$`Daily Internet Usage`)

print(mode_timespent)

## [1] 62.26

print(mode_age)

## [1] 31

print(mode_income)

## [1] 61833.9

print(mode_internet)

## [1] 167.22

# the mode represents the most repeated value per column respectively

# getting the range of the relevant columns
range_timespent <- range(df2$`Daily Time Spent on Site`)
range_age <- range(df2$Age)
range_income <- range(df2$`Area Income`)
range_internet <- range(df2$`Daily Internet Usage`)

print(range_timespent)

## [1] 32.60 91.43

print(range_age)

## [1] 19 61

print(range_income)

## [1] 19991.72 79484.80

print(range_internet)

## [1] 104.78 269.96

# the range gives the maximum and minimum figure for each column with the 1st value representing the minimum value and the 2nd value representing the maximum value for each column respectively

# getting the quantiles of relevant columns
quant_timespent <- quantile(df2$`Daily Time Spent on Site`)
quant_age <- quantile(df2$Age)
quant_income <- quantile(df2$`Area Income`)
quant_internet <- quantile(df2$`Daily Internet Usage`)

print(quant_timespent)

##    0%   25%   50%   75%  100% 
## 32.60 51.34 68.41 78.59 91.43

print(quant_age)

##   0%  25%  50%  75% 100% 
##   19   29   35   42   61

print(quant_income)

##       0%      25%      50%      75%     100% 
## 19991.72 47348.17 57260.41 65537.99 79484.80

print(quant_internet)

##      0%     25%     50%     75%    100% 
## 104.780 138.615 183.430 218.885 269.960

# the quantiles represent the cut points dividing the range of a probability distribution per column respectively

# getting the variance of the relevant columns
variance_timespent <- var(df2$`Daily Time Spent on Site`)
variance_age <- var(df2$Age)
variance_income <- var(df2$`Area Income`)
variance_internet <- var(df2$`Daily Internet Usage`)

print(variance_timespent)

## [1] 252.8258

print(variance_age)

## [1] 77.52303

print(variance_income)

## [1] 168000385

print(variance_internet)

## [1] 1940.743

# variance is a measure of how far the set of numbers per column is spread out from their mean eg. those of the area income seem to be far spread out from their mean when compared to that of the age column

# getting the standard deviation of the relevant columns
sd_timespent <- sd(df2$`Daily Time Spent on Site`)
sd_age <- sd(df2$Age)
sd_income <- sd(df2$`Area Income`)
sd_internet <- sd(df2$`Daily Internet Usage`)

print(sd_timespent)

## [1] 15.9005

print(sd_age)

## [1] 8.804716

print(sd_income)

## [1] 12961.5

print(sd_internet)

## [1] 44.05386

# a low standard deviation indicates that values are closer to the mean while a high one indicates they are far from the mean e.g the age column standard deviation of 8.8 displays that its values are closer to their mean than that of the Area income column whose value is 12961.5

library(moments)
# getting the skewness of the relevant columns
sk_timespent <- skewness(df2$`Daily Time Spent on Site`)
sk_age <- skewness(df2$Age)
sk_income <- skewness(df2$`Area Income`)
sk_internet <- skewness(df2$`Daily Internet Usage`)

print(sk_timespent)

## [1] -0.3792563

print(sk_age)

## [1] 0.4839501

print(sk_income)

## [1] -0.5683297

print(sk_internet)

## [1] -0.03385825

# skewness of the age column being positive indicates that the distribution of age column has a longer right tail than left tail while the rest of the columns left tails are longer given that they are skewed negatively

# getting the kurtosis of the relevant columns
kt_timespent <- kurtosis(df2$`Daily Time Spent on Site`)
kt_age <- kurtosis(df2$Age)
kt_income <- kurtosis(df2$`Area Income`)
kt_internet <- kurtosis(df2$`Daily Internet Usage`)

print(kt_timespent)

## [1] 1.901479

print(kt_age)

## [1] 2.596795

print(kt_income)

## [1] 2.694045

print(kt_internet)

## [1] 1.717443

# the kurtosis levels are low hence our columns' distributions have light tails indication presence of little to no outliers

7.) BIVARIATE ANALYSIS:

# assigning the relevant columns to variables
time <- df2$`Daily Time Spent on Site`
age <- df2$Age
income <- df2$`Area Income`
internet <- df2$`Daily Internet Usage`
male <- df2$Male
ad <- df2$`Clicked on Ad`

# checking the covariance between the relevant columns and the clicked on ad column
print(cov(time,ad))

## [1] -5.958448

print(cov(age,ad))

## [1] 2.172663

print(cov(income,ad))

## [1] -3048.73

print(cov(internet,ad))

## [1] -17.43896

print(cov(male,ad))

## [1] -0.01044756

#daily time spent and clicks on ad have a negative relationship
#age and clicks on ad have a positive relationship
#area income and clicks on ad have a negative relationship
#internet usage and clicks on ad have a negative relationship
#male and clicks on ad have a negative relationship

# checking the correlation between the relevant columns and the clicked on ad column
print(cor(time,ad))

## [1] -0.7491196

print(cor(age,ad))

## [1] 0.4932938

print(cor(income,ad))

## [1] -0.4702107

print(cor(internet,ad))

## [1] -0.7913439

print(cor(male,ad))

## [1] -0.04178558

#daily time spent and clicks on ad have a strong negative relationship
#age and clicks on ad have a moderate positive relationship
#area income and clicks on ad have a weak negative relationship
#internet usage and clicks on ad have a strong negative relationship
#male and clicks on ad have a weak negative relationship

# creating a matrix of the relevant numeric columns and previewing it
df3 <- cbind(time,age,income,internet,male,ad)
head(df3)

##       time age   income internet male ad
## [1,] 68.95  35 61833.90   256.09    0  0
## [2,] 80.23  31 68441.85   193.77    1  0
## [3,] 69.47  26 59785.94   236.50    0  0
## [4,] 74.15  29 54806.18   245.89    1  0
## [5,] 68.37  35 73889.99   225.58    0  0
## [6,] 59.99  23 59761.56   226.74    1  0

# getting the correlation matrix of the relevant columns in the new matrix
cor(df3)

##                 time         age      income    internet        male
## time      1.00000000 -0.33285145  0.31345211  0.52003198 -0.01997511
## age      -0.33285145  1.00000000 -0.18177184 -0.36795375 -0.02416672
## income    0.31345211 -0.18177184  1.00000000  0.35221300  0.01199546
## internet  0.52003198 -0.36795375  0.35221300  1.00000000  0.02773955
## male     -0.01997511 -0.02416672  0.01199546  0.02773955  1.00000000
## ad       -0.74911960  0.49329383 -0.47021067 -0.79134386 -0.04178558
##                   ad
## time     -0.74911960
## age       0.49329383
## income   -0.47021067
## internet -0.79134386
## male     -0.04178558
## ad        1.00000000

# Correlogram in R
# required packages
library(corrplot)

## corrplot 0.92 loaded

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#correlation matrix
x <- cor(df3)

  
#visualizing correlogram
# as colour
corrplot(x, method="color")

8.)RECOMMENDATIONS: Since our data has revealed the correlation between the relevant columns and clicks on ads we are able to conclude that the enterpreneur should focus on:

a.The older population since the correlation between age and clicks is a moderately positive one indicating that as age increases the more likely the clicks are made

b.The regions with a lower Area income since the correlation between area income and clicks on ads is a weak negative one indicating that as area income decreases the more likely the clicks are made

c.The regions with low daily internet usage since the correlation between the daily internet usage and clicks on ads is a strong negative one indicating that as internet use decreases the more likely the clicks will be made

d.The regions with low daily time spent on site since the correlation between the daily time spent on site and clicks on ads is a strong negative one indicating that as daily time spent decreases the more likely the clicks will be made

R WEEK 1 IP

Gakungi

2022-05-27

R Markdown

Including Plots