1.1 Data
The dataset used for the first research question of the homework is
found from the UCI Machine learning Repository and the name of
the dataset is Online Shoppers Purchasing Intention
Dataset (link to the
dataset).
This dataset gives information about an online bookstore that is
using e-commerce platform and details about the behavior of its online
shoppers. They are using data from 12330 sessions over a year that can
help to predict the visitors’ likelihood to make a purchase
(Revenue=True). This information is beneficial for sellers who want to
understand what attracts their customers the best and that will make
them buy their products.
#with this code a new column called ID will be created and added to each row from the dataset, and it will be the first column since it is not provided in the existing one, and this can help with the future analyzing
#cbind() is used to bind columns together i.e. to combine vectors, matrices, or data frames by columns
mydata <- cbind(ID = 1:nrow(mydata), mydata)
head(mydata) #showing the first six rows
## ID Administrative Administrative_Duration Informational
## 1 1 0 0 0
## 2 2 0 0 0
## 3 3 0 0 0
## 4 4 0 0 0
## 5 5 0 0 0
## 6 6 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 1 0 1 0 0.2
## 2 0 2 64 0
## 3 0 1 0 0.2
## 4 0 2 2.666666667 0.05
## 5 0 10 627.5 0.02
## 6 0 19 154.2166667 0.015789474
## ExitRates PageValues SpecialDay Month OperatingSystems Browser Region
## 1 0.2 0 0 Feb 1 1 1
## 2 0.1 0 0 Feb 2 2 1
## 3 0.2 0 0 Feb 4 1 9
## 4 0.14 0 0 Feb 3 2 2
## 5 0.05 0 0 Feb 3 3 1
## 6 0.024561404 0 0 Feb 2 2 1
## TrafficType VisitorType Weekend Revenue
## 1 1 Returning_Visitor FALSE FALSE
## 2 2 Returning_Visitor FALSE FALSE
## 3 3 Returning_Visitor FALSE FALSE
## 4 4 Returning_Visitor FALSE FALSE
## 5 4 Returning_Visitor TRUE FALSE
## 6 3 Returning_Visitor FALSE FALSE
tail(mydata) #showing the last six rows, we can see that the dataset has 12330 observations
## ID Administrative Administrative_Duration Informational
## 12325 12325 0 0 1
## 12326 12326 3 145 0
## 12327 12327 0 0 0
## 12328 12328 0 0 0
## 12329 12329 4 75 0
## 12330 12330 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 12325 0 16 503 0
## 12326 0 53 1783.791667 0.007142857
## 12327 0 5 465.75 0
## 12328 0 6 184.25 0.083333333
## 12329 0 15 346 0
## 12330 0 3 21.25 0
## ExitRates PageValues SpecialDay Month OperatingSystems Browser Region
## 12325 0.037647059 0 0 Nov 2 2 1
## 12326 0.029030612 12.24171745 0 Dec 4 6 1
## 12327 0.021333333 0 0 Nov 3 2 1
## 12328 0.086666667 0 0 Nov 3 2 1
## 12329 0.021052632 0 0 Nov 2 2 3
## 12330 0.066666667 0 0 Nov 3 2 1
## TrafficType VisitorType Weekend Revenue
## 12325 1 Returning_Visitor FALSE FALSE
## 12326 1 Returning_Visitor TRUE FALSE
## 12327 8 Returning_Visitor TRUE FALSE
## 12328 13 Returning_Visitor TRUE FALSE
## 12329 11 Returning_Visitor FALSE FALSE
## 12330 2 New_Visitor TRUE FALSE
#showing the number of columns, we can see that the dataset has 19 columns
#one of the columns is the dependent variable Revenue
ncol(mydata)
## [1] 19
nrow(mydata) #showing the number of rows
## [1] 12330
#creating a vector named categorical_columns, that will contain the names of the columns in the data frame named mydata from the 12th to the 19th column
categorical_columns <- names(mydata)[12:19]
for (col in categorical_columns) #for each column in the vector named categorical_columns do this
{
#for each column find the unique values and print them as string in one line
print(toString(unique(mydata[[col]])))
}
## [1] "Feb, Mar, May, Oct, June, Jul, Aug, Nov, Sep, Dec"
## [1] "1, 2, 4, 3, 7, 6, 8, 5"
## [1] "1, 2, 3, 4, 5, 6, 7, 10, 8, 9, 12, 13, 11"
## [1] "1, 9, 2, 3, 4, 5, 6, 7, 8"
## [1] "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 16, 17, 20"
## [1] "Returning_Visitor, New_Visitor, Other"
## [1] "FALSE, TRUE"
## [1] "FALSE, TRUE"
#is.na() is used to find missing values or NA in the data frame named mydata the function returns TRUE if the corresponding element in is NA, and FALSE if not
#sum() is used to count the number of missing values or NA, and we can see that it is 0, so there are no missing values/NA
sum(is.na(mydata))
## [1] 0
#contain the names of the columns in the data frame named mydata from the 8th to the 11th column
char_columns <- names(mydata)[8:11]
for (col in char_columns) #for each column in the vector named char_columns do the following
{
#for each column the char value make it to be integer from character to number
mydata[[col]] <- as.integer(mydata[[col]])
}
1.2 Analysis
Parametric test -
Pearson chi-square test
Assumptions for Pearson chi-square test
1. Observations must be independent.
2. Check that all expected frequencies are greater than 5.
3. In larger contingency tables (at least one categorical variable
has more than two categories), up to 20% of the expected frequencies can
be between 1 and 5.
Hypothesis for chisq.test
H0:
There is no association between the visitor type and making
purchase.
H1:
There is an association between the visitor type and making
purchase.
p<0.001 Ho is rejected
#it uses continuity correction since contingency table is 2x2
chi_square <- chisq.test(mydata$VisitorTypeF, mydata$RevenueF,
correct = TRUE) #correct = TRUE for 2x2 contingency table
chi_square
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mydata$VisitorTypeF and mydata$RevenueF
## X-squared = 133.84, df = 1, p-value < 2.2e-16
addmargins(chi_square$observed) #observed
## mydata$RevenueF
## mydata$VisitorTypeF False True Sum
## Returning_Visitor 9081 1470 10551
## New_Visitor 1272 422 1694
## Sum 10353 1892 12245
addmargins(round(chi_square$expected, 2)) #expected
## mydata$RevenueF
## mydata$VisitorTypeF False True Sum
## Returning_Visitor 8920.74 1630.26 10551
## New_Visitor 1432.26 261.74 1694
## Sum 10353.00 1892.00 12245
For f’11 = (9081-8920.74)/sqrt(8920.74) = 1.69 ~ 1.70, and this is
not statistically significant at alfa=0.05 (<1.96)
round(chi_square$res, 2) #std.residual
## mydata$RevenueF
## mydata$VisitorTypeF False True
## Returning_Visitor 1.70 -3.97
## New_Visitor -4.23 9.91
0.742 means 74% were returning visitors and did not made
purchase.
addmargins(round(prop.table(chi_square$observed), 3))
## mydata$RevenueF
## mydata$VisitorTypeF False True Sum
## Returning_Visitor 0.742 0.120 0.862
## New_Visitor 0.104 0.034 0.138
## Sum 0.846 0.154 1.000
0.861 means out of all visitors who are returning, 86% did not made
purchase.
addmargins(round(prop.table(chi_square$observed, 1), 3), 2)
## mydata$RevenueF
## mydata$VisitorTypeF False True Sum
## Returning_Visitor 0.861 0.139 1.000
## New_Visitor 0.751 0.249 1.000
0.877 means out of all who did not made purchase, 87% were returning
visitors.
addmargins(round(prop.table(chi_square$observed, 2), 3), 1)
## mydata$RevenueF
## mydata$VisitorTypeF False True
## Returning_Visitor 0.877 0.777
## New_Visitor 0.123 0.223
## Sum 1.000 1.000
#install.packages("effectsize")
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.3.2
effectsize::cramers_v(mydata$VisitorTypeF, mydata$RevenueF)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.10 | [0.09, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.35)
## [1] "large"
## (Rules: funder2019)
oddsratio(mydata$VisitorTypeF, mydata$RevenueF)
## Odds ratio | 95% CI
## -------------------------
## 2.05 | [1.81, 2.32]
The odds of making purchase for the new visitors are 2.05 times the
odds of making purchase for the returning visitors.
interpret_oddsratio(2.05)
## [1] "small"
## (Rules: chen2010)
Non-parametric
- Fisher’s exact probability test
Hypothesis for fisher.test
H0:
There is no associated odds ration (OR=1).
H1:
There is an associated odds ration (OR=0).
p<0.001 H0 is rejected
fisher.test(mydata$VisitorTypeF, mydata$RevenueF)
##
## Fisher's Exact Test for Count Data
##
## data: mydata$VisitorTypeF and mydata$RevenueF
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.807397 2.320985
## sample estimates:
## odds ratio
## 2.049337
2.1 Data
For this research question several variables will not be used since
they are considered as not useful for the analysis like:
“Administrative” & “Administrative Duration”,
“Informational” & “Informational Duration”, and
“Product Related” & “Product Related Duration”.
This columns represent the number of pages visited by the user in each
of these categories, and is are also captured in the variable “Page
Values”.
Also, the type of OS or browser a user use are not very meaningful
since people can buy online no matter what kind of OS or internet
browser they use. Month should be removed as well, because some months
are missing, and we can not be sure if the month makes a big difference.
The variable weekend will not be used as well as the variable
region.
The descriptive statistics for VisitorTypeF &
RevenueF were explaind under RQ 1.
The summary for PageValues shows that this variable has a range from
minimum 0 to maximum 361. By the mean of this variable we can say that
the average number of pages that are visited by users have and positive
outcome by completing transactions is 5.7.
The range for variable SpecialDay is from minimum 0 to maximum 1,
and if it is close to 0 means that the users are not visiting the page
close to some spacial day, which is opposite of 1, that means users are
visiting the page close to some special day or even on that specific
one. By the mean 0.01, we can say that on average, user visits are not
close to certain special days.
summary(mydata[colnames(mydata) %in% c("PageValues", "SpecialDay")])
## PageValues SpecialDay
## Min. : 0.000 Min. :0.00000
## 1st Qu.: 0.000 1st Qu.:0.00000
## Median : 0.000 Median :0.00000
## Mean : 5.694 Mean :0.01258
## 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :361.000 Max. :1.00000
2.2 Analysis
fit0
#dependent and explanatory variables | only reg. constant is included because we want to know the starting point without explanatory variables
fit0 <- glm(RevenueF ~ 1,
family = binomial, #binary logistic regression
data = mydata)
summary(fit0)
##
## Call:
## glm(formula = RevenueF ~ 1, family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.700 0.025 -67.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10542 on 12244 degrees of freedom
## Residual deviance: 10542 on 12244 degrees of freedom
## AIC: 10544
##
## Number of Fisher Scoring iterations: 3
The probability that a visitor will make purchase on the site is
0.183 times the probability that the visitor will not make purchase on
the site.
exp(cbind(odds = fit0$coefficients, confint.default(fit0))) #odds for Revenue=True
## odds 2.5 % 97.5 %
## (Intercept) 0.182749 0.1740097 0.1919271
head(fitted(fit0)) #estimated probability for Revenue=True
## 1 2 3 4 5 6
## 0.154512 0.154512 0.154512 0.154512 0.154512 0.154512
Pseudo_R2_fit1 <- 10353/12245 #prop. of correctly classified units
Pseudo_R2_fit1
## [1] 0.845488
fit1
H0: beta1 = 0.
H1: beta1 != 0.
PageValues is significant
at p<0.001
If
PageValues has increased by one page, the odds that the visitor will
make a purchase on the site are equal to e^0.089242 = 1.093 times the
inital odds at p<0.001. So, if we increase the PageValues, the
probability for visitor to make purchase will increase.
fit1 <- glm(RevenueF ~ PageValues, #dependent and explanatory variables
family = binomial, #binary logistic regression
data = mydata)
summary(fit1)
##
## Call:
## glm(formula = RevenueF ~ PageValues, family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.399622 0.034278 -70.01 <2e-16 ***
## PageValues 0.089242 0.002373 37.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10541.9 on 12244 degrees of freedom
## Residual deviance: 7888.9 on 12243 degrees of freedom
## AIC: 7892.9
##
## Number of Fisher Scoring iterations: 5
#simpler model 10541.9
#complex model 7888.9
#Better is the complex model, and it is significant.
anova(fit0, fit1, test = "Chi") #comparison of models based on -2LL statistics
## Analysis of Deviance Table
##
## Model 1: RevenueF ~ 1
## Model 2: RevenueF ~ PageValues
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 12244 10541.9
## 2 12243 7888.9 1 2653 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #odds ratio for Revenue=1 (with 95% CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09075223 0.08485555 0.09705869
## PageValues 1.09334488 1.08827082 1.09844258
fit2
#install.packages("car")
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.2
fit2 <- glm(RevenueF ~ PageValues + SpecialDay + VisitorTypeF, #dependent and explanatory variables
family = binomial, #binary logistic regression
data = mydata)
Assumptions for logistic regression
1. The dependent variable is dichotomous. - Yes, Revenue has False
& True.
2. The explanatory variables are numerical or in the form of a Dummy
variable. - Yes, PageValues and SpecialDay are numerical &
VisitorTypeF is Dummy variable.
3. No outliers and/or units with high impact.
mydata$StdResid <- round(rstandard(fit2), 3) #standardized residuals | remove above 3 and below -3
hist(mydata$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of standardized residuals")

head(mydata[order(-mydata$StdResid), c("ID", "StdResid")])
## ID StdResid
## 2670 2670 2.337
## 4902 4902 2.266
## 1189 1189 2.255
## 5491 5491 2.255
## 5515 5515 2.255
## 5555 5555 2.255
head(mydata[order(mydata$StdResid), c("ID", "StdResid")], n=30) #showing the first thirty rows
## ID StdResid
## 5774 5774 -6.288
## 8346 8346 -5.802
## 10641 10641 -5.139
## 6430 6430 -5.038
## 7939 7939 -4.417
## 12248 12248 -4.315
## 5878 5878 -4.233
## 10026 10026 -4.040
## 3201 3201 -4.018
## 7008 7008 -3.861
## 8895 8895 -3.838
## 3272 3272 -3.526
## 9055 9055 -3.239
## 7037 7037 -3.212
## 9250 9250 -3.212
## 10106 10106 -3.212
## 1790 1790 -3.185
## 11599 11599 -3.129
## 5543 5543 -3.105
## 8417 8417 -3.101
## 11520 11520 -3.072
## 3629 3629 -3.044
## 4195 4195 -3.015
## 9960 9960 -3.015
## 10436 10436 -2.956
## 8097 8097 -2.926
## 8706 8706 -2.926
## 2957 2957 -2.897
## 10068 10068 -2.805
## 7728 7728 -2.774
mydata$CooksD <- round(cooks.distance(fit2), 3) #cooks distances | remove above 1 or gaps
hist(mydata$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")

head(mydata[order(-mydata$Cook), c("ID", "CooksD")])
## ID CooksD
## 5774 5774 0.083
## 8346 8346 0.064
## 10641 10641 0.042
## 6430 6430 0.037
## 2670 2670 0.026
## 4902 4902 0.026
#installed.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mydata <- mydata %>%
dplyr::filter(!(ID %in% c(5689, 8249, 10496, 6344, 5688, 7850, 12043, 5792, 9896, 3148, 6921, 8787, 3218, 8944, 6950, 9139, 9974, 1749, 11424, 5458, 8319, 11347, 3572, 4132, 9833)))
nrow(mydata)
## [1] 12220
4. Sufficiently large samples. - Yes, the dataset has more than 400
observations.
There are no enough evidence to explain SpecialDay, so it can not be
explaind since is not significant.
fit1 <- glm(RevenueF ~ PageValues,
family = binomial,
data = mydata)
fit2 <- glm(RevenueF ~ PageValues + SpecialDay + VisitorTypeF,
family = binomial,
data = mydata)
anova(fit1, fit2, test = "Chi") #comparison of models based on -2LL statistics
## Analysis of Deviance Table
##
## Model 1: RevenueF ~ PageValues
## Model 2: RevenueF ~ PageValues + SpecialDay + VisitorTypeF
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 12218 7875.0
## 2 12216 7841.8 2 33.263 5.986e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit2)
##
## Call:
## glm(formula = RevenueF ~ PageValues + SpecialDay + VisitorTypeF,
## family = binomial, data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.457519 0.036979 -66.457 < 2e-16 ***
## PageValues 0.088391 0.002387 37.026 < 2e-16 ***
## SpecialDay -0.537325 0.344977 -1.558 0.119
## VisitorTypeFNew_Visitor 0.450792 0.079997 5.635 1.75e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10516.5 on 12219 degrees of freedom
## Residual deviance: 7841.8 on 12216 degrees of freedom
## AIC: 7849.8
##
## Number of Fisher Scoring iterations: 5
If PageValues increases by one page, the odds that the visitor will
make a purchase on the site are equal to 1.092 times the initial odds at
p<0.001, under the assumptions that the remaining expalanotyr
variables do not change. The probability to make purchase on the site
increase.
Given the values of other explanatory variables, the odds for
visitor to make a purchase if it is a part of the group who are new
visitor is 1.569 times the odds of the group of visitors who are
returning. New visitors are more likely to purchase product on the
site.
exp(cbind(OR = fit2$coefficients, confint.default(fit2))) #odds ratio for Revenue=True (with 95% CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.08564717 0.07965924 0.09208521
## PageValues 1.09241505 1.08731569 1.09753833
## SpecialDay 0.58430947 0.29716507 1.14891551
## VisitorTypeFNew_Visitor 1.56955542 1.34178511 1.83599012
mydata$EstProb <- fitted(fit2) #estimates of probabilities for Revenue=True: P(Y=1)
head(mydata)
## ID Administrative Administrative_Duration Informational
## 1 1 0 0 0
## 2 2 0 0 0
## 3 3 0 0 0
## 4 4 0 0 0
## 5 5 0 0 0
## 6 6 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 1 0 1 0 0
## 2 0 2 64 0
## 3 0 1 0 0
## 4 0 2 2.666666667 0
## 5 0 10 627.5 0
## 6 0 19 154.2166667 0
## ExitRates PageValues SpecialDay Month OperatingSystems Browser Region
## 1 0 0 0 Feb 1 1 1
## 2 0 0 0 Feb 2 2 1
## 3 0 0 0 Feb 4 1 9
## 4 0 0 0 Feb 3 2 2
## 5 0 0 0 Feb 3 3 1
## 6 0 0 0 Feb 2 2 1
## TrafficType VisitorType Weekend Revenue VisitorTypeF RevenueF
## 1 1 Returning_Visitor FALSE FALSE Returning_Visitor False
## 2 2 Returning_Visitor FALSE FALSE Returning_Visitor False
## 3 3 Returning_Visitor FALSE FALSE Returning_Visitor False
## 4 4 Returning_Visitor FALSE FALSE Returning_Visitor False
## 5 4 Returning_Visitor TRUE FALSE Returning_Visitor False
## 6 3 Returning_Visitor FALSE FALSE Returning_Visitor False
## StdResid CooksD EstProb
## 1 -0.405 0 0.07889043
## 2 -0.405 0 0.07889043
## 3 -0.405 0 0.07889043
## 4 -0.405 0 0.07889043
## 5 -0.405 0 0.07889043
## 6 -0.405 0 0.07889043
mydata$Classification <- ifelse(test = mydata$EstProb < 0.50,
yes = "False",
no = "True")
#if estimated probability is below 0.50, visitor is classified into the group of people that have not made a purchase (Revenue=False), otherwise True
mydata$ClassificationF <- factor(mydata$Classification,
levels = c("False", "True"),
labels = c("False", "True"))
#classification table based on 2 categorical variables
ClassificationTable <- table(mydata$RevenueF, mydata$ClassificationF)
ClassificationTable
##
## False True
## False 10129 204
## True 1206 681
The Pseudo_R2_fit2 is 0.8846154 and says that
approximately 89% of the variability of the dependent variable RevenueF
(if a purchased is made or not) can be explained by the explanaotry
variables in the logistic regression model.
#prop. of correctly classified units
Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit2
## [1] 0.8846154