Homework with the program R at course RMT by Teodora Kochova

Dataset: Online Shoppers Purchasing Intention

The digital era has transformed the way we shop.

1. Research Question

The first reasearch question will be answered by analyzing the association between two categorical variables.

Research Question: “Is there a significant association between the type of online visitor and their contribution to the website revenue?”

mydata <- read.table("C:/Users/teodo/OneDrive/Desktop/Program R/Homework Teodora Kochova/online shoppers purchasing intention dataset/online_shoppers_intention.csv", header = TRUE, sep=",", dec = ",")

1.1 Data

The dataset used for the first research question of the homework is found from the UCI Machine learning Repository and the name of the dataset is Online Shoppers Purchasing Intention Dataset (link to the dataset).

This dataset gives information about an online bookstore that is using e-commerce platform and details about the behavior of its online shoppers. They are using data from 12330 sessions over a year that can help to predict the visitors’ likelihood to make a purchase (Revenue=True). This information is beneficial for sellers who want to understand what attracts their customers the best and that will make them buy their products.

#with this code a new column called ID will be created and added to each row from the dataset, and it will be the first column since it is not provided in the existing one, and this can help with the future analyzing

#cbind() is used to bind columns together i.e. to combine vectors, matrices, or data frames by columns

mydata <- cbind(ID = 1:nrow(mydata), mydata)

head(mydata) #showing the first six rows

##   ID Administrative Administrative_Duration Informational
## 1  1              0                       0             0
## 2  2              0                       0             0
## 3  3              0                       0             0
## 4  4              0                       0             0
## 5  5              0                       0             0
## 6  6              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 1                      0              1                       0         0.2
## 2                      0              2                      64           0
## 3                      0              1                       0         0.2
## 4                      0              2             2.666666667        0.05
## 5                      0             10                   627.5        0.02
## 6                      0             19             154.2166667 0.015789474
##     ExitRates PageValues SpecialDay Month OperatingSystems Browser Region
## 1         0.2          0          0   Feb                1       1      1
## 2         0.1          0          0   Feb                2       2      1
## 3         0.2          0          0   Feb                4       1      9
## 4        0.14          0          0   Feb                3       2      2
## 5        0.05          0          0   Feb                3       3      1
## 6 0.024561404          0          0   Feb                2       2      1
##   TrafficType       VisitorType Weekend Revenue
## 1           1 Returning_Visitor   FALSE   FALSE
## 2           2 Returning_Visitor   FALSE   FALSE
## 3           3 Returning_Visitor   FALSE   FALSE
## 4           4 Returning_Visitor   FALSE   FALSE
## 5           4 Returning_Visitor    TRUE   FALSE
## 6           3 Returning_Visitor   FALSE   FALSE

tail(mydata) #showing the last six rows, we can see that the dataset has 12330 observations

##          ID Administrative Administrative_Duration Informational
## 12325 12325              0                       0             1
## 12326 12326              3                     145             0
## 12327 12327              0                       0             0
## 12328 12328              0                       0             0
## 12329 12329              4                      75             0
## 12330 12330              0                       0             0
##       Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 12325                      0             16                     503           0
## 12326                      0             53             1783.791667 0.007142857
## 12327                      0              5                  465.75           0
## 12328                      0              6                  184.25 0.083333333
## 12329                      0             15                     346           0
## 12330                      0              3                   21.25           0
##         ExitRates  PageValues SpecialDay Month OperatingSystems Browser Region
## 12325 0.037647059           0          0   Nov                2       2      1
## 12326 0.029030612 12.24171745          0   Dec                4       6      1
## 12327 0.021333333           0          0   Nov                3       2      1
## 12328 0.086666667           0          0   Nov                3       2      1
## 12329 0.021052632           0          0   Nov                2       2      3
## 12330 0.066666667           0          0   Nov                3       2      1
##       TrafficType       VisitorType Weekend Revenue
## 12325           1 Returning_Visitor   FALSE   FALSE
## 12326           1 Returning_Visitor    TRUE   FALSE
## 12327           8 Returning_Visitor    TRUE   FALSE
## 12328          13 Returning_Visitor    TRUE   FALSE
## 12329          11 Returning_Visitor   FALSE   FALSE
## 12330           2       New_Visitor    TRUE   FALSE

#showing the number of columns, we can see that the dataset has 19 columns
#one of the columns is the dependent variable Revenue

ncol(mydata)

## [1] 19

nrow(mydata) #showing the number of rows

## [1] 12330

#creating a vector named categorical_columns, that will contain the names of the columns in the data frame named mydata from the 12th to the 19th column

categorical_columns <- names(mydata)[12:19] 

for (col in categorical_columns) #for each column in the vector named categorical_columns do this 
                             
{
  #for each column find the unique values and print them as string in one line
  print(toString(unique(mydata[[col]]))) 
}

## [1] "Feb, Mar, May, Oct, June, Jul, Aug, Nov, Sep, Dec"
## [1] "1, 2, 4, 3, 7, 6, 8, 5"
## [1] "1, 2, 3, 4, 5, 6, 7, 10, 8, 9, 12, 13, 11"
## [1] "1, 9, 2, 3, 4, 5, 6, 7, 8"
## [1] "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 18, 19, 16, 17, 20"
## [1] "Returning_Visitor, New_Visitor, Other"
## [1] "FALSE, TRUE"
## [1] "FALSE, TRUE"

#is.na() is used to find missing values or NA in the data frame named mydata the function returns TRUE if the corresponding element in is NA, and FALSE if not

#sum() is used to count the number of missing values or NA, and we can see that it is 0, so there are no missing values/NA

sum(is.na(mydata))

## [1] 0

#contain the names of the columns in the data frame named mydata from the 8th to the 11th column

char_columns <- names(mydata)[8:11] 

for (col in char_columns) #for each column in the vector named char_columns do the following
{
  #for each column the char value make it to be integer from character to number
  mydata[[col]] <- as.integer(mydata[[col]]) 
}

Description of variables in the dataset:

Numerical variables:

Administrative

the number of pages related to administrative tasks (like account settings) and shows the user’s engagement with administrative content

Administrative Duration

the time spent on administrative pages in a session

Informational

the number of pages related to informational content (non-product related) and shows the user’s interest with products

Informational Duration

the time spent on informational pages in a session

Bounce Rates

the percentage of users who enter a page and leave without taking any action and shows how often users leave without exploring

Exit Rates

the percentage of page views that were the last in a session and shows the last pages users view before leaving the site

Page Value

the average value of a page that a user visited before making a transaction and shows the contribution of a page to completing transactions

Special Day

how close the visit of a user is to a specific day/holiday (e.g. Christmas, Valentine’s Day)

Categorical variables:

Operating System

type of OS used by the user (e.g. Windows, Mac) (1-8)

Browser

web browser used by the user (e.g. Chrome, Firefox) (1-13)

Region

geographical region of the user (1-9)

Traffic Type

type of traffic source that brought the user to the site (1-20)

Visitor Type

if the visitor is a returning or new visitor (Returning, New & Other)

Weekend

if the date of the visit is on a weekend (False & True)

Month

month in which the session has happened (Feb to Dec; w/o January and April)

Revenue

if a transaction resulted in income during the user’s online session (False, True)

Fun fact: “Bounce Rate”, “Exit Rate” and “Page Value” are metrics measured by “Google Analytics”, and give insights about the user behavior, to help for the overall website success.

For the purpose of this research question onyl two categorical variables will be used since and those are “Visitor Type” & “Revenue”. This research question is trying to understand if there is a meaningful connection between the visitor type and making purchase on the website. With this analyzation we an get deep insights into whether different types of visitors can contribute differently to the overall revenue. This can help to find out how to do better and attract more visitors.

#the type of visitor who is Other will be removed since it has no specific meaning for this variable

mydata <- mydata[mydata$VisitorType != "Other", ]

unique_values_VisitorType <- unique(mydata$VisitorType) #finding the unique values for Visitor Type

#paste() is used to join together character strings

print(paste("The unique values are:", unique_values_VisitorType))

## [1] "The unique values are: Returning_Visitor"
## [2] "The unique values are: New_Visitor"

print(paste("The number of unique values is:", length(unique_values_VisitorType)))

## [1] "The number of unique values is: 2"

#factor() is used converting categorical vectors into data type called factor, that can help with analyzing categorical data. This was used for the variable VisitorType, and Revenue and the factor data type was put in new variable called VisitorTypeF & RevenueF accordingly.

#reference group Returning_Visitor

mydata$VisitorTypeF <- factor(mydata$VisitorType, 
                          levels = c("Returning_Visitor", "New_Visitor"), 
                          labels = c("Returning_Visitor", "New_Visitor"))

#reference group FALSE

mydata$RevenueF <- factor(mydata$Revenue, 
                          levels = c("FALSE", "TRUE"), 
                          labels = c("False", "True"))

The descriptive statistics for the variable VisitorTypeF, show the occurrences of the type of visitors on the site. Based on the summary, the “Returning_Visitor” type of user is the most common, “New_Visitor” type is the second most common. This can help when making decisions like targeting promotions for the new visitors to encourage their first purchase, or offering rewards for the returning visitors.

summary(mydata$VisitorTypeF)

## Returning_Visitor       New_Visitor 
##             10551              1694

The descriptive statistics for the variable RevenueF, show the occurrences if the visitors made purchase (Revenue=True) or not (Revenue=False) the type of visitors on the site.

summary(mydata$RevenueF)

## False  True 
## 10353  1892

1.2 Analysis

Parametric test - Pearson chi-square test

Assumptions for Pearson chi-square test

1. Observations must be independent.

2. Check that all expected frequencies are greater than 5.

3. In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5.

Hypothesis for chisq.test

H0: There is no association between the visitor type and making purchase.

H1: There is an association between the visitor type and making purchase.

p<0.001 Ho is rejected

#it uses continuity correction since contingency table is 2x2

chi_square <- chisq.test(mydata$VisitorTypeF, mydata$RevenueF, 
                        correct = TRUE) #correct = TRUE for 2x2 contingency table

chi_square

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mydata$VisitorTypeF and mydata$RevenueF
## X-squared = 133.84, df = 1, p-value < 2.2e-16

addmargins(chi_square$observed) #observed

##                    mydata$RevenueF
## mydata$VisitorTypeF False  True   Sum
##   Returning_Visitor  9081  1470 10551
##   New_Visitor        1272   422  1694
##   Sum               10353  1892 12245

addmargins(round(chi_square$expected, 2)) #expected

##                    mydata$RevenueF
## mydata$VisitorTypeF    False    True   Sum
##   Returning_Visitor  8920.74 1630.26 10551
##   New_Visitor        1432.26  261.74  1694
##   Sum               10353.00 1892.00 12245

For f’11 = (9081-8920.74)/sqrt(8920.74) = 1.69 ~ 1.70, and this is not statistically significant at alfa=0.05 (<1.96)

round(chi_square$res, 2) #std.residual

##                    mydata$RevenueF
## mydata$VisitorTypeF False  True
##   Returning_Visitor  1.70 -3.97
##   New_Visitor       -4.23  9.91

0.742 means 74% were returning visitors and did not made purchase.

addmargins(round(prop.table(chi_square$observed), 3))

##                    mydata$RevenueF
## mydata$VisitorTypeF False  True   Sum
##   Returning_Visitor 0.742 0.120 0.862
##   New_Visitor       0.104 0.034 0.138
##   Sum               0.846 0.154 1.000

0.861 means out of all visitors who are returning, 86% did not made purchase.

addmargins(round(prop.table(chi_square$observed, 1), 3), 2)

##                    mydata$RevenueF
## mydata$VisitorTypeF False  True   Sum
##   Returning_Visitor 0.861 0.139 1.000
##   New_Visitor       0.751 0.249 1.000

0.877 means out of all who did not made purchase, 87% were returning visitors.

addmargins(round(prop.table(chi_square$observed, 2), 3), 1)

##                    mydata$RevenueF
## mydata$VisitorTypeF False  True
##   Returning_Visitor 0.877 0.777
##   New_Visitor       0.123 0.223
##   Sum               1.000 1.000

#install.packages("effectsize")

library(effectsize)

## Warning: package 'effectsize' was built under R version 4.3.2

effectsize::cramers_v(mydata$VisitorTypeF, mydata$RevenueF)

## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.10              | [0.09, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

interpret_cramers_v(0.35)

## [1] "large"
## (Rules: funder2019)

oddsratio(mydata$VisitorTypeF, mydata$RevenueF)

## Odds ratio |       95% CI
## -------------------------
## 2.05       | [1.81, 2.32]

The odds of making purchase for the new visitors are 2.05 times the odds of making purchase for the returning visitors.

interpret_oddsratio(2.05)

## [1] "small"
## (Rules: chen2010)

Non-parametric - Fisher’s exact probability test

Hypothesis for fisher.test

H0: There is no associated odds ration (OR=1).

H1: There is an associated odds ration (OR=0).

p<0.001 H0 is rejected

fisher.test(mydata$VisitorTypeF, mydata$RevenueF)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  mydata$VisitorTypeF and mydata$RevenueF
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.807397 2.320985
## sample estimates:
## odds ratio 
##   2.049337

1.3 Conclusion

In summary, with this analysis we investigated the correlation between the type of visitor and if an online transaction is made or not. Both parametric (Pearson chi-square test) and non-parametric test (Fisher‘s Exact Probability Test of Independence). Since no assumptions were violated, more appropriate test for this research question is the parametric test.

The analysis showed a strong connection between the type of online visitors (new or returning) and their likelihood to make a purchase. Precisely, 74% of returning visitors did not make a purchase, and 87% of those who didn’t make a purchase were returning visitors. The type of visitor plays a significant role in predicting if users will make a purchase.

This can help to make the target marketing more effective. For example for the returning visitors some promotions, discounts, or loyalty points may be given, to boost them to make more transactions.

Explanations for each part of the 1.2 chapter Analysis are given alongside the code, to understand different part of the code and what tell us the variables.

2.Research Question

The second reasearch question will be answered by logistic regression.

Research Question: “What factors influence the likelihood of online shoppers making a purchase?”

# Revenue = True makes a purchase 
# Revenue = False does no make a purchase

2.1 Data

The same dataset from the first research question is being used in as well for the second research question. At the begining some basic information about the dataset were given. For the purpose of this research question only several new things will be presented and explained.

#the values from the dataset looked a little bit strange and that is why this was done to see if this variable will have a meaning to be used for RQ2

unique_values_BounceRates <- unique(mydata$BounceRates) #finding the unique values for Bounce Rates

print(paste("The unique values are:", unique_values_BounceRates))

## [1] "The unique values are: 0"

print(paste("The number of unique values is:", length(unique_values_BounceRates)))

## [1] "The number of unique values is: 1"

unique_values_ExitRates <- unique(mydata$ExitRates) #finding the unique values for Exit Rates

print(paste("The unique values are:", unique_values_ExitRates))

## [1] "The unique values are: 0"

print(paste("The number of unique values is:", length(unique_values_ExitRates)))

## [1] "The number of unique values is: 1"

Also, the type of OS or browser a user use are not very meaningful since people can buy online no matter what kind of OS or internet browser they use. Month should be removed as well, because some months are missing, and we can not be sure if the month makes a big difference. The variable weekend will not be used as well as the variable region.

The variables “Bounce Rates” and “Exit Rates” have constantly zeros and no variability and can not provide useful information, so they will not be used.

Description of the effect of the variables used over depended variable Revenue

Page Values: a high page value will show that users who visited certain pages are more likely to complete a transaction, and that may suggest how much important are those pages with revenue

Special Day: will show if the revenue is increased around some special days, and may suggest that people are willing to buy during holidays and take advantage of the special deals and discounts offered by the store

Visitor Type: those who are returning visitors may have a higher likelihood of making a purchase, as instead of the new visitors

The descriptive statistics for VisitorTypeF & RevenueF were explaind under RQ 1.

The summary for PageValues shows that this variable has a range from minimum 0 to maximum 361. By the mean of this variable we can say that the average number of pages that are visited by users have and positive outcome by completing transactions is 5.7.

The range for variable SpecialDay is from minimum 0 to maximum 1, and if it is close to 0 means that the users are not visiting the page close to some spacial day, which is opposite of 1, that means users are visiting the page close to some special day or even on that specific one. By the mean 0.01, we can say that on average, user visits are not close to certain special days.

summary(mydata[colnames(mydata) %in% c("PageValues", "SpecialDay")])

##    PageValues        SpecialDay     
##  Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :  0.000   Median :0.00000  
##  Mean   :  5.694   Mean   :0.01258  
##  3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :361.000   Max.   :1.00000

2.2 Analysis

fit0

#dependent and explanatory variables | only reg. constant is included because we want to know the starting point without explanatory variables

fit0 <- glm(RevenueF ~ 1, 
            family = binomial, #binary logistic regression
            data = mydata)

summary(fit0)

## 
## Call:
## glm(formula = RevenueF ~ 1, family = binomial, data = mydata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.700      0.025  -67.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10542  on 12244  degrees of freedom
## Residual deviance: 10542  on 12244  degrees of freedom
## AIC: 10544
## 
## Number of Fisher Scoring iterations: 3

The probability that a visitor will make purchase on the site is 0.183 times the probability that the visitor will not make purchase on the site.

exp(cbind(odds = fit0$coefficients, confint.default(fit0))) #odds for Revenue=True

##                 odds     2.5 %    97.5 %
## (Intercept) 0.182749 0.1740097 0.1919271

head(fitted(fit0)) #estimated probability for Revenue=True

##        1        2        3        4        5        6 
## 0.154512 0.154512 0.154512 0.154512 0.154512 0.154512

Pseudo_R2_fit1 <- 10353/12245 #prop. of correctly classified units

Pseudo_R2_fit1

## [1] 0.845488

fit1

H0: beta1 = 0.

H1: beta1 != 0.

PageValues is significant at p<0.001

If PageValues has increased by one page, the odds that the visitor will make a purchase on the site are equal to e^0.089242 = 1.093 times the inital odds at p<0.001. So, if we increase the PageValues, the probability for visitor to make purchase will increase.

fit1 <- glm(RevenueF ~ PageValues, #dependent and explanatory variables
            family = binomial, #binary logistic regression
            data = mydata)

summary(fit1)

## 
## Call:
## glm(formula = RevenueF ~ PageValues, family = binomial, data = mydata)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.399622   0.034278  -70.01   <2e-16 ***
## PageValues   0.089242   0.002373   37.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10541.9  on 12244  degrees of freedom
## Residual deviance:  7888.9  on 12243  degrees of freedom
## AIC: 7892.9
## 
## Number of Fisher Scoring iterations: 5

#simpler model 10541.9
#complex model 7888.9

#Better is the complex model, and it is significant.
anova(fit0, fit1, test = "Chi") #comparison of models based on -2LL statistics

## Analysis of Deviance Table
## 
## Model 1: RevenueF ~ 1
## Model 2: RevenueF ~ PageValues
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     12244    10541.9                          
## 2     12243     7888.9  1     2653 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

exp(cbind(OR = fit1$coefficients, confint.default(fit1))) #odds ratio for Revenue=1 (with 95% CI)

##                     OR      2.5 %     97.5 %
## (Intercept) 0.09075223 0.08485555 0.09705869
## PageValues  1.09334488 1.08827082 1.09844258

fit2

#install.packages("car")

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.2

fit2 <- glm(RevenueF ~ PageValues + SpecialDay + VisitorTypeF, #dependent and explanatory variables
            family = binomial, #binary logistic regression
            data = mydata)

Assumptions for logistic regression

1. The dependent variable is dichotomous. - Yes, Revenue has False & True.

2. The explanatory variables are numerical or in the form of a Dummy variable. - Yes, PageValues and SpecialDay are numerical & VisitorTypeF is Dummy variable.

3. No outliers and/or units with high impact.

mydata$StdResid <- round(rstandard(fit2), 3) #standardized residuals | remove above 3 and below -3

hist(mydata$StdResid, 
     xlab = "Standardized residuals", 
     ylab = "Frequency", 
     main = "Histogram of standardized residuals")

head(mydata[order(-mydata$StdResid), c("ID", "StdResid")])

##        ID StdResid
## 2670 2670    2.337
## 4902 4902    2.266
## 1189 1189    2.255
## 5491 5491    2.255
## 5515 5515    2.255
## 5555 5555    2.255

head(mydata[order(mydata$StdResid), c("ID", "StdResid")], n=30) #showing the first thirty rows

##          ID StdResid
## 5774   5774   -6.288
## 8346   8346   -5.802
## 10641 10641   -5.139
## 6430   6430   -5.038
## 7939   7939   -4.417
## 12248 12248   -4.315
## 5878   5878   -4.233
## 10026 10026   -4.040
## 3201   3201   -4.018
## 7008   7008   -3.861
## 8895   8895   -3.838
## 3272   3272   -3.526
## 9055   9055   -3.239
## 7037   7037   -3.212
## 9250   9250   -3.212
## 10106 10106   -3.212
## 1790   1790   -3.185
## 11599 11599   -3.129
## 5543   5543   -3.105
## 8417   8417   -3.101
## 11520 11520   -3.072
## 3629   3629   -3.044
## 4195   4195   -3.015
## 9960   9960   -3.015
## 10436 10436   -2.956
## 8097   8097   -2.926
## 8706   8706   -2.926
## 2957   2957   -2.897
## 10068 10068   -2.805
## 7728   7728   -2.774

mydata$CooksD <- round(cooks.distance(fit2), 3) #cooks distances | remove above 1 or gaps 

hist(mydata$CooksD, 
     xlab = "Cooks distance", 
     ylab = "Frequency", 
     main = "Histogram of Cooks distances")

head(mydata[order(-mydata$Cook), c("ID", "CooksD")])

##          ID CooksD
## 5774   5774  0.083
## 8346   8346  0.064
## 10641 10641  0.042
## 6430   6430  0.037
## 2670   2670  0.026
## 4902   4902  0.026

#installed.packages("dplyr")

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mydata <- mydata %>%
  dplyr::filter(!(ID %in% c(5689, 8249, 10496, 6344, 5688, 7850, 12043, 5792, 9896, 3148, 6921, 8787,                             3218, 8944, 6950, 9139, 9974, 1749, 11424, 5458, 8319, 11347, 3572, 4132,                             9833)))

nrow(mydata)

## [1] 12220

4. Sufficiently large samples. - Yes, the dataset has more than 400 observations.

5. Absence of too strong multicolinearity. - Yes, all VIF values are close to 1, so this shows that the variables are not highly correlated, and that multicollinearity is not a problem.

vif(fit2)

##   PageValues   SpecialDay VisitorTypeF 
##     1.000439     1.000734     1.000435

There are no enough evidence to explain SpecialDay, so it can not be explaind since is not significant.

fit1 <- glm(RevenueF ~ PageValues,  
            family = binomial, 
            data = mydata)

fit2 <- glm(RevenueF ~ PageValues + SpecialDay + VisitorTypeF,  
            family = binomial, 
            data = mydata)

anova(fit1, fit2, test = "Chi") #comparison of models based on -2LL statistics

## Analysis of Deviance Table
## 
## Model 1: RevenueF ~ PageValues
## Model 2: RevenueF ~ PageValues + SpecialDay + VisitorTypeF
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     12218     7875.0                          
## 2     12216     7841.8  2   33.263 5.986e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit2)

## 
## Call:
## glm(formula = RevenueF ~ PageValues + SpecialDay + VisitorTypeF, 
##     family = binomial, data = mydata)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.457519   0.036979 -66.457  < 2e-16 ***
## PageValues               0.088391   0.002387  37.026  < 2e-16 ***
## SpecialDay              -0.537325   0.344977  -1.558    0.119    
## VisitorTypeFNew_Visitor  0.450792   0.079997   5.635 1.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10516.5  on 12219  degrees of freedom
## Residual deviance:  7841.8  on 12216  degrees of freedom
## AIC: 7849.8
## 
## Number of Fisher Scoring iterations: 5

If PageValues increases by one page, the odds that the visitor will make a purchase on the site are equal to 1.092 times the initial odds at p<0.001, under the assumptions that the remaining expalanotyr variables do not change. The probability to make purchase on the site increase.

Given the values of other explanatory variables, the odds for visitor to make a purchase if it is a part of the group who are new visitor is 1.569 times the odds of the group of visitors who are returning. New visitors are more likely to purchase product on the site.

exp(cbind(OR = fit2$coefficients, confint.default(fit2))) #odds ratio for Revenue=True (with 95% CI)

##                                 OR      2.5 %     97.5 %
## (Intercept)             0.08564717 0.07965924 0.09208521
## PageValues              1.09241505 1.08731569 1.09753833
## SpecialDay              0.58430947 0.29716507 1.14891551
## VisitorTypeFNew_Visitor 1.56955542 1.34178511 1.83599012

mydata$EstProb <- fitted(fit2) #estimates of probabilities for Revenue=True: P(Y=1)
head(mydata)

##   ID Administrative Administrative_Duration Informational
## 1  1              0                       0             0
## 2  2              0                       0             0
## 3  3              0                       0             0
## 4  4              0                       0             0
## 5  5              0                       0             0
## 6  6              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration BounceRates
## 1                      0              1                       0           0
## 2                      0              2                      64           0
## 3                      0              1                       0           0
## 4                      0              2             2.666666667           0
## 5                      0             10                   627.5           0
## 6                      0             19             154.2166667           0
##   ExitRates PageValues SpecialDay Month OperatingSystems Browser Region
## 1         0          0          0   Feb                1       1      1
## 2         0          0          0   Feb                2       2      1
## 3         0          0          0   Feb                4       1      9
## 4         0          0          0   Feb                3       2      2
## 5         0          0          0   Feb                3       3      1
## 6         0          0          0   Feb                2       2      1
##   TrafficType       VisitorType Weekend Revenue      VisitorTypeF RevenueF
## 1           1 Returning_Visitor   FALSE   FALSE Returning_Visitor    False
## 2           2 Returning_Visitor   FALSE   FALSE Returning_Visitor    False
## 3           3 Returning_Visitor   FALSE   FALSE Returning_Visitor    False
## 4           4 Returning_Visitor   FALSE   FALSE Returning_Visitor    False
## 5           4 Returning_Visitor    TRUE   FALSE Returning_Visitor    False
## 6           3 Returning_Visitor   FALSE   FALSE Returning_Visitor    False
##   StdResid CooksD    EstProb
## 1   -0.405      0 0.07889043
## 2   -0.405      0 0.07889043
## 3   -0.405      0 0.07889043
## 4   -0.405      0 0.07889043
## 5   -0.405      0 0.07889043
## 6   -0.405      0 0.07889043

mydata$Classification <- ifelse(test = mydata$EstProb < 0.50, 
                                yes = "False", 
                                no = "True")

#if estimated probability is below 0.50, visitor is classified into the group of people that have not made a purchase (Revenue=False), otherwise True

mydata$ClassificationF <- factor(mydata$Classification,
                                 levels = c("False", "True"), 
                                 labels = c("False", "True"))

#classification table based on 2 categorical variables

ClassificationTable <- table(mydata$RevenueF, mydata$ClassificationF) 

ClassificationTable

##        
##         False  True
##   False 10129   204
##   True   1206   681

The Pseudo_R2_fit2 is 0.8846154 and says that approximately 89% of the variability of the dependent variable RevenueF (if a purchased is made or not) can be explained by the explanaotry variables in the logistic regression model.

#prop. of correctly classified units

Pseudo_R2_fit2 <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata) 

Pseudo_R2_fit2

## [1] 0.8846154

2.3 Conclusion

In summary, this dataset helped to understand what influences online shoppers, and why they make the decisions to make a purchase or not. We discovered that certain things, like how close it is to special days is not very significant if they will make transactions, and on the other hand the type of visitor or the value of pages are significant. For the purpose of this type of analysis, some of the variables were excluded because they that did not really help to understand the things.

Explanations for each part of the 2.2 chapter Analysis are given alongside the code, to understand different types of logisitc models and what is the effect of the variables.

Homework with the program R at course RMT by Teodora Kochova

2024-01-08

Dataset: Online Shoppers Purchasing Intention

1. Research Question

The first reasearch question will be answered by analyzing the association between two categorical variables.

Research Question: “Is there a significant association between the type of online visitor and their contribution to the website revenue?”

1.1 Data

The dataset used for the first research question of the homework is found from the UCI Machine learning Repository and the name of the dataset is Online Shoppers Purchasing Intention Dataset (link to the dataset).

Description of variables in the dataset:

Numerical variables:

Administrative

Administrative Duration

Informational

Informational Duration

Product Related

Product Related Duration

Bounce Rates

Exit Rates

Page Value

Special Day

Categorical variables:

Operating System

Browser

Region

Traffic Type

Visitor Type

Weekend

Month

Revenue

Fun fact: “Bounce Rate”, “Exit Rate” and “Page Value” are metrics measured by “Google Analytics”, and give insights about the user behavior, to help for the overall website success.

The descriptive statistics for the variable RevenueF, show the occurrences if the visitors made purchase (Revenue=True) or not (Revenue=False) the type of visitors on the site.

1.2 Analysis

Parametric test - Pearson chi-square test

Assumptions for Pearson chi-square test

1. Observations must be independent.

2. Check that all expected frequencies are greater than 5.

3. In larger contingency tables (at least one categorical variable has more than two categories), up to 20% of the expected frequencies can be between 1 and 5.

Hypothesis for chisq.test

H0: There is no association between the visitor type and making purchase.

H1: There is an association between the visitor type and making purchase.

p<0.001 Ho is rejected

For f’11 = (9081-8920.74)/sqrt(8920.74) = 1.69 ~ 1.70, and this is not statistically significant at alfa=0.05 (<1.96)

0.742 means 74% were returning visitors and did not made purchase.

0.861 means out of all visitors who are returning, 86% did not made purchase.

0.877 means out of all who did not made purchase, 87% were returning visitors.

The odds of making purchase for the new visitors are 2.05 times the odds of making purchase for the returning visitors.

Non-parametric - Fisher’s exact probability test

Hypothesis for fisher.test

H0: There is no associated odds ration (OR=1).

H1: There is an associated odds ration (OR=0).

p<0.001 H0 is rejected

1.3 Conclusion

This can help to make the target marketing more effective. For example for the returning visitors some promotions, discounts, or loyalty points may be given, to boost them to make more transactions.

Explanations for each part of the 1.2 chapter Analysis are given alongside the code, to understand different part of the code and what tell us the variables.

2.Research Question

The second reasearch question will be answered by logistic regression.

Research Question: “What factors influence the likelihood of online shoppers making a purchase?”

2.1 Data

The same dataset from the first research question is being used in as well for the second research question. At the begining some basic information about the dataset were given. For the purpose of this research question only several new things will be presented and explained.

The variables “Bounce Rates” and “Exit Rates” have constantly zeros and no variability and can not provide useful information, so they will not be used.

Description of the effect of the variables used over depended variable Revenue

Page Values: a high page value will show that users who visited certain pages are more likely to complete a transaction, and that may suggest how much important are those pages with revenue

Special Day: will show if the revenue is increased around some special days, and may suggest that people are willing to buy during holidays and take advantage of the special deals and discounts offered by the store

Visitor Type: those who are returning visitors may have a higher likelihood of making a purchase, as instead of the new visitors

The descriptive statistics for VisitorTypeF & RevenueF were explaind under RQ 1.

The summary for PageValues shows that this variable has a range from minimum 0 to maximum 361. By the mean of this variable we can say that the average number of pages that are visited by users have and positive outcome by completing transactions is 5.7.

2.2 Analysis

fit0

The probability that a visitor will make purchase on the site is 0.183 times the probability that the visitor will not make purchase on the site.

fit1

H0: beta1 = 0.

H1: beta1 != 0.

PageValues is significant at p<0.001

If PageValues has increased by one page, the odds that the visitor will make a purchase on the site are equal to e^0.089242 = 1.093 times the inital odds at p<0.001. So, if we increase the PageValues, the probability for visitor to make purchase will increase.

fit2

Assumptions for logistic regression

1. The dependent variable is dichotomous. - Yes, Revenue has False & True.

2. The explanatory variables are numerical or in the form of a Dummy variable. - Yes, PageValues and SpecialDay are numerical & VisitorTypeF is Dummy variable.

3. No outliers and/or units with high impact.

4. Sufficiently large samples. - Yes, the dataset has more than 400 observations.