Abstract

This study investigates factors influencing online purchase decisions using the Online Shoppers Purchasing Intention Dataset with 12,330 instances and 18 variables. The analysis centers on key attributes such as Revenue, Bounce Rates, Product Page Engagement, and user type (new or returning visitors). Revenue variable indicates whether a purchase is made. Findings reveal that returning visitors are 55.2% as likely to generate revenue as new visitors. A one-unit increase in BounceRates1_sqrt decreases the odds by 99.81%, while a one-unit increase in ProductRelatedDuration_trans increases the odds by 14.79%. These results offer insights for businesses to optimize website design, personalize marketing campaigns, and plan strategic promotions to drive revenue growth.

Introduction

Insights about the main push factors for revenue generation are valuable for business owners to tailor website content and offers based on the main buying push forces for each customer. For instance, the results of the study are helpful in making personalized marketing campaigns, optimizing product pages by including more detailed descriptions, images, and reviews, planning weekend or seasonal promotions, and designing website interface to increase or decrease bounce rates. My initial hypothesis is that purchases are more common on weekends, returning users contribute more revenue, and longer time on product pages increases purchase likelihood. Therefore, the research question I aim to answer is: what are the most influencing factors in consumers’ purchasing decision-making and how significant is each predictor?

Methods

A. Data

The dataset is imported from the [Online Shoppers Purchasing Intention Dataset] (https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset). It contains 12,330 observations each representing a unique online user interaction over the course of 1 year. Each data point corresponds to a different user, minimizing the potential influence of specific campaigns, special days, or user profiles. Such selection eliminates biases from recurring user behavior.

B. Variables

Response Variable: Revenue is a categorical binary response variable with two levels (“TRUE” - coded as “1”, “FALSE” - coded as “0”). It indicates whether purchase is made or not.

Predictor Variables: 1. VisitorType is a categorical variable with 3 levels: Returning_Visitor, New_Visitor, and other. It contains information on how different types of visitors affect revenue generation. 2. Weekend is a categorical variable with two levels (“TRUE” - coded as “1”, “FALSE” - coded as “0”). It shows whether user session happened on a weekend or workday. 3. BounceRates is a quantitative variable that shows the percentage of visitors who land on the website and leave without clicking on anything else or visiting other webpages. 4. ProductRelated_Duration is a quantitative variable that represents the total time measured in seconds spent in this page category.

C. Statistical Methods

This study predicts the odds of online retail purchase likelihood using two models: a simple logistic regression model assessing the impact of Weekend shopping on the likelihood of generating revenue and a multiple logistic regression model evaluating additional predictors, including BounceRates, ProductRelated_Duration, and VisitorType. Model selection relies on Akaike Information Criterion (AIC) values, explained deviance, concordance, and likelihood ratio testing (LRT). Interaction terms were considered, but none were found to significantly improve the model. The analysis also checks for linearity between predictors and log-odds, addressing any inconsistencies with transformations.

Exploratory Data Analysis (EDA)

Univariate EDA: The revenue distribution shows the heavily imbalanced proportion of successful an unsuccessful purchases, as 84.5% of customer sessions do not generate any revenue.

Multi-variable EDA

Bivariate EDA: - BounceRates vs. Revenue

The boxplot shows the distribution of Bounce Rates grouped by the variable Revenue. When revenue is generated, bounce rates are typically lower, with values concentrated near zero. In contrast, when no revenue is generated, bounce rates have a wider spread and more outliers. Summary statistics show the mean Bounce Rate is 0.51% when revenue is generated and 2.53% when not, with medians of 0% and 0.04%, respectively. Given the skewed distribution, a square root transformation is appropriate to see the difference between groups clearly.

  • ProductRelated_Duration vs. Revenue

When revenue is generated, ProductRelated_Duration is higher with a mean of 1876.2, compared to those where revenue is not generated with a mean of 1059.9. The median is also higher for sessions where purchase was made. The difference is clearer after a cube root transformation.

  • VisitorType vs. Revenue

Based on the bar graph, the greatest proportion of customers are returning customers, at the value of 10551, the second largest group are new customers, with 1694 observations, and the rest 84 observations come from other users. To create a more straightforward distinction between returning and new clients, while preserving the significance of the model, I will drop Other level. Mosaic plot depicting the instance of revenue by Visitor Type shows that returning visitors make up 86% of total observations, while new visitors account for 14%. 24.9% of new visitors make a purchase, compared to 13.9% for returning visitors. However, in terms of total purchases, returning visitors contribute 12% of all revenue-generating sessions compared to 3.4% from new visitors. Therefore, while new visitors are more likely to purchase on their first visit, returning visitors drive the majority of total revenue due to their higher numbers.

  • Weekend vs. Revenue:

Based on the distribution of revenue by day, there is a slight increase in sales on the weekend compared to workdays. On the margin, 17.4% of instances result in revenue generation on the weekend, and 14.9% of instances result in revenue generation on workday.

Model 1: Simple logistic regression

I start by answering the question of whether the day is weekend or workday leads to a higher sales likelihood using a simple logistic regression. The linearity condition of this model is met, because predictor is binary. The independence condition is met, based on the assumption that each shopping session in the dataset is independent of others. For randomness, it is important to consider that the response variable, Revenue, is binary, the sample size is fixed at 12,245 observations, constant model probability raises slight concerns because even when predictors like Weekend are significant, they may not fully account for all variability in Revenue. However, the conditions are sufficiently satisfied to proceed with fitting the logistic regression model.

The null hypothesis is that Weekend is not a significant predictor of Revenue, and the alternative hypothesis is that Weekend is a significant predictor. The respective p-value for predictor Weekend is 0.00096, which is below the significance level of 0.05, so it contributes a lot to the explanation of the variability in the model. I reject of the null hypothesis and conclude that Weekend is a significant predictor of revenue generation in online retail.

Reporting the model:

\[\pi = \text{probability of generating revenue}\] \[\log(\frac{\widehat{\pi}}{1-\widehat{\pi}}) = −1.7460 + 0.1889(WeekendTRUE) \]

Based on the regression output, the intercept of −1.7460 represents the log odds of revenue generation on workday. Based on the odds of 0.17447, the probability of revenue generation on workday is 14.9%. The odds ratio, when moving from a weekday to a weekend, is 1.2038, which suggests that the likelihood of revenue generation on weekends is 20.8% higher than on weekdays. The probability of revenue generation on the weekend, given the log-odds of 0.1889, is approximately 54.7%.

Model 2: Multiple Logistic Regression

Check Conditions

I will proceed studying the model using multiple logistic regression to predict the likelihood of generating revenue considering additional predictors. The new predictors include categorical (VisitorType, Weekend) and continuous variables (BounceRates, ProductRelated_Duration). I assume the independence condition is met (since all shopping sessions are independent of each other), as well as the conditions of randomness. Constant model probability raises some slight concerns because predictors may not be the only factors influencing revenue generation, suggesting the model might not capture all relationships. The variable BounceRates has many similar values, including grouped data (Fig. 1). In order to reduce the number of groups of data with the goal of seeing a potential relationship between isRevenue and BounceRates, I will use jitter() function to add a small amount of random noise to numeric values. It is used to reduce overplotting when data points overlap. The linearity condition is met for the categorical predictors by design. Continuous predictors are checked for linearity with the empirical logit plots. Based on the output, there is a slight curvature, and potential outliers, suggesting that the relationship between BounceRates and isRevenue is not perfectly linear. Same holds for the ProductRelated_Duration variable, as data points follow a concave pattern (Fig. 2).

Transformation

In order to reduce curved patterns, I will use transformations to achieve a better fit with respect to the reference line (Fig. 3). With the square root transformation BounceRates1 appears to have a stronger negative linear relationship with Revenue. ProductRelated_Duration, after being raised to the power of 0.33, achieves a better positive linearity, but it still has fluctuations along the reference line. We will consider linearity conditions satisfied for all 4 predictors now.

Variable selection

Next, I perform a stepwise regression as an automated variable selection procedure (Fig. 4), which suggests considering all four predictors: ProductRelated_Duration_trans, BounceRates1_sqrt, VisitorType, and Weekend. I proceed to fit a multiple logistic regression model. All predictors, except for Weekend are statistically significant, with p-values below 0.05. Weekend has p-value of 0.11, which is above the significance level of 0.1, but, I will still consider it. Additionally, I check for potential issues of multicollinearity using the vif function. Since all Variance Inflation Factors (VIFs) are below 5, there are no significant concerns regarding multicollinearity.

Possible interaction

Now, I consider potential interactions between the predictors. The emplogitplot2 output suggests a potential interaction between ProductRelated_Duration and Weekend, as the lines showing the association between the predictor and response are not parallel, indicating that the effect of ProductRelated_Duration on the response may differ depending on whether it is a weekend (Fig. 5).

Model Fitting

Before proceeding with model fitting, I filtered the shop data to get rid of NA values which appeared in the datatset after transforming the variables. Model 1 includes three predictors, such as ProductRelated_Duration_trans, BounceRates1_sqrt, VisitorType (Fig. 6.1). Model 2 includes all four predictors (Fig. 6.2). Model 3 includes all four predictors and an interaction term (Fig. 6.3). Based on msummary output, the interaction term has a p-value of 0.49 which is above the significance level of 0.05, so it doesn’t add a lot of explanatory power to the model. I will include it in the likelihood ratio test below to confirm this finding.

Nested LRT tests

Based on the likelihood ratio test, which is used to compare the goodness of fit between multiple models, Model 2 has a p-value of 0.11 and Model 3 has a p-value of 0.33, which means that the smallest model (glm1) is the best at explaining variation in Revenue.

## Likelihood ratio test
## 
## Model 1: Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans
## Model 2: Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans + 
##     Weekend
## Model 3: Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans + 
##     Weekend + ProductRelated_Duration_trans * Weekend
##   #Df LogLik Df Chisq Pr(>Chisq)
## 1   4  -3561                    
## 2   5  -3560  1  1.88       0.17
## 3   6  -3560  1  0.58       0.45

Model Selection

In order to support the selection of the model glm1 (Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans), I will compare the first and the second models’ glm2 (Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans+Weekend) Explained deviance, AIC, and concordance values.cExplained Deviance is an \(R^2\)-like measure for logistic regression. The Explained Deviance of model glm1 is 0.10081, while for model glm2 it is 0.10114. glm1’s and glm2’s AIC value is 7080. I will also consider the c-statistic, which measures a model’s ability to correctly rank predictions for binary outcomes. Concordance is 0.73 for both models (Fig. 7). The two models have very similar values, thus a simpler model is a better choice in this case.

Identifying influential points

Before reporting and interpreting the final model, I check the 95% Confidence interval for the coefficients of my logistic regression model (Fig. 8). None of the values cover 0, meaning that all of them are statistically significant. Additionally, I check for unusual points, such as high leverage points and influential points. According to the Cook’s distance vs. leverage plot (Fig. 9), there are no points with a Cook’s distance of above 0.5. Additionally, no points have severely high leverage, since their leverage values all range from 0 to 0.0045, and they are located close to the reference line.

Reporting the final model: \(\pi = \text{Probability of Revenue Generation }\) \[\log(\frac{\widehat{\pi}}{1-\widehat{\pi}}) = -2.0581 -0.6116(VisitorTypeReturning Visitor) -6.1502(BounceRates1 sqrt)\] \[+0.1346 (ProductRelated Duration trans) \]

Model interpretation

When all predictor variables are equal to 0, the odds of revenue generation are 0.1277, corresponding to a probability of 11.3%. The coefficients of each variable represent the change in log-odds of revenue generation associated with a one-unit increase in each respective variable. Therefore, returning visitors have 54.25% lower odds of revenue generation compared to new visitors, holding other factors constant. A one-unit increase in BounceRates1_sqrt decreases the odds of revenue generation by approximately 99.79%. For each one-unit increase in ProductRelatedDuration_trans, the odds of revenue generation increase by approximately 14.41%.

Conclusions & Discussion

This study addressed the question of the most important factors in revenue generation in online retail. The findings revealed that new visitors are more likely to generate revenue than returning visitors. Higher bounce rates reduce the likelihood of revenue generation. Increased engagement with product-related pages positively influences revenue generation. The type of day, weekend or workday wasn’t a significant predictor, suggesting that there may be additional factors that influence shoppers’ behavior, not captured by this model. A small explained deviance of 0.10245 indicates that the model is not explaining a large portion of the variability in the data. This could be due to complex relationships between variables not captured by the model.

In summary, the findings suggest that business owners should tailor their promotions to reach to new clients, while trying to engage returning clients with personalized offerings. Better website usability increases engagement and can help reduce bounce rates and boost revenue. Interactive features, better description, photos and videos can streamline purchasing process and reduce decision-fatigue for highly engaged visitors.

References

Sakar, C., and Yomi Kastro. “Online Shoppers Purchasing Intention Dataset.” UCI Machine Learning Repository, 30 Aug. 2018, archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset.

Appendix

Fig. 1:

The graph shows the relationship between BounceRates and the likelihood of generating Revenue across all groups in the dataset:

plot1<-emplogitplot1(factor(Revenue) ~ (BounceRates), data = shop, 
                     ngroups = "all",ylab = "Bounce Rates full data")

Fig. 2:

Plot 1: The graph shows the relationship between the jittered BounceRates (denoted as BounceRates1) and the likelihood of generating Revenue. Plot 2: The graph shows the relationship between ProductRelated_Duration and the likelihood of generating Revenue.

plot2<-emplogitplot1(factor(Revenue) ~ 
                       (BounceRates1), data = shop, ngroups = 100)

plot3<-emplogitplot1(factor(Revenue) ~ (ProductRelated_Duration), 
                     data = shop, ngroups = 20)

Fig. 3:

Plot 1: The graph shows the relationship between the square root transformation of BounceRates (BounceRates1_sqrt) and the likelihood of generating Revenue. Plot 2: The graph shows the relationship between the transformed ProductRelated_Duration (ProductRelated_Duration_trans) and the likelihood of generating Revenue.

plot1<-emplogitplot1(factor(Revenue) ~ (BounceRates1_sqrt), 
                     data = shop, ngroups = 90)

plot2<-emplogitplot1(factor(Revenue) ~ (ProductRelated_Duration_trans), 
                     data = shop, ngroups = 20)

Fig. 4:

Stepwise regression using the AIC (Akaike Information Criterion) to identify the most significant variables for predicting Revenue:

cleaned_shop <- na.omit(shop)
require(MASS)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
mod.small <- glm(Revenue ~ 1, data = cleaned_shop, family = binomial)

mod.all <- glm(Revenue ~ ProductRelated_Duration_trans + 
                 BounceRates1_sqrt + VisitorType + Weekend, 
               data = cleaned_shop, family = binomial)

# Perform stepwise regression

stepAIC(mod.small, scope = list(lower = mod.small, upper = mod.all), 
direction = "both", trace = FALSE)$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Revenue ~ 1
## 
## Final Model:
## Revenue ~ ProductRelated_Duration_trans + BounceRates1_sqrt + 
##     VisitorType
## 
## 
##                              Step Df Deviance Resid. Df Resid. Dev    AIC
## 1                                                  9522     7944.5 7946.5
## 2 + ProductRelated_Duration_trans  1  474.730      9521     7469.8 7473.8
## 3             + BounceRates1_sqrt  1  291.062      9520     7178.7 7184.7
## 4                   + VisitorType  1   56.592      9519     7122.1 7130.1

Fig. 5:

Summary of the results of a logistic regression model with an interaction term between ProductRelated_Duration_trans and Weekend:

## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               -3.05004    0.07519  -40.56   <2e-16
## ProductRelated_Duration_trans              0.14034    0.00683   20.54   <2e-16
## WeekendTRUE                                0.42234    0.14803    2.85   0.0043
## ProductRelated_Duration_trans:WeekendTRUE -0.02496    0.01371   -1.82   0.0686
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10541.9  on 12244  degrees of freedom
## Residual deviance:  9984.8  on 12241  degrees of freedom
## AIC: 9993
## 
## Number of Fisher Scoring iterations: 5

Fig. 6.1:

Summary of the results of a logistic regression model with VisitorType, BounceRates1_sqrt, ProductRelated_Duration_trans:

glm1 <- glm(Revenue ~ VisitorType+BounceRates1_sqrt+
              ProductRelated_Duration_trans, data = cleaned_shop,
            family = binomial)
msummary(glm1)
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.95655    0.09772  -20.02  < 2e-16 ***
## VisitorTypeReturning_Visitor  -0.69259    0.08972   -7.72  1.2e-14 ***
## BounceRates1_sqrt             -6.25926    0.49451  -12.66  < 2e-16 ***
## ProductRelated_Duration_trans  0.13367    0.00734   18.20  < 2e-16 ***
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7944.5  on 9522  degrees of freedom
## Residual deviance: 7122.1  on 9519  degrees of freedom
## AIC: 7130
## 
## Number of Fisher Scoring iterations: 6

Fig. 6.2:

Summary of the results of a logistic regression model with VisitorType, BounceRates1_sqrt, ProductRelated_Duration_trans, and Weekend:

glm2 <- glm(Revenue ~ VisitorType+BounceRates1_sqrt+
              ProductRelated_Duration_trans+Weekend, data = cleaned_shop, 
            family = binomial)
msummary(glm2)
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.98331    0.09976  -19.88  < 2e-16 ***
## VisitorTypeReturning_Visitor  -0.68925    0.08976   -7.68  1.6e-14 ***
## BounceRates1_sqrt             -6.25064    0.49487  -12.63  < 2e-16 ***
## ProductRelated_Duration_trans  0.13361    0.00735   18.19  < 2e-16 ***
## WeekendTRUE                    0.09546    0.06926    1.38     0.17    
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7944.5  on 9522  degrees of freedom
## Residual deviance: 7120.2  on 9518  degrees of freedom
## AIC: 7130
## 
## Number of Fisher Scoring iterations: 6

Fig. 6.3:

Summary of the results of a logistic regression model with VisitorType, BounceRates1_sqrt, ProductRelated_Duration_trans, Weekend, and an interaction term:

glm3 <- glm(Revenue ~ VisitorType+BounceRates1_sqrt+
              ProductRelated_Duration_trans+Weekend+
              ProductRelated_Duration_trans*Weekend, data = cleaned_shop, 
            family = binomial)
msummary(glm3)
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               -2.01723    0.10954  -18.42  < 2e-16
## VisitorTypeReturning_Visitor              -0.68789    0.08979   -7.66  1.8e-14
## BounceRates1_sqrt                         -6.24591    0.49506  -12.62  < 2e-16
## ProductRelated_Duration_trans              0.13664    0.00837   16.32  < 2e-16
## WeekendTRUE                                0.22888    0.18828    1.22     0.22
## ProductRelated_Duration_trans:WeekendTRUE -0.01254    0.01650   -0.76     0.45
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7944.5  on 9522  degrees of freedom
## Residual deviance: 7119.7  on 9517  degrees of freedom
## AIC: 7132
## 
## Number of Fisher Scoring iterations: 6

Fig. 7:

Concordance index (C-index) tables:

require(survival)
concordance(glm1)
## Call:
## concordance.lm(object = glm1)
## 
## n= 9523 
## Concordance= 0.73 se= 0.0067
## concordant discordant     tied.x     tied.y    tied.xy 
##    8304136    3054614          0   33980253          0
concordance(glm2)
## Call:
## concordance.lm(object = glm2)
## 
## n= 9523 
## Concordance= 0.73 se= 0.0067
## concordant discordant     tied.x     tied.y    tied.xy 
##    8310224    3048526          0   33980253          0

Fig. 8:

95% Confidence Interval:

exp(confint(glm1))
## Waiting for profiling to be done...
##                                    2.5 %    97.5 %
## (Intercept)                   0.11652541 0.1709325
## VisitorTypeReturning_Visitor  0.41997625 0.5970675
## BounceRates1_sqrt             0.00071561 0.0049729
## ProductRelated_Duration_trans 1.12674246 1.1596501

Fig. 9:

The diagnostic plots showing key assumptions in regression models, including linearity, homoscedasticity, normality of residuals, influential data points, and leverage:

mplot(glm1, which =6)
## `geom_smooth()` using formula = 'y ~ x'

pt<-which(hatvalues(glm1) > (3*(3+1)/9503))
which(cooks.distance(glm1) > 0.5)
## named integer(0)