This study investigates factors influencing online purchase decisions
using the Online Shoppers Purchasing Intention Dataset with 12,330
instances and 18 variables. The analysis centers on key attributes such
as Revenue, Bounce Rates, Product Page Engagement, and user type (new or
returning visitors). Revenue variable indicates whether a purchase is
made. Findings reveal that returning visitors are 55.2% as likely to
generate revenue as new visitors. A one-unit increase in
BounceRates1_sqrt decreases the odds by 99.81%, while a
one-unit increase in ProductRelatedDuration_trans increases
the odds by 14.79%. These results offer insights for businesses to
optimize website design, personalize marketing campaigns, and plan
strategic promotions to drive revenue growth.
Insights about the main push factors for revenue generation are valuable for business owners to tailor website content and offers based on the main buying push forces for each customer. For instance, the results of the study are helpful in making personalized marketing campaigns, optimizing product pages by including more detailed descriptions, images, and reviews, planning weekend or seasonal promotions, and designing website interface to increase or decrease bounce rates. My initial hypothesis is that purchases are more common on weekends, returning users contribute more revenue, and longer time on product pages increases purchase likelihood. Therefore, the research question I aim to answer is: what are the most influencing factors in consumers’ purchasing decision-making and how significant is each predictor?
The dataset is imported from the [Online Shoppers Purchasing Intention Dataset] (https://archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset). It contains 12,330 observations each representing a unique online user interaction over the course of 1 year. Each data point corresponds to a different user, minimizing the potential influence of specific campaigns, special days, or user profiles. Such selection eliminates biases from recurring user behavior.
Response Variable: Revenue is a categorical
binary response variable with two levels (“TRUE” - coded as “1”, “FALSE”
- coded as “0”). It indicates whether purchase is made or not.
Predictor Variables: 1. VisitorType is a
categorical variable with 3 levels: Returning_Visitor, New_Visitor, and
other. It contains information on how different types of visitors affect
revenue generation. 2. Weekend is a categorical variable
with two levels (“TRUE” - coded as “1”, “FALSE” - coded as “0”). It
shows whether user session happened on a weekend or workday. 3.
BounceRates is a quantitative variable that shows the
percentage of visitors who land on the website and leave without
clicking on anything else or visiting other webpages. 4.
ProductRelated_Duration is a quantitative variable that
represents the total time measured in seconds spent in this page
category.
This study predicts the odds of online retail purchase likelihood
using two models: a simple logistic regression model assessing the
impact of Weekend shopping on the likelihood of generating
revenue and a multiple logistic regression model evaluating additional
predictors, including BounceRates,
ProductRelated_Duration, and VisitorType.
Model selection relies on Akaike Information Criterion (AIC) values,
explained deviance, concordance, and likelihood ratio testing (LRT).
Interaction terms were considered, but none were found to significantly
improve the model. The analysis also checks for linearity between
predictors and log-odds, addressing any inconsistencies with
transformations.
Univariate EDA: The revenue distribution shows the heavily imbalanced proportion of successful an unsuccessful purchases, as 84.5% of customer sessions do not generate any revenue.
Bivariate EDA: - BounceRates vs. Revenue
The boxplot shows the distribution of Bounce Rates
grouped by the variable Revenue. When revenue is generated,
bounce rates are typically lower, with values concentrated near zero. In
contrast, when no revenue is generated, bounce rates have a wider spread
and more outliers. Summary statistics show the mean Bounce Rate is 0.51%
when revenue is generated and 2.53% when not, with medians of 0% and
0.04%, respectively. Given the skewed distribution, a square root
transformation is appropriate to see the difference between groups
clearly.
When revenue is generated, ProductRelated_Duration is higher with a mean of 1876.2, compared to those where revenue is not generated with a mean of 1059.9. The median is also higher for sessions where purchase was made. The difference is clearer after a cube root transformation.
Based on the bar graph, the greatest proportion of customers are
returning customers, at the value of 10551, the second largest group are
new customers, with 1694 observations, and the rest 84 observations come
from other users. To create a more straightforward distinction between
returning and new clients, while preserving the significance of the
model, I will drop Other level. Mosaic plot depicting the
instance of revenue by Visitor Type shows that returning
visitors make up 86% of total observations, while new visitors account
for 14%. 24.9% of new visitors make a purchase, compared to 13.9% for
returning visitors. However, in terms of total purchases, returning
visitors contribute 12% of all revenue-generating sessions compared to
3.4% from new visitors. Therefore, while new visitors are more likely to
purchase on their first visit, returning visitors drive the majority of
total revenue due to their higher numbers.
Based on the distribution of revenue by day, there is a slight increase in sales on the weekend compared to workdays. On the margin, 17.4% of instances result in revenue generation on the weekend, and 14.9% of instances result in revenue generation on workday.
I start by answering the question of whether the day is weekend or
workday leads to a higher sales likelihood using a simple logistic
regression. The linearity condition of this model is met, because
predictor is binary. The independence condition is met, based on the
assumption that each shopping session in the dataset is independent of
others. For randomness, it is important to consider that the response
variable, Revenue, is binary, the sample size is fixed at
12,245 observations, constant model probability raises slight concerns
because even when predictors like Weekend are significant,
they may not fully account for all variability in Revenue.
However, the conditions are sufficiently satisfied to proceed with
fitting the logistic regression model.
The null hypothesis is that Weekend is not a significant
predictor of Revenue, and the alternative hypothesis is
that Weekend is a significant predictor. The respective p-value for
predictor Weekend is 0.00096, which is below the significance level of
0.05, so it contributes a lot to the explanation of the variability in
the model. I reject of the null hypothesis and conclude that Weekend is
a significant predictor of revenue generation in online retail.
Reporting the model:
\[\pi = \text{probability of generating revenue}\] \[\log(\frac{\widehat{\pi}}{1-\widehat{\pi}}) = −1.7460 + 0.1889(WeekendTRUE) \]
Based on the regression output, the intercept of −1.7460 represents the log odds of revenue generation on workday. Based on the odds of 0.17447, the probability of revenue generation on workday is 14.9%. The odds ratio, when moving from a weekday to a weekend, is 1.2038, which suggests that the likelihood of revenue generation on weekends is 20.8% higher than on weekdays. The probability of revenue generation on the weekend, given the log-odds of 0.1889, is approximately 54.7%.
I will proceed studying the model using multiple logistic regression
to predict the likelihood of generating revenue considering additional
predictors. The new predictors include categorical
(VisitorType, Weekend) and continuous
variables (BounceRates,
ProductRelated_Duration). I assume the independence
condition is met (since all shopping sessions are independent of each
other), as well as the conditions of randomness. Constant model
probability raises some slight concerns because predictors may not be
the only factors influencing revenue generation, suggesting the model
might not capture all relationships. The variable BounceRates has many
similar values, including grouped data (Fig. 1). In order to reduce the
number of groups of data with the goal of seeing a potential
relationship between isRevenue and BounceRates, I will use jitter()
function to add a small amount of random noise to numeric values. It is
used to reduce overplotting when data points overlap. The linearity
condition is met for the categorical predictors by design. Continuous
predictors are checked for linearity with the empirical logit plots.
Based on the output, there is a slight curvature, and potential
outliers, suggesting that the relationship between BounceRates and
isRevenue is not perfectly linear. Same holds for the
ProductRelated_Duration variable, as data points follow a concave
pattern (Fig. 2).
In order to reduce curved patterns, I will use transformations to
achieve a better fit with respect to the reference line (Fig. 3). With
the square root transformation BounceRates1 appears to have
a stronger negative linear relationship with Revenue.
ProductRelated_Duration, after being raised to the power of
0.33, achieves a better positive linearity, but it still has
fluctuations along the reference line. We will consider linearity
conditions satisfied for all 4 predictors now.
Next, I perform a stepwise regression as an automated variable
selection procedure (Fig. 4), which suggests considering all four
predictors: ProductRelated_Duration_trans,
BounceRates1_sqrt, VisitorType, and
Weekend. I proceed to fit a multiple logistic regression
model. All predictors, except for Weekend are statistically
significant, with p-values below 0.05. Weekend has p-value
of 0.11, which is above the significance level of 0.1, but, I will still
consider it. Additionally, I check for potential issues of
multicollinearity using the vif function. Since all
Variance Inflation Factors (VIFs) are below 5, there are no significant
concerns regarding multicollinearity.
Now, I consider potential interactions between the predictors. The
emplogitplot2 output suggests a potential interaction
between ProductRelated_Duration and Weekend,
as the lines showing the association between the predictor and response
are not parallel, indicating that the effect of
ProductRelated_Duration on the response may differ
depending on whether it is a weekend (Fig. 5).
Before proceeding with model fitting, I filtered the shop data to get
rid of NA values which appeared in the datatset after transforming the
variables. Model 1 includes three predictors, such as
ProductRelated_Duration_trans,
BounceRates1_sqrt, VisitorType (Fig. 6.1).
Model 2 includes all four predictors (Fig. 6.2).
Model 3 includes all four predictors and an interaction
term (Fig. 6.3). Based on msummary output, the interaction
term has a p-value of 0.49 which is above the significance level of
0.05, so it doesn’t add a lot of explanatory power to the model. I will
include it in the likelihood ratio test below to confirm this
finding.
Based on the likelihood ratio test, which is used to compare the
goodness of fit between multiple models, Model 2 has a p-value of 0.11
and Model 3 has a p-value of 0.33, which means that the smallest model
(glm1) is the best at explaining variation in Revenue.
## Likelihood ratio test
##
## Model 1: Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans
## Model 2: Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans +
## Weekend
## Model 3: Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans +
## Weekend + ProductRelated_Duration_trans * Weekend
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 4 -3561
## 2 5 -3560 1 1.88 0.17
## 3 6 -3560 1 0.58 0.45
In order to support the selection of the model glm1 (Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans), I will compare the first and the second models’ glm2 (Revenue ~ VisitorType + BounceRates1_sqrt + ProductRelated_Duration_trans+Weekend) Explained deviance, AIC, and concordance values.cExplained Deviance is an \(R^2\)-like measure for logistic regression. The Explained Deviance of model glm1 is 0.10081, while for model glm2 it is 0.10114. glm1’s and glm2’s AIC value is 7080. I will also consider the c-statistic, which measures a model’s ability to correctly rank predictions for binary outcomes. Concordance is 0.73 for both models (Fig. 7). The two models have very similar values, thus a simpler model is a better choice in this case.
Before reporting and interpreting the final model, I check the 95% Confidence interval for the coefficients of my logistic regression model (Fig. 8). None of the values cover 0, meaning that all of them are statistically significant. Additionally, I check for unusual points, such as high leverage points and influential points. According to the Cook’s distance vs. leverage plot (Fig. 9), there are no points with a Cook’s distance of above 0.5. Additionally, no points have severely high leverage, since their leverage values all range from 0 to 0.0045, and they are located close to the reference line.
Reporting the final model: \(\pi = \text{Probability of Revenue Generation }\) \[\log(\frac{\widehat{\pi}}{1-\widehat{\pi}}) = -2.0581 -0.6116(VisitorTypeReturning Visitor) -6.1502(BounceRates1 sqrt)\] \[+0.1346 (ProductRelated Duration trans) \]
When all predictor variables are equal to 0, the odds of revenue
generation are 0.1277, corresponding to a probability of 11.3%. The
coefficients of each variable represent the change in log-odds of
revenue generation associated with a one-unit increase in each
respective variable. Therefore, returning visitors have 54.25% lower
odds of revenue generation compared to new visitors, holding other
factors constant. A one-unit increase in BounceRates1_sqrt
decreases the odds of revenue generation by approximately 99.79%. For
each one-unit increase in ProductRelatedDuration_trans, the
odds of revenue generation increase by approximately 14.41%.
This study addressed the question of the most important factors in revenue generation in online retail. The findings revealed that new visitors are more likely to generate revenue than returning visitors. Higher bounce rates reduce the likelihood of revenue generation. Increased engagement with product-related pages positively influences revenue generation. The type of day, weekend or workday wasn’t a significant predictor, suggesting that there may be additional factors that influence shoppers’ behavior, not captured by this model. A small explained deviance of 0.10245 indicates that the model is not explaining a large portion of the variability in the data. This could be due to complex relationships between variables not captured by the model.
In summary, the findings suggest that business owners should tailor their promotions to reach to new clients, while trying to engage returning clients with personalized offerings. Better website usability increases engagement and can help reduce bounce rates and boost revenue. Interactive features, better description, photos and videos can streamline purchasing process and reduce decision-fatigue for highly engaged visitors.
Sakar, C., and Yomi Kastro. “Online Shoppers Purchasing Intention Dataset.” UCI Machine Learning Repository, 30 Aug. 2018, archive.ics.uci.edu/dataset/468/online+shoppers+purchasing+intention+dataset.
Fig. 1:
The graph shows the relationship between BounceRates and the likelihood of generating Revenue across all groups in the dataset:
plot1<-emplogitplot1(factor(Revenue) ~ (BounceRates), data = shop,
ngroups = "all",ylab = "Bounce Rates full data")
Fig. 2:
Plot 1: The graph shows the relationship between the jittered BounceRates (denoted as BounceRates1) and the likelihood of generating Revenue. Plot 2: The graph shows the relationship between ProductRelated_Duration and the likelihood of generating Revenue.
plot2<-emplogitplot1(factor(Revenue) ~
(BounceRates1), data = shop, ngroups = 100)
plot3<-emplogitplot1(factor(Revenue) ~ (ProductRelated_Duration),
data = shop, ngroups = 20)
Fig. 3:
Plot 1: The graph shows the relationship between the square root transformation of BounceRates (BounceRates1_sqrt) and the likelihood of generating Revenue. Plot 2: The graph shows the relationship between the transformed ProductRelated_Duration (ProductRelated_Duration_trans) and the likelihood of generating Revenue.
plot1<-emplogitplot1(factor(Revenue) ~ (BounceRates1_sqrt),
data = shop, ngroups = 90)
plot2<-emplogitplot1(factor(Revenue) ~ (ProductRelated_Duration_trans),
data = shop, ngroups = 20)
Fig. 4:
Stepwise regression using the AIC (Akaike Information Criterion) to identify the most significant variables for predicting Revenue:
cleaned_shop <- na.omit(shop)
require(MASS)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
mod.small <- glm(Revenue ~ 1, data = cleaned_shop, family = binomial)
mod.all <- glm(Revenue ~ ProductRelated_Duration_trans +
BounceRates1_sqrt + VisitorType + Weekend,
data = cleaned_shop, family = binomial)
# Perform stepwise regression
stepAIC(mod.small, scope = list(lower = mod.small, upper = mod.all),
direction = "both", trace = FALSE)$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Revenue ~ 1
##
## Final Model:
## Revenue ~ ProductRelated_Duration_trans + BounceRates1_sqrt +
## VisitorType
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 9522 7944.5 7946.5
## 2 + ProductRelated_Duration_trans 1 474.730 9521 7469.8 7473.8
## 3 + BounceRates1_sqrt 1 291.062 9520 7178.7 7184.7
## 4 + VisitorType 1 56.592 9519 7122.1 7130.1
Fig. 5:
Summary of the results of a logistic regression model with an interaction term between ProductRelated_Duration_trans and Weekend:
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.05004 0.07519 -40.56 <2e-16
## ProductRelated_Duration_trans 0.14034 0.00683 20.54 <2e-16
## WeekendTRUE 0.42234 0.14803 2.85 0.0043
## ProductRelated_Duration_trans:WeekendTRUE -0.02496 0.01371 -1.82 0.0686
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10541.9 on 12244 degrees of freedom
## Residual deviance: 9984.8 on 12241 degrees of freedom
## AIC: 9993
##
## Number of Fisher Scoring iterations: 5
Fig. 6.1:
Summary of the results of a logistic regression model with VisitorType, BounceRates1_sqrt, ProductRelated_Duration_trans:
glm1 <- glm(Revenue ~ VisitorType+BounceRates1_sqrt+
ProductRelated_Duration_trans, data = cleaned_shop,
family = binomial)
msummary(glm1)
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95655 0.09772 -20.02 < 2e-16 ***
## VisitorTypeReturning_Visitor -0.69259 0.08972 -7.72 1.2e-14 ***
## BounceRates1_sqrt -6.25926 0.49451 -12.66 < 2e-16 ***
## ProductRelated_Duration_trans 0.13367 0.00734 18.20 < 2e-16 ***
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7944.5 on 9522 degrees of freedom
## Residual deviance: 7122.1 on 9519 degrees of freedom
## AIC: 7130
##
## Number of Fisher Scoring iterations: 6
Fig. 6.2:
Summary of the results of a logistic regression model with VisitorType, BounceRates1_sqrt, ProductRelated_Duration_trans, and Weekend:
glm2 <- glm(Revenue ~ VisitorType+BounceRates1_sqrt+
ProductRelated_Duration_trans+Weekend, data = cleaned_shop,
family = binomial)
msummary(glm2)
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.98331 0.09976 -19.88 < 2e-16 ***
## VisitorTypeReturning_Visitor -0.68925 0.08976 -7.68 1.6e-14 ***
## BounceRates1_sqrt -6.25064 0.49487 -12.63 < 2e-16 ***
## ProductRelated_Duration_trans 0.13361 0.00735 18.19 < 2e-16 ***
## WeekendTRUE 0.09546 0.06926 1.38 0.17
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7944.5 on 9522 degrees of freedom
## Residual deviance: 7120.2 on 9518 degrees of freedom
## AIC: 7130
##
## Number of Fisher Scoring iterations: 6
Fig. 6.3:
Summary of the results of a logistic regression model with VisitorType, BounceRates1_sqrt, ProductRelated_Duration_trans, Weekend, and an interaction term:
glm3 <- glm(Revenue ~ VisitorType+BounceRates1_sqrt+
ProductRelated_Duration_trans+Weekend+
ProductRelated_Duration_trans*Weekend, data = cleaned_shop,
family = binomial)
msummary(glm3)
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.01723 0.10954 -18.42 < 2e-16
## VisitorTypeReturning_Visitor -0.68789 0.08979 -7.66 1.8e-14
## BounceRates1_sqrt -6.24591 0.49506 -12.62 < 2e-16
## ProductRelated_Duration_trans 0.13664 0.00837 16.32 < 2e-16
## WeekendTRUE 0.22888 0.18828 1.22 0.22
## ProductRelated_Duration_trans:WeekendTRUE -0.01254 0.01650 -0.76 0.45
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7944.5 on 9522 degrees of freedom
## Residual deviance: 7119.7 on 9517 degrees of freedom
## AIC: 7132
##
## Number of Fisher Scoring iterations: 6
Fig. 7:
Concordance index (C-index) tables:
require(survival)
concordance(glm1)
## Call:
## concordance.lm(object = glm1)
##
## n= 9523
## Concordance= 0.73 se= 0.0067
## concordant discordant tied.x tied.y tied.xy
## 8304136 3054614 0 33980253 0
concordance(glm2)
## Call:
## concordance.lm(object = glm2)
##
## n= 9523
## Concordance= 0.73 se= 0.0067
## concordant discordant tied.x tied.y tied.xy
## 8310224 3048526 0 33980253 0
Fig. 8:
95% Confidence Interval:
exp(confint(glm1))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.11652541 0.1709325
## VisitorTypeReturning_Visitor 0.41997625 0.5970675
## BounceRates1_sqrt 0.00071561 0.0049729
## ProductRelated_Duration_trans 1.12674246 1.1596501
Fig. 9:
The diagnostic plots showing key assumptions in regression models, including linearity, homoscedasticity, normality of residuals, influential data points, and leverage:
mplot(glm1, which =6)
## `geom_smooth()` using formula = 'y ~ x'
pt<-which(hatvalues(glm1) > (3*(3+1)/9503))
which(cooks.distance(glm1) > 0.5)
## named integer(0)