This project investigates a binary classification problem in e-commerce, aiming to predict purchase outcomes based on session-level behaviour. The goal is to identify key behavioural and visitor-related factors associated with conversion, supporting improved marketing efficiency and user experience. (Smith & Palmatier, 2020).The Online Shoppers Purchasing Intention dataset is appropriate for this analysis, as it provides session-level observations with a clear binary outcome and a diverse set of behavioural and categorical features.
This project uses the Online Shoppers Purchasing Intention
Dataset, which is publicly accessible via Kaggle and mirrors
the version originally released through the UCI Machine Learning
Repository. The dataset contains 12,330 user sessions
and 18 variables describing browsing behaviour, page
interaction, session timing, and whether the session resulted in a
purchase (Revenue). Each row represents a single online
shopping session, making the dataset appropriate for analysing factors
associated with purchase intention at the session level.
This dataset was selected as it meets the project requirements in terms of size, complexity, and data quality. It contains 12,330 observations, providing a sufficient sample for model training and evaluation. The data also exhibits skewness and outliers in several numerical features, requiring appropriate preprocessing. In addition, the dataset includes a mix of numerical and categorical variables, increasing its complexity and making it suitable for feature engineering and classification analysis.
raw_data <- read.csv("online_shoppers_intention.csv")
row_count <- nrow(raw_data)
cat("The dataset contains", row_count, "rows.")
## The dataset contains 12330 rows.
Data integrity checks confirmed that there are no missing values and 125 duplicated rows. Given their small proportion and potential to represent valid repeated behaviour, duplicates were retained for analysis.
# Check if dataset is loaded, if not, skip this chunk
if (exists("raw_data")) {
# Check missing values
total_missing <- sum(is.na(raw_data))
cat("Total missing values in the dataset:", total_missing, "\n")
# Check duplicate rows
duplicate_count <- sum(duplicated(raw_data))
cat("Total duplicated rows in the dataset:", duplicate_count, "\n")
} else {
cat("Dataset not loaded. Please ensure the file 'online_shoppers_intention.csv' is available and rerun this chunk.\n")
}
## Total missing values in the dataset: 0
## Total duplicated rows in the dataset: 125
Before conducting exploratory analysis and later modelling, a basic preprocessing plan was applied to improve data consistency and make the variables easier to interpret. The main steps included converting categorical variables to factors, reducing skewness in heavily right-skewed duration variables, and standardizing selected numerical variables. These steps were intended to prepare the dataset for clearer analysis while retaining the original structure of the data.
The first step was to standardize variable types. Several categorical variables were originally stored as integers or character-like values, which may be misinterpreted as continuous variables if left unchanged. These variables were therefore converted to factors so that they would be treated as categorical levels in subsequent analysis.
clean_data <- raw_data %>%
mutate(across(c(Month, OperatingSystems, Browser, Region,
TrafficType, VisitorType, Weekend, Revenue), as.factor))
Initial inspection of the dataset showed that several duration-related variables were highly right-skewed, with many small values and a small number of very large observations. To reduce the inffuence of this skewness and make the distributions more stable for analysis, a log1p() transformation was applied to the main duration variables. The log1p() function was used instead of a standard log transformation because it handles zero values safely.
if (exists("raw_data")) {
before_max <- max(raw_data$ProductRelated_Duration, na.rm = TRUE)
clean_data <- clean_data %>%
mutate(across(c(ProductRelated_Duration, Administrative_Duration, Informational_Duration), log1p))
p1 <- ggplot(raw_data, aes(x = ProductRelated_Duration)) +
geom_histogram(fill = "grey") +
labs(title = "Before") +
theme_minimal(base_size = 8)
p2 <- ggplot(clean_data, aes(x = ProductRelated_Duration)) +
geom_histogram(fill = "blue") +
labs(title = "After") +
theme_minimal(base_size = 8)
grid.arrange(p1, p2, ncol = 2)
}
Figure 1 shows the comparison of the distribution before and after using log1p().
The numerical variables were measured on very different scales. For example, count variables, duration variables, and page-related indicators had substantially different ranges. To make these variables more comparable, selected numerical features were standardized using z-score scaling. This centers each variable around zero and rescales it to have unit variance.
# Define numeric columns
numeric_cols <- c("Administrative", "Administrative_Duration",
"Informational", "Informational_Duration",
"ProductRelated", "ProductRelated_Duration",
"BounceRates", "ExitRates", "PageValues")
# Remove problematic columns (sd = 0)
valid_cols <- numeric_cols[
sapply(clean_data[numeric_cols], function(x) sd(x, na.rm = TRUE) > 0)
]
# Apply scaling
clean_data[valid_cols] <- scale(clean_data[valid_cols])
All selected numerical variables were standardised to have approximately zero mean and unit variance.
Exploratory Data Analysis (EDA) was conducted to identify key patterns associated with purchasing behaviour. The dataset contains 12,330 observations and 18 variables with no missing values. To ensure clarity and conciseness, the analysis is summarised using three compact figures covering numerical distributions, categorical patterns, and key relationships with the target variable (Revenue).
num_vars <- raw_data %>% select(where(is.numeric))
# Histogram
p1 <- num_vars %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "black") +
facet_wrap(~Variable, scales = "free", ncol = 4) +
theme_minimal(base_size = 7) +
labs(title = "Numerical Distributions")
# Boxplot
p2 <- num_vars %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "orange") +
coord_flip() +
theme_minimal(base_size = 7) +
labs(title = "Outliers")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Figure 1. Most numerical variables are right-skewed with noticeable
outliers, especially in duration-related features, indicating
heterogeneous browsing behaviour.
cor_matrix <- cor(num_vars)
cor_df <- as.data.frame(as.table(cor_matrix))
ggplot(cor_df, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
theme_minimal(base_size = 7) +
labs(title = "Correlation Heatmap", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Figure 2. Strong correlations exist between related features such as
ProductRelated and ProductRelated_Duration, suggesting overlapping
behavioural information.
cat_vars_small <- raw_data %>%
select(VisitorType, Weekend) %>%
mutate(across(everything(), as.character))
p3 <- cat_vars_small %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
ggplot(aes(x = Value)) +
geom_bar(fill = "steelblue") +
facet_wrap(~Variable, scales = "free", ncol = 2) +
theme_minimal(base_size = 7) +
labs(title = "Categorical Distributions") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
p4 <- ggplot(raw_data, aes(x = VisitorType, fill = Revenue)) +
geom_bar(position = "fill") +
theme_minimal(base_size = 7) +
labs(title = "VisitorType vs Revenue", y = "Proportion")
p5 <- ggplot(raw_data, aes(x = Revenue, fill = Revenue)) +
geom_bar() +
theme_minimal(base_size = 7) +
theme(legend.position = "none") +
labs(title = "Revenue Distribution")
gridExtra::grid.arrange(p3, p4, p5, ncol = 2)
Figure 3. Returning visitors dominate the dataset and show higher
conversion proportions. Revenue distribution indicates class
imbalance.
p6 <- ggplot(raw_data, aes(x = Revenue, y = PageValues, fill = Revenue)) +
geom_boxplot() +
theme_minimal(base_size = 7) +
labs(title = "PageValues")
p7 <- ggplot(raw_data, aes(x = Revenue, y = BounceRates, fill = Revenue)) +
geom_boxplot() +
theme_minimal(base_size = 7) +
labs(title = "BounceRates")
p8 <- ggplot(raw_data, aes(x = Revenue, y = ProductRelated, fill = Revenue)) +
geom_boxplot() +
theme_minimal(base_size = 7) +
labs(title = "ProductRelated")
gridExtra::grid.arrange(p6, p7, p8, ncol = 3)
Figure 4. PageValues shows the strongest positive relationship with
Revenue, while BounceRates is negatively associated with conversion.
EDA indicates that engagement-related variables are the most important predictors of purchasing behaviour. PageValues is the strongest positive predictor, while BounceRates shows a negative relationship with conversion. These findings guide feature selection for subsequent modelling.
The target variable, Revenue, is a binary classification
outcome indicating whether an online shopping session resulted in a
purchase. Because the EDA shows nonlinear patterns, skewed numeric
variables, categorical predictors, and class imbalance, the modelling
stage will compare a set of interpretable baseline models and more
flexible machine learning models. All models will be trained using the
same training/test split and cross-validation procedure in RStudio after
the preprocessing steps described above.
The following classification models are proposed:
PageValues, ExitRates, and
VisitorType increase or decrease the probability of a
purchase. Factor encoding and scaling will be applied before
training.PageValues, low BounceRates, or returning
visitor status.caret or
tidymodels; in particular, caret provides a
consistent framework for model training and tuning in R. The same
resampling folds should be reused across models so that performance
comparisons are fair. Class imbalance will be addressed through
stratified sampling and, if needed, resampling methods such as
up-sampling, down-sampling, or SMOTE applied only within the training
folds.The primary evaluation concern is that most sessions do not generate
revenue, so overall accuracy may be misleading under class imbalance. A
model that predicts Revenue = FALSE for nearly all sessions
could achieve reasonable accuracy while failing to identify valuable
purchasing sessions. Therefore, model performance will be assessed using
metrics that reflect both class imbalance and the business value of
identifying high-intent shoppers.
The dataset will be split into training and test sets using stratified sampling so that the proportion of revenue and non-revenue sessions is preserved in both sets. Model tuning will be performed on the training data using repeated k-fold cross-validation. The final model comparison will then be reported on the unseen test set to provide an unbiased estimate of performance.
The following evaluation metrics will be used:
Confusion Matrix: This will summarise true positives, false positives, true negatives, and false negatives. It is important for understanding the business consequences of each error type.
Recall/Sensitivity for
Revenue = TRUE: This will measure how many actual
purchasing sessions the model correctly identifies. It is important
because missing high-intent customers could reduce the value of targeted
marketing actions.
Precision for Revenue = TRUE: This
will measure how reliable the positive predictions are. It matters
because too many false positives could waste promotional discounts or
marketing resources on visitors unlikely to purchase.
F1-score: This balances precision and recall and is more appropriate than accuracy when the minority class is important.
ROC-AUC: This will assess the model’s ability to rank purchasing sessions above non-purchasing sessions across different classification thresholds.
PR-AUC: Precision-recall AUC will be included because it is especially informative for imbalanced binary classification, where the positive class is the main focus.
Balanced Accuracy: This will average sensitivity and specificity, reducing the risk that the majority non-revenue class dominates the evaluation.
The final model will not be selected only by the highest accuracy. Priority will be given to a model with strong recall and F1-score for the purchasing class, a competitive PR-AUC, and acceptable precision. If two models perform similarly, the simpler and more interpretable model will be preferred, especially if it provides clear insights into which session attributes most influence purchase intention.
I acknowledge the use of ChatGPT (OpenAI, GPT-5.3, https://chat.openai.com/) to assist with grammar refinement, code debugging, and minor data-related queries. All analytical decisions, interpretations, and final conclusions were developed independently.