1. Framing & Dataset Selection

This project investigates a binary classification problem in e-commerce, aiming to predict purchase outcomes based on session-level behaviour. The goal is to identify key behavioural and visitor-related factors associated with conversion, supporting improved marketing efficiency and user experience. (Smith & Palmatier, 2020).The Online Shoppers Purchasing Intention dataset is appropriate for this analysis, as it provides session-level observations with a clear binary outcome and a diverse set of behavioural and categorical features.

2. Consider & Gather Data

2.1 Data Source

This project uses the Online Shoppers Purchasing Intention Dataset, which is publicly accessible via Kaggle and mirrors the version originally released through the UCI Machine Learning Repository. The dataset contains 12,330 user sessions and 18 variables describing browsing behaviour, page interaction, session timing, and whether the session resulted in a purchase (Revenue). Each row represents a single online shopping session, making the dataset appropriate for analysing factors associated with purchase intention at the session level.

2.2 Why This Dataset Was Selected

This dataset was selected as it meets the project requirements in terms of size, complexity, and data quality. It contains 12,330 observations, providing a sufficient sample for model training and evaluation. The data also exhibits skewness and outliers in several numerical features, requiring appropriate preprocessing. In addition, the dataset includes a mix of numerical and categorical variables, increasing its complexity and making it suitable for feature engineering and classification analysis.

raw_data <- read.csv("online_shoppers_intention.csv")
row_count <- nrow(raw_data)
cat("The dataset contains", row_count, "rows.")

## The dataset contains 12330 rows.

2.3 Data Integrity and Suitability

Data integrity checks confirmed that there are no missing values and 125 duplicated rows. Given their small proportion and potential to represent valid repeated behaviour, duplicates were retained for analysis.

# Check if dataset is loaded, if not, skip this chunk
if (exists("raw_data")) {
  # Check missing values
  total_missing <- sum(is.na(raw_data))
  cat("Total missing values in the dataset:", total_missing, "\n")
  
  # Check duplicate rows
  duplicate_count <- sum(duplicated(raw_data))
  cat("Total duplicated rows in the dataset:", duplicate_count, "\n")
} else {
  cat("Dataset not loaded. Please ensure the file 'online_shoppers_intention.csv' is available and rerun this chunk.\n")
}

## Total missing values in the dataset: 0 
## Total duplicated rows in the dataset: 125

3. Data Preprocessing Plan

Before conducting exploratory analysis and later modelling, a basic preprocessing plan was applied to improve data consistency and make the variables easier to interpret. The main steps included converting categorical variables to factors, reducing skewness in heavily right-skewed duration variables, and standardizing selected numerical variables. These steps were intended to prepare the dataset for clearer analysis while retaining the original structure of the data.

3.1 Factor Conversion

The first step was to standardize variable types. Several categorical variables were originally stored as integers or character-like values, which may be misinterpreted as continuous variables if left unchanged. These variables were therefore converted to factors so that they would be treated as categorical levels in subsequent analysis.

clean_data <- raw_data %>%
  mutate(across(c(Month, OperatingSystems, Browser, Region,
                  TrafficType, VisitorType, Weekend, Revenue), as.factor))

3.2 Handling Outliers and Skewness

Initial inspection of the dataset showed that several duration-related variables were highly right-skewed, with many small values and a small number of very large observations. To reduce the inffuence of this skewness and make the distributions more stable for analysis, a log1p() transformation was applied to the main duration variables. The log1p() function was used instead of a standard log transformation because it handles zero values safely.

if (exists("raw_data")) {
  before_max <- max(raw_data$ProductRelated_Duration, na.rm = TRUE)
  
  clean_data <- clean_data %>%
    mutate(across(c(ProductRelated_Duration, Administrative_Duration, Informational_Duration), log1p))
  
  p1 <- ggplot(raw_data, aes(x = ProductRelated_Duration)) + 
    geom_histogram(fill = "grey") + 
    labs(title = "Before") +
    theme_minimal(base_size = 8)

  p2 <- ggplot(clean_data, aes(x = ProductRelated_Duration)) + 
    geom_histogram(fill = "blue") + 
    labs(title = "After") +
    theme_minimal(base_size = 8)

  grid.arrange(p1, p2, ncol = 2)
}

Figure 1 shows the comparison of the distribution before and after using log1p().

3.3 Scaling Numerical Features

The numerical variables were measured on very different scales. For example, count variables, duration variables, and page-related indicators had substantially different ranges. To make these variables more comparable, selected numerical features were standardized using z-score scaling. This centers each variable around zero and rescales it to have unit variance.

# Define numeric columns
numeric_cols <- c("Administrative", "Administrative_Duration", 
                  "Informational", "Informational_Duration", 
                  "ProductRelated", "ProductRelated_Duration", 
                  "BounceRates", "ExitRates", "PageValues")

# Remove problematic columns (sd = 0)
valid_cols <- numeric_cols[
  sapply(clean_data[numeric_cols], function(x) sd(x, na.rm = TRUE) > 0)
]

# Apply scaling
clean_data[valid_cols] <- scale(clean_data[valid_cols])

All selected numerical variables were standardised to have approximately zero mean and unit variance.

4. Exploratory Data Analysis (EDA)

4.1 Overview

Exploratory Data Analysis (EDA) was conducted to identify key patterns associated with purchasing behaviour. The dataset contains 12,330 observations and 18 variables with no missing values. To ensure clarity and conciseness, the analysis is summarised using three compact figures covering numerical distributions, categorical patterns, and key relationships with the target variable (Revenue).

4.2 Numerical Overview

num_vars <- raw_data %>% select(where(is.numeric))

# Histogram
p1 <- num_vars %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "black") +
  facet_wrap(~Variable, scales = "free", ncol = 4) +
  theme_minimal(base_size = 7) +
  labs(title = "Numerical Distributions")

# Boxplot
p2 <- num_vars %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "orange") +
  coord_flip() +
  theme_minimal(base_size = 7) +
  labs(title = "Outliers")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Figure 1. Most numerical variables are right-skewed with noticeable outliers, especially in duration-related features, indicating heterogeneous browsing behaviour.

cor_matrix <- cor(num_vars)
cor_df <- as.data.frame(as.table(cor_matrix))

ggplot(cor_df, aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  theme_minimal(base_size = 7) +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 2. Strong correlations exist between related features such as ProductRelated and ProductRelated_Duration, suggesting overlapping behavioural information.

4.3 Categorical Overview

cat_vars_small <- raw_data %>%
  select(VisitorType, Weekend) %>%
  mutate(across(everything(), as.character))

p3 <- cat_vars_small %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_bar(fill = "steelblue") +
  facet_wrap(~Variable, scales = "free", ncol = 2) +
  theme_minimal(base_size = 7) +
  labs(title = "Categorical Distributions") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

p4 <- ggplot(raw_data, aes(x = VisitorType, fill = Revenue)) +
  geom_bar(position = "fill") +
  theme_minimal(base_size = 7) +
  labs(title = "VisitorType vs Revenue", y = "Proportion")

p5 <- ggplot(raw_data, aes(x = Revenue, fill = Revenue)) +
  geom_bar() +
  theme_minimal(base_size = 7) +
  theme(legend.position = "none") +
  labs(title = "Revenue Distribution")

gridExtra::grid.arrange(p3, p4, p5, ncol = 2)

Figure 3. Returning visitors dominate the dataset and show higher conversion proportions. Revenue distribution indicates class imbalance.

4.4 Key Relationships with Revenue

p6 <- ggplot(raw_data, aes(x = Revenue, y = PageValues, fill = Revenue)) +
  geom_boxplot() +
  theme_minimal(base_size = 7) +
  labs(title = "PageValues")

p7 <- ggplot(raw_data, aes(x = Revenue, y = BounceRates, fill = Revenue)) +
  geom_boxplot() +
  theme_minimal(base_size = 7) +
  labs(title = "BounceRates")

p8 <- ggplot(raw_data, aes(x = Revenue, y = ProductRelated, fill = Revenue)) +
  geom_boxplot() +
  theme_minimal(base_size = 7) +
  labs(title = "ProductRelated")

gridExtra::grid.arrange(p6, p7, p8, ncol = 3)

Figure 4. PageValues shows the strongest positive relationship with Revenue, while BounceRates is negatively associated with conversion.

4.5 Conclusion

EDA indicates that engagement-related variables are the most important predictors of purchasing behaviour. PageValues is the strongest positive predictor, while BounceRates shows a negative relationship with conversion. These findings guide feature selection for subsequent modelling.

5. Project Planning

5.1 Model Plan

The target variable, Revenue, is a binary classification outcome indicating whether an online shopping session resulted in a purchase. Because the EDA shows nonlinear patterns, skewed numeric variables, categorical predictors, and class imbalance, the modelling stage will compare a set of interpretable baseline models and more flexible machine learning models. All models will be trained using the same training/test split and cross-validation procedure in RStudio after the preprocessing steps described above.

The following classification models are proposed:

Logistic Regression: This will be used as the baseline model because the response variable is binary and the fitted coefficients can be interpreted directly. It will help identify whether variables such as PageValues, ExitRates, and VisitorType increase or decrease the probability of a purchase. Factor encoding and scaling will be applied before training.
Penalised Logistic Regression (LASSO/Ridge): This model is suitable because the dataset contains a mixture of numeric and categorical predictors, and one-hot encoding can increase the number of model terms. Regularisation can reduce overfitting, support feature selection, and improve generalisation compared with an unpenalised logistic regression.
Decision Tree: A decision tree is appropriate because it can capture nonlinear relationships and interaction effects without requiring strong parametric assumptions. It is also easy to explain to business stakeholders, for example through rules involving high PageValues, low BounceRates, or returning visitor status.
Random Forest: A random forest will be used to improve predictive performance over a single decision tree. It is robust to outliers and nonlinear relationships and can provide variable importance scores, which directly supports the research question about the most significant predictors of purchase intention.
Gradient Boosting Machine, such as XGBoost: Boosting is well suited to this problem because it can model complex interactions among browsing behaviour variables and usually performs well on structured tabular data. Hyperparameters such as tree depth, learning rate, and number of trees will be tuned using cross-validation to manage overfitting.
Support Vector Machine: An SVM with a radial basis kernel will be considered because the purchase boundary may be nonlinear. Scaled numeric features and encoded categorical variables will be required. The SVM result will be compared with tree-based models to assess whether a margin-based classifier improves performance on the minority purchase class. For implementation in RStudio, the modelling workflow can be managed using packages such as caret or tidymodels; in particular, caret provides a consistent framework for model training and tuning in R. The same resampling folds should be reused across models so that performance comparisons are fair. Class imbalance will be addressed through stratified sampling and, if needed, resampling methods such as up-sampling, down-sampling, or SMOTE applied only within the training folds.

5.2 Evaluation Plan

The primary evaluation concern is that most sessions do not generate revenue, so overall accuracy may be misleading under class imbalance. A model that predicts Revenue = FALSE for nearly all sessions could achieve reasonable accuracy while failing to identify valuable purchasing sessions. Therefore, model performance will be assessed using metrics that reflect both class imbalance and the business value of identifying high-intent shoppers.

The dataset will be split into training and test sets using stratified sampling so that the proportion of revenue and non-revenue sessions is preserved in both sets. Model tuning will be performed on the training data using repeated k-fold cross-validation. The final model comparison will then be reported on the unseen test set to provide an unbiased estimate of performance.

The following evaluation metrics will be used:

Confusion Matrix: This will summarise true positives, false positives, true negatives, and false negatives. It is important for understanding the business consequences of each error type.
Recall/Sensitivity for Revenue = TRUE: This will measure how many actual purchasing sessions the model correctly identifies. It is important because missing high-intent customers could reduce the value of targeted marketing actions.
Precision for Revenue = TRUE: This will measure how reliable the positive predictions are. It matters because too many false positives could waste promotional discounts or marketing resources on visitors unlikely to purchase.
F1-score: This balances precision and recall and is more appropriate than accuracy when the minority class is important.
ROC-AUC: This will assess the model’s ability to rank purchasing sessions above non-purchasing sessions across different classification thresholds.
PR-AUC: Precision-recall AUC will be included because it is especially informative for imbalanced binary classification, where the positive class is the main focus.
Balanced Accuracy: This will average sensitivity and specificity, reducing the risk that the majority non-revenue class dominates the evaluation.

The final model will not be selected only by the highest accuracy. Priority will be given to a model with strong recall and F1-score for the purchasing class, a competitive PR-AUC, and acceptable precision. If two models perform similarly, the simpler and more interpretable model will be preferred, especially if it provides clear insights into which session attributes most influence purchase intention.

References

Sakar, C. O., Polat, S. O., Katircioglu, M., & Kastro, Y. (2019). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893–6908. https://doi.org/10.1007/s00521-018-3523-0
Smith, A. P., & Palmatier, R. W. (2020). Harnessing AI and machine learning for personalized marketing in e-commerce. Journal of Marketing Research, 57(3), 401–418. https://doi.org/10.1177/0022243720910078

AI Use Statement

I acknowledge the use of ChatGPT (OpenAI, GPT-5.3, https://chat.openai.com/) to assist with grammar refinement, code debugging, and minor data-related queries. All analytical decisions, interpretations, and final conclusions were developed independently.

Online Shopper Analysis - Data Cleaning and EDA Report

W06G02

2026-04-11