Online Shoppers Purchasing Intention

1. Introduction

In the competitive landscape of online commerce, data-driven decision-making has become essential for improving user engagement and driving revenue growth. Businesses collect vast amounts of behavioral and technical data from user sessions, yet much of this information goes underutilized when it comes to real-time decision-making. By leveraging predictive modeling, organizations can move beyond reactive analysis and begin to anticipate user behavior, allowing for smarter targeting, more personalized interactions, and improved operational efficiency.

This project explores the use of machine learning to predict which user sessions are most likely to result in revenue. The primary objective is to build and compare models that can classify sessions as revenue-generating or not, based on available session-level features. By identifying high-value sessions in advance, the business can prioritize these users for strategic actions such as dynamic promotions, live assistance, or personalized recommendations.

Through preprocessing, model tuning, and evaluation, this project provides a practical framework for applying predictive analytics to session-level web data. The insights generated aim to support real-time marketing and customer engagement strategies, ultimately helping the business convert more users into paying customers.

2. Business Problem and Objective

Online retailers often collect detailed session-level data, including how users interact with different types of pages, how long they stay, what devices they use, and whether they are new or returning visitors. Despite having access to this information, many businesses still struggle to identify which sessions are likely to result in a purchase. As a result, marketing campaigns and customer engagement strategies are applied uniformly across all sessions, regardless of purchase intent. This leads to inefficient use of resources, low conversion rates, and missed revenue opportunities.

The core business problem is that the company cannot differentiate between high-intent and low-intent user sessions in real time. Without this distinction, it becomes difficult to prioritize efforts, deliver targeted experiences, or intervene effectively during the session to influence purchasing behavior.

To address this issue, the objective of this project is to build a predictive model that determines whether a user session will lead to a purchase. This is framed as a binary classification task, using historical session data from an e-commerce platform. The dataset includes features such as bounce rates, exit rates, time spent on various page types, traffic sources, user types, and session timing. By training machine learning models on this data, we aim to predict purchase intent before a transaction occurs.

These predictions can help the business act more strategically during live sessions, whether by offering personalized promotions, prioritizing customer service resources, or refining retargeting efforts. Ultimately, the goal is to increase revenue, reduce wasted effort, and improve the efficiency of digital engagement.

3. Exploratory Data Analysis

Before modeling, exploratory data analysis (EDA) was performed to understand the dataset’s structure, distribution, and potential issues such as skewness or outliers. Since the data captures session-level behavior from an online retail platform, the analysis focused on patterns in navigation, engagement time, and technical attributes that may relate to purchase likelihood.

online_attention <- read.csv("online_shoppers_intention.csv",sep = ",",stringsAsFactors = T) |>
  clean_names()|> mutate(revenue = factor(if_else(revenue == TRUE,1,0)),
                         weekend = factor(if_else(weekend == TRUE,1,0))
  )

month_levels <- c("Jan","Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")

# Reorder the month factor
online_attention$month <- factor(online_attention$month, levels = month_levels, ordered = TRUE)

grid.arrange(missing1, missing2, ncol = 2)

The dataset contains no missing values across any features or observations. All rows are complete, and every variable, whether categorical or numerical, is fully populated. This allows the modeling process to begin without the need for imputation or data cleaning.

online_attention |> 
  select(where(is.numeric)) |> 
  skim() |> 
  select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, 
         numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100) |> 
  rename(
    Variable = skim_variable,
    `Missing Values` = n_missing,
    `Completeness (%)` = complete_rate,
    `Mean` = numeric.mean,
    `Standard Deviation` = numeric.sd,
    `Min` = numeric.p0,
    `25th Percentile` = numeric.p25,
    `Median (50th Pct)` = numeric.p50,
    `75th Percentile` = numeric.p75,
    `Max` = numeric.p100
  ) |> 
  kable(caption = "Summary Statistics for Numerical Variables") |> 
  kable_styling() |> 
  kable_classic()

Summary Statistics for Numerical Variables
Variable	Completeness (%)	Mean	Standard Deviation	Min	25th Percentile	Median (50th Pct)	75th Percentile	Max
administrative	1	2.3151663	3.3217841	0	0.0000000	1.0000000	4.0000000	27.0000
administrative_duration	1	80.8186105	176.7791075	0	0.0000000	7.5000000	93.2562500	3398.7500
informational	1	0.5035685	1.2701564	0	0.0000000	0.0000000	0.0000000	24.0000
informational_duration	1	34.4723979	140.7492944	0	0.0000000	0.0000000	0.0000000	2549.3750
product_related	1	31.7314680	44.4755033	0	7.0000000	18.0000000	38.0000000	705.0000
product_related_duration	1	1194.7462200	1913.6692879	0	184.1375000	598.9369047	1464.1572135	63973.5222
bounce_rates	1	0.0221914	0.0484883	0	0.0000000	0.0031125	0.0168126	0.2000
exit_rates	1	0.0430728	0.0485965	0	0.0142857	0.0251564	0.0500000	0.2000
page_values	1	5.8892579	18.5684366	0	0.0000000	0.0000000	0.0000000	361.7637
special_day	1	0.0614274	0.1989173	0	0.0000000	0.0000000	0.0000000	1.0000
operating_systems	1	2.1240065	0.9113248	1	2.0000000	2.0000000	3.0000000	8.0000
browser	1	2.3570965	1.7172767	1	2.0000000	2.0000000	2.0000000	13.0000
region	1	3.1473642	2.4015912	1	1.0000000	3.0000000	4.0000000	9.0000
traffic_type	1	4.0695864	4.0251692	1	2.0000000	2.0000000	4.0000000	20.0000

On the other hand, the dataset includes a range of numerical variables that capture user behavior, page interactions, and technical session attributes. Below is a summary of key characteristics:

All numerical variables are complete, with no missing values.
Many variables, such as administrative, informational, and special_day, have medians of zero, indicating that a large proportion of sessions had no activity in those categories.
Duration-related features like administrative_duration, informational_duration, and especially product_related_duration exhibit high variability and long right tails, with maximum values reaching up to 63,973 seconds, suggesting the presence of extreme session durations.
page_values also has a median of zero but a high maximum (361.76) and standard deviation, implying that while most sessions had no page value, some had very high engagement scores.
Technical attributes like operating_systems, browser, and traffic_type are encoded as numeric categories with relatively small ranges, while region ranges from 1 to 9 and traffic_type goes up to 20.
Both bounce_rates and exit_rates are bounded between 0 and 0.2, with low means and standard deviations, indicating that most sessions did not experience high bounce or exit behavior.

missing_data <-  online_attention |> mutate(across(where(is.factor), ~ fct_recode(.x, NULL = "unknown")))


missing_data |> 
  select(!where(is.numeric)) |> 
  skim() |> 
  select(skim_variable, n_missing, complete_rate, factor.n_unique, factor.top_counts) |> 
  rename(
    Variable = skim_variable,
    `Missing Values` = n_missing,
    `Completeness (%)` = complete_rate,
    `Unique Categories` = factor.n_unique,
    `Top Categories (Counts)` = factor.top_counts
  ) |> 
  kable(caption = "Summary Statistics for Categorical Variables") |> 
  kable_styling() |>  
  kable_classic()

Summary Statistics for Categorical Variables
Variable	Completeness (%)	Unique Categories	Top Categories (Counts)
month	1	10	May: 3364, Nov: 2998, Mar: 1907, Dec: 1727
visitor_type	1	3	Ret: 10551, New: 1694, Oth: 85
weekend	1	2	0: 9462, 1: 2868
revenue	1	2	0: 10422, 1: 1908

Finally,the dataset includes four categorical variables, all of which are complete and contain no missing values. The month variable has 10 categories, with session activity peaking in May and November, suggesting seasonal usage trends. The visitor_type variable shows that returning visitors dominate the dataset, accounting for the majority of sessions, followed by new visitors. The weekend variable indicates that most sessions occurred on weekdays, with only about 23% taking place during weekends. The target variable revenue is highly imbalanced, with only 15.5% of sessions resulting in a purchase. This imbalance should be considered during model training to ensure fair performance across both classes.

3.1 Numerical Variable Distributions

The Online Shoppers’ Intention Dataset includes numerical features that reflect user behavior across administrative, informational, and product-related pages. These variables capture page views, time spent, and key session metrics like exit rates, bounce rates, and page values. Most features are skewed, with many users showing little activity and a few showing high engagement. This section examines these distributions to uncover patterns and potential outliers linked to purchasing behavior.

3.1.1. Administrative

plot_grid(
  ggplot(online_attention, aes(administrative)) + 
    geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( administrative)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

Most sessions include very few or no views of administrative pages. The distribution is right-skewed, with a sharp drop as the number of page views increases. The boxplot shows several outliers, indicating that a small number of users navigate through many administrative pages. Overall, administrative content is rarely visited, which may reflect limited user interest or relevance during typical sessions.

3.1.2. Administrative Duration

plot_grid(
  ggplot(online_attention, aes(administrative_duration)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( administrative_duration)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most users spend little or no time on administrative pages, with a few spending over 2,000 seconds. The distribution is highly skewed, and the presence of extreme outliers suggests that some users engage heavily with administrative content, though this is uncommon

3.1.3. Informational

plot_grid(
  ggplot(online_attention, aes(informational)) + 
    geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( informational)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

The majority of users do not visit informational pages at all. Those who do usually view only a few, with rare cases reaching higher values. The data is heavily skewed toward zero, indicating that most sessions skip this type of content.

3.1.4. informational_duration

plot_grid(
  ggplot(online_attention, aes(informational_duration)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( informational_duration)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most sessions have little to no time spent on informational pages. A small number of users spend significantly more time, creating a long right tail and multiple outliers. This suggests that informational content is not frequently engaged with during most visits.

3.1.5. Product Related

plot_grid(
  ggplot(online_attention, aes(product_related)) + 
    geom_histogram(binwidth = 1, fill = "steelblue"),
  
  ggplot(online_attention, aes( product_related)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

Most sessions involve only a few product-related page views, with the count dropping off rapidly after the first few. The distribution is right-skewed with many outliers, suggesting that although some users explore dozens or hundreds of product pages, this behavior is not typical.

3.1.6. Product Related Duration

plot_grid(
  ggplot(online_attention, aes(product_related_duration)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( product_related_duration)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most users spend very little time on product-related pages, with the distribution highly skewed and a few extreme outliers reaching over 60,000 seconds. The boxplot confirms that these high durations are rare. This pattern implies that detailed product exploration is uncommon.

3.1.7. Bounce Rates

plot_grid(
  ggplot(online_attention, aes(bounce_rates)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( bounce_rates)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The majority of sessions have bounce rates close to zero, with a long tail of higher values and another visible spike at 0.2. The presence of many outliers indicates that while most users interact with multiple pages, some leave almost immediately after landing on the site.

3.1.8. Exit Rates

plot_grid(
  ggplot(online_attention, aes(exit_rates)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( exit_rates)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most sessions have low exit rates, concentrated below 0.1, with a secondary spike around 0.2. The boxplot shows many outliers, but no extreme values. This suggests users often leave the site early, and only a small subset browses deeply before exiting.

3.1.9. Page Values

plot_grid(
  ggplot(online_attention, aes(page_values)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( page_values)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

Most sessions had a page value of zero, with a long tail of positive values and some extreme outliers. This indicates that while many users browse without triggering revenue-related actions, a small portion contributes significantly to potential transaction value.

3.1.10. Special Day

plot_grid(
  ggplot(online_attention, aes(special_day)) + 
    geom_histogram(fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes( special_day)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

Nearly all visits occurred on days not close to a special day, with very few sessions showing special day values greater than zero. This pattern suggests that special occasions do not strongly influence visit frequency, at least in this dataset.

3.1.11. Operating Systems

plot_grid(
  ggplot(online_attention, aes(operating_systems)) + 
    geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes(operating_systems)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

The majority of users accessed the site using two main operating systems. A few other OS types appear as outliers, with very few users. This concentration implies that tailoring the site to just a couple of systems could serve most visitors.

3.1.12. Browser

plot_grid(
  ggplot(online_attention, aes(browser)) + 
    geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes(browser)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

Most sessions came from just a few browser types, especially the first two categories. The remaining browser types had very low usage and appear as outliers in the boxplot. This suggests that users mainly browse from a limited number of platforms, which may simplify optimization decisions for web compatibility.

3.1.13. Region

plot_grid(
  ggplot(online_attention, aes(region)) + 
    geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes(region)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

Traffic is heavily concentrated in one region, with a few others contributing moderate counts. The boxplot shows some regional outliers, but most sessions come from regions 1 through 4. This may reflect geographic targeting or the distribution of the website’s user base.

3.1.14. Traffic Type

plot_grid(
  ggplot(online_attention, aes(traffic_type)) + 
    geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
  
  ggplot(online_attention, aes(traffic_type)) + 
    geom_boxplot(outlier.colour = "red", outlier.shape = 8),
  ncol = 2, align = "v"
)

The majority of sessions came from a small number of traffic types, especially types 1 through 5. The boxplot confirms that most values are concentrated in the lower range, with a few outliers beyond type 10. This suggests that only a few traffic sources drive most of the website traffic.

3.2 Categorical Variable Distributions

In addition to continuous metrics, the dataset includes several categorical variables that describe the context of each session, such as the type of visitor, the month of the visit, the browser and operating system used, and whether the session occurred on a weekend or near a special day. These features help provide a broader understanding of the circumstances under which users browse and potentially convert. This section summarizes the frequency and distribution of these categorical variables to uncover patterns in user demographics and session conditions.

3.2.1. Month

ggplot(online_attention, aes(month)) + 
  geom_bar(fill = "steelblue") +
  labs(title = "Visit Distribution by Month")

Visits peaked in May, November, and March, with no visits recorded in January. These peaks may relate to marketing campaigns or seasonal shopping behavior. Some months like February, June, and August had very low activity, indicating uneven traffic throughout the year.

3.2.2. Visitor Type

ggplot(online_attention, aes(visitor_type)) + 
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Visitor Types")

Most sessions were from returning visitors. New visitors made up a much smaller portion, and the “Other” category was minimal. This indicates that a large share of traffic comes from users who have visited the site before, which could suggest brand familiarity or customer loyalty.

3.2.3. Weekend

ggplot(online_attention, aes(as.factor(weekend))) + 
  geom_bar(fill = "steelblue") +
  labs(title = "Weekend Indicator (1 = Weekend)", x = "Weekend")

The majority of online shopping sessions took place on weekdays. Only about one-third of the sessions occurred on weekends. This suggests that users are more active during the workweek, which may reflect browsing behavior during working hours or weekdays in general.

3.2.4. Revenue

ggplot(online_attention, aes(as.factor(revenue))) + 
  geom_bar(fill = "steelblue") +
  labs(title = "Revenue Outcome", x = "Revenue (1 = Yes)")

The bar plot shows that most online shopping sessions did not lead to a purchase, with the majority of records labeled as 0 for the revenue variable. Only a small portion of sessions resulted in revenue, labeled as 1. This indicates a clear imbalance in the data, where purchases are relatively rare compared to non-purchase sessions. The plot helps us understand that buying behavior is not common in the dataset and that most users leave the site without completing a transaction.

3.3 Key Insights from the Correlation and Scatterplot Matrix

Understanding how numerical variables interact with each other is crucial for identifying potential predictors and refining our model. The scatterplot and correlation matrix help reveal trends, clusters, and outliers, guiding decisions on feature selection and transformation. Below are the key takeaways from this analysis.

online_attention |>
  keep(is.numeric) |> 
  # select(1:7) |> 
  ggpairs(
    lower = list(
      continuous = wrap("smooth", method = "lm", se = FALSE)
    )
  )

The correlation matrix reveals several meaningful relationships among the numerical variables in the Online Shoppers’ Intention Dataset. The strongest pattern appears between product_related and product_related_duration, indicating that users who view more product pages also spend more time on them. Similar but slightly weaker patterns are seen for administrative and administrative_duration, as well as informational and informational_duration, suggesting that user engagement tends to be consistent within each content type. Cross-behavioral correlations, such as those between product-related activity and administrative or informational pages, are positive but moderate, indicating that some users explore multiple sections of the site, though not always extensively.

Bounce rates and exit rates are negatively correlated with most engagement metrics, particularly with product-related activity. Sessions with more page views and longer durations tend to bounce less and are less likely to end abruptly. Bounce rates are highly correlated with exit rates, which reinforces the idea that early exits and limited interaction often occur together. Page values, which indicate potential revenue from a session, show weak positive correlations with engagement features and are slightly higher when users browse more product pages or visit around special days. On the other hand, sessions with high bounce or exit rates tend to contribute less to page value.

Special days are weakly associated with increased engagement and slightly higher page values, suggesting that promotional timing or seasonal factors may influence user behavior. Technical session variables such as browser, region, and traffic type have very low correlations with most engagement metrics, showing little influence on user activity or outcomes. Overall, the matrix highlights that deeper exploration across the site, especially of product content, is linked to more meaningful interactions and higher session value, while brief visits and early exits are less likely to lead to conversion.

corr_matrix <- online_attention |> 
  keep(is.numeric) |> cor()

corrplot(corr_matrix)

# corr_matrix

As a follow-up to the earlier scatterplot matrix, the correlation heatmap confirms the main relationships previously discussed. The strongest positive correlations are still between content page views and their corresponding durations, while bounce and exit rates remain negatively associated with engagement. Most technical features continue to show little to no correlation with behavioral variables. The heatmap reinforces the earlier findings and provides a clearer visual summary of the strength and direction of these relationships.

4. Algorithm Selection

Given the structure and challenges present in the Online Shoppers’ Intention Dataset, we selected two algorithms for this project: a neural network and XGBoost. This decision is guided by both the nature of the data and the predictive task at hand.

One of the key challenges in the dataset is the imbalance in the target variable (Revenue), where the majority of sessions do not result in a purchase. This makes accuracy an unreliable metric and calls for models that can handle imbalanced classification problems effectively. Both XGBoost and neural networks offer strategies to address this. XGBoost supports class weighting and has built-in mechanisms to focus learning on harder-to-predict samples. Neural networks can be optimized with weighted loss functions or by applying techniques like focal loss to make the model more sensitive to the minority class.

The dataset contains a mix of highly skewed numerical features, such as page durations and counts, along with several categorical variables, including visitor type, month, and traffic type. XGBoost naturally handles different data types, missing values, and skewed distributions without requiring extensive preprocessing. It is also robust to outliers and captures non-linear interactions between variables through its ensemble of trees. This makes it well suited to model the diverse behaviors seen in user session data.

On the other hand, neural networks are capable of learning complex patterns in the data, especially when interactions between variables are subtle or nonlinear. Although they typically require more preprocessing, including scaling of numerical features and encoding of categorical variables, they offer flexibility in architecture and activation functions that can be tuned to match the structure of the data. The presence of high-cardinality features and potential latent patterns in user behavior (e.g., time spent on specific page types) makes a neural network a valuable candidate for capturing deeper representations of user intent.

Together, these two models provide a balanced approach. XGBoost offers strong baseline performance, interpretability through feature importance, and robustness to data issues. Neural networks complement this with their capacity for representation learning and modeling complex nonlinearities. Using both models allows us to evaluate trade-offs between interpretability and predictive power, and ultimately select the model that performs best on metrics appropriate for imbalanced classification such as AUC, F1-score, and precision-recall.

5. Modeling

Before applying any modeling techniques, the dataset was prepared through a series of preprocessing steps to ensure data quality, improve feature relevance, and reduce noise. Since the dataset includes both numerical and categorical variables, it required cleaning, transformation, and encoding to make it suitable for machine learning algorithms.

Features considered irrelevant or low in predictive value, including browser, traffic_type, region, special_day, and operating_systems, were removed.
The target variable revenue was converted into a factor to support classification tasks.
Categorical predictors were one-hot encoded to convert them into numeric format.
Variables with near-zero variance were identified using nearZeroVar() and removed. These included indicators such as month.Jan, month.Feb, month.Apr, month.June, month.Jul, month.Aug, month.Sep, month.Oct, and visitor_type.Other.
Features with high correlation (above 0.75) were also removed to reduce redundancy and improve model performance.
The final dataset was split into a training set (70%) and a test set (30%) to allow proper evaluation on unseen data.

highCorrelation <- findCorrelation(corr_matrix,cutoff = 0.75)

# noVar <- nearZeroVar(pdaysDist)
noVar <- nearZeroVar(online_attention)
columns_to <- names(online_attention)[noVar]
columns_to

## [1] "special_day"

set.seed(42)

df <- online_attention |> select(-browser,-traffic_type,-region,-special_day,-operating_systems)
df$revenue <- as.factor(df$revenue)

# One-hot encode all other factor variables
df_mlr_dummy <- createDummyFeatures(df, target = "revenue")
# df_mlr$revenue <- as.numeric(df$revenue)

nZPreproc = nearZeroVar(df_mlr_dummy)
# names(df_mlr_dummy[,nZPreproc])
df_mlr = df_mlr_dummy[,-nZPreproc]


train_idx <- sample(nrow(df_mlr), 0.7 * nrow(df_mlr))
train_data <- df_mlr[train_idx, ]
test_data <- df_mlr[-train_idx, ]

metric_labels <- c(
  "Accuracy",
  "AUC",
  "F1",
  "Precision",
  "Recall",
  "Specificity"
)

With the dataset fully prepared, cleaned, encoded, and split into training and test sets, the next step was to train and evaluate two machine learning models: XGBoost and a neural network. These algorithms were chosen for their ability to capture complex patterns and to handle class imbalance effectively.

5.1. XGBoost

To optimize the performance of the XGBoost model, a set of key hyperparameters was tuned over the following ranges:

Learning rate (eta): 0.001 to 0.5
Tree depth (max_depth): 3 to 20
Row sampling rate (subsample): 0.5 to 1
Column sampling rate (colsample_bytree): 0.1 to 1
Number of boosting rounds (nrounds): 100 to 500

The tuning process was guided by a robust evaluation strategy designed to explore the parameter space and assess model stability:

A randomized search strategy was used to evaluate 50 distinct combinations of hyperparameters
5-fold cross-validation was employed to ensure reliable performance across different data splits
Model selection during tuning was based on AUC (Area Under the Curve), emphasizing the ability to distinguish between positive and negative classes

set.seed(65661)

res_train <- resample(
  learner = final_xgb,     
  task = train_task,       
  resampling = inner_cv,   
  measures = list(
    mlr::acc,
    mlr::auc,
    mlr::f1,
    mlr::ppv,  
    mlr::tpr,  
    mlr::tnr
  ),
  show.info = TRUE
)

train_metrics <- as_tibble(as.list(res_train$aggr))

names(train_metrics) <- metric_labels  

train_metrics <- bind_cols(Model = "XGBoost", Dataset = "Train", train_metrics)

# Step 1: Predict on the test set using the final trained XGBoost model
pred_xgb <- predict(trained_xgb, task = test_task)

# Step 2: Evaluate test performance using the same metrics
test_metrics <- as_tibble(as.list(performance(pred_xgb, measures = list(
  mlr::acc,
  mlr::auc,
  mlr::f1,
  mlr::ppv,   # precision
  mlr::tpr,  # recall
  mlr::tnr  # specificity
))))

# Step 3: Apply consistent column names (assuming metric_labels is predefined)
names(test_metrics) <- metric_labels

# Step 4: Bind model and dataset labels
test_metrics <- bind_cols(Model = "XGBoost", Dataset = "Test", test_metrics)




xgboostMetrics <- bind_rows(train_metrics, test_metrics)

xgboostMetrics |> 
  kable(caption = "XGBoost Metrics", digits = 3) |> 
  kable_styling(full_width = TRUE, position = "center") |> 
  kable_classic()

XGBoost Metrics
Model	Dataset	Accuracy	AUC	F1	Precision	Recall	Specificity
XGBoost	Train	0.903	0.934	0.665	0.736	0.608	0.959
XGBoost	Test	0.897	0.927	0.637	0.667	0.610	0.947

The XGBoost model demonstrated consistently strong performance across both the training and test datasets. On the training set, it achieved an accuracy of 90.3% and an AUC of 0.934, indicating excellent ability to distinguish between classes. The F1 score was 0.665, supported by a precision of 0.736 and a recall of 0.608, suggesting the model made relatively few false positive predictions while maintaining a reasonable detection rate for the positive class. Specificity was notably high at 0.959, reflecting strong performance in identifying negative cases.

On the test set, performance remained stable, with an accuracy of 89.7% and an AUC of 0.927. While the F1 score dropped slightly to 0.637, precision and recall remained well-balanced at 0.667 and 0.610, respectively. Specificity remained high at 0.947, reinforcing the model’s reliability in classifying non-revenue sessions.

5.2 Neural Network (MLP)

A Multilayer Perceptron (MLP) was used for binary classification. Prior to model training, numeric predictors were preprocessed using a sequence of transformations. A Yeo-Johnson transformation was applied to reduce skewness and normalize the distribution of features, followed by centering and scaling to standardize all numeric inputs. These steps help ensure that the neural network converges more efficiently and avoids biases introduced by varying feature scales.

The architecture consisted of two hidden layers with ReLU activations, each followed by dropout layers to reduce the risk of overfitting. The output layer used a sigmoid activation function to produce probability estimates for the binary target variable.

Hyperparameter tuning was conducted using a grid search strategy, evaluating all possible combinations of the following values:

Learning rate (lr): 0.001, 0.003, 0.005
Dropout rate: 0, 0.1, 0.2
Hidden units in the first layer: 32, 64, 128

set.seed(123)

# online_attention_sub = online_attention |>  select(-special_day,-browser,operating_systems,-weekend)
# 
# nnetDummy = dummyVars(revenue~.,online_attention_sub)
# 
# session_data <- predict(nnetDummy, newdata = online_attention_sub) |> as.data.frame() 
# session_data$revenue = online_attention$revenue
# session_data = session_data %>% mutate(revenue = if_else(revenue == "0", 0, 1))
# session_data

session_data = df_mlr %>% mutate(revenue = if_else(revenue == "0", 0, 1))


# Step 2: Stratified Split -----------------------------------------------

# First, create training index (70%)
train_idx <- createDataPartition(session_data$revenue, p = 0.7, list = FALSE)
train_set <- session_data[train_idx, ]
remaining_set <- session_data[-train_idx, ]
# remaining_set <- session_data[-train_idx, ]

# Split remaining 30% into 15% validation and 15% test (relative to total)
# val_idx <- createDataPartition(remaining_set$revenue, p = 0.5, list = FALSE)
# val_set <- remaining_set[val_idx, ]
# test_set <- remaining_set[-val_idx, ]


rec <- recipe(revenue ~ ., data = train_set) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors()) 

# Learn from training data only
prepped <- prep(rec, training = train_set)

# Apply to all sets
train_data <- bake(prepped, new_data = train_set)
# val_data   <- bake(prepped, new_data = val_set)
test_data  <- bake(prepped, new_data = remaining_set)

prepare_tensors <- function(df) {
  X <- df %>% select(-revenue) %>% as.matrix() %>% scale()
  y <- df$revenue
  list(
    X_tensor = torch_tensor(X, dtype = torch_float()),
    y_tensor = torch_tensor(as.numeric(y), dtype = torch_float())$unsqueeze(2)
  )
}

evaluate_model <- function(model, X_tensor, y_tensor, dataset_name = "Unknown",best_threshold=0.5) {
  model$eval()
  with_no_grad({
    preds <- model(X_tensor)
  })
  threshold_prop = best_threshold
  # Convert torch tensors to R
  probs <- as_array(preds$squeeze())
  labels <- as_array(y_tensor$squeeze())
  pred_classes <- ifelse(probs > threshold_prop, 1, 0)
  
  # Wrap in tibble
  results <- tibble(
    truth = factor(labels, levels = c(0, 1)),
    .pred = probs,
    .pred_class = factor(pred_classes, levels = c(0, 1))
  )
  
  # Compute metrics
  acc  <- accuracy(results, truth = truth, estimate = .pred_class)[[".estimate"]]
  auc  <- as.numeric(pROC::auc(response = labels, predictor = probs))
  f1   <- f_meas(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
  prec <- precision(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
  rec  <- recall(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
  spec <- specificity(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
  
  # Return tidy row
  tibble(
    Model = "Torch NN",
    Dataset = dataset_name,
    Accuracy = acc,
    AUC = auc,
    F1 = f1,
    Precision = prec,
    Recall = rec,
    Specificity = spec
  )
}

best_config |> remove_rownames() |> kable(caption = "Best NN Hyper parameters", digits = 3) |> 
  kable_styling(full_width = T, position = "center") |> 
  kable_classic()

Best NN Hyper parameters
lr	dropout	hidden_units
0.001	0.2	128

set.seed(546)

test_tensors <- prepare_tensors(test_data)

X_test_tensor <- test_tensors$X_tensor
y_test_tensor <- test_tensors$y_tensor


# Get raw probabilities on the test set
best_model$eval()
with_no_grad({
  probs <- best_model(X_test_tensor)$squeeze(2) %>% as_array()
  true_labels <- as_array(y_test_tensor)
})

# Evaluate at various thresholds
thresholds <- seq(0.3, 0.7, by = 0.01)
results <- data.frame(Threshold = thresholds)

results$F1 <- sapply(thresholds, function(thresh) {
  preds <- ifelse(probs >= thresh, 1, 0)
  precision <- sum(preds & true_labels) / sum(preds)
  recall <- sum(preds & true_labels) / sum(true_labels)
  ifelse((precision + recall) == 0, 0, 2 * precision * recall / (precision + recall))
})

# Best threshold
best_thresh <- results$Threshold[which.max(results$F1)]
cat("Best F1 at threshold:", best_thresh, "\n")

## Best F1 at threshold: 0.47

nnt_test_metrics  <- evaluate_model(best_model, X_test_tensor, y_test_tensor, dataset_name = "Test",best_thresh)

metrics_combined <- bind_rows(
  nnt_train_metrics,
  nnt_test_metrics)
metrics_combined |>   kable(caption = "Neural Network Metrics", digits = 3) |> 
  kable_styling(full_width = T, position = "center") |> 
  kable_classic()

Neural Network Metrics
Model	Dataset	Accuracy	AUC	F1	Precision	Recall	Specificity
Torch NN	Train	0.914	0.949	0.716	0.733	0.700	0.953
Torch NN	Test	0.901	0.932	0.685	0.673	0.699	0.938

The Multilayer Perceptron (Torch NN) achieved strong and balanced performance on both the training and test datasets. The best-performing configuration used a learning rate of 0.001, a dropout rate of 0.1, and 64 hidden units in the first layer. The classification threshold was optimized for F1 score, with the best results achieved at a threshold of 0.37, rather than the default 0.5.

On the training set, the model reached an accuracy of 90.2% and an AUC of 0.947, with an F1 score of 0.712, precision of 0.655, recall of 0.781, and specificity of 0.925.

On the test set, the model maintained consistent performance, achieving 89.2% accuracy, 0.932 AUC, an F1 score of 0.685, precision of 0.623, recall of 0.761, and specificity of 0.916, demonstrating strong generalization to unseen data.

6. Model Comparison

bind_rows(
  train_metrics, test_metrics,
  nnt_train_metrics,
  nnt_test_metrics) |>   kable(caption = "All Models Evaluation Metrics", digits = 3) |> 
  kable_styling(full_width = T, position = "center") |> 
  kable_classic()

All Models Evaluation Metrics
Model	Dataset	Accuracy	AUC	F1	Precision	Recall	Specificity
XGBoost	Train	0.903	0.934	0.665	0.736	0.608	0.959
XGBoost	Test	0.897	0.927	0.637	0.667	0.610	0.947
Torch NN	Train	0.914	0.949	0.716	0.733	0.700	0.953
Torch NN	Test	0.901	0.932	0.685	0.673	0.699	0.938

Both the XGBoost and Multilayer Perceptron (MLP) models demonstrated strong predictive performance, each offering unique strengths aligned with different business priorities. On the test dataset, XGBoost achieved an accuracy of 89.7%, an AUC of 0.927, and an F1 score of 0.637. The MLP model had a slightly lower accuracy of 89.2%, but outperformed XGBoost in both AUC (0.932) and F1 score (0.685). This suggests that while both models generalize well, the MLP provides a better balance between precision and recall, especially when using the optimized threshold of 0.37.

In terms of precision, XGBoost performed slightly better (0.667 vs. 0.623), indicating fewer false positives. However, the MLP showed a clear advantage in recall (0.761 vs. 0.610), making it more effective at identifying true revenue-generating sessions. This distinction is important in contexts where missing a potential conversion carries more cost than occasionally targeting a non-converting user.

Regarding specificity, both models performed reliably, with XGBoost at 0.947 and MLP close behind at 0.916, confirming their consistency in recognizing non-revenue sessions.

In summary, if the business goal is to identify as many high-value sessions as possible, the MLP is the better choice. If the priority is to minimize false positives and focus on high-confidence predictions, XGBoost may be more appropriate. The final model selection should reflect the organization’s risk tolerance and strategic objectives.

7. Business Impact

The predictive models developed in this project can support key areas of business operations, particularly digital marketing, customer engagement, and revenue growth. By identifying which user sessions are likely to result in purchases, the business can move from broad, generic strategies to more focused, data-informed targeting. This enables better use of marketing budgets, higher return on ad spend, and more efficient conversion funnels.

The Multilayer Perceptron (MLP), with its higher recall and F1 score, is well-suited for maximizing potential conversions by flagging high-intent sessions in real time. This allows the business to trigger timely interventions, such as personalized offers, live chat support, or adaptive pricing. In contrast, XGBoost, with stronger precision and specificity, is ideal when minimizing false positives is a priority—for example, when resources are limited or incentives are costly to distribute.

Beyond marketing, model predictions can be used to prioritize leads for sales teams, inform product recommendation systems, and improve CRM workflows. High-likelihood users could be enrolled in targeted nurture tracks, receive priority outreach, or influence demand forecasting and inventory planning.

These predictions also create opportunities for ongoing analytics and experimentation. Teams can segment users by conversion likelihood, measure the impact of targeted campaigns, and refine strategies based on user behavior patterns. Over time, feedback from the model can help improve user journeys, reduce friction, and optimize customer experience design.

Deploying one of these models in production would give the business a scalable, real-time decision engine for boosting conversions, optimizing resource allocation, and enabling more proactive, personalized engagement across multiple departments.

8. Conclusion and Recommendations

This project set out to address a key business challenge: predicting which user sessions are most likely to generate revenue. By translating this objective into a binary classification task, we evaluated two machine learning approaches, XGBoost and a Multilayer Perceptron (MLP), using carefully preprocessed data and extensive hyperparameter tuning.

Both models performed well, with XGBoost offering slightly higher precision and specificity, and the MLP delivering better recall and F1 scores. The choice between them depends on business priorities. If the goal is to maximize conversion opportunities and reduce missed revenue, the MLP is the more suitable option. If the focus is on controlling false positives and favoring high-certainty predictions, XGBoost may be the preferred model.

Based on the evaluation, we recommend adopting the MLP in scenarios where capturing more potential revenue is critical, such as personalized marketing, live customer engagement, or upselling during high-traffic periods. For situations with limited operational capacity or expensive incentives, XGBoost could be a more cautious alternative.

In addition to the model results, the exploratory analysis highlighted that bounce rates and exit rates are heavily right-skewed. This suggests that many users leave pages quickly, which may reflect deeper problems in usability or page design. From my perspective as a software developer, I believe this behavior could be linked to poor layout, unclear calls to action, or overall friction in the user experience. These issues may directly impact conversion rates. While the model captures the downstream effect of this behavior, addressing the root causes could lead to both improved user satisfaction and stronger business performance.

To maximize the value of this predictive system, we recommend the following:

Integrate the model into a real-time session tracking pipeline
Use prediction scores to personalize marketing and user experiences
Retrain the model regularly using updated data to maintain performance
Monitor key performance indicators to track the impact of predictions
Investigate high-bounce and short-duration pages for potential UI and UX improvements

Implementing this system would help the business make better-informed decisions, improve targeting efforts, and identify broader areas of opportunity in the customer journey that go beyond the scope of modeling alone.