In the competitive landscape of online commerce, data-driven decision-making has become essential for improving user engagement and driving revenue growth. Businesses collect vast amounts of behavioral and technical data from user sessions, yet much of this information goes underutilized when it comes to real-time decision-making. By leveraging predictive modeling, organizations can move beyond reactive analysis and begin to anticipate user behavior, allowing for smarter targeting, more personalized interactions, and improved operational efficiency.
This project explores the use of machine learning to predict which user sessions are most likely to result in revenue. The primary objective is to build and compare models that can classify sessions as revenue-generating or not, based on available session-level features. By identifying high-value sessions in advance, the business can prioritize these users for strategic actions such as dynamic promotions, live assistance, or personalized recommendations.
Through preprocessing, model tuning, and evaluation, this project provides a practical framework for applying predictive analytics to session-level web data. The insights generated aim to support real-time marketing and customer engagement strategies, ultimately helping the business convert more users into paying customers.
Online retailers often collect detailed session-level data, including how users interact with different types of pages, how long they stay, what devices they use, and whether they are new or returning visitors. Despite having access to this information, many businesses still struggle to identify which sessions are likely to result in a purchase. As a result, marketing campaigns and customer engagement strategies are applied uniformly across all sessions, regardless of purchase intent. This leads to inefficient use of resources, low conversion rates, and missed revenue opportunities.
The core business problem is that the company cannot differentiate between high-intent and low-intent user sessions in real time. Without this distinction, it becomes difficult to prioritize efforts, deliver targeted experiences, or intervene effectively during the session to influence purchasing behavior.
To address this issue, the objective of this project is to build a predictive model that determines whether a user session will lead to a purchase. This is framed as a binary classification task, using historical session data from an e-commerce platform. The dataset includes features such as bounce rates, exit rates, time spent on various page types, traffic sources, user types, and session timing. By training machine learning models on this data, we aim to predict purchase intent before a transaction occurs.
These predictions can help the business act more strategically during live sessions, whether by offering personalized promotions, prioritizing customer service resources, or refining retargeting efforts. Ultimately, the goal is to increase revenue, reduce wasted effort, and improve the efficiency of digital engagement.
Before modeling, exploratory data analysis (EDA) was performed to understand the dataset’s structure, distribution, and potential issues such as skewness or outliers. Since the data captures session-level behavior from an online retail platform, the analysis focused on patterns in navigation, engagement time, and technical attributes that may relate to purchase likelihood.
online_attention <- read.csv("online_shoppers_intention.csv",sep = ",",stringsAsFactors = T) |>
clean_names()|> mutate(revenue = factor(if_else(revenue == TRUE,1,0)),
weekend = factor(if_else(weekend == TRUE,1,0))
)
month_levels <- c("Jan","Feb", "Mar", "Apr", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
# Reorder the month factor
online_attention$month <- factor(online_attention$month, levels = month_levels, ordered = TRUE)
grid.arrange(missing1, missing2, ncol = 2)
The dataset contains no missing values across any features or observations. All rows are complete, and every variable, whether categorical or numerical, is fully populated. This allows the modeling process to begin without the need for imputation or data cleaning.
online_attention |>
select(where(is.numeric)) |>
skim() |>
select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd,
numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100) |>
rename(
Variable = skim_variable,
`Missing Values` = n_missing,
`Completeness (%)` = complete_rate,
`Mean` = numeric.mean,
`Standard Deviation` = numeric.sd,
`Min` = numeric.p0,
`25th Percentile` = numeric.p25,
`Median (50th Pct)` = numeric.p50,
`75th Percentile` = numeric.p75,
`Max` = numeric.p100
) |>
kable(caption = "Summary Statistics for Numerical Variables") |>
kable_styling() |>
kable_classic()
| Variable | Missing Values | Completeness (%) | Mean | Standard Deviation | Min | 25th Percentile | Median (50th Pct) | 75th Percentile | Max |
|---|---|---|---|---|---|---|---|---|---|
| administrative | 0 | 1 | 2.3151663 | 3.3217841 | 0 | 0.0000000 | 1.0000000 | 4.0000000 | 27.0000 |
| administrative_duration | 0 | 1 | 80.8186105 | 176.7791075 | 0 | 0.0000000 | 7.5000000 | 93.2562500 | 3398.7500 |
| informational | 0 | 1 | 0.5035685 | 1.2701564 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 24.0000 |
| informational_duration | 0 | 1 | 34.4723979 | 140.7492944 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 2549.3750 |
| product_related | 0 | 1 | 31.7314680 | 44.4755033 | 0 | 7.0000000 | 18.0000000 | 38.0000000 | 705.0000 |
| product_related_duration | 0 | 1 | 1194.7462200 | 1913.6692879 | 0 | 184.1375000 | 598.9369047 | 1464.1572135 | 63973.5222 |
| bounce_rates | 0 | 1 | 0.0221914 | 0.0484883 | 0 | 0.0000000 | 0.0031125 | 0.0168126 | 0.2000 |
| exit_rates | 0 | 1 | 0.0430728 | 0.0485965 | 0 | 0.0142857 | 0.0251564 | 0.0500000 | 0.2000 |
| page_values | 0 | 1 | 5.8892579 | 18.5684366 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 361.7637 |
| special_day | 0 | 1 | 0.0614274 | 0.1989173 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 1.0000 |
| operating_systems | 0 | 1 | 2.1240065 | 0.9113248 | 1 | 2.0000000 | 2.0000000 | 3.0000000 | 8.0000 |
| browser | 0 | 1 | 2.3570965 | 1.7172767 | 1 | 2.0000000 | 2.0000000 | 2.0000000 | 13.0000 |
| region | 0 | 1 | 3.1473642 | 2.4015912 | 1 | 1.0000000 | 3.0000000 | 4.0000000 | 9.0000 |
| traffic_type | 0 | 1 | 4.0695864 | 4.0251692 | 1 | 2.0000000 | 2.0000000 | 4.0000000 | 20.0000 |
On the other hand, the dataset includes a range of numerical variables that capture user behavior, page interactions, and technical session attributes. Below is a summary of key characteristics:
administrative,
informational, and special_day, have
medians of zero, indicating that a large proportion of
sessions had no activity in those categories.administrative_duration,
informational_duration, and especially
product_related_duration exhibit high
variability and long right tails, with maximum
values reaching up to 63,973 seconds, suggesting the
presence of extreme session durations.page_values also has a median of zero but a high
maximum (361.76) and standard deviation, implying that
while most sessions had no page value, some had very high engagement
scores.operating_systems,
browser, and traffic_type are encoded as
numeric categories with relatively small ranges, while
region ranges from 1 to 9 and traffic_type
goes up to 20.bounce_rates and exit_rates are
bounded between 0 and 0.2, with low means and standard
deviations, indicating that most sessions did not experience high bounce
or exit behavior.missing_data <- online_attention |> mutate(across(where(is.factor), ~ fct_recode(.x, NULL = "unknown")))
missing_data |>
select(!where(is.numeric)) |>
skim() |>
select(skim_variable, n_missing, complete_rate, factor.n_unique, factor.top_counts) |>
rename(
Variable = skim_variable,
`Missing Values` = n_missing,
`Completeness (%)` = complete_rate,
`Unique Categories` = factor.n_unique,
`Top Categories (Counts)` = factor.top_counts
) |>
kable(caption = "Summary Statistics for Categorical Variables") |>
kable_styling() |>
kable_classic()
| Variable | Missing Values | Completeness (%) | Unique Categories | Top Categories (Counts) |
|---|---|---|---|---|
| month | 0 | 1 | 10 | May: 3364, Nov: 2998, Mar: 1907, Dec: 1727 |
| visitor_type | 0 | 1 | 3 | Ret: 10551, New: 1694, Oth: 85 |
| weekend | 0 | 1 | 2 | 0: 9462, 1: 2868 |
| revenue | 0 | 1 | 2 | 0: 10422, 1: 1908 |
Finally,the dataset includes four categorical variables, all of which
are complete and contain no missing values. The month
variable has 10 categories, with session activity peaking in
May and November, suggesting seasonal
usage trends. The visitor_type variable shows that
returning visitors dominate the dataset, accounting for
the majority of sessions, followed by new visitors. The
weekend variable indicates that most sessions occurred on
weekdays, with only about 23% taking place during
weekends. The target variable revenue is highly imbalanced,
with only 15.5% of sessions resulting in a purchase.
This imbalance should be considered during model training to ensure fair
performance across both classes.
The Online Shoppers’ Intention Dataset includes numerical features that reflect user behavior across administrative, informational, and product-related pages. These variables capture page views, time spent, and key session metrics like exit rates, bounce rates, and page values. Most features are skewed, with many users showing little activity and a few showing high engagement. This section examines these distributions to uncover patterns and potential outliers linked to purchasing behavior.
plot_grid(
ggplot(online_attention, aes(administrative)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
ggplot(online_attention, aes( administrative)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
Most sessions include very few or no views of administrative pages. The distribution is right-skewed, with a sharp drop as the number of page views increases. The boxplot shows several outliers, indicating that a small number of users navigate through many administrative pages. Overall, administrative content is rarely visited, which may reflect limited user interest or relevance during typical sessions.
plot_grid(
ggplot(online_attention, aes(administrative_duration)) +
geom_histogram(fill = "steelblue", color = "white"),
ggplot(online_attention, aes( administrative_duration)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most users spend little or no time on administrative pages, with a few spending over 2,000 seconds. The distribution is highly skewed, and the presence of extreme outliers suggests that some users engage heavily with administrative content, though this is uncommon
plot_grid(
ggplot(online_attention, aes(informational)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
ggplot(online_attention, aes( informational)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
The majority of users do not visit informational pages at all. Those who
do usually view only a few, with rare cases reaching higher values. The
data is heavily skewed toward zero, indicating that most sessions skip
this type of content.
plot_grid(
ggplot(online_attention, aes(informational_duration)) +
geom_histogram(fill = "steelblue", color = "white"),
ggplot(online_attention, aes( informational_duration)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most sessions have little to no time spent on informational pages. A small number of users spend significantly more time, creating a long right tail and multiple outliers. This suggests that informational content is not frequently engaged with during most visits.
plot_grid(
ggplot(online_attention, aes(bounce_rates)) +
geom_histogram(fill = "steelblue", color = "white"),
ggplot(online_attention, aes( bounce_rates)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The majority of sessions have bounce rates close to zero, with a long tail of higher values and another visible spike at 0.2. The presence of many outliers indicates that while most users interact with multiple pages, some leave almost immediately after landing on the site.
plot_grid(
ggplot(online_attention, aes(exit_rates)) +
geom_histogram(fill = "steelblue", color = "white"),
ggplot(online_attention, aes( exit_rates)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Most sessions have low exit rates, concentrated below 0.1, with a secondary spike around 0.2. The boxplot shows many outliers, but no extreme values. This suggests users often leave the site early, and only a small subset browses deeply before exiting.
plot_grid(
ggplot(online_attention, aes(page_values)) +
geom_histogram(fill = "steelblue", color = "white"),
ggplot(online_attention, aes( page_values)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
Most sessions had a page value of zero, with a long tail of positive values and some extreme outliers. This indicates that while many users browse without triggering revenue-related actions, a small portion contributes significantly to potential transaction value.
plot_grid(
ggplot(online_attention, aes(special_day)) +
geom_histogram(fill = "steelblue", color = "white"),
ggplot(online_attention, aes( special_day)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
Nearly all visits occurred on days not close to a special day, with very
few sessions showing special day values greater than zero. This pattern
suggests that special occasions do not strongly influence visit
frequency, at least in this dataset.
plot_grid(
ggplot(online_attention, aes(operating_systems)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
ggplot(online_attention, aes(operating_systems)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
The majority of users accessed the site using two main operating systems. A few other OS types appear as outliers, with very few users. This concentration implies that tailoring the site to just a couple of systems could serve most visitors.
plot_grid(
ggplot(online_attention, aes(browser)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
ggplot(online_attention, aes(browser)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
Most sessions came from just a few browser types, especially the first
two categories. The remaining browser types had very low usage and
appear as outliers in the boxplot. This suggests that users mainly
browse from a limited number of platforms, which may simplify
optimization decisions for web compatibility.
plot_grid(
ggplot(online_attention, aes(region)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
ggplot(online_attention, aes(region)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
Traffic is heavily concentrated in one region, with a few others contributing moderate counts. The boxplot shows some regional outliers, but most sessions come from regions 1 through 4. This may reflect geographic targeting or the distribution of the website’s user base.
plot_grid(
ggplot(online_attention, aes(traffic_type)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white"),
ggplot(online_attention, aes(traffic_type)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8),
ncol = 2, align = "v"
)
The majority of sessions came from a small number of traffic types, especially types 1 through 5. The boxplot confirms that most values are concentrated in the lower range, with a few outliers beyond type 10. This suggests that only a few traffic sources drive most of the website traffic.
In addition to continuous metrics, the dataset includes several categorical variables that describe the context of each session, such as the type of visitor, the month of the visit, the browser and operating system used, and whether the session occurred on a weekend or near a special day. These features help provide a broader understanding of the circumstances under which users browse and potentially convert. This section summarizes the frequency and distribution of these categorical variables to uncover patterns in user demographics and session conditions.
ggplot(online_attention, aes(month)) +
geom_bar(fill = "steelblue") +
labs(title = "Visit Distribution by Month")
Visits peaked in May, November, and March, with no visits recorded in January. These peaks may relate to marketing campaigns or seasonal shopping behavior. Some months like February, June, and August had very low activity, indicating uneven traffic throughout the year.
ggplot(online_attention, aes(visitor_type)) +
geom_bar(fill = "steelblue") +
labs(title = "Distribution of Visitor Types")
Most sessions were from returning visitors. New visitors made up a much
smaller portion, and the “Other” category was minimal. This indicates
that a large share of traffic comes from users who have visited the site
before, which could suggest brand familiarity or customer loyalty.
ggplot(online_attention, aes(as.factor(weekend))) +
geom_bar(fill = "steelblue") +
labs(title = "Weekend Indicator (1 = Weekend)", x = "Weekend")
The majority of online shopping sessions took place on weekdays. Only
about one-third of the sessions occurred on weekends. This suggests that
users are more active during the workweek, which may reflect browsing
behavior during working hours or weekdays in general.
ggplot(online_attention, aes(as.factor(revenue))) +
geom_bar(fill = "steelblue") +
labs(title = "Revenue Outcome", x = "Revenue (1 = Yes)")
The bar plot shows that most online shopping sessions did not lead to a purchase, with the majority of records labeled as 0 for the revenue variable. Only a small portion of sessions resulted in revenue, labeled as 1. This indicates a clear imbalance in the data, where purchases are relatively rare compared to non-purchase sessions. The plot helps us understand that buying behavior is not common in the dataset and that most users leave the site without completing a transaction.
Understanding how numerical variables interact with each other is crucial for identifying potential predictors and refining our model. The scatterplot and correlation matrix help reveal trends, clusters, and outliers, guiding decisions on feature selection and transformation. Below are the key takeaways from this analysis.
online_attention |>
keep(is.numeric) |>
# select(1:7) |>
ggpairs(
lower = list(
continuous = wrap("smooth", method = "lm", se = FALSE)
)
)
The correlation matrix reveals several meaningful relationships among
the numerical variables in the Online Shoppers’ Intention Dataset. The
strongest pattern appears between product_related and
product_related_duration, indicating that users who view more product
pages also spend more time on them. Similar but slightly weaker patterns
are seen for administrative and administrative_duration, as well as
informational and informational_duration, suggesting that user
engagement tends to be consistent within each content type.
Cross-behavioral correlations, such as those between product-related
activity and administrative or informational pages, are positive but
moderate, indicating that some users explore multiple sections of the
site, though not always extensively.
Bounce rates and exit rates are negatively correlated with most engagement metrics, particularly with product-related activity. Sessions with more page views and longer durations tend to bounce less and are less likely to end abruptly. Bounce rates are highly correlated with exit rates, which reinforces the idea that early exits and limited interaction often occur together. Page values, which indicate potential revenue from a session, show weak positive correlations with engagement features and are slightly higher when users browse more product pages or visit around special days. On the other hand, sessions with high bounce or exit rates tend to contribute less to page value.
Special days are weakly associated with increased engagement and slightly higher page values, suggesting that promotional timing or seasonal factors may influence user behavior. Technical session variables such as browser, region, and traffic type have very low correlations with most engagement metrics, showing little influence on user activity or outcomes. Overall, the matrix highlights that deeper exploration across the site, especially of product content, is linked to more meaningful interactions and higher session value, while brief visits and early exits are less likely to lead to conversion.
corr_matrix <- online_attention |>
keep(is.numeric) |> cor()
corrplot(corr_matrix)
# corr_matrix
As a follow-up to the earlier scatterplot matrix, the correlation heatmap confirms the main relationships previously discussed. The strongest positive correlations are still between content page views and their corresponding durations, while bounce and exit rates remain negatively associated with engagement. Most technical features continue to show little to no correlation with behavioral variables. The heatmap reinforces the earlier findings and provides a clearer visual summary of the strength and direction of these relationships.
Given the structure and challenges present in the Online Shoppers’ Intention Dataset, we selected two algorithms for this project: a neural network and XGBoost. This decision is guided by both the nature of the data and the predictive task at hand.
One of the key challenges in the dataset is the imbalance in the target variable (Revenue), where the majority of sessions do not result in a purchase. This makes accuracy an unreliable metric and calls for models that can handle imbalanced classification problems effectively. Both XGBoost and neural networks offer strategies to address this. XGBoost supports class weighting and has built-in mechanisms to focus learning on harder-to-predict samples. Neural networks can be optimized with weighted loss functions or by applying techniques like focal loss to make the model more sensitive to the minority class.
The dataset contains a mix of highly skewed numerical features, such as page durations and counts, along with several categorical variables, including visitor type, month, and traffic type. XGBoost naturally handles different data types, missing values, and skewed distributions without requiring extensive preprocessing. It is also robust to outliers and captures non-linear interactions between variables through its ensemble of trees. This makes it well suited to model the diverse behaviors seen in user session data.
On the other hand, neural networks are capable of learning complex patterns in the data, especially when interactions between variables are subtle or nonlinear. Although they typically require more preprocessing, including scaling of numerical features and encoding of categorical variables, they offer flexibility in architecture and activation functions that can be tuned to match the structure of the data. The presence of high-cardinality features and potential latent patterns in user behavior (e.g., time spent on specific page types) makes a neural network a valuable candidate for capturing deeper representations of user intent.
Together, these two models provide a balanced approach. XGBoost offers strong baseline performance, interpretability through feature importance, and robustness to data issues. Neural networks complement this with their capacity for representation learning and modeling complex nonlinearities. Using both models allows us to evaluate trade-offs between interpretability and predictive power, and ultimately select the model that performs best on metrics appropriate for imbalanced classification such as AUC, F1-score, and precision-recall.
Before applying any modeling techniques, the dataset was prepared through a series of preprocessing steps to ensure data quality, improve feature relevance, and reduce noise. Since the dataset includes both numerical and categorical variables, it required cleaning, transformation, and encoding to make it suitable for machine learning algorithms.
browser, traffic_type, region,
special_day, and operating_systems, were
removed.revenue was converted into a
factor to support classification tasks.nearZeroVar() and removed. These included indicators
such as month.Jan, month.Feb,
month.Apr, month.June, month.Jul,
month.Aug, month.Sep, month.Oct,
and visitor_type.Other.highCorrelation <- findCorrelation(corr_matrix,cutoff = 0.75)
# noVar <- nearZeroVar(pdaysDist)
noVar <- nearZeroVar(online_attention)
columns_to <- names(online_attention)[noVar]
columns_to
## [1] "special_day"
set.seed(42)
df <- online_attention |> select(-browser,-traffic_type,-region,-special_day,-operating_systems)
df$revenue <- as.factor(df$revenue)
# One-hot encode all other factor variables
df_mlr_dummy <- createDummyFeatures(df, target = "revenue")
# df_mlr$revenue <- as.numeric(df$revenue)
nZPreproc = nearZeroVar(df_mlr_dummy)
# names(df_mlr_dummy[,nZPreproc])
df_mlr = df_mlr_dummy[,-nZPreproc]
train_idx <- sample(nrow(df_mlr), 0.7 * nrow(df_mlr))
train_data <- df_mlr[train_idx, ]
test_data <- df_mlr[-train_idx, ]
metric_labels <- c(
"Accuracy",
"AUC",
"F1",
"Precision",
"Recall",
"Specificity"
)
With the dataset fully prepared, cleaned, encoded, and split into training and test sets, the next step was to train and evaluate two machine learning models: XGBoost and a neural network. These algorithms were chosen for their ability to capture complex patterns and to handle class imbalance effectively.
To optimize the performance of the XGBoost model, a set of key hyperparameters was tuned over the following ranges:
eta): 0.001 to
0.5max_depth): 3 to 20subsample): 0.5 to
1colsample_bytree): 0.1 to 1nrounds):
100 to 500The tuning process was guided by a robust evaluation strategy designed to explore the parameter space and assess model stability:
set.seed(65661)
res_train <- resample(
learner = final_xgb,
task = train_task,
resampling = inner_cv,
measures = list(
mlr::acc,
mlr::auc,
mlr::f1,
mlr::ppv,
mlr::tpr,
mlr::tnr
),
show.info = TRUE
)
train_metrics <- as_tibble(as.list(res_train$aggr))
names(train_metrics) <- metric_labels
train_metrics <- bind_cols(Model = "XGBoost", Dataset = "Train", train_metrics)
# Step 1: Predict on the test set using the final trained XGBoost model
pred_xgb <- predict(trained_xgb, task = test_task)
# Step 2: Evaluate test performance using the same metrics
test_metrics <- as_tibble(as.list(performance(pred_xgb, measures = list(
mlr::acc,
mlr::auc,
mlr::f1,
mlr::ppv, # precision
mlr::tpr, # recall
mlr::tnr # specificity
))))
# Step 3: Apply consistent column names (assuming metric_labels is predefined)
names(test_metrics) <- metric_labels
# Step 4: Bind model and dataset labels
test_metrics <- bind_cols(Model = "XGBoost", Dataset = "Test", test_metrics)
xgboostMetrics <- bind_rows(train_metrics, test_metrics)
xgboostMetrics |>
kable(caption = "XGBoost Metrics", digits = 3) |>
kable_styling(full_width = TRUE, position = "center") |>
kable_classic()
| Model | Dataset | Accuracy | AUC | F1 | Precision | Recall | Specificity |
|---|---|---|---|---|---|---|---|
| XGBoost | Train | 0.903 | 0.934 | 0.665 | 0.736 | 0.608 | 0.959 |
| XGBoost | Test | 0.897 | 0.927 | 0.637 | 0.667 | 0.610 | 0.947 |
The XGBoost model demonstrated consistently strong performance across both the training and test datasets. On the training set, it achieved an accuracy of 90.3% and an AUC of 0.934, indicating excellent ability to distinguish between classes. The F1 score was 0.665, supported by a precision of 0.736 and a recall of 0.608, suggesting the model made relatively few false positive predictions while maintaining a reasonable detection rate for the positive class. Specificity was notably high at 0.959, reflecting strong performance in identifying negative cases.
On the test set, performance remained stable, with an accuracy of 89.7% and an AUC of 0.927. While the F1 score dropped slightly to 0.637, precision and recall remained well-balanced at 0.667 and 0.610, respectively. Specificity remained high at 0.947, reinforcing the model’s reliability in classifying non-revenue sessions.
A Multilayer Perceptron (MLP) was used for binary classification. Prior to model training, numeric predictors were preprocessed using a sequence of transformations. A Yeo-Johnson transformation was applied to reduce skewness and normalize the distribution of features, followed by centering and scaling to standardize all numeric inputs. These steps help ensure that the neural network converges more efficiently and avoids biases introduced by varying feature scales.
The architecture consisted of two hidden layers with ReLU activations, each followed by dropout layers to reduce the risk of overfitting. The output layer used a sigmoid activation function to produce probability estimates for the binary target variable.
Hyperparameter tuning was conducted using a grid search strategy, evaluating all possible combinations of the following values:
lr): 0.001, 0.003,
0.005set.seed(123)
# online_attention_sub = online_attention |> select(-special_day,-browser,operating_systems,-weekend)
#
# nnetDummy = dummyVars(revenue~.,online_attention_sub)
#
# session_data <- predict(nnetDummy, newdata = online_attention_sub) |> as.data.frame()
# session_data$revenue = online_attention$revenue
# session_data = session_data %>% mutate(revenue = if_else(revenue == "0", 0, 1))
# session_data
session_data = df_mlr %>% mutate(revenue = if_else(revenue == "0", 0, 1))
# Step 2: Stratified Split -----------------------------------------------
# First, create training index (70%)
train_idx <- createDataPartition(session_data$revenue, p = 0.7, list = FALSE)
train_set <- session_data[train_idx, ]
remaining_set <- session_data[-train_idx, ]
# remaining_set <- session_data[-train_idx, ]
# Split remaining 30% into 15% validation and 15% test (relative to total)
# val_idx <- createDataPartition(remaining_set$revenue, p = 0.5, list = FALSE)
# val_set <- remaining_set[val_idx, ]
# test_set <- remaining_set[-val_idx, ]
rec <- recipe(revenue ~ ., data = train_set) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
# Learn from training data only
prepped <- prep(rec, training = train_set)
# Apply to all sets
train_data <- bake(prepped, new_data = train_set)
# val_data <- bake(prepped, new_data = val_set)
test_data <- bake(prepped, new_data = remaining_set)
prepare_tensors <- function(df) {
X <- df %>% select(-revenue) %>% as.matrix() %>% scale()
y <- df$revenue
list(
X_tensor = torch_tensor(X, dtype = torch_float()),
y_tensor = torch_tensor(as.numeric(y), dtype = torch_float())$unsqueeze(2)
)
}
evaluate_model <- function(model, X_tensor, y_tensor, dataset_name = "Unknown",best_threshold=0.5) {
model$eval()
with_no_grad({
preds <- model(X_tensor)
})
threshold_prop = best_threshold
# Convert torch tensors to R
probs <- as_array(preds$squeeze())
labels <- as_array(y_tensor$squeeze())
pred_classes <- ifelse(probs > threshold_prop, 1, 0)
# Wrap in tibble
results <- tibble(
truth = factor(labels, levels = c(0, 1)),
.pred = probs,
.pred_class = factor(pred_classes, levels = c(0, 1))
)
# Compute metrics
acc <- accuracy(results, truth = truth, estimate = .pred_class)[[".estimate"]]
auc <- as.numeric(pROC::auc(response = labels, predictor = probs))
f1 <- f_meas(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
prec <- precision(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
rec <- recall(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
spec <- specificity(results, truth = truth, estimate = .pred_class, event_level = "second")[[".estimate"]]
# Return tidy row
tibble(
Model = "Torch NN",
Dataset = dataset_name,
Accuracy = acc,
AUC = auc,
F1 = f1,
Precision = prec,
Recall = rec,
Specificity = spec
)
}
best_config |> remove_rownames() |> kable(caption = "Best NN Hyper parameters", digits = 3) |>
kable_styling(full_width = T, position = "center") |>
kable_classic()
| lr | dropout | hidden_units |
|---|---|---|
| 0.001 | 0.2 | 128 |
set.seed(546)
test_tensors <- prepare_tensors(test_data)
X_test_tensor <- test_tensors$X_tensor
y_test_tensor <- test_tensors$y_tensor
# Get raw probabilities on the test set
best_model$eval()
with_no_grad({
probs <- best_model(X_test_tensor)$squeeze(2) %>% as_array()
true_labels <- as_array(y_test_tensor)
})
# Evaluate at various thresholds
thresholds <- seq(0.3, 0.7, by = 0.01)
results <- data.frame(Threshold = thresholds)
results$F1 <- sapply(thresholds, function(thresh) {
preds <- ifelse(probs >= thresh, 1, 0)
precision <- sum(preds & true_labels) / sum(preds)
recall <- sum(preds & true_labels) / sum(true_labels)
ifelse((precision + recall) == 0, 0, 2 * precision * recall / (precision + recall))
})
# Best threshold
best_thresh <- results$Threshold[which.max(results$F1)]
cat("Best F1 at threshold:", best_thresh, "\n")
## Best F1 at threshold: 0.47
nnt_test_metrics <- evaluate_model(best_model, X_test_tensor, y_test_tensor, dataset_name = "Test",best_thresh)
metrics_combined <- bind_rows(
nnt_train_metrics,
nnt_test_metrics)
metrics_combined |> kable(caption = "Neural Network Metrics", digits = 3) |>
kable_styling(full_width = T, position = "center") |>
kable_classic()
| Model | Dataset | Accuracy | AUC | F1 | Precision | Recall | Specificity |
|---|---|---|---|---|---|---|---|
| Torch NN | Train | 0.914 | 0.949 | 0.716 | 0.733 | 0.700 | 0.953 |
| Torch NN | Test | 0.901 | 0.932 | 0.685 | 0.673 | 0.699 | 0.938 |
The Multilayer Perceptron (Torch NN) achieved strong and balanced performance on both the training and test datasets. The best-performing configuration used a learning rate of 0.001, a dropout rate of 0.1, and 64 hidden units in the first layer. The classification threshold was optimized for F1 score, with the best results achieved at a threshold of 0.37, rather than the default 0.5.
On the training set, the model reached an accuracy of 90.2% and an AUC of 0.947, with an F1 score of 0.712, precision of 0.655, recall of 0.781, and specificity of 0.925.
On the test set, the model maintained consistent performance, achieving 89.2% accuracy, 0.932 AUC, an F1 score of 0.685, precision of 0.623, recall of 0.761, and specificity of 0.916, demonstrating strong generalization to unseen data.
bind_rows(
train_metrics, test_metrics,
nnt_train_metrics,
nnt_test_metrics) |> kable(caption = "All Models Evaluation Metrics", digits = 3) |>
kable_styling(full_width = T, position = "center") |>
kable_classic()
| Model | Dataset | Accuracy | AUC | F1 | Precision | Recall | Specificity |
|---|---|---|---|---|---|---|---|
| XGBoost | Train | 0.903 | 0.934 | 0.665 | 0.736 | 0.608 | 0.959 |
| XGBoost | Test | 0.897 | 0.927 | 0.637 | 0.667 | 0.610 | 0.947 |
| Torch NN | Train | 0.914 | 0.949 | 0.716 | 0.733 | 0.700 | 0.953 |
| Torch NN | Test | 0.901 | 0.932 | 0.685 | 0.673 | 0.699 | 0.938 |
Both the XGBoost and Multilayer Perceptron (MLP) models demonstrated strong predictive performance, each offering unique strengths aligned with different business priorities. On the test dataset, XGBoost achieved an accuracy of 89.7%, an AUC of 0.927, and an F1 score of 0.637. The MLP model had a slightly lower accuracy of 89.2%, but outperformed XGBoost in both AUC (0.932) and F1 score (0.685). This suggests that while both models generalize well, the MLP provides a better balance between precision and recall, especially when using the optimized threshold of 0.37.
In terms of precision, XGBoost performed slightly better (0.667 vs. 0.623), indicating fewer false positives. However, the MLP showed a clear advantage in recall (0.761 vs. 0.610), making it more effective at identifying true revenue-generating sessions. This distinction is important in contexts where missing a potential conversion carries more cost than occasionally targeting a non-converting user.
Regarding specificity, both models performed reliably, with XGBoost at 0.947 and MLP close behind at 0.916, confirming their consistency in recognizing non-revenue sessions.
In summary, if the business goal is to identify as many high-value sessions as possible, the MLP is the better choice. If the priority is to minimize false positives and focus on high-confidence predictions, XGBoost may be more appropriate. The final model selection should reflect the organization’s risk tolerance and strategic objectives.
The predictive models developed in this project can support key areas of business operations, particularly digital marketing, customer engagement, and revenue growth. By identifying which user sessions are likely to result in purchases, the business can move from broad, generic strategies to more focused, data-informed targeting. This enables better use of marketing budgets, higher return on ad spend, and more efficient conversion funnels.
The Multilayer Perceptron (MLP), with its higher recall and F1 score, is well-suited for maximizing potential conversions by flagging high-intent sessions in real time. This allows the business to trigger timely interventions, such as personalized offers, live chat support, or adaptive pricing. In contrast, XGBoost, with stronger precision and specificity, is ideal when minimizing false positives is a priority—for example, when resources are limited or incentives are costly to distribute.
Beyond marketing, model predictions can be used to prioritize leads for sales teams, inform product recommendation systems, and improve CRM workflows. High-likelihood users could be enrolled in targeted nurture tracks, receive priority outreach, or influence demand forecasting and inventory planning.
These predictions also create opportunities for ongoing analytics and experimentation. Teams can segment users by conversion likelihood, measure the impact of targeted campaigns, and refine strategies based on user behavior patterns. Over time, feedback from the model can help improve user journeys, reduce friction, and optimize customer experience design.
Deploying one of these models in production would give the business a scalable, real-time decision engine for boosting conversions, optimizing resource allocation, and enabling more proactive, personalized engagement across multiple departments.
This project set out to address a key business challenge: predicting which user sessions are most likely to generate revenue. By translating this objective into a binary classification task, we evaluated two machine learning approaches, XGBoost and a Multilayer Perceptron (MLP), using carefully preprocessed data and extensive hyperparameter tuning.
Both models performed well, with XGBoost offering slightly higher precision and specificity, and the MLP delivering better recall and F1 scores. The choice between them depends on business priorities. If the goal is to maximize conversion opportunities and reduce missed revenue, the MLP is the more suitable option. If the focus is on controlling false positives and favoring high-certainty predictions, XGBoost may be the preferred model.
Based on the evaluation, we recommend adopting the MLP in scenarios where capturing more potential revenue is critical, such as personalized marketing, live customer engagement, or upselling during high-traffic periods. For situations with limited operational capacity or expensive incentives, XGBoost could be a more cautious alternative.
In addition to the model results, the exploratory analysis highlighted that bounce rates and exit rates are heavily right-skewed. This suggests that many users leave pages quickly, which may reflect deeper problems in usability or page design. From my perspective as a software developer, I believe this behavior could be linked to poor layout, unclear calls to action, or overall friction in the user experience. These issues may directly impact conversion rates. While the model captures the downstream effect of this behavior, addressing the root causes could lead to both improved user satisfaction and stronger business performance.
To maximize the value of this predictive system, we recommend the following:
Implementing this system would help the business make better-informed decisions, improve targeting efforts, and identify broader areas of opportunity in the customer journey that go beyond the scope of modeling alone.