library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.2
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
library(dplyr)
project_data <- read.csv("online_shoppers_intention.csv")
Objective: The primary goal of this project is to understand and model customer behavior in an e-commerce setting to predict positive shopping outcomes (purchases). By analyzing various features of user sessions and their relationship with revenue generation, we aim to identify actionable insights that businesses can use to improve their conversion rates and overall customer experience.
First, I’ll give a brief summary of our data set.
The data set includes 12,000+ instances of shopping sessions on our companies e-commerce platform (specific industry not defined).
dim(project_data)
## [1] 12330 18
It includes 18 feature related to either the customers context (Region, Browser, Month, VisitorType, etc.) or how the session went (ExitRates, ProductRelated_Duration, PageValues, etc.).
unique(names(project_data))
## [1] "Administrative" "Administrative_Duration"
## [3] "Informational" "Informational_Duration"
## [5] "ProductRelated" "ProductRelated_Duration"
## [7] "BounceRates" "ExitRates"
## [9] "PageValues" "SpecialDay"
## [11] "Month" "OperatingSystems"
## [13] "Browser" "Region"
## [15] "TrafficType" "VisitorType"
## [17] "Weekend" "Revenue"
Our target variable is ‘Revenue’, and it is a boolean variable that states whether a shopper did (TRUE) or did not (FALSE) make a purchase.
unique(project_data$Revenue)
## [1] FALSE TRUE
There are no missing values.
any(is.na(project_data))
## [1] FALSE
Here’s an initial summary for more details:
summary(project_data)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 7.50 Median : 0.0000
## Mean : 2.315 Mean : 80.82 Mean : 0.5036
## 3rd Qu.: 4.000 3rd Qu.: 93.26 3rd Qu.: 0.0000
## Max. :27.000 Max. :3398.75 Max. :24.0000
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 184.1
## Median : 0.00 Median : 18.00 Median : 598.9
## Mean : 34.47 Mean : 31.73 Mean : 1194.8
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1464.2
## Max. :2549.38 Max. :705.00 Max. :63973.5
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003112 Median :0.02516 Median : 0.000 Median :0.00000
## Mean :0.022191 Mean :0.04307 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016813 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
## Month OperatingSystems Browser Region
## Length:12330 Min. :1.000 Min. : 1.000 Min. :1.000
## Class :character 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mode :character Median :2.000 Median : 2.000 Median :3.000
## Mean :2.124 Mean : 2.357 Mean :3.147
## 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Max. :8.000 Max. :13.000 Max. :9.000
## TrafficType VisitorType Weekend Revenue
## Min. : 1.00 Length:12330 Mode :logical Mode :logical
## 1st Qu.: 2.00 Class :character FALSE:9462 FALSE:10422
## Median : 2.00 Mode :character TRUE :2868 TRUE :1908
## Mean : 4.07
## 3rd Qu.: 4.00
## Max. :20.00
This provides a lot of information, but at this step here’s what seems most important for my purposes:
The target variable, ‘Revenue’, is really imbalanced (10,422 negative cases vs 1,908 positive or 84.5% vs 15.5%)
# Create a new column with labels for the Revenue variable
project_data <- project_data |>
dplyr::mutate(Revenue_Label = ifelse(Revenue == TRUE, "Purchase", "No Purchase"))
# Calculate the counts for each label
revenue_counts <- project_data |>
dplyr::count(Revenue_Label) |>
dplyr::mutate(percentage = n / sum(n) * 100)
# Plot a bar chart with counts and percentage labels inside the bars
ggplot(revenue_counts, aes(x = Revenue_Label, y = n, fill = Revenue_Label)) +
geom_bar(stat = "identity", color = "black") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5), size = 6, color = "black") +
labs(
title = "Histogram of Revenue Outcomes",
x = "Outcome",
y = "Count"
) +
scale_fill_manual(values = c("No Purchase" = "lightcoral", "Purchase" = "lightgreen")) +
theme_minimal() +
theme(
legend.position = "none", # Removes the legend
panel.background = element_rect(fill = "white", color = "black"), # White background with black border
plot.background = element_rect(fill = "white") # White background around the plot
)
Some of the categorical variables are numeric (OperatingSystems, Browser, and Region)
Most session measurements are distributed really close to zero and skewed right, here’s an example of how they are distributed (I won’t print all of them to save space since they’re all similar, as the summary above shows)
# Plot a histogram of ExitRates
ggplot(project_data, aes(x = ProductRelated)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Product Pages", x = "ProductRelated", y = "Count")
Most features are intuitive but some need further explanation:
Page types are divided into Administrative, Informational, and ProductRelated. We have two variables for each, one for total number of pages visited of that type and the other is duration in seconds spent on that page type.
BounceRates: The percentage of visitors who enter the website through that page and exit without triggering any additional tasks.
ExitRates: The percentage of pageviews on the website that end at the specific page that particular session ended.
PageValues: The average value of the pages averaged over the value of the target page and/or the completion of an eCommerce transaction.
SpecialDay: This value represents the closeness of the browsing date to special days or holidays (eg Mother’s Day or Valentine’s day) in which the transaction is more likely to be finalized.
VisitorType: New, Returning, or Other
Next, we’ll look at how all the session metrics interact with each other correlation-wise.
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.3.3
# Select the session features for the correlation matrix
selected_features <- project_data |>
dplyr::select(Revenue, BounceRates, ExitRates, PageValues,
Administrative_Duration,
Informational_Duration,
ProductRelated_Duration,
Administrative, Informational,
ProductRelated)
# Compute the correlation matrix
cor_matrix <- cor(method = "spearman", selected_features, use = "complete.obs")
# Plot the upper triangle of the correlation matrix
ggcorrplot(cor_matrix,
method = "square",
type = "upper",
lab = TRUE,
title = "Correlation Matrix of Key Features",
show.legend = FALSE) +
theme(axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 8),
plot.title = element_text(size = 14, hjust = 0.5))
From this I notice a few more important insights:
exit rate and bounce rate have a correlation of .91 which is high enough for me to feel one should be removed going forward
Also like I’d expect, there’s a lot of multicollinearity in the features related to page type, especially when they have the same type
Next, I want to analyze some of categorical features in our user metrics.
These include traffic type, browser, operating system, and region so I’ll make a plot to analyze these
plot_categorical_distribution <- function(data, categorical_var) {
library(ggplot2)
library(dplyr)
# Normalize data to make the bars take up the full plot area (100%)
summary_data <- data %>%
group_by(!!sym(categorical_var), Revenue) %>%
summarize(Count = n(), .groups = "drop") %>%
group_by(!!sym(categorical_var)) %>%
mutate(Percentage = Count / sum(Count) * 100)
# Automatically generate the title based on the categorical variable name
title <- paste(categorical_var, "versus Revenue")
# Create the proportional stacked bar chart
ggplot(summary_data, aes(x = factor(!!sym(categorical_var)), y = Percentage, fill = Revenue)) +
geom_bar(stat = "identity", position = "fill", color = "black") +
geom_text(
aes(label = paste0(round(Count, 1))),
position = position_fill(vjust = 0.5), # Center the percentage labels within the bars
size = 3
) +
labs(
title = title,
x = categorical_var,
y = "Percentage",
fill = "Revenue"
) +
theme_minimal() +
scale_fill_manual(values = c("lightpink", "lightgreen"))
}
# Example usage:
plot_categorical_distribution(data = project_data, categorical_var = "TrafficType")
plot_categorical_distribution(data = project_data, categorical_var = "Browser")
plot_categorical_distribution(data = project_data, categorical_var = "OperatingSystems")
plot_categorical_distribution(data = project_data, categorical_var = "Region")
before breaking these down, I’ll clarify that to my knowledge there’s no documentation on any of these features to equate the numbers to specific real world browser types, regions, etc.
This becomes an issue for how I want the interpretation of my final insights to be considered. If we don’t know the browser type then how can we make suggestions to the companies executives?
For this reason I’ll forget about all of these features but one since we’re instructed to use at least one categorical feature
I’ll use traffic_type and assume that someone in the company has the ability to match the classes to a group of customers
I wanted this one because it appears to affect conversion more than the others and in my eyes is more directly related to shopping sessions since it shows how the customer ended up on our site
Next, we’ll take a look at our temporal features Month, SpecialDay, and Weekend
# Ensure the month order is set correctly
month_order <- c("Feb", "Mar", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
project_data$Month <- factor(project_data$Month, levels = month_order)
# Create a table of just Revenue and Month
month_revenue_counts <- table(project_data$Revenue, project_data$Month)
# Calculate the percentage of True for each month
true_percents <- prop.table(month_revenue_counts, 2)[2, ] * 100
# Convert the table into a data frame for ggplot
df_month_revenue <- as.data.frame(month_revenue_counts)
# Plot the stacked bar chart with percentage labels for True %
ggplot(df_month_revenue, aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat = "identity", color = "black") +
scale_fill_manual(values = c("lightpink", "lightgreen"), labels = c( "False", "True")) +
labs(title = "Distribution of Revenue by Month",
x = "Month",
y = "Count",
fill = "Revenue") +
theme_minimal() +
geom_text(data = subset(df_month_revenue, Var1 == "TRUE"),
aes(x = Var2, y = Freq / 2, label = paste0(round(true_percents, 1), "%")),
color = "black", size = 3)
Takeaways:
Looks like conversion rates and visit quantity change a lot based on the month
Quantity increase buy a large degree in Spring and right before christmas
Conversion improves in the months prior to christmas, but this isn’t the same for spring
The only issue is April and January are missing so the feature loses some of it’s temporal value, but it still appears useful for drawing insights
Now I’ll look at weekend:
plot_categorical_distribution(data = project_data, categorical_var = "Weekend")
This shows that there’s not much difference in how users shop based whether it’s the weekend
More people shop weekdays, but lets divide the true instance by 2 and the false by 5 to see if there’s more per day.
Because of there’s no difference in conversion I’ll drop this feature from my model
Now I’ll check special day
plot_categorical_distribution(data = project_data, categorical_var = "SpecialDay")
Interestingly, the likelyhood of conversion reduces the closer the instance is to a “Special Day”
Like the categorical features there is no documentation on what denotes a special day, the only examples the source gives is says “This value represents the closesness of the browsing date to special days or holidays (eg Mother’s Day or Easter)”, and says nothing else.
For this reason I won’t use this feature
Finally my last feature in user metrics is VisitorType
plot_categorical_distribution(data = project_data, categorical_var = "VisitorType")
project_data <- project_data |>
filter(VisitorType != "Other")
plot_categorical_distribution(data = project_data, categorical_var = "VisitorType")
It looks like there’s a lot more returning visitors, but they don’t convert as often as the new customers
This seems like useful information so I’ll utilize this feature further into the project
At this point we’ve removed Browser, OperatingSystem, SpecialDay, Weekend, Region, BounceRate
Other than VisitorType the rest of our features need to be engineered in some way
First, I’ll handle the Month feature:
# Recode Month into two categories: "Pre-Christmas" and "Other"
project_data <- project_data |>
mutate(Pre_Christmas = if_else(Month %in% c("Sep","Oct", "Nov"), 1, 0))
# Convert Pre_Christmas to a factor for proper visualization
project_data$Pre_Christmas <- factor(project_data$Pre_Christmas, levels = c(1, 0), labels = c("Pre-Christmas", "Other"))
# Calculate percentages for labels
pre_christmas_counts <- project_data |>
group_by(Pre_Christmas, Revenue) |>
summarise(count = n(), .groups = "drop") |>
group_by(Pre_Christmas) |>
mutate(percentage = count / sum(count) * 100)
# Create the plot
ggplot(pre_christmas_counts, aes(x = Pre_Christmas, y = count, fill = Revenue)) +
geom_bar(stat = "identity", position = "stack", color = "black") +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_fill(vjust = 2000), size = 3.5) +
scale_fill_manual(values = c("lightpink", "lightgreen"), labels = c("False", "True")) +
labs(
title = "Seasonality Impact on Purchase Status",
x = "Seasonality",
y = "Count",
fill = "Revenue"
) +
theme_minimal()
Next I’ll look at the page data
Based on some different insights I gathered in the Data Dives, I learned that product related pages are most useful for understanding the conversion of the customer
For this reason I’m inclined to create a feature to measure the ratio of page duration of product page to all the other page types
So I consolidated all the duration’s and divided them by the product duration, here’s the formula I used:
(I also add .01 to the denominator to keep the denominator from being zero which would cause issues)
Here’s the new feature distribution:
# Ensure ProductRatio is calculated and exists in the dataset
project_data <- project_data |>
mutate(ProductRatio = ProductRelated_Duration /
(ProductRelated_Duration + Administrative_Duration + Informational_Duration + .01))
# Plot the histogram of ProductRatio with bins at every 1%
ggplot(project_data, aes(x = ProductRatio)) +
geom_histogram(binwidth = 0.01, fill = "blue", color = "black", alpha = 0.7) +
labs(
title = "Distribution of Product Ratio",
x = "Product Ratio",
y = "Count"
) +
theme_minimal()
Now that I’ve covered how product pages effect the data, we’ll make a feature to represent the sheer quantity of pages visited by adding up all the page count features to get “total_pages”:
project_data <- project_data |>
mutate(
total_pages = Administrative + Informational + ProductRelated
)
At this point I’ve effectively consolidated the 6 features with high multicollinearity into two features that describe the data in distinctly different way.
The only feature that needs work to be used in modeling is to make dummy variable to represent traffic type,
There was originally 20 features but based on the plot from earlier, I’ve decided to consolidate a bunch of the classes with a small number of instances to make TrafficType_Other
I’ll first make them dummy variable and then condense the feature space:
# One-hot encoding with fastDummies
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
project_data <- dummy_cols(project_data, select_columns = "TrafficType")
# Create the "TrafficType_Other" variable by summing the selected TrafficTypes
project_data$TrafficType_Other <- (
project_data$TrafficType_19 +
project_data$TrafficType_18 +
project_data$TrafficType_17 +
project_data$TrafficType_16 +
project_data$TrafficType_15 +
project_data$TrafficType_14 +
project_data$TrafficType_12 +
project_data$TrafficType_9 +
project_data$TrafficType_7
)
# Ensure TrafficType_Other is binary (0 or 1)
project_data$TrafficType_Other <- ifelse(project_data$TrafficType_Other > 0, 1, 0)
# Remove the original dummy variables for TrafficTypes included in "Other"
project_data <- project_data |>
dplyr::select(-TrafficType_19, -TrafficType_18, -TrafficType_17, -TrafficType_16,
-TrafficType_15, -TrafficType_14, -TrafficType_12, -TrafficType_9, -TrafficType_7, -TrafficType_4)
Now I’ve got 12 features to represent these classes and I can move on to transformations,
Next, I need to handle the skewness and outliers problem I have with many of my continuous variables
this issue effects exit rates, page values, and both of the two features we just made
to fix it we’ll start by getting the optimal lambda with the boxcox method to determine the best transformation
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Add a small constant to ExitRates to ensure all values are positive
project_data$ExitRates_shifted <- project_data$ExitRates + 0.001
project_data$pagevalues_shifted <- project_data$PageValues + 0.001
project_data$total_pages_shifted <- project_data$total_pages + 0.001
project_data$ProductRatio_shifted <- 1 - (project_data$ProductRatio)
# Apply the Box-Cox transformation to determine the optimal lambda for the shifted variable
boxcox_result <- boxcox(lm(ExitRates_shifted ~ 1, data = project_data),
lambda = seq(-2, 2, by = 0.1))
# Apply Box-Cox for PageValues
boxcox_pagevalues <- boxcox(lm(pagevalues_shifted ~ 1, data = project_data),
lambda = seq(-2, 2, by = 0.1))
# Apply Box-Cox for total_pages
boxcox_total_pages <- boxcox(lm(total_pages_shifted ~ 1, data = project_data),
lambda = seq(-2, 2, by = 0.1))
# Apply Box-Cox for product_ratio
boxcox_ProductRatio <- boxcox(lm(ProductRatio_shifted ~ 1, data = project_data),
lambda = seq(-2, 2, by = 0.1))
# Find the optimal lambda
optimal_lambda_exitrates <- boxcox_result$x[which.max(boxcox_result$y)]
cat("ExitRates lambda:", optimal_lambda_exitrates, "\n")
## ExitRates lambda: 0.02020202
# Identify optimal lambda for PageValues
optimal_lambda_pagevalues <- boxcox_pagevalues$x[which.max(boxcox_pagevalues$y)]
cat("Optimal lambda for PageValues:", optimal_lambda_pagevalues, "\n")
## Optimal lambda for PageValues: -0.4242424
# Identify optimal lambda for total_pages
optimal_lambda_total_pages <- boxcox_total_pages$x[which.max(boxcox_total_pages$y)]
cat("Optimal lambda for Total Pages:", optimal_lambda_total_pages, "\n")
## Optimal lambda for Total Pages: 0.1414141
# Find the optimal lambda
optimal_lambda_ProductRatio <- boxcox_result$x[which.max(boxcox_ProductRatio$y)]
cat("ProductRatio lambda:", optimal_lambda_ProductRatio, "\n")
## ProductRatio lambda: 0.1010101
So what this shows us is that a log transformation would be appropriate for ExitRates, Total_Pages, and ProductRatio in order to make the distribution more normal. I think this because the lambda is so close to zero
Since PageValues is negative I’ll use a log transformation too, but I don’t expect normal distribution
First, I’ll apply my transformations
# Apply Log Transformation for ExitRates
project_data$ExitRates_transformed <-
(project_data$ExitRates_shifted ^ optimal_lambda_exitrates - 1) /
optimal_lambda_exitrates
# Apply Log Transformation for PageValues
project_data$PageValues_transformed <-
(project_data$pagevalues_shifted ^ optimal_lambda_pagevalues - 1) /
optimal_lambda_pagevalues
# Apply Log Transformation for Total Pages
project_data$total_pages_transformed <-
(project_data$total_pages_shifted ^ optimal_lambda_total_pages - 1) /
optimal_lambda_total_pages
# Apply Log Transformation for Product Ratio
project_data$ProductRatio_transformed <-
(project_data$ProductRatio_shifted ^ optimal_lambda_ProductRatio - 1) /
optimal_lambda_ProductRatio
Now I’ll plot the distributions and see how it worked out
# Create a list of transformed features and their names
transformed_features <- list(
ExitRates = project_data$ExitRates_transformed,
PageValues = project_data$PageValues_transformed,
TotalPages = project_data$total_pages_transformed,
ProductRatio = project_data$ProductRatio_transformed
)
# Combine the features into a data frame for plotting
transformed_df <- do.call(rbind, lapply(names(transformed_features), function(name) {
data.frame(Value = transformed_features[[name]], Feature = name)
}))
# Plot the distribution of all transformed features
ggplot(transformed_df, aes(x = Value)) +
geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
facet_wrap(~Feature, scales = "free", ncol = 2) +
labs(
title = "Distributions of Transformed Features",
x = "Value",
y = "Count"
) +
theme_minimal()
This does a good job normalizing product ExitRates and TotalPages
ProductRatio becomes more bimodal than it does normal, but I’m alright with that. The distribution is still more interpretable in this form
PageValues creates two distinct spikes which is interesting
Now I’ll plot these again, but this time I’ll incorporate the revenue feature and see what happened
# Ensure Revenue is a factor
project_data$Revenue <- as.factor(project_data$Revenue)
# Create a list of transformed features and their names
transformed_features <- list(
ExitRates = project_data$ExitRates_transformed,
PageValues = project_data$PageValues_transformed,
TotalPages = project_data$total_pages_transformed,
ProductRatio = project_data$ProductRatio_transformed
)
# Combine the features into a long data frame for plotting
transformed_df <- do.call(rbind, lapply(names(transformed_features), function(name) {
data.frame(Value = transformed_features[[name]],
Feature = name,
Revenue = project_data$Revenue)
}))
# Plot the stacked bar plots in a 2x2 grid
ggplot(transformed_df, aes(x = Value, fill = Revenue)) +
geom_histogram(bins = 30, color = "black", position = "stack") +
facet_wrap(~Feature, scales = "free", ncol = 2) + # Set ncol = 2 for a 2x2 grid
labs(
title = "Distributions of Transformed Features with Revenue Interaction",
x = "Value",
y = "Count"
) +
scale_fill_manual(values = c("lightpink", "lightgreen"), labels = c("No Purchase", "Purchase")) +
theme_minimal()
Here’s we a take from this:
ExitRates: Lower exit rates appear to moderately increase the likelihood of purchases and values higher than around -2.5 (in this form) have nearly no chance of coversion. This is what I’d expect from an exitrates effect on purchasing behavior and the now normal distribution makes this relationship much more clear.
PageValues: This was a very highly correlated feature in it’s normal form, but it was impossible to tell plotting it as it was. This transformation makes that correlation more clear to us. Instances in the spike on the right are a majority positive conversions and the spike on the lower extreme are almost never purchases.
ProductRatio: This one is interesting, I expected the distribution in the higher range to have much better conversions, but it’s just a small amount higher (proportion-wise) as there’s a group of positive instances in the lower range that is somewhat sizable.
TotalPages: This one shows that the purchasing distribution is shifted slightly more towards the higher values than the distribution of the dataset as a whole. I figured I’d see something like this with a moderate increase the more pages the user visited.
I’ll accept all these transformations as my new features in my upcoming model as they will make them more interpratable and will make my model more reliable.
Now I can move on to the next part of my inferential analysis.
Now that I have a good understanding of my variables, I can now create a null hypothesis relating to how changes in these features affect our target variable. I will use T-tests and ANOVA tests to confirm or deny this hypothesis and further understand how well these features represent purchases in our data set.
Null Hypothesis (H₀):
A customer’s session behavior doesn’t significantly predict whether they will make a purchase.
Alternative Hypothesis (H₁):
A customer’s session behavior significantly predicts whether they will make a purchase.
For our two new binary features, VisitorType and PreChristmas would be best analyzed with t-tests:
# Ensure Revenue is a factor
project_data$Revenue <- as.numeric(as.factor(project_data$Revenue)) - 1 # Convert Revenue to 0 and 1
# Subset data by VisitorType
visitor_new <- project_data$Revenue[project_data$VisitorType == "New_Visitor"]
visitor_returning <- project_data$Revenue[project_data$VisitorType == "Returning_Visitor"]
# Perform a two-sample t-test
t_test_visitor <- t.test(visitor_new, visitor_returning,
alternative = "two.sided",
var.equal = FALSE)
# Print the results
print(t_test_visitor)
##
## Welch Two Sample t-test
##
## data: visitor_new and visitor_returning
## t = 9.946, df = 2055.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.08814291 0.13143956
## sample estimates:
## mean of x mean of y
## 0.2491145 0.1393233
Now pre_christmas:
# Subset data by Pre_Christmas
pre_christmas <- project_data$Revenue[project_data$Pre_Christmas == "Pre-Christmas"]
other <- project_data$Revenue[project_data$Pre_Christmas == "Other"]
# Perform a two-sample t-test
t_test_pre_christmas <- t.test(pre_christmas, other,
alternative = "two.sided",
var.equal = FALSE)
# Print the results
print(t_test_pre_christmas)
##
## Welch Two Sample t-test
##
## data: pre_christmas and other
## t = 16.81, df = 6131.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1132640 0.1431692
## sample estimates:
## mean of x mean of y
## 0.2411276 0.1129110
Takeaways:
Both tests show strong evidence of significant differences between the compared groups, highlighting the role of seasonality (Pre-Christmas) and customer type (New vs. Returning) in driving revenue.
On the basis of these two features, we can confidently reject the null hypothesis as both p-values are well below 0.05
The higher proportions for Pre-Christmas and New Visitors suggest actionable insights:
Pre-Christmas campaigns should be a focus, as they align with increased likelihoods of revenue generation.
Attracting new visitors through targeted marketing or first-time purchase discounts may yield higher conversion rates.
Next, I will ANOVA test my categorical feature TrafficType,
#create a new column to consolidate the sparse features like we did earlier when making dummy variables, but in a single column that we can anova test with
# Group the TrafficType categories into a new categorical feature
project_data$TrafficTypes_Grouped <- as.factor(ifelse(
project_data$TrafficType %in% c("7", "9", "12", "14", "15", "16", "17", "18", "19", "20"),
"Other",
as.character(project_data$TrafficType)
))
# Create a subset of the data with necessary variables
traffic_data <- project_data %>%
dplyr::select(TrafficTypes_Grouped, Revenue) %>%
mutate(Revenue = as.numeric(as.factor(Revenue)) - 1) # Convert Revenue to numeric (0 = FALSE, 1 = TRUE)
# Conduct ANOVA test
anova_result <- aov(Revenue ~ as.factor(TrafficTypes_Grouped), data = traffic_data)
# Summarize the results
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(TrafficTypes_Grouped) 10 44.8 4.484 35.28 <2e-16 ***
## Residuals 12234 1554.8 0.127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This feature also results in a very low p-value of <2e-16 which proves a significant effect on the revenue variable as a result of the traffic type.
Again we can reject the null hypothesis
Next, I’ll move on to my logistic model,
# Convert VisitorType to binary numeric values: 1 for New, 0 for Returning
project_data <- project_data |>
mutate(
VisitorType = if_else(VisitorType == "New_Visitor", 1, 0)
)
# Ensure VisitorType is numeric
project_data$VisitorType <- as.numeric(project_data$VisitorType)
# Select the variables to include in the modeling dataset
modeling_dataset <- project_data |>
dplyr::select(
ExitRates_transformed,
PageValues_transformed,
total_pages_transformed,
ProductRatio_transformed,
Pre_Christmas,
VisitorType,
dplyr::starts_with("TrafficType_",),
Revenue
)
# Recode the Pre_Christmas variable to binary
modeling_dataset <- modeling_dataset |>
dplyr::mutate(Pre_Christmas = ifelse(Pre_Christmas == "Other", 0, 1))
# Recode the Pre_Christmas variable to binary
modeling_dataset$Revenue <- as.factor(modeling_dataset$Revenue)
display a corr matrix of our remaining features to ensure there’s no more multicollinearity issues
# Load necessary library
library(ggcorrplot)
# Select numeric columns for correlation analysis
numeric_vars <- modeling_dataset |>
dplyr::select(where(is.numeric))
# Compute the correlation matrix
cor_matrix <- cor(method = "spearman", numeric_vars, use = "complete.obs")
# Plot the upper triangle of the correlation matrix
ggcorrplot(
cor_matrix,
method = "square",
type = "upper",
lab = TRUE,
title = "Correlation Matrix of Modeling Dataset",
show.legend = FALSE
)
# Load necessary libraries
library(dplyr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ROCR)
# Scale the variables to [0, 1] using MinMaxScaler
scale_features <- function(data, features) {
data %>%
mutate(across(all_of(features), ~ (. - min(.)) / (max(.) - min(.)), .names = "{col}_scaled"))
}
# Select the features to scale
features_to_scale <- c("ExitRates_transformed", "PageValues_transformed",
"total_pages_transformed", "ProductRatio_transformed")
# Apply scaling
modeling_dataset_scaled <- scale_features(modeling_dataset, features_to_scale)
# Ensure 'Revenue' is a binary factor
modeling_dataset_scaled$Revenue <- as.factor(modeling_dataset_scaled$Revenue)
# Define weights for the imbalanced classes
class_counts <- table(modeling_dataset_scaled$Revenue)
total_count <- nrow(modeling_dataset_scaled)
weight_0 <- total_count / (2 * class_counts[1]) # Weight for class 0 (no revenue)
weight_1 <- total_count / (2 * class_counts[2]) # Weight for class 1 (revenue)
modeling_dataset_scaled$weights <- ifelse(modeling_dataset_scaled$Revenue == "0", weight_0, weight_1)
# Fit the weighted logistic regression model
weighted_model <- glm(
Revenue ~ ExitRates_transformed + PageValues_transformed +
total_pages_transformed + ProductRatio_transformed +
VisitorType + Pre_Christmas + TrafficType_1 +
TrafficType_2 + TrafficType_3 + TrafficType_5 + TrafficType_6 +
TrafficType_8 + TrafficType_10 + TrafficType_11 +
TrafficType_13 + TrafficType_20 + TrafficType_Other,
family = binomial,
data = modeling_dataset_scaled,
weights = weights
)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
# Make predictions
predicted_probabilities <- predict(weighted_model, type = "response")
predicted_classes <- ifelse(predicted_probabilities > 0.5, "1", "0")
# Create a confusion matrix
conf_matrix <- table(Predicted = predicted_classes, Actual = modeling_dataset_scaled$Revenue)
# Calculate evaluation metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- conf_matrix["1", "1"] / sum(conf_matrix["1", ])
recall <- conf_matrix["1", "1"] / sum(conf_matrix[, "1"])
f1_score <- 2 * ((precision * recall) / (precision + recall))
# Output the evaluation metrics
cat("\n","Accuracy:", round(accuracy, 4), "\n")
##
## Accuracy: 0.8577
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.5251
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.8282
cat("F1 Score:", round(f1_score, 4), "\n")
## F1 Score: 0.6427
summary(weighted_model)
##
## Call:
## glm(formula = Revenue ~ ExitRates_transformed + PageValues_transformed +
## total_pages_transformed + ProductRatio_transformed + VisitorType +
## Pre_Christmas + TrafficType_1 + TrafficType_2 + TrafficType_3 +
## TrafficType_5 + TrafficType_6 + TrafficType_8 + TrafficType_10 +
## TrafficType_11 + TrafficType_13 + TrafficType_20 + TrafficType_Other,
## family = binomial, data = modeling_dataset_scaled, weights = weights)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.252761 0.176330 1.433 0.15173
## ExitRates_transformed -0.214223 0.041314 -5.185 2.16e-07 ***
## PageValues_transformed 0.084853 0.001474 57.576 < 2e-16 ***
## total_pages_transformed 0.026868 0.018864 1.424 0.15436
## ProductRatio_transformed -0.019126 0.012154 -1.574 0.11557
## VisitorType 0.598996 0.080489 7.442 9.92e-14 ***
## Pre_Christmas 1.419522 0.061129 23.222 < 2e-16 ***
## TrafficType_1 -0.285957 0.112713 -2.537 0.01118 *
## TrafficType_2 0.265408 0.102677 2.585 0.00974 **
## TrafficType_3 -0.371080 0.119820 -3.097 0.00195 **
## TrafficType_5 0.533282 0.186775 2.855 0.00430 **
## TrafficType_6 -0.151421 0.171409 -0.883 0.37703
## TrafficType_8 0.765536 0.164172 4.663 3.12e-06 ***
## TrafficType_10 0.281268 0.160355 1.754 0.07942 .
## TrafficType_11 0.383121 0.192131 1.994 0.04615 *
## TrafficType_13 -0.930483 0.169054 -5.504 3.71e-08 ***
## TrafficType_20 0.254133 0.246943 1.029 0.30342
## TrafficType_Other -0.215734 0.268319 -0.804 0.42139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16975.2 on 12244 degrees of freedom
## Residual deviance: 9204.7 on 12227 degrees of freedom
## AIC: 12228
##
## Number of Fisher Scoring iterations: 5
# Extract and rank feature importance
feature_importance <- as.data.frame(summary(weighted_model)$coefficients)
feature_importance <- feature_importance[-1, ] # Remove the intercept
feature_importance <- feature_importance[order(abs(feature_importance[, "Estimate"]), decreasing = TRUE), ]
feature_importance$Rank <- seq_len(nrow(feature_importance))
# Add feature names for clarity
feature_importance <- tibble::rownames_to_column(feature_importance, var = "Feature")
# Round the relevant columns to 3 decimal places
feature_importance <- feature_importance %>%
dplyr::mutate(
Estimate = round(Estimate, 3),
`Std. Error` = round(`Std. Error`, 3),
`z value` = round(`z value`, 3),
`Pr(>|z|)` = round(`Pr(>|z|)`, 3)
)
# Optionally visualize the feature importance
library(ggplot2)
ggplot(feature_importance, aes(x = reorder(Feature, abs(Estimate)), y = abs(Estimate))) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Feature Importance in Logistic Regression Model",
x = "Features",
y = "Absolute Coefficient Value"
) +
theme_minimal()
# Summary of the model to get coefficients and p-values
model_summary <- summary(weighted_model)
# Extract coefficients and p-values
coefficients <- model_summary$coefficients
significant_coefficients <- coefficients[coefficients[, 4] < 1, ] # Select p-values < 0.05
# Create a data frame to hold results
results <- data.frame(
Coefficient = significant_coefficients[, 1], # Extract the coefficient
Odds_Change = exp(significant_coefficients[, 1]), # Calculate e^(coefficient)
P_Value = significant_coefficients[, 4] # Include p-values
)
# Display results
print(results)
## Coefficient Odds_Change P_Value
## (Intercept) 0.25276123 1.2875758 1.517288e-01
## ExitRates_transformed -0.21422349 0.8071680 2.157204e-07
## PageValues_transformed 0.08485326 1.0885573 0.000000e+00
## total_pages_transformed 0.02686843 1.0272326 1.543593e-01
## ProductRatio_transformed -0.01912564 0.9810561 1.155665e-01
## VisitorType 0.59899590 1.8202901 9.919369e-14
## Pre_Christmas 1.41952195 4.1351431 2.736636e-119
## TrafficType_1 -0.28595653 0.7512953 1.117974e-02
## TrafficType_2 0.26540805 1.3039630 9.741534e-03
## TrafficType_3 -0.37107956 0.6899890 1.954999e-03
## TrafficType_5 0.53328193 1.7045173 4.300814e-03
## TrafficType_6 -0.15142065 0.8594861 3.770273e-01
## TrafficType_8 0.76553623 2.1501470 3.115941e-06
## TrafficType_10 0.28126789 1.3248085 7.942472e-02
## TrafficType_11 0.38312086 1.4668553 4.614569e-02
## TrafficType_13 -0.93048304 0.3943632 3.711662e-08
## TrafficType_20 0.25413332 1.2893437 3.034244e-01
## TrafficType_Other -0.21573364 0.8059499 4.213860e-01
Significant Predictors: Most features are statistically significant (p < 0.05), except for total_pages_transformed, product_ratio_transformed, and some traffic types (TrafficType_6, 10, 20, and other).
Top Positive Predictors:
Pre_Christmas: Increases purchase likelihood by 313.51%, making it the most influential feature.
TrafficType_8 and TrafficType_5: Drive significant purchase likelihood increases (115.01% and 70.45%, respectively).
Top Negative Predictors:
ExitRates_transformed: Reduces purchase likelihood by 19.28%, indicating a strong association between higher exit rates and reduced purchases.
TrafficType_1 and TrafficType_3: Decrease purchase likelihood by 24.87% and 31.00%, respectively.
Impact of Feature Scaling:
Business Recomendations:
Refine advertising strategies based on traffic
sources:
Traffic sources like 8 and 2 have higher odds of conversion. Allocate
more advertising budget and efforts to these traffic types to optimize
ROI while deprioritizing less effective sources.
Improve loyalty programs for returning
customers:
Returning visitors are less likely to convert compared to new visitors.
Enhance loyalty programs, provide personalized recommendations, and
introduce retention offers to increase conversion rates among returning
customers.
Increase targeted marketing for new
customers:
New visitors are significantly more likely to convert. Design campaigns
aimed at acquiring first-time customers, such as exclusive welcome
offers or discounts tailored to encourage initial purchases.
Leverage key features for optimization:
Features like Page Values and Exit Rates significantly impact purchase
behavior. Streamline website navigation, improve page engagement, and
reduce bounce rates to optimize the shopping experience and increase
conversion odds.
Invest in high-converting traffic
channels:
Channels like TrafficType 8 and TrafficType 5 showed the highest odds of
generating revenue. Focus resources on these channels, while
experimenting with strategies to improve the performance of other
channels.