library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.2
library(broom)
library(lindia)
## Warning: package 'lindia' was built under R version 4.3.3
library(dplyr)
project_data <- read.csv("online_shoppers_intention.csv")

Project Breakdown

Scope & Context:

Objective: The primary goal of this project is to understand and model customer behavior in an e-commerce setting to predict positive shopping outcomes (purchases). By analyzing various features of user sessions and their relationship with revenue generation, we aim to identify actionable insights that businesses can use to improve their conversion rates and overall customer experience.

First, I’ll give a brief summary of our data set.

  • The data set includes 12,000+ instances of shopping sessions on our companies e-commerce platform (specific industry not defined).

    dim(project_data)
    ## [1] 12330    18
  • It includes 18 feature related to either the customers context (Region, Browser, Month, VisitorType, etc.) or how the session went (ExitRates, ProductRelated_Duration, PageValues, etc.).

    unique(names(project_data))
    ##  [1] "Administrative"          "Administrative_Duration"
    ##  [3] "Informational"           "Informational_Duration" 
    ##  [5] "ProductRelated"          "ProductRelated_Duration"
    ##  [7] "BounceRates"             "ExitRates"              
    ##  [9] "PageValues"              "SpecialDay"             
    ## [11] "Month"                   "OperatingSystems"       
    ## [13] "Browser"                 "Region"                 
    ## [15] "TrafficType"             "VisitorType"            
    ## [17] "Weekend"                 "Revenue"
  • Our target variable is ‘Revenue’, and it is a boolean variable that states whether a shopper did (TRUE) or did not (FALSE) make a purchase.

    unique(project_data$Revenue)
    ## [1] FALSE  TRUE
  • There are no missing values.

    any(is.na(project_data))
    ## [1] FALSE

EDA:

Here’s an initial summary for more details:

summary(project_data)
##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##     Month           OperatingSystems    Browser           Region     
##  Length:12330       Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Class :character   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mode  :character   Median :2.000    Median : 2.000   Median :3.000  
##                     Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##                     3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##                     Max.   :8.000    Max.   :13.000   Max.   :9.000  
##   TrafficType    VisitorType         Weekend         Revenue       
##  Min.   : 1.00   Length:12330       Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Class :character   FALSE:9462      FALSE:10422    
##  Median : 2.00   Mode  :character   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                     
##  3rd Qu.: 4.00                                                     
##  Max.   :20.00

This provides a lot of information, but at this step here’s what seems most important for my purposes:

  • The target variable, ‘Revenue’, is really imbalanced (10,422 negative cases vs 1,908 positive or 84.5% vs 15.5%)

    # Create a new column with labels for the Revenue variable
    project_data <- project_data |>
      dplyr::mutate(Revenue_Label = ifelse(Revenue == TRUE, "Purchase", "No Purchase"))
    
    # Calculate the counts for each label
    revenue_counts <- project_data |>
      dplyr::count(Revenue_Label) |>
      dplyr::mutate(percentage = n / sum(n) * 100)
    
    # Plot a bar chart with counts and percentage labels inside the bars
    ggplot(revenue_counts, aes(x = Revenue_Label, y = n, fill = Revenue_Label)) +
      geom_bar(stat = "identity", color = "black") +
      geom_text(aes(label = paste0(round(percentage, 1), "%")), 
                position = position_stack(vjust = 0.5), size = 6, color = "black") +
      labs(
        title = "Histogram of Revenue Outcomes",
        x = "Outcome",
        y = "Count"
      ) +
      scale_fill_manual(values = c("No Purchase" = "lightcoral", "Purchase" = "lightgreen")) +
        theme_minimal() +
      theme(
        legend.position = "none",            # Removes the legend
        panel.background = element_rect(fill = "white", color = "black"),  # White background with black border
        plot.background = element_rect(fill = "white")  # White background around the plot
      )

  • Some of the categorical variables are numeric (OperatingSystems, Browser, and Region)

  • Most session measurements are distributed really close to zero and skewed right, here’s an example of how they are distributed (I won’t print all of them to save space since they’re all similar, as the summary above shows)

    # Plot a histogram of ExitRates
    ggplot(project_data, aes(x = ProductRelated)) +
      geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
      labs(title = "Distribution of Product Pages", x = "ProductRelated", y = "Count")

  • Most features are intuitive but some need further explanation:

    • Page types are divided into Administrative, Informational, and ProductRelated. We have two variables for each, one for total number of pages visited of that type and the other is duration in seconds spent on that page type.

    • BounceRates: The percentage of visitors who enter the website through that page and exit without triggering any additional tasks.

    • ExitRates: The percentage of pageviews on the website that end at the specific page that particular session ended.

    • PageValues: The average value of the pages averaged over the value of the target page and/or the completion of an eCommerce transaction.

    • SpecialDay: This value represents the closeness of the browsing date to special days or holidays (eg Mother’s Day or Valentine’s day) in which the transaction is more likely to be finalized.

    • VisitorType: New, Returning, or Other

Next, we’ll look at how all the session metrics interact with each other correlation-wise.

library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.3.3
# Select the session features for the correlation matrix
selected_features <- project_data |>
  dplyr::select(Revenue, BounceRates, ExitRates, PageValues, 
                Administrative_Duration, 
                Informational_Duration, 
                ProductRelated_Duration,
                Administrative, Informational, 
                ProductRelated)

# Compute the correlation matrix
cor_matrix <- cor(method = "spearman", selected_features, use = "complete.obs")

# Plot the upper triangle of the correlation matrix
ggcorrplot(cor_matrix, 
           method = "square", 
           type = "upper", 
           lab = TRUE,
           title = "Correlation Matrix of Key Features",
           show.legend = FALSE) +
    theme(axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 8),
        plot.title = element_text(size = 14, hjust = 0.5))

From this I notice a few more important insights:

  • exit rate and bounce rate have a correlation of .91 which is high enough for me to feel one should be removed going forward

    • Since bounce is slightly less correlated with Revenue, I’ll go ahead and remove bounce
  • Also like I’d expect, there’s a lot of multicollinearity in the features related to page type, especially when they have the same type

    • These 6 features should be consolidated at some point to handle this issue

Next, I want to analyze some of categorical features in our user metrics.

  • These include traffic type, browser, operating system, and region so I’ll make a plot to analyze these

    • I wanted to be able to compare the proportions of the target for each class easily while still getting details on the count for each group so I did it like this:
plot_categorical_distribution <- function(data, categorical_var) {
  library(ggplot2)
  library(dplyr)
  
  # Normalize data to make the bars take up the full plot area (100%)
  summary_data <- data %>%
    group_by(!!sym(categorical_var), Revenue) %>%
    summarize(Count = n(), .groups = "drop") %>%
    group_by(!!sym(categorical_var)) %>%
    mutate(Percentage = Count / sum(Count) * 100)
  
  # Automatically generate the title based on the categorical variable name
  title <- paste(categorical_var, "versus Revenue")
  
  # Create the proportional stacked bar chart
  ggplot(summary_data, aes(x = factor(!!sym(categorical_var)), y = Percentage, fill = Revenue)) +
    geom_bar(stat = "identity", position = "fill", color = "black") +
    geom_text(
      aes(label = paste0(round(Count, 1))),
      position = position_fill(vjust = 0.5), # Center the percentage labels within the bars
      size = 3
    ) +
    labs(
      title = title,
      x = categorical_var,
      y = "Percentage",
      fill = "Revenue"
    ) +
    theme_minimal() +
    scale_fill_manual(values = c("lightpink", "lightgreen"))
}

# Example usage:
plot_categorical_distribution(data = project_data, categorical_var = "TrafficType")

plot_categorical_distribution(data = project_data, categorical_var = "Browser")

plot_categorical_distribution(data = project_data, categorical_var = "OperatingSystems")

plot_categorical_distribution(data = project_data, categorical_var = "Region")

  • before breaking these down, I’ll clarify that to my knowledge there’s no documentation on any of these features to equate the numbers to specific real world browser types, regions, etc.

  • This becomes an issue for how I want the interpretation of my final insights to be considered. If we don’t know the browser type then how can we make suggestions to the companies executives?

  • For this reason I’ll forget about all of these features but one since we’re instructed to use at least one categorical feature

  • I’ll use traffic_type and assume that someone in the company has the ability to match the classes to a group of customers

  • I wanted this one because it appears to affect conversion more than the others and in my eyes is more directly related to shopping sessions since it shows how the customer ended up on our site

Next, we’ll take a look at our temporal features Month, SpecialDay, and Weekend

  • first a distribution of the months based on conversion:
# Ensure the month order is set correctly
month_order <- c("Feb", "Mar", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
project_data$Month <- factor(project_data$Month, levels = month_order)

# Create a table of just Revenue and Month
month_revenue_counts <- table(project_data$Revenue, project_data$Month)

# Calculate the percentage of True for each month
true_percents <- prop.table(month_revenue_counts, 2)[2, ] * 100

# Convert the table into a data frame for ggplot
df_month_revenue <- as.data.frame(month_revenue_counts)

# Plot the stacked bar chart with percentage labels for True %
ggplot(df_month_revenue, aes(x = Var2, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", color = "black") +
  scale_fill_manual(values = c("lightpink", "lightgreen"), labels = c( "False", "True")) +
  labs(title = "Distribution of Revenue by Month",
       x = "Month",
       y = "Count",
       fill = "Revenue") +
  theme_minimal() +
  geom_text(data = subset(df_month_revenue, Var1 == "TRUE"),
            aes(x = Var2, y = Freq / 2, label = paste0(round(true_percents, 1), "%")),
            color = "black", size = 3)

  • Takeaways:

    • Looks like conversion rates and visit quantity change a lot based on the month

      • Quantity increase buy a large degree in Spring and right before christmas

      • Conversion improves in the months prior to christmas, but this isn’t the same for spring

    • The only issue is April and January are missing so the feature loses some of it’s temporal value, but it still appears useful for drawing insights

Now I’ll look at weekend:

plot_categorical_distribution(data = project_data, categorical_var = "Weekend")

  • This shows that there’s not much difference in how users shop based whether it’s the weekend

  • More people shop weekdays, but lets divide the true instance by 2 and the false by 5 to see if there’s more per day.

    • weekday = 1876 per day, weekend = 1430 per day, so less people seem to visit on weekends, but conversion is consistent for both
  • Because of there’s no difference in conversion I’ll drop this feature from my model

  • Now I’ll check special day

plot_categorical_distribution(data = project_data, categorical_var = "SpecialDay")

  • Interestingly, the likelyhood of conversion reduces the closer the instance is to a “Special Day”

  • Like the categorical features there is no documentation on what denotes a special day, the only examples the source gives is says “This value represents the closesness of the browsing date to special days or holidays (eg Mother’s Day or Easter)”, and says nothing else.

  • For this reason I won’t use this feature

Finally my last feature in user metrics is VisitorType

plot_categorical_distribution(data = project_data, categorical_var = "VisitorType")

  • I don’t see how a visitor could be in a group other than new or returning so I’m going to remove those instances from the data set and print it again.
 project_data <- project_data |>
   filter(VisitorType != "Other")
 
 plot_categorical_distribution(data = project_data, categorical_var = "VisitorType")

  • It looks like there’s a lot more returning visitors, but they don’t convert as often as the new customers

  • This seems like useful information so I’ll utilize this feature further into the project

Feature Engineering:

At this point we’ve removed Browser, OperatingSystem, SpecialDay, Weekend, Region, BounceRate

Other than VisitorType the rest of our features need to be engineered in some way

First, I’ll handle the Month feature:

Month

  • My thought was to group up the months leading up to Christmas since we see an increase of conversion rates in these months:
# Recode Month into two categories: "Pre-Christmas" and "Other"
project_data <- project_data |>
  mutate(Pre_Christmas = if_else(Month %in% c("Sep","Oct", "Nov"), 1, 0))

# Convert Pre_Christmas to a factor for proper visualization
project_data$Pre_Christmas <- factor(project_data$Pre_Christmas, levels = c(1, 0), labels = c("Pre-Christmas", "Other"))

# Calculate percentages for labels
pre_christmas_counts <- project_data |>
  group_by(Pre_Christmas, Revenue) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(Pre_Christmas) |>
  mutate(percentage = count / sum(count) * 100)

# Create the plot
ggplot(pre_christmas_counts, aes(x = Pre_Christmas, y = count, fill = Revenue)) +
  geom_bar(stat = "identity", position = "stack", color = "black") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_fill(vjust = 2000), size = 3.5) +
  scale_fill_manual(values = c("lightpink", "lightgreen"), labels = c("False", "True")) +
  labs(
    title = "Seasonality Impact on Purchase Status",
    x = "Seasonality",
    y = "Count",
    fill = "Revenue"
  ) +
  theme_minimal()

Next I’ll look at the page data

Page Metrics

  • Based on some different insights I gathered in the Data Dives, I learned that product related pages are most useful for understanding the conversion of the customer

  • For this reason I’m inclined to create a feature to measure the ratio of page duration of product page to all the other page types

  • So I consolidated all the duration’s and divided them by the product duration, here’s the formula I used:

(I also add .01 to the denominator to keep the denominator from being zero which would cause issues)

Here’s the new feature distribution:

# Ensure ProductRatio is calculated and exists in the dataset
project_data <- project_data |>
  mutate(ProductRatio = ProductRelated_Duration / 
           (ProductRelated_Duration + Administrative_Duration + Informational_Duration + .01))

# Plot the histogram of ProductRatio with bins at every 1%
ggplot(project_data, aes(x = ProductRatio)) +
  geom_histogram(binwidth = 0.01, fill = "blue", color = "black", alpha = 0.7) +
  labs(
    title = "Distribution of Product Ratio",
    x = "Product Ratio",
    y = "Count"
  ) +
  theme_minimal()

Now that I’ve covered how product pages effect the data, we’ll make a feature to represent the sheer quantity of pages visited by adding up all the page count features to get “total_pages”:

project_data <- project_data |>
  mutate(
    total_pages = Administrative + Informational + ProductRelated
  )

At this point I’ve effectively consolidated the 6 features with high multicollinearity into two features that describe the data in distinctly different way.

The only feature that needs work to be used in modeling is to make dummy variable to represent traffic type,

TrafficType

  • There was originally 20 features but based on the plot from earlier, I’ve decided to consolidate a bunch of the classes with a small number of instances to make TrafficType_Other

  • I’ll first make them dummy variable and then condense the feature space:

# One-hot encoding with fastDummies
library(fastDummies)
## Warning: package 'fastDummies' was built under R version 4.3.3
project_data <- dummy_cols(project_data, select_columns = "TrafficType")

# Create the "TrafficType_Other" variable by summing the selected TrafficTypes
project_data$TrafficType_Other <- (
  project_data$TrafficType_19 +
  project_data$TrafficType_18 +
  project_data$TrafficType_17 +
  project_data$TrafficType_16 +
  project_data$TrafficType_15 +
  project_data$TrafficType_14 +
  project_data$TrafficType_12 +
  project_data$TrafficType_9 +
  project_data$TrafficType_7
)

# Ensure TrafficType_Other is binary (0 or 1)
project_data$TrafficType_Other <- ifelse(project_data$TrafficType_Other > 0, 1, 0)

# Remove the original dummy variables for TrafficTypes included in "Other"
project_data <- project_data |>
  dplyr::select(-TrafficType_19, -TrafficType_18, -TrafficType_17, -TrafficType_16,
                -TrafficType_15, -TrafficType_14, -TrafficType_12, -TrafficType_9, -TrafficType_7, -TrafficType_4)

Now I’ve got 12 features to represent these classes and I can move on to transformations,

Transformations

Next, I need to handle the skewness and outliers problem I have with many of my continuous variables

  • this issue effects exit rates, page values, and both of the two features we just made

  • to fix it we’ll start by getting the optimal lambda with the boxcox method to determine the best transformation

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# Add a small constant to ExitRates to ensure all values are positive
project_data$ExitRates_shifted <- project_data$ExitRates + 0.001
project_data$pagevalues_shifted <- project_data$PageValues + 0.001
project_data$total_pages_shifted <- project_data$total_pages + 0.001
project_data$ProductRatio_shifted <- 1 - (project_data$ProductRatio)

# Apply the Box-Cox transformation to determine the optimal lambda for the shifted variable
boxcox_result <- boxcox(lm(ExitRates_shifted ~ 1, data = project_data), 
                        lambda = seq(-2, 2, by = 0.1))

# Apply Box-Cox for PageValues
boxcox_pagevalues <- boxcox(lm(pagevalues_shifted ~ 1, data = project_data), 
                            lambda = seq(-2, 2, by = 0.1))

# Apply Box-Cox for total_pages
boxcox_total_pages <- boxcox(lm(total_pages_shifted ~ 1, data = project_data), 
                             lambda = seq(-2, 2, by = 0.1))

# Apply Box-Cox for product_ratio
boxcox_ProductRatio <- boxcox(lm(ProductRatio_shifted ~ 1, data = project_data), 
                             lambda = seq(-2, 2, by = 0.1))

# Find the optimal lambda
optimal_lambda_exitrates <- boxcox_result$x[which.max(boxcox_result$y)]
cat("ExitRates lambda:", optimal_lambda_exitrates, "\n")
## ExitRates lambda: 0.02020202
# Identify optimal lambda for PageValues
optimal_lambda_pagevalues <- boxcox_pagevalues$x[which.max(boxcox_pagevalues$y)]
cat("Optimal lambda for PageValues:", optimal_lambda_pagevalues, "\n")
## Optimal lambda for PageValues: -0.4242424
# Identify optimal lambda for total_pages
optimal_lambda_total_pages <- boxcox_total_pages$x[which.max(boxcox_total_pages$y)]
cat("Optimal lambda for Total Pages:", optimal_lambda_total_pages, "\n")
## Optimal lambda for Total Pages: 0.1414141
# Find the optimal lambda
optimal_lambda_ProductRatio <- boxcox_result$x[which.max(boxcox_ProductRatio$y)]
cat("ProductRatio lambda:", optimal_lambda_ProductRatio, "\n")
## ProductRatio lambda: 0.1010101
  • So what this shows us is that a log transformation would be appropriate for ExitRates, Total_Pages, and ProductRatio in order to make the distribution more normal. I think this because the lambda is so close to zero

  • Since PageValues is negative I’ll use a log transformation too, but I don’t expect normal distribution

  • First, I’ll apply my transformations

# Apply Log Transformation for ExitRates
project_data$ExitRates_transformed <- 
  (project_data$ExitRates_shifted ^ optimal_lambda_exitrates - 1) / 
  optimal_lambda_exitrates

# Apply Log Transformation for PageValues
project_data$PageValues_transformed <- 
  (project_data$pagevalues_shifted ^ optimal_lambda_pagevalues - 1) / 
  optimal_lambda_pagevalues

# Apply Log Transformation for Total Pages
project_data$total_pages_transformed <- 
  (project_data$total_pages_shifted ^ optimal_lambda_total_pages - 1) / 
  optimal_lambda_total_pages

# Apply Log Transformation for Product Ratio
project_data$ProductRatio_transformed <- 
  (project_data$ProductRatio_shifted ^ optimal_lambda_ProductRatio - 1) / 
  optimal_lambda_ProductRatio

Now I’ll plot the distributions and see how it worked out

# Create a list of transformed features and their names
transformed_features <- list(
  ExitRates = project_data$ExitRates_transformed,
  PageValues = project_data$PageValues_transformed,
  TotalPages = project_data$total_pages_transformed,
  ProductRatio = project_data$ProductRatio_transformed
)

# Combine the features into a data frame for plotting
transformed_df <- do.call(rbind, lapply(names(transformed_features), function(name) {
  data.frame(Value = transformed_features[[name]], Feature = name)
}))

# Plot the distribution of all transformed features
ggplot(transformed_df, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  facet_wrap(~Feature, scales = "free", ncol = 2) +
  labs(
    title = "Distributions of Transformed Features",
    x = "Value",
    y = "Count"
  ) +
  theme_minimal()

  • This does a good job normalizing product ExitRates and TotalPages

  • ProductRatio becomes more bimodal than it does normal, but I’m alright with that. The distribution is still more interpretable in this form

  • PageValues creates two distinct spikes which is interesting

Now I’ll plot these again, but this time I’ll incorporate the revenue feature and see what happened

# Ensure Revenue is a factor
project_data$Revenue <- as.factor(project_data$Revenue)

# Create a list of transformed features and their names
transformed_features <- list(
  ExitRates = project_data$ExitRates_transformed,
  PageValues = project_data$PageValues_transformed,
  TotalPages = project_data$total_pages_transformed,
  ProductRatio = project_data$ProductRatio_transformed
)

# Combine the features into a long data frame for plotting
transformed_df <- do.call(rbind, lapply(names(transformed_features), function(name) {
  data.frame(Value = transformed_features[[name]], 
             Feature = name, 
             Revenue = project_data$Revenue)
}))

# Plot the stacked bar plots in a 2x2 grid
ggplot(transformed_df, aes(x = Value, fill = Revenue)) +
  geom_histogram(bins = 30, color = "black", position = "stack") +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  # Set ncol = 2 for a 2x2 grid
  labs(
    title = "Distributions of Transformed Features with Revenue Interaction",
    x = "Value",
    y = "Count"
  ) +
  scale_fill_manual(values = c("lightpink", "lightgreen"), labels = c("No Purchase", "Purchase")) +
  theme_minimal()

Here’s we a take from this:

  • ExitRates: Lower exit rates appear to moderately increase the likelihood of purchases and values higher than around -2.5 (in this form) have nearly no chance of coversion. This is what I’d expect from an exitrates effect on purchasing behavior and the now normal distribution makes this relationship much more clear.

  • PageValues: This was a very highly correlated feature in it’s normal form, but it was impossible to tell plotting it as it was. This transformation makes that correlation more clear to us. Instances in the spike on the right are a majority positive conversions and the spike on the lower extreme are almost never purchases.

  • ProductRatio: This one is interesting, I expected the distribution in the higher range to have much better conversions, but it’s just a small amount higher (proportion-wise) as there’s a group of positive instances in the lower range that is somewhat sizable.

  • TotalPages: This one shows that the purchasing distribution is shifted slightly more towards the higher values than the distribution of the dataset as a whole. I figured I’d see something like this with a moderate increase the more pages the user visited.

I’ll accept all these transformations as my new features in my upcoming model as they will make them more interpratable and will make my model more reliable.

Now I can move on to the next part of my inferential analysis.

Hypothesis Testing:

Now that I have a good understanding of my variables, I can now create a null hypothesis relating to how changes in these features affect our target variable. I will use T-tests and ANOVA tests to confirm or deny this hypothesis and further understand how well these features represent purchases in our data set.

Null Hypothesis (H₀):

A customer’s session behavior doesn’t significantly predict whether they will make a purchase.

Alternative Hypothesis (H₁):

A customer’s session behavior significantly predicts whether they will make a purchase.

Two Sampled T-Test

For our two new binary features, VisitorType and PreChristmas would be best analyzed with t-tests:

# Ensure Revenue is a factor
project_data$Revenue <- as.numeric(as.factor(project_data$Revenue)) - 1 # Convert Revenue to 0 and 1

# Subset data by VisitorType
visitor_new <- project_data$Revenue[project_data$VisitorType == "New_Visitor"]
visitor_returning <- project_data$Revenue[project_data$VisitorType == "Returning_Visitor"]

# Perform a two-sample t-test
t_test_visitor <- t.test(visitor_new, visitor_returning, 
                         alternative = "two.sided", 
                         var.equal = FALSE)

# Print the results
print(t_test_visitor)
## 
##  Welch Two Sample t-test
## 
## data:  visitor_new and visitor_returning
## t = 9.946, df = 2055.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.08814291 0.13143956
## sample estimates:
## mean of x mean of y 
## 0.2491145 0.1393233

Now pre_christmas:

# Subset data by Pre_Christmas
pre_christmas <- project_data$Revenue[project_data$Pre_Christmas == "Pre-Christmas"]
other <- project_data$Revenue[project_data$Pre_Christmas == "Other"]

# Perform a two-sample t-test
t_test_pre_christmas <- t.test(pre_christmas, other, 
                               alternative = "two.sided", 
                               var.equal = FALSE)

# Print the results
print(t_test_pre_christmas)
## 
##  Welch Two Sample t-test
## 
## data:  pre_christmas and other
## t = 16.81, df = 6131.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1132640 0.1431692
## sample estimates:
## mean of x mean of y 
## 0.2411276 0.1129110

Takeaways:

  • Both tests show strong evidence of significant differences between the compared groups, highlighting the role of seasonality (Pre-Christmas) and customer type (New vs. Returning) in driving revenue.

  • On the basis of these two features, we can confidently reject the null hypothesis as both p-values are well below 0.05

  • The higher proportions for Pre-Christmas and New Visitors suggest actionable insights:

    • Pre-Christmas campaigns should be a focus, as they align with increased likelihoods of revenue generation.

    • Attracting new visitors through targeted marketing or first-time purchase discounts may yield higher conversion rates.

Next, I will ANOVA test my categorical feature TrafficType,

ANOVA Test

#create a new column to consolidate the sparse features like we did earlier when making dummy variables, but in a single column that we can anova test with

# Group the TrafficType categories into a new categorical feature
project_data$TrafficTypes_Grouped <- as.factor(ifelse(
  project_data$TrafficType %in% c("7", "9", "12", "14", "15", "16", "17", "18", "19", "20"),
  "Other",
  as.character(project_data$TrafficType)
))

# Create a subset of the data with necessary variables
traffic_data <- project_data %>%
  dplyr::select(TrafficTypes_Grouped, Revenue) %>%
  mutate(Revenue = as.numeric(as.factor(Revenue)) - 1)  # Convert Revenue to numeric (0 = FALSE, 1 = TRUE)

# Conduct ANOVA test
anova_result <- aov(Revenue ~ as.factor(TrafficTypes_Grouped), data = traffic_data)

# Summarize the results
summary(anova_result)
##                                    Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(TrafficTypes_Grouped)    10   44.8   4.484   35.28 <2e-16 ***
## Residuals                       12234 1554.8   0.127                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • This feature also results in a very low p-value of <2e-16 which proves a significant effect on the revenue variable as a result of the traffic type.

  • Again we can reject the null hypothesis

Next, I’ll move on to my logistic model,

Weighted Logistic Model:

  • First, I’ll do some preprocessing and make our features work for modelling and define the modelling dataset
# Convert VisitorType to binary numeric values: 1 for New, 0 for Returning
project_data <- project_data |>
  mutate(
    VisitorType = if_else(VisitorType == "New_Visitor", 1, 0)
  )

# Ensure VisitorType is numeric
project_data$VisitorType <- as.numeric(project_data$VisitorType)
# Select the variables to include in the modeling dataset
modeling_dataset <- project_data |>
  dplyr::select(
    ExitRates_transformed,
    PageValues_transformed,
    total_pages_transformed,
    ProductRatio_transformed,
    Pre_Christmas,
    VisitorType,
    dplyr::starts_with("TrafficType_",),
    Revenue
  )

# Recode the Pre_Christmas variable to binary
modeling_dataset <- modeling_dataset |>
  dplyr::mutate(Pre_Christmas = ifelse(Pre_Christmas == "Other", 0, 1))

# Recode the Pre_Christmas variable to binary
modeling_dataset$Revenue <- as.factor(modeling_dataset$Revenue)

display a corr matrix of our remaining features to ensure there’s no more multicollinearity issues

# Load necessary library
library(ggcorrplot)

# Select numeric columns for correlation analysis
numeric_vars <- modeling_dataset |>
  dplyr::select(where(is.numeric))

# Compute the correlation matrix
cor_matrix <- cor(method = "spearman", numeric_vars, use = "complete.obs")

# Plot the upper triangle of the correlation matrix
ggcorrplot(
  cor_matrix,
  method = "square",
  type = "upper",
  lab = TRUE,
  title = "Correlation Matrix of Modeling Dataset",
  show.legend = FALSE
)

Results

# Load necessary libraries
library(dplyr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(ROCR)

# Scale the variables to [0, 1] using MinMaxScaler
scale_features <- function(data, features) {
  data %>%
    mutate(across(all_of(features), ~ (. - min(.)) / (max(.) - min(.)), .names = "{col}_scaled"))
}

# Select the features to scale
features_to_scale <- c("ExitRates_transformed", "PageValues_transformed", 
                       "total_pages_transformed", "ProductRatio_transformed")

# Apply scaling
modeling_dataset_scaled <- scale_features(modeling_dataset, features_to_scale)

# Ensure 'Revenue' is a binary factor
modeling_dataset_scaled$Revenue <- as.factor(modeling_dataset_scaled$Revenue)

# Define weights for the imbalanced classes
class_counts <- table(modeling_dataset_scaled$Revenue)
total_count <- nrow(modeling_dataset_scaled)
weight_0 <- total_count / (2 * class_counts[1])  # Weight for class 0 (no revenue)
weight_1 <- total_count / (2 * class_counts[2])  # Weight for class 1 (revenue)
modeling_dataset_scaled$weights <- ifelse(modeling_dataset_scaled$Revenue == "0", weight_0, weight_1)

# Fit the weighted logistic regression model
weighted_model <- glm(
  Revenue ~ ExitRates_transformed + PageValues_transformed + 
    total_pages_transformed + ProductRatio_transformed + 
    VisitorType + Pre_Christmas + TrafficType_1 + 
    TrafficType_2 + TrafficType_3 + TrafficType_5 + TrafficType_6 +
    TrafficType_8 + TrafficType_10 + TrafficType_11 +
    TrafficType_13 + TrafficType_20 + TrafficType_Other,
  family = binomial,
  data = modeling_dataset_scaled,
  weights = weights
)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
# Make predictions
predicted_probabilities <- predict(weighted_model, type = "response")
predicted_classes <- ifelse(predicted_probabilities > 0.5, "1", "0")

# Create a confusion matrix
conf_matrix <- table(Predicted = predicted_classes, Actual = modeling_dataset_scaled$Revenue)

# Calculate evaluation metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- conf_matrix["1", "1"] / sum(conf_matrix["1", ])
recall <- conf_matrix["1", "1"] / sum(conf_matrix[, "1"])
f1_score <- 2 * ((precision * recall) / (precision + recall))

# Output the evaluation metrics
cat("\n","Accuracy:", round(accuracy, 4), "\n")
## 
##  Accuracy: 0.8577
cat("Precision:", round(precision, 4), "\n")
## Precision: 0.5251
cat("Recall:", round(recall, 4), "\n")
## Recall: 0.8282
cat("F1 Score:", round(f1_score, 4), "\n")
## F1 Score: 0.6427
summary(weighted_model)
## 
## Call:
## glm(formula = Revenue ~ ExitRates_transformed + PageValues_transformed + 
##     total_pages_transformed + ProductRatio_transformed + VisitorType + 
##     Pre_Christmas + TrafficType_1 + TrafficType_2 + TrafficType_3 + 
##     TrafficType_5 + TrafficType_6 + TrafficType_8 + TrafficType_10 + 
##     TrafficType_11 + TrafficType_13 + TrafficType_20 + TrafficType_Other, 
##     family = binomial, data = modeling_dataset_scaled, weights = weights)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.252761   0.176330   1.433  0.15173    
## ExitRates_transformed    -0.214223   0.041314  -5.185 2.16e-07 ***
## PageValues_transformed    0.084853   0.001474  57.576  < 2e-16 ***
## total_pages_transformed   0.026868   0.018864   1.424  0.15436    
## ProductRatio_transformed -0.019126   0.012154  -1.574  0.11557    
## VisitorType               0.598996   0.080489   7.442 9.92e-14 ***
## Pre_Christmas             1.419522   0.061129  23.222  < 2e-16 ***
## TrafficType_1            -0.285957   0.112713  -2.537  0.01118 *  
## TrafficType_2             0.265408   0.102677   2.585  0.00974 ** 
## TrafficType_3            -0.371080   0.119820  -3.097  0.00195 ** 
## TrafficType_5             0.533282   0.186775   2.855  0.00430 ** 
## TrafficType_6            -0.151421   0.171409  -0.883  0.37703    
## TrafficType_8             0.765536   0.164172   4.663 3.12e-06 ***
## TrafficType_10            0.281268   0.160355   1.754  0.07942 .  
## TrafficType_11            0.383121   0.192131   1.994  0.04615 *  
## TrafficType_13           -0.930483   0.169054  -5.504 3.71e-08 ***
## TrafficType_20            0.254133   0.246943   1.029  0.30342    
## TrafficType_Other        -0.215734   0.268319  -0.804  0.42139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16975.2  on 12244  degrees of freedom
## Residual deviance:  9204.7  on 12227  degrees of freedom
## AIC: 12228
## 
## Number of Fisher Scoring iterations: 5
# Extract and rank feature importance
feature_importance <- as.data.frame(summary(weighted_model)$coefficients)
feature_importance <- feature_importance[-1, ]  # Remove the intercept
feature_importance <- feature_importance[order(abs(feature_importance[, "Estimate"]), decreasing = TRUE), ]
feature_importance$Rank <- seq_len(nrow(feature_importance))

# Add feature names for clarity
feature_importance <- tibble::rownames_to_column(feature_importance, var = "Feature")

# Round the relevant columns to 3 decimal places
feature_importance <- feature_importance %>%
  dplyr::mutate(
    Estimate = round(Estimate, 3),
    `Std. Error` = round(`Std. Error`, 3),
    `z value` = round(`z value`, 3),
    `Pr(>|z|)` = round(`Pr(>|z|)`, 3)
  )

# Optionally visualize the feature importance
library(ggplot2)
ggplot(feature_importance, aes(x = reorder(Feature, abs(Estimate)), y = abs(Estimate))) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Feature Importance in Logistic Regression Model",
    x = "Features",
    y = "Absolute Coefficient Value"
  ) +
  theme_minimal()

# Summary of the model to get coefficients and p-values
model_summary <- summary(weighted_model)

# Extract coefficients and p-values
coefficients <- model_summary$coefficients
significant_coefficients <- coefficients[coefficients[, 4] < 1, ]  # Select p-values < 0.05

# Create a data frame to hold results
results <- data.frame(
  Coefficient = significant_coefficients[, 1],  # Extract the coefficient
  Odds_Change = exp(significant_coefficients[, 1]),  # Calculate e^(coefficient)
  P_Value = significant_coefficients[, 4]  # Include p-values
)

# Display results
print(results)
##                          Coefficient Odds_Change       P_Value
## (Intercept)               0.25276123   1.2875758  1.517288e-01
## ExitRates_transformed    -0.21422349   0.8071680  2.157204e-07
## PageValues_transformed    0.08485326   1.0885573  0.000000e+00
## total_pages_transformed   0.02686843   1.0272326  1.543593e-01
## ProductRatio_transformed -0.01912564   0.9810561  1.155665e-01
## VisitorType               0.59899590   1.8202901  9.919369e-14
## Pre_Christmas             1.41952195   4.1351431 2.736636e-119
## TrafficType_1            -0.28595653   0.7512953  1.117974e-02
## TrafficType_2             0.26540805   1.3039630  9.741534e-03
## TrafficType_3            -0.37107956   0.6899890  1.954999e-03
## TrafficType_5             0.53328193   1.7045173  4.300814e-03
## TrafficType_6            -0.15142065   0.8594861  3.770273e-01
## TrafficType_8             0.76553623   2.1501470  3.115941e-06
## TrafficType_10            0.28126789   1.3248085  7.942472e-02
## TrafficType_11            0.38312086   1.4668553  4.614569e-02
## TrafficType_13           -0.93048304   0.3943632  3.711662e-08
## TrafficType_20            0.25413332   1.2893437  3.034244e-01
## TrafficType_Other        -0.21573364   0.8059499  4.213860e-01

Interpretations and Suggestions:

  • Significant Predictors: Most features are statistically significant (p < 0.05), except for total_pages_transformed, product_ratio_transformed, and some traffic types (TrafficType_6, 10, 20, and other).

  • Top Positive Predictors:

    • Pre_Christmas: Increases purchase likelihood by 313.51%, making it the most influential feature.

    • TrafficType_8 and TrafficType_5: Drive significant purchase likelihood increases (115.01% and 70.45%, respectively).

  • Top Negative Predictors:

    • ExitRates_transformed: Reduces purchase likelihood by 19.28%, indicating a strong association between higher exit rates and reduced purchases.

    • TrafficType_1 and TrafficType_3: Decrease purchase likelihood by 24.87% and 31.00%, respectively.

  • Impact of Feature Scaling:

    • Numerical features were scaled, so unit increases correspond to scaled increments rather than real-world measures. This impacts interpretation, especially for features like ExitRates_transformed, where practical implications must be carefully framed.
  • Business Recomendations:

    • Focus promotions and discounts on seasonal trends:
      The model indicates a significant increase in purchase likelihood during the Pre-Christmas period. Businesses should plan promotional campaigns and discounts to capitalize on this seasonal behavior.
    • Refine advertising strategies based on traffic sources:
      Traffic sources like 8 and 2 have higher odds of conversion. Allocate more advertising budget and efforts to these traffic types to optimize ROI while deprioritizing less effective sources.

    • Improve loyalty programs for returning customers:
      Returning visitors are less likely to convert compared to new visitors. Enhance loyalty programs, provide personalized recommendations, and introduce retention offers to increase conversion rates among returning customers.

    • Increase targeted marketing for new customers:
      New visitors are significantly more likely to convert. Design campaigns aimed at acquiring first-time customers, such as exclusive welcome offers or discounts tailored to encourage initial purchases.

    • Leverage key features for optimization:
      Features like Page Values and Exit Rates significantly impact purchase behavior. Streamline website navigation, improve page engagement, and reduce bounce rates to optimize the shopping experience and increase conversion odds.

    • Invest in high-converting traffic channels:
      Channels like TrafficType 8 and TrafficType 5 showed the highest odds of generating revenue. Focus resources on these channels, while experimenting with strategies to improve the performance of other channels.