library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.3     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.3     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
project_data <- read.csv("online_shoppers_intention.csv")

Data Discovery

SUMMARY

Description:

The Online Shoppers’ Purchasing Intention dataset from the UCI repository (UCI - online shoppers purchasing intention dataset) consists of data collected from online shopping sessions with the goal of predicting whether a user will complete a purchase. It includes a variety of features such as time spent on different types of pages (Administrative, Informational, and Product-related), as well as behavioral metrics like Bounce Rates, Exit Rates, and Page Values, which estimate how much a page contributes to revenue. The dataset also includes categorical variables, such as Visitor Type (New or Returning Visitors), Traffic Type, Browser, and Region, and an indicator for sessions occurring near special days, which may influence purchasing behavior. The primary target variable, Revenue, is a boolean field that identifies whether the session resulted in a purchase. This combination of numeric and categorical features provides a detailed view of user behavior and website navigation efficiency.

Main Purpose:

Main Question: “What patterns in user behavior, such as time spent on various page types and special day effects, contribute most to online purchases?”

Goal: This project aims to analyze the relationship between different user engagement metrics and their likelihood of completing a purchase, helping e-commerce platforms optimize website design and user experience.

Visualization:

This shows how visitors shopped in each month.

# Ensure the month order is set correctly
month_order <- c("Feb", "Mar", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
project_data$Month <- factor(project_data$Month, levels = month_order)

# Create a table of just Revenue and Month
month_revenue_counts <- table(project_data$Revenue, project_data$Month)

# Calculate the percentage of True for each month
true_percents <- prop.table(month_revenue_counts, 2)[2, ] * 100

# Convert the table into a data frame for ggplot
df_month_revenue <- as.data.frame(month_revenue_counts)

# Plot the stacked bar chart with percentage labels for True %
ggplot(df_month_revenue, aes(x = Var2, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c( "False", "True")) +
  labs(title = "Distribution of Revenue by Month",
       x = "Month",
       y = "Count",
       fill = "Revenue") +
  theme_minimal() +
  geom_text(data = subset(df_month_revenue, Var1 == "TRUE"),
            aes(x = Var2, y = Freq / 2, label = paste0(round(true_percents, 1), "%")),
            color = "black", size = 3)

  • What we see is a large increase in shopping in spring and also in november + december before christmas.

  • We also see our best hit rate with customers in November at 25.4%

For this next visual we check how new vs returning customers shop.

# Filter out instances where VisitorType is 'Other'
filtered_data <- subset(project_data, VisitorType %in% c("New_Visitor", "Returning_Visitor"))
visitor_revenue_counts_filtered <- table(filtered_data$Revenue, filtered_data$VisitorType)

# Calculate percentages for True (assuming True is the second row in the table)
true_percents_visitor_filtered <- prop.table(visitor_revenue_counts_filtered, 2)[2, ] * 100

# Convert the table into a data frame for ggplot
df_visitor_revenue_filtered <- as.data.frame(visitor_revenue_counts_filtered)

# Reverse the factor levels so False is plotted first (on the bottom)
df_visitor_revenue_filtered$Var1 <- factor(df_visitor_revenue_filtered$Var1, levels = c("FALSE", "TRUE"))

# Plot the stacked bar chart with percentage labels for the True (blue) portion
ggplot(df_visitor_revenue_filtered, aes(x = Var2, y = Freq, fill = Var1)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c("False", "True")) +
  labs(title = "New vs Returning Visitors and Purchase Status",
       x = "Visitor Type",
       y = "Count",
       fill = "Revenue") +
  theme_minimal() +
  geom_text(data = subset(df_visitor_revenue_filtered, Var1 == "TRUE"),
            aes(x = Var2, y = cumsum(Freq) - (Freq / 2), label = paste0(round(true_percents_visitor_filtered, 1), "%")),
            color = "black", size = 3)

  • We see new customers are much more likely to shop, but there isn’t as many of them.

Plan moving forward:

With these tables we see there is spikes within certain segments of our data set, and I want to discover what user behavior is leading is leading to these groups doing well.

With this information, I will be able to easily target these tendencies to increase the amount of new customers, while also improving the proportion of returning customers who shop.

INITIAL FINDINGS:

For my initial analysis, I used the two visualizations above to form hypotheses regarding the factors influencing shoppers to complete purchases.

Month vs Revenue:

  • The bar chart clearly shows two significant spikes in purchases: one during the spring and one in November and December. These trends can be explained by seasonal shopping behavior.

  • The spike in % in November is likely attributed to Black Friday and other holiday sales leading up to Christmas, which typically offer substantial discounts.

  • The increase in spring may correspond to people preparing for summer and the general increase in consumer optimism experienced in spring.

Hypothesis: Shoppers are driven by seasonal factors, particularly major sales events like Black Friday, and a general increase in consumer spending leading up to summer. The success in November may largely be driven by aggressive pricing strategies and promotions around Black Friday. Therefore, targeted marketing and competitive pricing during these key months could further enhance sales.

Customer Type vs Revenue:

  • The second visualization highlights that New Visitors are more likely to complete a purchase compared to Returning Visitors, though they are a smaller portion of the total visitors.

  • One possible explanation for this is that new customers are often drawn to the site through targeted ads, promotions, or discounts specifically designed to convert them into first-time buyers.

  • Returning visitors, on the other hand, may be more familiar with the site but less incentivized to purchase if no promotions are offered.

Hypothesis: New customers are more likely to make a purchase because they are likely arriving from targeted ads or direct promotions that match their immediate needs. However, returning visitors may require more engagement or re-incentivization, such as loyalty rewards or personalized recommendations, to encourage repeated purchases. By focusing on retention strategies for returning customers, we could potentially increase their conversion rates.