Description:
The Online Shoppers’ Purchasing Intention dataset
from the UCI repository (UCI
- online shoppers purchasing intention dataset) consists of data
collected from online shopping sessions with the goal of predicting
whether a user will complete a purchase. It includes a variety of
features such as time spent on different types of pages (Administrative,
Informational, and Product-related), as well as behavioral metrics like
Bounce Rates, Exit Rates, and Page Values, which estimate how much a
page contributes to revenue. The dataset also includes categorical
variables, such as Visitor Type (New or Returning Visitors), Traffic
Type, Browser, and Region, and an indicator for sessions occurring near
special days, which may influence purchasing behavior. The primary
target variable, Revenue, is a boolean field that identifies whether the
session resulted in a purchase. This combination of numeric and
categorical features provides a detailed view of user behavior and
website navigation efficiency.
Main Purpose:
Main Question: “What patterns in user behavior, such as time
spent on various page types and special day effects, contribute most to
online purchases?”
Goal: This project aims to analyze the relationship between
different user engagement metrics and their likelihood of completing a
purchase, helping e-commerce platforms optimize website design and user
experience.
Visualization:
This shows how visitors shopped in each month.
# Ensure the month order is set correctly
month_order <- c("Feb", "Mar", "May", "June", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
project_data$Month <- factor(project_data$Month, levels = month_order)
# Create a table of just Revenue and Month
month_revenue_counts <- table(project_data$Revenue, project_data$Month)
# Calculate the percentage of True for each month
true_percents <- prop.table(month_revenue_counts, 2)[2, ] * 100
# Convert the table into a data frame for ggplot
df_month_revenue <- as.data.frame(month_revenue_counts)
# Plot the stacked bar chart with percentage labels for True %
ggplot(df_month_revenue, aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c( "False", "True")) +
labs(title = "Distribution of Revenue by Month",
x = "Month",
y = "Count",
fill = "Revenue") +
theme_minimal() +
geom_text(data = subset(df_month_revenue, Var1 == "TRUE"),
aes(x = Var2, y = Freq / 2, label = paste0(round(true_percents, 1), "%")),
color = "black", size = 3)

For this next visual we check how new vs returning customers
shop.
# Filter out instances where VisitorType is 'Other'
filtered_data <- subset(project_data, VisitorType %in% c("New_Visitor", "Returning_Visitor"))
visitor_revenue_counts_filtered <- table(filtered_data$Revenue, filtered_data$VisitorType)
# Calculate percentages for True (assuming True is the second row in the table)
true_percents_visitor_filtered <- prop.table(visitor_revenue_counts_filtered, 2)[2, ] * 100
# Convert the table into a data frame for ggplot
df_visitor_revenue_filtered <- as.data.frame(visitor_revenue_counts_filtered)
# Reverse the factor levels so False is plotted first (on the bottom)
df_visitor_revenue_filtered$Var1 <- factor(df_visitor_revenue_filtered$Var1, levels = c("FALSE", "TRUE"))
# Plot the stacked bar chart with percentage labels for the True (blue) portion
ggplot(df_visitor_revenue_filtered, aes(x = Var2, y = Freq, fill = Var1)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("lightcoral", "lightblue"), labels = c("False", "True")) +
labs(title = "New vs Returning Visitors and Purchase Status",
x = "Visitor Type",
y = "Count",
fill = "Revenue") +
theme_minimal() +
geom_text(data = subset(df_visitor_revenue_filtered, Var1 == "TRUE"),
aes(x = Var2, y = cumsum(Freq) - (Freq / 2), label = paste0(round(true_percents_visitor_filtered, 1), "%")),
color = "black", size = 3)

- We see new customers are much more likely to shop, but there isn’t
as many of them.
Plan moving forward:
With these tables we see there is spikes within certain segments of
our data set, and I want to discover what user behavior is leading is
leading to these groups doing well.
With this information, I will be able to easily target these
tendencies to increase the amount of new customers, while also improving
the proportion of returning customers who shop.