Business Problem:
This analysis addresses a critical business challenge for Regork: identifying and understanding our most valuable customers (High Value Customers or HVCs). These customers represent the top 20% of spenders and are vital to our revenue stream. Understanding their behavior is essential because:
Methodology:
We used the Complete Journey dataset, which contains transactional data, demographic information, and coupon redemption records. Our approach involved:
Proposed Solution:
This analysis will help the Regork CEO make data-driven decisions by:
# Load required libraries
library(completejourney) # Provides the retail transaction dataset
library(tidyverse) # Collection of data manipulation and visualization tools
# Load datasets
transactions <- get_transactions()
demographics <- demographics
promotions <- get_promotions()
coupon_redemptions <- coupon_redemptions
Libraries Explanation:
dplyr
for data manipulation, ggplot2
for
visualization, and lubridate
for date/time operationsWe define High Value Customers as the top 20% by total spend, then select the top 10 households.
# Calculate total spend per household and identify top 20% (quintile 5)
hvc <- transactions %>%
group_by(household_id) %>%
summarize(total_spend = sum(sales_value, na.rm = TRUE)) %>%
mutate(percentile = ntile(total_spend, 5)) %>% # Create 5 groups (quintiles)
filter(percentile == 5) %>% # Keep only top 20% (quintile 5)
arrange(desc(total_spend)) %>%
slice_head(n = 10) # Select top 10 households
print(hvc)
## # A tibble: 10 × 3
## household_id total_spend percentile
## <chr> <dbl> <int>
## 1 1023 24880. 5
## 2 1609 16581. 5
## 3 1453 13376. 5
## 4 2322 13188. 5
## 5 1430 12849. 5
## 6 1111 12368. 5
## 7 400 11802. 5
## 8 707 11492. 5
## 9 1653 11190. 5
## 10 1489 10964. 5
Code Explanation:
group_by(household_id)
: Groups all transactions by
householdsummarize(total_spend = ...)
: Calculates total lifetime
spend per householdntile(total_spend, 5)
: Creates quintiles (5 equal
groups) where 5 = top 20%filter(percentile == 5)
: Keeps only customers in the
top spending quintileslice_head(n = 10)
: Selects the 10 highest
spendersKey Insight: This creates a new derived variable
(percentile
) that segments customers by spending behavior,
revealing our most valuable customer segment.
# Visualize the top 10 HVCs with a horizontal bar chart
plot_top10_hvc_spend <- ggplot(hvc, aes(x = reorder(as.factor(household_id), total_spend), y = total_spend)) +
geom_col(fill = "royalblue") +
coord_flip() + # Horizontal bars for better readability
labs(title = "Top 10 High Value Customers by Total Spend",
x = "Household ID",
y = "Total Spend ($)") +
theme_minimal()
print(plot_top10_hvc_spend)
Visualization Explanation:
reorder()
: Sorts bars by spend amount (not
alphabetically) for easier comparisoncoord_flip()
: Makes bars horizontal for better
readability of household IDsfill = "royalblue"
: Professional, trust-evoking color
schemeKey Insight: This chart reveals the spending hierarchy among our most valuable customers, showing whether there’s a dominant customer or relatively equal distribution of spending power.
# Analyze spending by product category for HVCs
hvc_category_spend <- transactions %>%
filter(household_id %in% hvc$household_id) %>% # Filter to HVC transactions only
group_by(product_id) %>%
summarize(category_spend = sum(sales_value, na.rm = TRUE)) %>%
left_join(products, by = "product_id") %>% # Join with products table to get categories
group_by(product_category) %>%
summarize(total_spend = sum(category_spend)) %>%
arrange(desc(total_spend))
# Select top 10 categories
top_categories <- hvc_category_spend %>%
slice_head(n = 10)
# Visualize top spending categories
plot_top_categories <- ggplot(top_categories, aes(x = reorder(product_category, total_spend), y = total_spend)) +
geom_col(fill = "royalblue") +
coord_flip() +
labs(title = "Top 10 Spend Categories for High Value Customers",
x = "Product Category",
y = "Total Spend ($)") +
theme_minimal()
print(plot_top_categories)
Code Explanation:
filter(household_id %in% hvc$household_id)
: Restricts
analysis to only HVC transactionsleft_join(products, by = "product_id")
:
Combines two datasets to connect product IDs with
category namesKey Insight: This analysis uncovers which product categories drive HVC spending, information not immediately visible in raw transaction data. The CEO can use this to:
# Analyze coupon redemption patterns for HVCs
hvc_coupons <- coupon_redemptions %>%
filter(household_id %in% hvc$household_id) %>%
left_join(coupons, by = "coupon_upc") # Join to get coupon details
# Count redemptions by coupon
top_coupons <- hvc_coupons %>%
count(coupon_upc, sort = TRUE) %>%
slice_head(n = 10)
# Create lollipop chart for visual appeal
ggplot(top_coupons, aes(x = reorder(coupon_upc, n), y = n)) +
geom_segment(aes(xend = coupon_upc, yend = 0), color = "grey") +
geom_point(size = 4, color = "royalblue") +
coord_flip() +
labs(title = "Top 10 Coupons Redeemed by High Value Customers",
x = "Coupon UPC",
y = "Number of Redemptions") +
theme_minimal()
Code Explanation:
count(coupon_upc, sort = TRUE)
: Efficiently counts and
sorts in one stepKey Insight: Reveals which promotional offers resonate most with HVCs, showing:
# Analyze spending patterns by day of week
hvc_weekday_spend <- transactions %>%
filter(household_id %in% hvc$household_id) %>%
mutate(weekday = wday(transaction_timestamp)) %>% # Extract day of week (1=Sun, 7=Sat)
mutate(weekday_label = factor(weekday, levels = 1:7,
labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"))) %>%
group_by(weekday_label) %>%
summarize(total_spend = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_spend))
# Visualize with gradient coloring to highlight high-spend days
plot_hvc_weekday <- ggplot(hvc_weekday_spend, aes(x = weekday_label, y = total_spend, fill = total_spend)) +
geom_col(show.legend = FALSE) +
scale_fill_gradient(low = "lightblue", high = "royalblue") +
labs(title = "High Value Customer Spend by Day of Week",
x = "Day of Week",
y = "Total Spend ($)") +
theme_minimal()
print(plot_hvc_weekday)
Code Explanation:
wday(transaction_timestamp)
: Extracts numeric day of
week from datetimefactor(weekday, levels = 1:7, labels = ...)
:
Creates a new derived variable converting numbers to
readable day names while maintaining chronological orderscale_fill_gradient()
: Color intensity visually
represents spending magnitudeKey Insight: Identifies optimal days for promotions and staffing. For example:
# Create detailed dataset with coupon redemptions by weekday
hvc_coupons_weekday <- coupon_redemptions %>%
filter(household_id %in% hvc$household_id) %>%
left_join(coupons, by = "coupon_upc") %>%
mutate(weekday = wday(redemption_date)) %>%
mutate(weekday_label = factor(weekday, levels = 1:7,
labels = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")))
# Count redemptions by coupon and weekday combination
coupon_weekday_counts <- hvc_coupons_weekday %>%
group_by(coupon_upc, weekday_label) %>%
summarize(redemptions = n(), .groups = "drop") %>%
arrange(coupon_upc, weekday_label)
# Select top 10 most redeemed coupons overall
top_coupons_list <- coupon_weekday_counts %>%
group_by(coupon_upc) %>%
summarize(total_redemptions = sum(redemptions)) %>%
arrange(desc(total_redemptions)) %>%
slice_head(n = 10) %>%
pull(coupon_upc)
# Filter data for top coupons only
top_coupon_weekday <- coupon_weekday_counts %>%
filter(coupon_upc %in% top_coupons_list)
# Plot heatmap showing two-dimensional relationship
ggplot(top_coupon_weekday, aes(x = weekday_label, y = factor(coupon_upc), fill = redemptions)) +
geom_tile(color = "grey") +
scale_fill_gradient(low = "lightblue", high = "royalblue") +
labs(title = "Coupon Redemptions by Weekday for Top 10 Coupons (HVCs)",
x = "Weekday",
y = "Coupon UPC",
fill = "Redemptions") +
theme_minimal()
Code Explanation:
This is the most sophisticated analysis in the report:
geom_tile()
: Creates heatmap grid where color intensity
shows redemption patternsKey Insights Uncovered:
Business Application: The CEO can use this heatmap to:
We investigated who Regork’s most valuable customers are and how they behave, focusing on the top 10 highest-spending households representing the top 20% by revenue.
Current Limitations:
Recommended Improvements:
This analysis provides actionable insights for the Regork CEO to maximize revenue from the most valuable customer segment through targeted marketing, optimized operations, and strategic resource allocation. The data-driven approach ensures decisions are grounded in actual customer behavior rather than assumptions.