Business Problem: The aim of this analysis is to understand variability in household spending across product categories. By identifying which product categories show high variability in household spending, Regork can develop targeted marketing strategies to increase engagement and spending in underperforming categories. This information will help Regork make data-driven decisions to optimize personalized offers and boost customer retention and sales.
Why Should the Regork CEO Be Interested? Variability in customer spending reveals key insights about purchasing behavior and opportunities for growth. High variability may indicate inconsistent purchasing, where a targeted offer could drive higher engagement. The CEO should be interested because this analysis identifies actionable strategies for boosting sales in specific product categories and improving customer loyalty.
Approach: We utilized transaction-level data, product information, and demographic information from the completejourney dataset to conduct this analysis. Our methodology included merging these data sources, calculating household spending across product categories, and performing exploratory data analysis (EDA) to identify patterns in spending behavior.
Proposed Solution: Based on our analysis, we recommend personalized promotions for product categories with high spending variability and enhanced cross-promotional campaigns for categories with consistent spending. These strategies will help Regork improve customer retention and overall sales.
Explanation of Packages: -
completejourney: This package provides the dataset used
in the analysis, including transaction, product, and demographic
information. - tidyverse: A collection of R packages
that includes dplyr and ggplot2, which are
used for data manipulation and visualization. -
lubridate: Used to manipulate date variables for
creating time-related insights. - knitr: Used for
rendering tables in the report. - kableExtra: Provides
enhanced formatting for tables. - scales: Used to
format plot scales (e.g., dollar or percentage formatting).
transactions <- get_transactions()
data("products")
data("demographics")
# Merge transactions with products to get product category details
transactions_products <- transactions %>%
left_join(products, by = "product_id")
# Create total spend for each transaction
transactions_products <- transactions_products %>%
mutate(total_spend = sales_value)
# Calculate total spending by household and product category
household_spending <- transactions_products %>%
group_by(household_id, product_category) %>%
summarise(
total_spend = sum(total_spend, na.rm = TRUE)
) %>%
ungroup()
## `summarise()` has grouped output by 'household_id'. You can override using the
## `.groups` argument.
# Calculate spending variability by product category (using standard deviation)
spending_variability <- household_spending %>%
group_by(product_category) %>%
summarise(
mean_spend = mean(total_spend, na.rm = TRUE),
sd_spend = sd(total_spend, na.rm = TRUE),
coefficient_of_variation = sd_spend / mean_spend
) %>%
arrange(desc(coefficient_of_variation))
Interpretation: The coefficient of variation highlights which product categories have the most inconsistent spending behavior among households. Categories with higher values indicate potential opportunities to increase engagement by offering targeted promotions.
# Merge household spending with demographics data
household_spending_demo <- household_spending %>%
left_join(demographics, by = "household_id")
# Define income groups and calculate average spending by income group
household_spending_demo <- household_spending_demo %>%
mutate(
income_group = case_when(
income %in% c("Under 15K", "15-24K", "25-34K") ~ "Under 35K",
income %in% c("35-49K", "50-74K") ~ "35-74K",
income %in% c("75-99K", "100-124K", "125-149K", "150K+") ~ "75K+",
TRUE ~ "Unknown"
)
)
# Calculate average spending by income group and product category
spending_by_income <- household_spending_demo %>%
group_by(income_group, product_category) %>%
summarise(
average_spend = mean(total_spend, na.rm = TRUE)
) %>%
arrange(average_spend)
## `summarise()` has grouped output by 'income_group'. You can override using the
## `.groups` argument.
# Get the 10 least bought product categories
total_average_spend <- spending_by_income %>%
group_by(product_category) %>%
summarise(total_average_spend = sum(average_spend, na.rm = TRUE)) %>%
arrange(total_average_spend) %>%
slice(1:10) %>%
pull(product_category)
spending_by_income_least <- spending_by_income %>%
filter(product_category %in% total_average_spend)
# Visual 2: Bar plot of the 10 least bought product categories by income group
ggplot(spending_by_income_least, aes(x = reorder(product_category, average_spend), y = average_spend, fill = income_group)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Average Spending by Income Group for 10 Least Bought Product Categories",
x = "Product Category",
y = "Average Spend"
) +
coord_flip() +
theme_minimal()
Interpretation: The least bought categories across different income groups are highlighted here. This helps us understand which product categories are underperforming in terms of customer engagement and spending, indicating opportunities for targeted promotions.
# Calculate total spend by product category
total_spend_by_category <- transactions_products %>%
group_by(product_category) %>%
summarise(total_spend = sum(total_spend, na.rm = TRUE)) %>%
arrange(desc(total_spend))
# Get the top 10 product categories by total spend
top_10_categories <- total_spend_by_category %>%
slice(1:10)
# Visual 3: Pie chart of total spend distribution by product category (Top 10 Categories)
ggplot(top_10_categories, aes(x = "", y = total_spend, fill = product_category)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Total Spend Distribution by Product Category (Top 10 Categories)",
x = NULL,
y = NULL
) +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())
Interpretation: The pie chart illustrates which product categories contribute the most to overall spending. This insight allows us to focus on maintaining high performance in these categories while strategizing on ways to increase spending in others.
# Filter to get top 10 and bottom 10 product categories by total spend
top_10_product_categories <- total_spend_by_category %>%
slice(1:10) %>%
pull(product_category)
bottom_10_product_categories <- total_spend_by_category %>%
slice_tail(n = 10) %>%
pull(product_category)
# Filterable boxplot function
generate_boxplot <- function(data, categories, title_suffix) {
filtered_data <- data %>%
filter(product_category %in% categories)
ggplot(filtered_data, aes(x = reorder(product_category, total_spend), y = total_spend)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format()) +
labs(
title = paste("Household Spending by Product Category (", title_suffix, ")", sep = ""),
x = "Product Category",
y = "Total Spend"
) +
coord_flip() +
theme_minimal()
}
# Display boxplots for Top 10
generate_boxplot(household_spending, top_10_product_categories, "Top 10")
# Display boxplots for Bottom 10
generate_boxplot(household_spending, bottom_10_product_categories, "Bottom 10")
Interpretation: The boxplots help visualize the variability in household spending across the top and bottom product categories. This helps us identify categories where spending can be further optimized through tailored campaigns.
Problem Statement: We aimed to analyze household spending variability across product categories to identify opportunities for personalized marketing and increased customer engagement.
Methodology: The analysis involved merging transaction, product, and demographic data, calculating household spending, and evaluating spending variability across product categories. We employed visualization techniques to highlight key insights.
Key Insights: - High Variability Categories: Product categories like Alcoholic Beverages show high variability, indicating potential opportunities for targeted promotions. - Low Variability Categories: Categories like Dairy exhibit consistent spending, suggesting stable demand but fewer growth opportunities. - Income Group Insights: Lower-income households tend to have more consistent spending in staple categories.
Implications for Regork: We recommend developing personalized offers for high-variability categories to increase consistency in spending. Cross-promotional campaigns should be used to enhance spending in low variability categories.
Limitations and Future Work: The analysis is limited by its focus on product categories and household-level spending. Future analyses could incorporate temporal aspects (e.g., seasonal trends) and further explore customer preferences by clustering households based on spending behavior.