INTRODUCTION

In today’s competitive retail market, the ability to understand customer preferences at a granular level can be the difference between leading the market and being left behind. Retailers are constantly seeking ways to tailor their marketing strategies, optimize inventory, and maximize revenue streams. One key question arises: Which demographic group generates the most revenue for specific product categories, and how can Regork invest in targeted marketing to capture additional market share from this group? This report aims to answer that question by diving into a comprehensive revenue analysis, segmented by demographic group and product category.

Using transaction data, we explore spending patterns across diverse customer segments. We analyze revenue contributions by linking transaction data with product and demographic details, enabling us to build a holistic view of customer spending habits. This approach provides a basis for actionable insights, allowing us to pinpoint high-value segments and popular product categories, thus offering opportunities for strategic marketing and product positioning.

PACKAGES REQUIRED

Load essential packages for data manipulation and visualization library(tidyverse) # Data manipulation and visualization

library(viridis) # Color palettes for ggplot2

library(scales) # Formatting numbers and currency in plots

library(DT) # Creating interactive tables

library(RColorBrewer) # Additional color palettes for plots

Data Prepartion and Integration

To understand customer purchasing behavior across different demographic groups and product categories, we need to create a comprehensive dataset by integrating transaction data with product details and customer demographics. Below are the steps involved in preparing and integrating the data:

Datasets Overview We are working with three datasets:

Transactions Dataset: Contains detailed records of each transaction, including the household_id, product_id, basket_id, sales_value, and quantity purchased. Each row represents a single product purchase. Products Dataset: Provides information on each product, identified by product_id. It includes the product_category, allowing us to analyze revenue by category. Demographics Dataset: Contains demographic information for each household, identified by household_id. It includes variables such as income, age, and other characteristics that help segment the customer base. Data Integration To analyze revenue by demographic group and product category, we need to join these datasets:

Joining Transactions and Products: We start by merging the transactions and products datasets on product_id. This integration adds product details to each transaction record, enabling revenue analysis by category. Data Integration To analyze revenue by demographic group and product category, we need to join these datasets:

After these joins, we have a unified dataset, data_combined, which contains over 421,429 observations and 24 variables, representing all necessary information on transactions, products, and household demographics.

Data Cleaning With the integrated dataset, we proceed to clean the data to ensure accuracy and reliability:

Handling Missing Values: Given that our dataset contains missing values, which could distort the analysis, we remove all rows with NA values. This step simplifies our data and ensures that our calculations are based on complete information.

Summary of Integrated Data After data cleaning, the integrated dataset is ready for analysis. It includes the following key variables:

household_id: Unique identifier for each household, linking demographic details to each transaction. product_id: Unique identifier for each product, enabling category-level revenue analysis. product_category: Product category description, essential for analyzing revenue contributions by product type. income: Household income group, used to segment revenue data by income level. age: Age category of the household, allowing for age-based segmentation. By integrating these datasets and ensuring data quality, we establish a solid foundation for revenue analysis by demographic group and product category. This approach enables us to uncover patterns and insights that can inform targeted marketing strategies and product placement decisions.

Joining datasets to get combined view of transactions, products details and customer demographics

transactions_prod <- left_join(transactions, products, by = "product_id")
data_combined <- left_join(transactions_prod, demographics, by = "household_id")

After combing datasets, there were over 421429 observations and 24 variables

Since the dataset consists of NAs and missing values, lets clean the data

data_combined <- na.omit(data_combined)

Total Revenue by Demographic Group and Product Category

revenue_by_demo_product <- data_combined %>%
  group_by(income, age, product_category) %>%
  summarize(total_revenue = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_revenue))

# Display the top revenue-generating combinations
head(revenue_by_demo_product)

## # A tibble: 6 × 4
##   income age   product_category       total_revenue
##   <ord>  <ord> <chr>                          <dbl>
## 1 50-74K 45-54 SOFT DRINKS                    4968.
## 2 75-99K 45-54 SOFT DRINKS                    4391.
## 3 50-74K 45-54 CHEESE                         4211.
## 4 50-74K 45-54 FRZN MEAT/MEAT DINNERS         3583.
## 5 35-49K 35-44 SOFT DRINKS                    3544.
## 6 50-74K 45-54 BEERS/ALES                     3467.

Since we are looking at which demographic generating the most revenue. We proceeded to filter the data with top product category.

top_categories <- revenue_by_demo_product %>%
  group_by(product_category) %>%
  summarize(total_revenue = sum(total_revenue)) %>%
  top_n(10, total_revenue)

filter_data <- revenue_by_demo_product %>%
  filter(product_category %in% top_categories$product_category)

filter_data <- filter_data %>%
  mutate(income_group = case_when(
    income %in% c("Under 15K", "15-24K", "25-34K") ~ "Low",
    income %in% c("35-49K", "50-74K") ~ "Middle",
    income %in% c("75-99K", "100-124K", "125-149K") ~ "High",
    TRUE ~ "Very High"
  ))

ggplot(filter_data, aes(x = income_group, y = total_revenue, fill = income_group)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Total Revenue by Product Category and Income Group",
    x = "Income Group", y = "Total Revenue"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_viridis_d() +
  facet_wrap(~ product_category, scales = "free_y")

heatmap_data <- filter_data %>%
  group_by(product_category, income_group) %>%
  summarise(total_revenue = sum(total_revenue), .groups = "drop")

ggplot(heatmap_data, aes(x = income_group, y = product_category, fill = total_revenue)) +
  geom_tile(color = "white") +
  geom_text(aes(label = scales::dollar(total_revenue)), color = "black", size = 3) +
  labs(
    title = "Revenue Heatmap by Product Category and Income Group",
    x = "Income Group", y = "Product Category"
  ) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

From this we can see that SOFT DRINKS proceed the most revenue from the middle income group, so how can we target our marketing to capture more market share from this group? Our approach to analyze soft drink purchases among middle-class demographic to find correlations that provide actionable insides, we started with filtering for soft drinks purchases. Joining soft drinks with the demographics data and filter for middle-class customers, and than we calculated purchase patterns for soft drinks within the Middle Class

soft_drinks <- products %>%
  filter(grepl("SOFT DRINKS|SODA", product_category, ignore.case = TRUE))
  
soft_drink_transactions <- transactions %>%
  inner_join(soft_drinks, by = "product_id")

middle_class_income <- c("35-49K", "50-74K")

middle_class_soft_drinks <- soft_drink_transactions %>%
  inner_join(demographics, by = "household_id") %>%
  filter(income %in% middle_class_income)

soft_drink_summary <- middle_class_soft_drinks %>%
  group_by(income, age) %>%
  summarise(
    total_revenue = sum(sales_value, na.rm = TRUE),
    purchase_count = n(),
    avg_spend = mean(sales_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(total_revenue))

soft_drink_summary

## # A tibble: 12 × 5
##    income age   total_revenue purchase_count avg_spend
##    <ord>  <ord>         <dbl>          <int>     <dbl>
##  1 50-74K 45-54         8867.           3252      2.73
##  2 50-74K 35-44         6276.           2346      2.68
##  3 35-49K 45-54         4939.           1921      2.57
##  4 35-49K 35-44         4908.           1570      3.13
##  5 50-74K 25-34         4154.           1455      2.85
##  6 35-49K 25-34         3087.           1290      2.39
##  7 35-49K 65+           2446.            768      3.18
##  8 35-49K 55-64         1441.            439      3.28
##  9 50-74K 19-24         1317.            536      2.46
## 10 35-49K 19-24         1033.            406      2.54
## 11 50-74K 55-64          964.            303      3.18
## 12 50-74K 65+            433.            128      3.38

basket_data <- transactions %>%
  filter(basket_id %in% middle_class_soft_drinks$basket_id) %>%
  inner_join(products, by = "product_id") 

# Exclude soft drinks and focus on complementary products
complementary_products <- basket_data %>%
  filter(!grepl("SOFT DRINKS|SODA", product_category, ignore.case = TRUE)) %>%
  group_by(product_category) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Display top complementary products
complementary_products

## # A tibble: 288 × 2
##    product_category          count
##    <chr>                     <int>
##  1 BAKED BREAD/BUNS/ROLLS     5017
##  2 BAG SNACKS                 4837
##  3 CHEESE                     4376
##  4 FLUID MILK PRODUCTS        4370
##  5 FRZN MEAT/MEAT DINNERS     3562
##  6 BEEF                       2756
##  7 VEGETABLES - SHELF STABLE  2699
##  8 SOUP                       2673
##  9 FROZEN PIZZA               2541
## 10 YOGURT                     2427
## # ℹ 278 more rows

complementary_products <- transactions %>%
  filter(basket_id %in% middle_class_soft_drinks$basket_id) %>%
  inner_join(products, by = "product_id") %>%
  filter(!grepl("SOFT DRINKS|SODA", product_category, ignore.case = TRUE)) %>%
  group_by(product_category) %>%
  summarise(complementary_count = n()) %>%
  arrange(desc(complementary_count))

# Display top complementary products to soft drinks
datatable(complementary_products, options = list(pageLength = 10), caption = "Top Complementary Products Purchased with Soft Drinks by Middle-Class Customers")

soft_drink_frequency <- transactions %>%
  inner_join(soft_drinks, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarise(soft_drink_purchases = n(),
            avg_purchase_volume = mean(quantity, na.rm = TRUE)) %>%
  arrange(desc(soft_drink_purchases))

# Display soft drink purchase frequency by income group
datatable(soft_drink_frequency, options = list(pageLength = 10), caption = "Soft Drink Purchase Frequency and Average Volume by Income Group")

volume_distribution <- middle_class_soft_drinks %>%
  mutate(volume_category = case_when(
    quantity == 1 ~ "Single Bottle",
    quantity > 1 & quantity <= 6 ~ "Small Pack",
    quantity > 6 ~ "Multipack"
  )) %>%
  group_by(volume_category) %>%
  summarise(total_purchases = n(),
            total_revenue = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_purchases))

# Display volume distribution of soft drinks for middle-class customers
datatable(volume_distribution, options = list(pageLength = 10), caption = "Soft Drink Volume Distribution for Middle-Class Customers")

# Filter to show only the top 5 complementary products
top_complementary_products <- complementary_products %>%
  slice_max(complementary_count, n = 5)

# Plotting the filtered data with title alignment and data labels on the bars
ggplot(top_complementary_products, aes(x = reorder(product_category, complementary_count), y = complementary_count, fill = product_category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = complementary_count), hjust = 1.2, color = "white", size = 4) +  # Labels inside bars
  labs(title = "Top 5 Complementary Products with Soft Drinks (Middle Class)", x = "Product Category", y = "Purchase Count") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5)  # Center align the title
  ) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip()

# Soft Drink Purchase Frequency by Income Group
ggplot(soft_drink_frequency, aes(x = reorder(income, -soft_drink_purchases), y = soft_drink_purchases, fill = income)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = soft_drink_purchases), vjust = -0.5) +
  labs(title = "Soft Drink Purchase Frequency by Income Group", x = "Income Group", y = "Purchase Frequency") +
  theme_minimal() +
  scale_fill_viridis_d()

# Volume Distribution for Soft Drink Purchases (Middle Class)
ggplot(volume_distribution, aes(x = reorder(volume_category, -total_purchases), y = total_purchases, fill = volume_category)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = total_purchases), vjust = -0.5) +
  labs(title = "Volume Distribution for Soft Drink Purchases (Middle Class)", x = "Volume Category", y = "Total Purchases") +
  theme_minimal() +
  scale_fill_viridis_d()

Summary and Recommendations

Summary of Findings Through our analysis of customer transaction data by demographic group and product category, several key insights emerged:

Top Revenue-Generating Demographics and Categories

Income Group and Product Category: Middle-income customers (household incomes between $35,000 and $74,000) generate the most revenue in various product categories, especially soft drinks. Age Group: Customers aged 45-54 contribute significantly to the revenue in the top categories, indicating strong purchasing power and possibly a preference for certain types of products like soft drinks, cheese, and frozen meat dinners. Soft Drinks as a Key Product Category:

Soft drinks stand out as a high-revenue category, particularly among middle-income customers. This suggests an opportunity to enhance marketing and promotion efforts around this product category to further boost revenue from this segment. Complementary Product Insights:

Middle-income consumers frequently purchase soft drinks with complementary products such as baked bread/buns/rolls, bag snacks, and cheese. This indicates a preference for quick and convenient meal solutions, which can be leveraged in promotional campaigns. Purchase Patterns:

Most soft drink purchases are for single bottles, with a smaller portion of sales attributed to larger packs. This suggests that customers might prefer purchasing smaller quantities, potentially driven by convenience or price sensitivity. Recommendations Based on these findings, we propose the following recommendations to capitalize on the identified trends and boost revenue:

Targeted Promotions for Middle-Income Customers:

Cross-Promotional Bundling: Leverage the popularity of soft drinks among middle-income consumers by bundling them with top complementary items (e.g., soft drinks with bag snacks or baked goods). This could be implemented through in-store promotions, special discounts, or loyalty program rewards to encourage larger basket sizes. Volume-Based Incentives for Soft Drinks:

Multipack Discounts: Given the high sales of single bottles, introduce multipack discounts or “Buy More, Save More” promotions to incentivize customers to purchase larger quantities. This could help increase the average purchase size and overall revenue. Segmented Marketing Campaigns:

Age and Income-Specific Messaging: Develop targeted marketing campaigns for the 45-54 age group within the middle-income segment, emphasizing popular products like soft drinks and their complementary items. Highlight product benefits that align with this demographic’s lifestyle, such as convenience and value. Loyalty Programs and Personalized Offers:

Personalized Discounts: Use the findings from the revenue analysis to inform personalized offers for customers who frequently purchase high-revenue items. For example, providing targeted discounts on complementary products could encourage more frequent or larger purchases from existing customers. Optimize Product Placement:

In-Store and Online Positioning: Position high-revenue items like soft drinks, bag snacks, and baked goods in prominent locations within the store and on the online platform. This will enhance visibility and accessibility, potentially increasing impulse purchases and overall sales. Conclusion By tailoring promotional strategies and enhancing product placement based on customer demographics and product preferences, Regork can maximize revenue opportunities and deepen customer engagement. This analysis offers actionable insights that can guide marketing, pricing, and merchandising decisions, positioning Regork to capture a larger share of the market. Implementing these strategies will not only boost revenue in the short term but also foster long-term customer loyalty.

Detailed Revenue Analysis by Demographic Group and Product Category

Bishal Dhital, Zaynah Dabdoub

2024-10-5

INTRODUCTION

PACKAGES REQUIRED

Data Prepartion and Integration

Summary and Recommendations

Top Revenue-Generating Demographics and Categories