suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(tidyverse)))
knitr::opts_chunk$set(echo = TRUE)

Introduction

The following report is addressed to the CEO of Regork for viewership and consideration.

Business Problem Addressed:

There is an opportunity for Regork to experience business growth through a strategic marketing campaign. The business problem at hand is how to leverage the spending habits and coupon usage across different income brackets to target customers and boost sales revenue. The focus here will be on the sales of actual groceries and everyday household items.

Approach:

Using the programming language known as R, robust data analysis was performed on the CompleteJourney data package. I extracted information from datasets such as transactions, products, demographics, coupons, and coupon redemptions for Regork spanning a one-year period. Then, I conducted exploratory analysis of the data through joins, table summaries, and graphical visualizations to find insights.

Usefulness:

The analysis that follows takes raw data, like sales revenue by product and coupon redemption counts, and configures them in an easy-to-interpret format. You will be able to view the extent of sales and coupon usage across demographic groups varying by household income, to see who buys what and who is attracted to the best deals. This will enable us to identify habits and propose a solution for how to best target customers to attract them to spend more money shopping at Regork.

Packages and Libraries Required

The following R packages will be required to run this code:

Import CompleteJourney if you haven’t already:

options(download.file.method = "wininet")
install.packages("remotes")
remotes::install_github("bradleyboehmke/completejourney")

Load the libraries for the following packages:

suppressMessages(suppressWarnings(library(completejourney)))

suppressMessages(suppressWarnings(library(tidyverse)))

suppressMessages(suppressWarnings(library(knitr)))

suppressMessages(suppressWarnings(library(curl)))

Load the CompleteJourney datasets:

transactions <- get_transactions()

promotions <- get_promotions()

If those get_xxx() commands don’t work due to SSL connectivity issues, try these commands for Windows R:

url <- "https://raw.githubusercontent.com/bradleyboehmke/completejourney/master/data/transactions.rds"

tmp <- tempfile(fileext = ".rds")

curl::curl_download(url, tmp)

transactions <- readRDS(tmp)


url2 <- "https://raw.githubusercontent.com/bradleyboehmke/completejourney/master/data/promotions.rds"

tmp2 <- tempfile(fileext = ".rds")

curl::curl_download(url2, tmp2)

promotions <- readRDS(tmp2)

Load the rest of the relevant datasets:

data("demographics")
data("products")
data("coupons")
data("coupon_redemptions")

Exploratory Data Analysis

First, a good habit is to view the names of the columns in the data, and also check to see where there are unknown “NA” values, so that we can avoid NA errors when joining the data.

options(tibble.width = Inf)

transactions %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 11
##   household_id store_id basket_id product_id quantity sales_value retail_disc
##          <int>    <int>     <int>      <int>    <int>       <int>       <int>
## 1            0        0         0          0        0           0           0
##   coupon_disc coupon_match_disc  week transaction_timestamp
##         <int>             <int> <int>                 <int>
## 1           0                 0     0                     0
products %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 7
##   product_id manufacturer_id department brand product_category product_type
##        <int>           <int>      <int> <int>            <int>        <int>
## 1          0               0          0     0              540          528
##   package_size
##          <int>
## 1        30586
promotions %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 5
##   product_id store_id display_location mailer_location  week
##        <int>    <int>            <int>           <int> <int>
## 1          0        0                0               0     0
demographics %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 8
##   household_id   age income home_ownership marital_status household_size
##          <int> <int>  <int>          <int>          <int>          <int>
## 1            0     0      0            233            137              0
##   household_comp kids_count
##            <int>      <int>
## 1              0          0
coupons %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 3
##   coupon_upc product_id campaign_id
##        <int>      <int>       <int>
## 1          0          0           0
coupon_redemptions %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 4
##   household_id coupon_upc campaign_id redemption_date
##          <int>      <int>       <int>           <int>
## 1            0          0           0               0

This reveals useful columns of data i.e. household_id, product_id, sales_value, product_category, product_type, and income.

Information like product_category and product_type have NA values, so we will need to be mindful of this.

We can now begin summarizing and viewing the actual data in meaningful ways, to see where exploration leads us. First, let’s take a look at the top 10 product categories in terms of sales revenue.

Top 10 Product Categories by Revenue

top_categories <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  group_by(product_category) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total_sales)) %>% 
  slice_head(n = 10) %>%
  print()
## # A tibble: 10 × 2
##    product_category       total_sales
##    <chr>                        <dbl>
##  1 COUPON/MISC ITEMS          385972.
##  2 SOFT DRINKS                182126.
##  3 BEEF                       176615.
##  4 FLUID MILK PRODUCTS        116361.
##  5 CHEESE                     107012.
##  6 FRZN MEAT/MEAT DINNERS      93024.
##  7 BAG SNACKS                  84944.
##  8 BAKED BREAD/BUNS/ROLLS      82430.
##  9 BEERS/ALES                  82039.
## 10 FROZEN PIZZA                81175.

We can visualize this with a simple table and bar graph:

Table for Top 10

kable(top_categories, caption = "Top 10 Product Categories by Total Sales", align = "lc")
Top 10 Product Categories by Total Sales
product_category total_sales
COUPON/MISC ITEMS 385971.85
SOFT DRINKS 182126.30
BEEF 176614.54
FLUID MILK PRODUCTS 116360.68
CHEESE 107011.61
FRZN MEAT/MEAT DINNERS 93023.74
BAG SNACKS 84943.85
BAKED BREAD/BUNS/ROLLS 82429.70
BEERS/ALES 82039.45
FROZEN PIZZA 81175.04

Visualize with ggplot2

ggplot(top_categories, aes(x = reorder(product_category, total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Top 10 Product Categories by Revenue",
       x = "Product Category", y = "Total Sales ($)") +
  theme_minimal()

At a glance, this shows us that identifiable items like soft drinks, beef, milk, cheese, and frozen meat products have the highest sales revenue, with the largest amount of sales going toward “COUPON / MISC ITEMS.” We will need to explore this product category further to see if it warrants inclusion in general groceries and storebought household items.

Explore product_type within COUPON/MISC ITEMS

coupon_misc_sales <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  filter(product_category == "COUPON/MISC ITEMS") %>% 
  group_by(product_type) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total_sales))

Top 10 product categories by revenue, excluding COUPON/MISC ITEMS

top_categories_no_misc <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  filter(product_category != "COUPON/MISC ITEMS") %>% 
  group_by(product_category) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total_sales)) %>% 
  slice_head(n = 10)

Create and display table

kable(top_categories_no_misc, 
      caption = "Top 10 Product Categories by Revenue (Excluding COUPON/MISC ITEMS)", align = "lc")
Top 10 Product Categories by Revenue (Excluding COUPON/MISC ITEMS)
product_category total_sales
SOFT DRINKS 182126.30
BEEF 176614.54
FLUID MILK PRODUCTS 116360.68
CHEESE 107011.61
FRZN MEAT/MEAT DINNERS 93023.74
BAG SNACKS 84943.85
BAKED BREAD/BUNS/ROLLS 82429.70
BEERS/ALES 82039.45
FROZEN PIZZA 81175.04
COLD CEREAL 63009.35

Visualize with ggplot2

ggplot(top_categories_no_misc, aes(x = reorder(product_category, total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Top 10 Product Categories by Revenue (Excl. COUPON/MISC ITEMS)",
       x = "Product Category", y = "Total Sales ($)") +
  theme_minimal()

Alternatively, we can also visualize these results as a pie chart, to view the relative portions of sales for each product:

Generate top 10 product categories by total sales as percentages

top_categories_no_misc <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  filter(product_category != "COUPON/MISC ITEMS") %>% 
  group_by(product_category) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  mutate(percentage = total_sales / sum(total_sales) * 100) %>% 
  arrange(desc(total_sales)) %>% 
  slice_head(n = 10)

Pie chart of top 10 product sales

ggplot(top_categories_no_misc, aes(x = "", y = percentage, fill = product_category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Top 10 Product Categories by Sales (Excluding COUPON/MISC ITEMS)",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey")) +  # 10 distinct colors
  theme_minimal() +
  theme(
    axis.title = element_blank(),  # Remove axis titles
    axis.text = element_blank(),   # Remove axis text
    panel.grid = element_blank(),  # Remove grid
    legend.position = "right",     # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),    # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10)    # Extra plot margin
  )

Now that the top 10 product categories have been identified, we will carry these forward and analyze what the top 5 are for each income bracket. This will be split into two graphs, since the total sales revenue for higher income brackets is much lower than that of the lower income brackets. This will make each income group (lower vs. higher) easier to visualize.

Define top 10 product categories (excluding COUPON/MISC ITEMS)

top_10_categories <- c("SOFT DRINKS", "BEEF", "FLUID MILK PRODUCTS", "CHEESE", 
                       "FRZN MEAT/MEAT DINNERS", "BAG SNACKS", 
                       "BAKED BREAD/BUNS/ROLLS", "BEERS/ALES", 
                       "FROZEN PIZZA", "COLD CEREAL")

Sales by product_category and income

sales_by_income <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  left_join(demographics, by = "household_id") %>% 
  filter(product_category %in% top_10_categories, !is.na(income)) %>% 
  group_by(product_category, income) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>% 
  arrange(product_category, desc(total_sales))

Top 5 product categories by sales for each income bracket

top_5_by_income <- sales_by_income %>% 
  group_by(income) %>% 
  slice_max(total_sales, n = 5, with_ties = FALSE) %>%  # Top 5 by sales per income
  arrange(income, desc(total_sales))

Display table

kable(top_5_by_income, 
      caption = "Sales for Top 5 Product Categories by Income", 
      align = "llc")
Sales for Top 5 Product Categories by Income
product_category income total_sales
BEEF Under 15K 9567.36
SOFT DRINKS Under 15K 9501.77
CHEESE Under 15K 4903.56
FROZEN PIZZA Under 15K 4839.56
FLUID MILK PRODUCTS Under 15K 4248.32
SOFT DRINKS 15-24K 10811.04
BEEF 15-24K 7813.07
FLUID MILK PRODUCTS 15-24K 4552.21
FRZN MEAT/MEAT DINNERS 15-24K 4382.31
BAG SNACKS 15-24K 3853.93
BEEF 25-34K 10082.68
SOFT DRINKS 25-34K 8814.52
FRZN MEAT/MEAT DINNERS 25-34K 6939.98
FLUID MILK PRODUCTS 25-34K 5663.37
CHEESE 25-34K 5423.52
BEEF 35-49K 19230.10
SOFT DRINKS 35-49K 17853.96
FLUID MILK PRODUCTS 35-49K 12934.51
CHEESE 35-49K 11579.98
FRZN MEAT/MEAT DINNERS 35-49K 11564.07
SOFT DRINKS 50-74K 22010.52
BEEF 50-74K 21663.90
FLUID MILK PRODUCTS 50-74K 15707.60
CHEESE 50-74K 15155.94
FRZN MEAT/MEAT DINNERS 50-74K 12903.46
SOFT DRINKS 75-99K 13621.86
BEEF 75-99K 10869.89
FLUID MILK PRODUCTS 75-99K 8688.98
CHEESE 75-99K 7680.54
BEERS/ALES 75-99K 6822.22
SOFT DRINKS 100-124K 4127.24
BEEF 100-124K 3834.48
FLUID MILK PRODUCTS 100-124K 3132.09
BEERS/ALES 100-124K 2474.89
CHEESE 100-124K 2206.70
SOFT DRINKS 125-149K 4595.07
FLUID MILK PRODUCTS 125-149K 4356.19
CHEESE 125-149K 4041.13
BEEF 125-149K 3989.74
BAKED BREAD/BUNS/ROLLS 125-149K 3285.06
SOFT DRINKS 150-174K 5562.43
BEEF 150-174K 4817.54
FLUID MILK PRODUCTS 150-174K 3208.86
CHEESE 150-174K 2868.25
FRZN MEAT/MEAT DINNERS 150-174K 2385.66
BEEF 175-199K 1475.40
FLUID MILK PRODUCTS 175-199K 953.26
CHEESE 175-199K 882.88
BAKED BREAD/BUNS/ROLLS 175-199K 722.82
BAG SNACKS 175-199K 683.26
SOFT DRINKS 200-249K 508.74
BEEF 200-249K 474.83
CHEESE 200-249K 390.98
COLD CEREAL 200-249K 299.06
BAG SNACKS 200-249K 277.93
BEEF 250K+ 2072.10
SOFT DRINKS 250K+ 1745.54
FLUID MILK PRODUCTS 250K+ 1724.02
COLD CEREAL 250K+ 1295.42
CHEESE 250K+ 1260.95

Filter for income up to 99K

top_5_up_to_99k <- top_5_by_income %>% 
  filter(income %in% c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"))

Stacked bar plot - Lower Income

ggplot(top_5_up_to_99k, aes(x = income, y = total_sales, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  scale_y_continuous(labels = scales::comma, breaks = seq(0, 90000, by = 10000)) +  # 20,000 increments up to 160,000
  labs(title = "Top 5 Product Categories by Income Bracket (Up to 99K)",
       x = "Income Bracket", y = "Total Sales ($)",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow")) +  # 8 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

Filter for income 100K and higher

top_5_100k_plus <- top_5_by_income %>% 
  filter(income %in% c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+"))

Stacked bar plot - Higher Income

ggplot(top_5_100k_plus, aes(x = income, y = total_sales, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  scale_y_continuous(labels = scales::comma, breaks = seq(0, 22000, by = 2000)) +  # 10,000 increments up to 60,000
  labs(title = "Top 5 Product Categories by Income Bracket (100K and Higher)",
       x = "Income Bracket", y = "Total Sales ($)",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown")) +  # 9 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

What these tables and graphs reveal is that BEEF and SOFT DRINKS consistently rank among the top 5 across all income brackets (Under 15K to 250K+), with sales ranging from $9,567 to $22,011 for BEEF and $509 to $22,011 for SOFT DRINKS. This suggests a universal demand for meat and beverages. These can be considered staples.

Dairy products like FLUID MILK PRODUCTS and CHEESE also appear frequently as top items across almost all income levels. This is indicative of habitual purchases which draw a wide range of customers through the Regork doors.

The graphs also reveale differences between income brackets. For example, lower-income groups (Under 15K to 25-34K) tend to buy FROZEN PIZZA ($4,840 - $4,433) and FRZN MEAT/MEAT DINNERS ($3,943 - $6,940) more often, suggesting a preference for affordable, ready-to-eat options, likely influenced by constrained budgets.

However, middle-income groups (earning 35-49K to 75-99K) tend to buy BEERS/ALES ($6,822 - $8,787) in addition to higher sales in purchases of FRZN MEAT/MEAT DINNERS ($11,564 - $12,903). This indicates a shift toward discretionary spending and convenient foods as disposable income rises.

Higher income groups (earning 100K+) begin to buy things like BAKED BREAD/BUNS/ROLLS ($723 - $3,285) and BAG SNACKS ($278 - $683) especially in the higher brackets (175-199K to 250K+), suggesting preferences for fresh or snack options, while staple sales (like BEEF, $1,475 - $2,072) decline relatively.

Therefore, spending priorities change from budget-friendly convenience in lower incomes to variety and freshness in higher incomes. People who earn more money also tend to spend less money at Regork, implying they might shop elsewhere or use their disposable income to dine out or have groceries delivered to their homes.

Next, let’s evaluate coupon usage, analyzing coupon redemption counts to see which top 10 product categories drive customers to shop for deals at Regork:

Join datasets with many-to-many relationship and filter out NA product_category

coupon_usage <- coupons %>% 
  left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>% 
  left_join(products, by = "product_id") %>% 
  filter(!is.na(product_category)) %>%  # Exclude NA product_category
  group_by(product_category) %>% 
  summarize(coupon_count = n()) %>%  # Count coupon redemptions per category
  arrange(desc(coupon_count)) %>% 
  slice_head(n = 10)  # Top 10 categories

Display table

kable(coupon_usage, 
      caption = "Top 10 Product Categories by Coupon Usage", 
      align = "lc")
Top 10 Product Categories by Coupon Usage
product_category coupon_count
BEEF 133340
LUNCHMEAT 95849
FRZN MEAT/MEAT DINNERS 93991
MAKEUP AND TREATMENT 77488
PORK 71910
FROZEN PIZZA 65555
HAIR CARE PRODUCTS 56719
YOGURT 55586
CHEESE 46387
ICE CREAM/MILK/SHERBTS 44033

Many of the top sold items appear here, including BEEF, FRZN MEAT/MEAT DINNERS, FROZEN PIZZA, and CHEESE. Other popular couponed items include other meat items like LUNCHMEAT and PORK, and other dairy items like YOGURT and ICE CREAM/MILK/SHERBTS. Cosmetic products like MAKEUP AND TREATMENT and HAIR CARE PRODUCTS also show up here.

Now that we have an overall view of coupon usage, let’s see which products each income group tends to redeem coupons for the most:

Join datasets with many-to-many relationship and filter out NA values

coupon_redemptions_by_income <- coupons %>% 
  left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>% 
  left_join(products, by = "product_id") %>% 
  left_join(demographics, by = "household_id") %>% 
  filter(!is.na(product_category) & !is.na(income)) %>%  # Exclude NA product_category and income
  group_by(income, product_category) %>% 
  summarize(coupon_count = n(), .groups = "drop") %>%  # Count coupon redemptions per category per income
  group_by(income) %>% 
  slice_max(coupon_count, n = 5, with_ties = FALSE) %>%  # Top 5 per income bracket
  arrange(income, desc(coupon_count)) %>% 
  filter(income %in% c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"))  # Filter up to 99K

Stacked bar plot - Coupon usage for lower incomes

ggplot(coupon_redemptions_by_income, aes(x = income, y = coupon_count, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  labs(title = "Coupon Redemptions by Product Category (Incomes Up to 99K)",
       x = "Income Bracket", y = "Number of Coupons Redeemed",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey", "lightblue", "lightgreen", "lavender", "salmon", "darkgreen")) +  # 15 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

Join datasets with many-to-many relationship and filter out NA values

coupon_redemptions_by_income <- coupons %>% 
  left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>% 
  left_join(products, by = "product_id") %>% 
  left_join(demographics, by = "household_id") %>% 
  filter(!is.na(product_category) & !is.na(income)) %>%  # Exclude NA product_category and income
  group_by(income, product_category) %>% 
  summarise(coupon_count = n(), .groups = "drop") %>%  # Count coupon redemptions per category per income
  group_by(income) %>% 
  slice_max(coupon_count, n = 5, with_ties = FALSE) %>%  # Top 5 per income bracket
  arrange(income, desc(coupon_count)) %>% 
  filter(income %in% c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+"))  # Filter above 99K

Stacked bar plot - Coupon usage for higher incomes

ggplot(coupon_redemptions_by_income, aes(x = income, y = coupon_count, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  labs(title = "Coupon Redemptions by Product Category (Incomes Above 99K)",
       x = "Income Bracket", y = "Number of Coupons Redeemed",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey", "lightblue", "lightgreen", "lavender", "salmon", "darkgreen")) +  # 15 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

Analysis of Coupon Usage

BEEF consistently ranks high across all income brackets for coupon usage (i.e. Under 15K: 9,914; 50-74K: 36,370; 250K+: 1,097), reflecting a strong preference for protein-rich foods. LUNCHMEAT and PORK also appear frequently (i.e. 15-24K: 3,164; 150-174K: 4,741), suggesting meat is a coupon-driven priority regardless of income.

There are some differences between income brackets.

Lower-income households (Under 15K to 25-34K) focus their coupon use on convenience foods and essentials, i.e. CHEESE (3,386 - 5,723), CHICKEN (3,106), and FROZEN PIZZA (4,497), alongside other meats. Redemption counts are moderate (e.g., Under 15K: 3,106 - 9,914).

Middle-income groups (35-49K to 75-99K) show by far the highest coupon redemption counts overall, with the 50-74K income range standing out. In addition to the common products like BEEF, coupon use emerges fro MAKEUP AND TREATMENT (5,240 - 31,887) and HAIR CARE PRODUCTS (2,796 - 22,796). This suggests discretionary spending on personal care alongside staples.

Higher-income brackets (100K+ to 250K+) use coupons less, likely because they are less sensitive to prices with more disposable income. Top items include PROCESSED (2,809), YOGURT (705 - 644), and BABY FOODS (635) alongside the meat staples. MAKEUP AND TREATMENT (2,406) persists in the 200-249K income range.

Summary

Problem Statement

The core challenge I tackled was identifying growth opportunities for Regork by evaluating spending habits and coupon usage across income groups to drive higher sales. The focus was on groceries and household items, excluding non-core categories like “COUPON/MISC ITEMS” to target customers effectively.

Methodology

I used R to analyze the CompleteJourney dataset, pulling data from transactions, products, demographics, coupons, and coupon redemptions over a year. I performed exploratory analysis with joins to combine datasets, summarized sales and coupon counts by product category and income, and created tables and graphs (like stacked bars and pie charts) to uncover trends. This approach helped me slice the data by income groups and visualize key patterns.

Insights

My analysis revealed that BEEF and SOFT DRINKS dominate sales and coupon usage across all income brackets, marking them as staples. Dairy (FLUID MILK PRODUCTS, CHEESE) also shows consistent demand. Lower-income groups (Under 15K–25-34K) favor affordable options like FROZEN PIZZA and FRZN MEAT/MEAT DINNERS, while middle-income (35-49K–75-99K) lean toward BEERS/ALES and discretionary items like MAKEUP AND TREATMENT. Higher-income (100K+) shifts to BAKED BREAD/BUNS/ROLLS and BAG SNACKS, with lower overall spending, suggesting alternative shopping habits.

Implications and Recommendations

These insights guide Regork’s growth. For expansion in existing strong sales areas, I recommend heavily marketing BEEF, SOFT DRINKS, and dairy to all income groups, especially middle-income (50-74K), where sales and coupon use both peak. Use targeted ads and in-store promotions to amplify this. For market penetration, focus on lower-income groups with coupons for FROZEN PIZZA and FRZN MEAT/MEAT DINNERS, tapping their budget-driven demand. Middle-income should get coupons for MAKEUP AND TREATMENT and HAIR CARE PRODUCTS to boost discretionary spending. Higher-income needs niche coupons for YOGURT and BABY FOODS to re-engage them, addressing their lower store presence. This dual strategy maximizes revenue and attracts new buyers.

Limitations and Improvements

My analysis lacks time trends, limiting seasonal insights, and excludes unknown “NA” data from product info, potentially missing a segment. Future work could add time-series analysis to spot holiday spikes or include unknown product data. Other demographics, like age and household size, could also be incorporated into future studies.