Final Project - Regork Data Analysis

suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(tidyverse)))
knitr::opts_chunk$set(echo = TRUE)

Introduction

The following report is addressed to the CEO of Regork for viewership and consideration.

Business Problem Addressed:

There is an opportunity for Regork to experience business growth through a strategic marketing campaign. The business problem at hand is how to leverage the spending habits and coupon usage across different income brackets to target customers and boost sales revenue. The focus here will be on the sales of actual groceries and everyday household items.

Approach:

Using the programming language known as R, robust data analysis was performed on the CompleteJourney data package. I extracted information from datasets such as transactions, products, demographics, coupons, and coupon redemptions for Regork spanning a one-year period. Then, I conducted exploratory analysis of the data through joins, table summaries, and graphical visualizations to find insights.

Usefulness:

The analysis that follows takes raw data, like sales revenue by product and coupon redemption counts, and configures them in an easy-to-interpret format. You will be able to view the extent of sales and coupon usage across demographic groups varying by household income, to see who buys what and who is attracted to the best deals. This will enable us to identify habits and propose a solution for how to best target customers to attract them to spend more money shopping at Regork.

Packages and Libraries Required

The following R packages will be required to run this code:

completejourney - Needed for access to the core datasets (e.g., transactions, products, demographics, coupons, and coupon redemptions).
tidyverse – Needed for data wrangling (i.e. filter, group_by, summarize), visualization (i.e., ggplot2), and glimpse() to view data summaries.
curl – May be needed to circumvent SSL connectivity problems on Windows devices.
knitr – Used for clean R Markdown output (i.e. formatting tables, suppressing messages).

Import CompleteJourney if you haven’t already:

options(download.file.method = "wininet")
install.packages("remotes")
remotes::install_github("bradleyboehmke/completejourney")

Load the libraries for the following packages:

suppressMessages(suppressWarnings(library(completejourney)))

suppressMessages(suppressWarnings(library(tidyverse)))

suppressMessages(suppressWarnings(library(knitr)))

suppressMessages(suppressWarnings(library(curl)))

Load the CompleteJourney datasets:

transactions <- get_transactions()

promotions <- get_promotions()

If those get_xxx() commands don’t work due to SSL connectivity issues, try these commands for Windows R:

url <- "https://raw.githubusercontent.com/bradleyboehmke/completejourney/master/data/transactions.rds"

tmp <- tempfile(fileext = ".rds")

curl::curl_download(url, tmp)

transactions <- readRDS(tmp)


url2 <- "https://raw.githubusercontent.com/bradleyboehmke/completejourney/master/data/promotions.rds"

tmp2 <- tempfile(fileext = ".rds")

curl::curl_download(url2, tmp2)

promotions <- readRDS(tmp2)

Load the rest of the relevant datasets:

data("demographics")
data("products")
data("coupons")
data("coupon_redemptions")

Exploratory Data Analysis

First, a good habit is to view the names of the columns in the data, and also check to see where there are unknown “NA” values, so that we can avoid NA errors when joining the data.

options(tibble.width = Inf)

transactions %>% summarize_all(~sum(is.na(.))) %>% print()

## # A tibble: 1 × 11
##   household_id store_id basket_id product_id quantity sales_value retail_disc
##          <int>    <int>     <int>      <int>    <int>       <int>       <int>
## 1            0        0         0          0        0           0           0
##   coupon_disc coupon_match_disc  week transaction_timestamp
##         <int>             <int> <int>                 <int>
## 1           0                 0     0                     0

products %>% summarize_all(~sum(is.na(.))) %>% print()

## # A tibble: 1 × 7
##   product_id manufacturer_id department brand product_category product_type
##        <int>           <int>      <int> <int>            <int>        <int>
## 1          0               0          0     0              540          528
##   package_size
##          <int>
## 1        30586

promotions %>% summarize_all(~sum(is.na(.))) %>% print()

## # A tibble: 1 × 5
##   product_id store_id display_location mailer_location  week
##        <int>    <int>            <int>           <int> <int>
## 1          0        0                0               0     0

demographics %>% summarize_all(~sum(is.na(.))) %>% print()

## # A tibble: 1 × 8
##   household_id   age income home_ownership marital_status household_size
##          <int> <int>  <int>          <int>          <int>          <int>
## 1            0     0      0            233            137              0
##   household_comp kids_count
##            <int>      <int>
## 1              0          0

coupons %>% summarize_all(~sum(is.na(.))) %>% print()

## # A tibble: 1 × 3
##   coupon_upc product_id campaign_id
##        <int>      <int>       <int>
## 1          0          0           0

coupon_redemptions %>% summarize_all(~sum(is.na(.))) %>% print()

## # A tibble: 1 × 4
##   household_id coupon_upc campaign_id redemption_date
##          <int>      <int>       <int>           <int>
## 1            0          0           0               0

This reveals useful columns of data i.e. household_id, product_id, sales_value, product_category, product_type, and income.

Information like product_category and product_type have NA values, so we will need to be mindful of this.

We can now begin summarizing and viewing the actual data in meaningful ways, to see where exploration leads us. First, let’s take a look at the top 10 product categories in terms of sales revenue.

Top 10 Product Categories by Revenue

top_categories <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  group_by(product_category) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total_sales)) %>% 
  slice_head(n = 10) %>%
  print()

## # A tibble: 10 × 2
##    product_category       total_sales
##    <chr>                        <dbl>
##  1 COUPON/MISC ITEMS          385972.
##  2 SOFT DRINKS                182126.
##  3 BEEF                       176615.
##  4 FLUID MILK PRODUCTS        116361.
##  5 CHEESE                     107012.
##  6 FRZN MEAT/MEAT DINNERS      93024.
##  7 BAG SNACKS                  84944.
##  8 BAKED BREAD/BUNS/ROLLS      82430.
##  9 BEERS/ALES                  82039.
## 10 FROZEN PIZZA                81175.

We can visualize this with a simple table and bar graph:

Table for Top 10

kable(top_categories, caption = "Top 10 Product Categories by Total Sales", align = "lc")

Top 10 Product Categories by Total Sales
product_category	total_sales
COUPON/MISC ITEMS	385971.85
SOFT DRINKS	182126.30
BEEF	176614.54
FLUID MILK PRODUCTS	116360.68
CHEESE	107011.61
FRZN MEAT/MEAT DINNERS	93023.74
BAG SNACKS	84943.85
BAKED BREAD/BUNS/ROLLS	82429.70
BEERS/ALES	82039.45
FROZEN PIZZA	81175.04

Visualize with ggplot2

ggplot(top_categories, aes(x = reorder(product_category, total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Top 10 Product Categories by Revenue",
       x = "Product Category", y = "Total Sales ($)") +
  theme_minimal()

At a glance, this shows us that identifiable items like soft drinks, beef, milk, cheese, and frozen meat products have the highest sales revenue, with the largest amount of sales going toward “COUPON / MISC ITEMS.” We will need to explore this product category further to see if it warrants inclusion in general groceries and storebought household items.

Explore product_type within COUPON/MISC ITEMS

coupon_misc_sales <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  filter(product_category == "COUPON/MISC ITEMS") %>% 
  group_by(product_type) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total_sales))

Print results

print("Sales by product_type in COUPON/MISC ITEMS:")

## [1] "Sales by product_type in COUPON/MISC ITEMS:"

  coupon_misc_sales %>% print(n = Inf)

## # A tibble: 19 × 2
##    product_type                   total_sales
##    <chr>                                <dbl>
##  1 GASOLINE-REG UNLEADED            364117.  
##  2 ELECTRONIC GIFT CARDS ACTIVATI     5780.  
##  3 OUTSIDE VENDORS GIFT CARDS         2785.  
##  4 MEAT SUPPLIES                      2670.  
##  5 ELECTRONIC GIFT CARDS REFRESH      2268   
##  6 MISCELLANEOUS H & B AIDS           2038.  
##  7 MISC SALES TRANS                   1739.  
##  8 MASTERCARD GIFT CARD               1338.  
##  9 PRODUCE DEPT KEY RING               980.  
## 10 FLORAL DEPT KEY RING                667.  
## 11 AMERICAN EXPRESS GIFT CARD          453.  
## 12 CAN DOG FOOD RATION (TRIX/VETS      435.  
## 13 TICKETS                             302   
## 14 MISC. BATH PRODUCTS                 188.  
## 15 SOFT DRINKS CAN NON-CARB (EXCE      102.  
## 16 BULK CANDY                           67.0 
## 17 JUICE                                16.9 
## 18 CENTRAL SUPPLIES                     15   
## 19 COFF SHOP: RETAIL PACK BEVERAG        9.95

There’s no need for further visualization, because the display shows us that the lion’s share of this COUPON / MISC ITEMS category ($364K out of $386K) is gasoline from the fuel pump. Most of the remainder of the sales are for gift cards and miscellaneous items. Any true grocery items have negligible sales. Therefore we will exclude this product category from our top 10 moving forward.

Running the code again with this in mind:

Top 10 product categories by revenue, excluding COUPON/MISC ITEMS

top_categories_no_misc <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  filter(product_category != "COUPON/MISC ITEMS") %>% 
  group_by(product_category) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  arrange(desc(total_sales)) %>% 
  slice_head(n = 10)

Create and display table

kable(top_categories_no_misc, 
      caption = "Top 10 Product Categories by Revenue (Excluding COUPON/MISC ITEMS)", align = "lc")

Top 10 Product Categories by Revenue (Excluding COUPON/MISC ITEMS)
product_category	total_sales
SOFT DRINKS	182126.30
BEEF	176614.54
FLUID MILK PRODUCTS	116360.68
CHEESE	107011.61
FRZN MEAT/MEAT DINNERS	93023.74
BAG SNACKS	84943.85
BAKED BREAD/BUNS/ROLLS	82429.70
BEERS/ALES	82039.45
FROZEN PIZZA	81175.04
COLD CEREAL	63009.35

Visualize with ggplot2

ggplot(top_categories_no_misc, aes(x = reorder(product_category, total_sales), y = total_sales)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Top 10 Product Categories by Revenue (Excl. COUPON/MISC ITEMS)",
       x = "Product Category", y = "Total Sales ($)") +
  theme_minimal()

Alternatively, we can also visualize these results as a pie chart, to view the relative portions of sales for each product:

Generate top 10 product categories by total sales as percentages

top_categories_no_misc <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  filter(product_category != "COUPON/MISC ITEMS") %>% 
  group_by(product_category) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>% 
  mutate(percentage = total_sales / sum(total_sales) * 100) %>% 
  arrange(desc(total_sales)) %>% 
  slice_head(n = 10)

Pie chart of top 10 product sales

ggplot(top_categories_no_misc, aes(x = "", y = percentage, fill = product_category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  labs(title = "Top 10 Product Categories by Sales (Excluding COUPON/MISC ITEMS)",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey")) +  # 10 distinct colors
  theme_minimal() +
  theme(
    axis.title = element_blank(),  # Remove axis titles
    axis.text = element_blank(),   # Remove axis text
    panel.grid = element_blank(),  # Remove grid
    legend.position = "right",     # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),    # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10)    # Extra plot margin
  )

Now that the top 10 product categories have been identified, we will carry these forward and analyze what the top 5 are for each income bracket. This will be split into two graphs, since the total sales revenue for higher income brackets is much lower than that of the lower income brackets. This will make each income group (lower vs. higher) easier to visualize.

Define top 10 product categories (excluding COUPON/MISC ITEMS)

top_10_categories <- c("SOFT DRINKS", "BEEF", "FLUID MILK PRODUCTS", "CHEESE", 
                       "FRZN MEAT/MEAT DINNERS", "BAG SNACKS", 
                       "BAKED BREAD/BUNS/ROLLS", "BEERS/ALES", 
                       "FROZEN PIZZA", "COLD CEREAL")

Sales by product_category and income

sales_by_income <- transactions %>% 
  left_join(products, by = "product_id") %>% 
  left_join(demographics, by = "household_id") %>% 
  filter(product_category %in% top_10_categories, !is.na(income)) %>% 
  group_by(product_category, income) %>% 
  summarize(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>% 
  arrange(product_category, desc(total_sales))

Top 5 product categories by sales for each income bracket

top_5_by_income <- sales_by_income %>% 
  group_by(income) %>% 
  slice_max(total_sales, n = 5, with_ties = FALSE) %>%  # Top 5 by sales per income
  arrange(income, desc(total_sales))

Display table

kable(top_5_by_income, 
      caption = "Sales for Top 5 Product Categories by Income", 
      align = "llc")

Sales for Top 5 Product Categories by Income
product_category	income	total_sales
BEEF	Under 15K	9567.36
SOFT DRINKS	Under 15K	9501.77
CHEESE	Under 15K	4903.56
FROZEN PIZZA	Under 15K	4839.56
FLUID MILK PRODUCTS	Under 15K	4248.32
SOFT DRINKS	15-24K	10811.04
BEEF	15-24K	7813.07
FLUID MILK PRODUCTS	15-24K	4552.21
FRZN MEAT/MEAT DINNERS	15-24K	4382.31
BAG SNACKS	15-24K	3853.93
BEEF	25-34K	10082.68
SOFT DRINKS	25-34K	8814.52
FRZN MEAT/MEAT DINNERS	25-34K	6939.98
FLUID MILK PRODUCTS	25-34K	5663.37
CHEESE	25-34K	5423.52
BEEF	35-49K	19230.10
SOFT DRINKS	35-49K	17853.96
FLUID MILK PRODUCTS	35-49K	12934.51
CHEESE	35-49K	11579.98
FRZN MEAT/MEAT DINNERS	35-49K	11564.07
SOFT DRINKS	50-74K	22010.52
BEEF	50-74K	21663.90
FLUID MILK PRODUCTS	50-74K	15707.60
CHEESE	50-74K	15155.94
FRZN MEAT/MEAT DINNERS	50-74K	12903.46
SOFT DRINKS	75-99K	13621.86
BEEF	75-99K	10869.89
FLUID MILK PRODUCTS	75-99K	8688.98
CHEESE	75-99K	7680.54
BEERS/ALES	75-99K	6822.22
SOFT DRINKS	100-124K	4127.24
BEEF	100-124K	3834.48
FLUID MILK PRODUCTS	100-124K	3132.09
BEERS/ALES	100-124K	2474.89
CHEESE	100-124K	2206.70
SOFT DRINKS	125-149K	4595.07
FLUID MILK PRODUCTS	125-149K	4356.19
CHEESE	125-149K	4041.13
BEEF	125-149K	3989.74
BAKED BREAD/BUNS/ROLLS	125-149K	3285.06
SOFT DRINKS	150-174K	5562.43
BEEF	150-174K	4817.54
FLUID MILK PRODUCTS	150-174K	3208.86
CHEESE	150-174K	2868.25
FRZN MEAT/MEAT DINNERS	150-174K	2385.66
BEEF	175-199K	1475.40
FLUID MILK PRODUCTS	175-199K	953.26
CHEESE	175-199K	882.88
BAKED BREAD/BUNS/ROLLS	175-199K	722.82
BAG SNACKS	175-199K	683.26
SOFT DRINKS	200-249K	508.74
BEEF	200-249K	474.83
CHEESE	200-249K	390.98
COLD CEREAL	200-249K	299.06
BAG SNACKS	200-249K	277.93
BEEF	250K+	2072.10
SOFT DRINKS	250K+	1745.54
FLUID MILK PRODUCTS	250K+	1724.02
COLD CEREAL	250K+	1295.42
CHEESE	250K+	1260.95

Filter for income up to 99K

top_5_up_to_99k <- top_5_by_income %>% 
  filter(income %in% c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"))

Stacked bar plot - Lower Income

ggplot(top_5_up_to_99k, aes(x = income, y = total_sales, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  scale_y_continuous(labels = scales::comma, breaks = seq(0, 90000, by = 10000)) +  # 20,000 increments up to 160,000
  labs(title = "Top 5 Product Categories by Income Bracket (Up to 99K)",
       x = "Income Bracket", y = "Total Sales ($)",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow")) +  # 8 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

Filter for income 100K and higher

top_5_100k_plus <- top_5_by_income %>% 
  filter(income %in% c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+"))

Stacked bar plot - Higher Income

ggplot(top_5_100k_plus, aes(x = income, y = total_sales, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  scale_y_continuous(labels = scales::comma, breaks = seq(0, 22000, by = 2000)) +  # 10,000 increments up to 60,000
  labs(title = "Top 5 Product Categories by Income Bracket (100K and Higher)",
       x = "Income Bracket", y = "Total Sales ($)",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown")) +  # 9 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

What these tables and graphs reveal is that BEEF and SOFT DRINKS consistently rank among the top 5 across all income brackets (Under 15K to 250K+), with sales ranging from $9,567 to $22,011 for BEEF and $509 to $22,011 for SOFT DRINKS. This suggests a universal demand for meat and beverages. These can be considered staples.

Dairy products like FLUID MILK PRODUCTS and CHEESE also appear frequently as top items across almost all income levels. This is indicative of habitual purchases which draw a wide range of customers through the Regork doors.

The graphs also reveale differences between income brackets. For example, lower-income groups (Under 15K to 25-34K) tend to buy FROZEN PIZZA ($4,840 - $4,433) and FRZN MEAT/MEAT DINNERS ($3,943 - $6,940) more often, suggesting a preference for affordable, ready-to-eat options, likely influenced by constrained budgets.

However, middle-income groups (earning 35-49K to 75-99K) tend to buy BEERS/ALES ($6,822 - $8,787) in addition to higher sales in purchases of FRZN MEAT/MEAT DINNERS ($11,564 - $12,903). This indicates a shift toward discretionary spending and convenient foods as disposable income rises.

Higher income groups (earning 100K+) begin to buy things like BAKED BREAD/BUNS/ROLLS ($723 - $3,285) and BAG SNACKS ($278 - $683) especially in the higher brackets (175-199K to 250K+), suggesting preferences for fresh or snack options, while staple sales (like BEEF, $1,475 - $2,072) decline relatively.

Therefore, spending priorities change from budget-friendly convenience in lower incomes to variety and freshness in higher incomes. People who earn more money also tend to spend less money at Regork, implying they might shop elsewhere or use their disposable income to dine out or have groceries delivered to their homes.

Next, let’s evaluate coupon usage, analyzing coupon redemption counts to see which top 10 product categories drive customers to shop for deals at Regork:

Join datasets with many-to-many relationship and filter out NA product_category

coupon_usage <- coupons %>% 
  left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>% 
  left_join(products, by = "product_id") %>% 
  filter(!is.na(product_category)) %>%  # Exclude NA product_category
  group_by(product_category) %>% 
  summarize(coupon_count = n()) %>%  # Count coupon redemptions per category
  arrange(desc(coupon_count)) %>% 
  slice_head(n = 10)  # Top 10 categories

Display table

kable(coupon_usage, 
      caption = "Top 10 Product Categories by Coupon Usage", 
      align = "lc")

Top 10 Product Categories by Coupon Usage
product_category	coupon_count
BEEF	133340
LUNCHMEAT	95849
FRZN MEAT/MEAT DINNERS	93991
MAKEUP AND TREATMENT	77488
PORK	71910
FROZEN PIZZA	65555
HAIR CARE PRODUCTS	56719
YOGURT	55586
CHEESE	46387
ICE CREAM/MILK/SHERBTS	44033

Many of the top sold items appear here, including BEEF, FRZN MEAT/MEAT DINNERS, FROZEN PIZZA, and CHEESE. Other popular couponed items include other meat items like LUNCHMEAT and PORK, and other dairy items like YOGURT and ICE CREAM/MILK/SHERBTS. Cosmetic products like MAKEUP AND TREATMENT and HAIR CARE PRODUCTS also show up here.

Now that we have an overall view of coupon usage, let’s see which products each income group tends to redeem coupons for the most:

Join datasets with many-to-many relationship and filter out NA values

coupon_redemptions_by_income <- coupons %>% 
  left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>% 
  left_join(products, by = "product_id") %>% 
  left_join(demographics, by = "household_id") %>% 
  filter(!is.na(product_category) & !is.na(income)) %>%  # Exclude NA product_category and income
  group_by(income, product_category) %>% 
  summarize(coupon_count = n(), .groups = "drop") %>%  # Count coupon redemptions per category per income
  group_by(income) %>% 
  slice_max(coupon_count, n = 5, with_ties = FALSE) %>%  # Top 5 per income bracket
  arrange(income, desc(coupon_count)) %>% 
  filter(income %in% c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"))  # Filter up to 99K

Stacked bar plot - Coupon usage for lower incomes

ggplot(coupon_redemptions_by_income, aes(x = income, y = coupon_count, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  labs(title = "Coupon Redemptions by Product Category (Incomes Up to 99K)",
       x = "Income Bracket", y = "Number of Coupons Redeemed",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey", "lightblue", "lightgreen", "lavender", "salmon", "darkgreen")) +  # 15 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

Join datasets with many-to-many relationship and filter out NA values

coupon_redemptions_by_income <- coupons %>% 
  left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>% 
  left_join(products, by = "product_id") %>% 
  left_join(demographics, by = "household_id") %>% 
  filter(!is.na(product_category) & !is.na(income)) %>%  # Exclude NA product_category and income
  group_by(income, product_category) %>% 
  summarise(coupon_count = n(), .groups = "drop") %>%  # Count coupon redemptions per category per income
  group_by(income) %>% 
  slice_max(coupon_count, n = 5, with_ties = FALSE) %>%  # Top 5 per income bracket
  arrange(income, desc(coupon_count)) %>% 
  filter(income %in% c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+"))  # Filter above 99K

Stacked bar plot - Coupon usage for higher incomes

ggplot(coupon_redemptions_by_income, aes(x = income, y = coupon_count, fill = product_category)) +
  geom_bar(stat = "identity", width = 0.4) +  # Narrow bars
  labs(title = "Coupon Redemptions by Product Category (Incomes Above 99K)",
       x = "Income Bracket", y = "Number of Coupons Redeemed",
       fill = "Product Category") +
  scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey", "lightblue", "lightgreen", "lavender", "salmon", "darkgreen")) +  # 15 distinct colors
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),  # Rotate x-axis
    axis.text.y = element_text(size = 8, margin = margin(r = 10)),  # Space y-axis
    legend.position = "right",  # Legend on right
    legend.text = element_text(size = 8),  # Smaller legend text
    legend.key.size = unit(0.5, "cm"),  # Smaller legend keys
    legend.margin = margin(10, 10, 10, 20),  # Space for legend
    plot.margin = margin(10, 20, 10, 10),  # Extra plot margin
    panel.grid.major.x = element_blank()  # Remove x-axis grid
  )

Analysis of Coupon Usage

BEEF consistently ranks high across all income brackets for coupon usage (i.e. Under 15K: 9,914; 50-74K: 36,370; 250K+: 1,097), reflecting a strong preference for protein-rich foods. LUNCHMEAT and PORK also appear frequently (i.e. 15-24K: 3,164; 150-174K: 4,741), suggesting meat is a coupon-driven priority regardless of income.

There are some differences between income brackets.

Lower-income households (Under 15K to 25-34K) focus their coupon use on convenience foods and essentials, i.e. CHEESE (3,386 - 5,723), CHICKEN (3,106), and FROZEN PIZZA (4,497), alongside other meats. Redemption counts are moderate (e.g., Under 15K: 3,106 - 9,914).

Middle-income groups (35-49K to 75-99K) show by far the highest coupon redemption counts overall, with the 50-74K income range standing out. In addition to the common products like BEEF, coupon use emerges fro MAKEUP AND TREATMENT (5,240 - 31,887) and HAIR CARE PRODUCTS (2,796 - 22,796). This suggests discretionary spending on personal care alongside staples.

Higher-income brackets (100K+ to 250K+) use coupons less, likely because they are less sensitive to prices with more disposable income. Top items include PROCESSED (2,809), YOGURT (705 - 644), and BABY FOODS (635) alongside the meat staples. MAKEUP AND TREATMENT (2,406) persists in the 200-249K income range.

Summary

Problem Statement

The core challenge I tackled was identifying growth opportunities for Regork by evaluating spending habits and coupon usage across income groups to drive higher sales. The focus was on groceries and household items, excluding non-core categories like “COUPON/MISC ITEMS” to target customers effectively.

Methodology

I used R to analyze the CompleteJourney dataset, pulling data from transactions, products, demographics, coupons, and coupon redemptions over a year. I performed exploratory analysis with joins to combine datasets, summarized sales and coupon counts by product category and income, and created tables and graphs (like stacked bars and pie charts) to uncover trends. This approach helped me slice the data by income groups and visualize key patterns.

Insights

My analysis revealed that BEEF and SOFT DRINKS dominate sales and coupon usage across all income brackets, marking them as staples. Dairy (FLUID MILK PRODUCTS, CHEESE) also shows consistent demand. Lower-income groups (Under 15K–25-34K) favor affordable options like FROZEN PIZZA and FRZN MEAT/MEAT DINNERS, while middle-income (35-49K–75-99K) lean toward BEERS/ALES and discretionary items like MAKEUP AND TREATMENT. Higher-income (100K+) shifts to BAKED BREAD/BUNS/ROLLS and BAG SNACKS, with lower overall spending, suggesting alternative shopping habits.

Implications and Recommendations

These insights guide Regork’s growth. For expansion in existing strong sales areas, I recommend heavily marketing BEEF, SOFT DRINKS, and dairy to all income groups, especially middle-income (50-74K), where sales and coupon use both peak. Use targeted ads and in-store promotions to amplify this. For market penetration, focus on lower-income groups with coupons for FROZEN PIZZA and FRZN MEAT/MEAT DINNERS, tapping their budget-driven demand. Middle-income should get coupons for MAKEUP AND TREATMENT and HAIR CARE PRODUCTS to boost discretionary spending. Higher-income needs niche coupons for YOGURT and BABY FOODS to re-engage them, addressing their lower store presence. This dual strategy maximizes revenue and attracts new buyers.

Limitations and Improvements

My analysis lacks time trends, limiting seasonal insights, and excludes unknown “NA” data from product info, potentially missing a segment. Future work could add time-series analysis to spot holiday spikes or include unknown product data. Other demographics, like age and household size, could also be incorporated into future studies.

Final Project - Regork Data Analysis

George Cain

2025-10-12

Introduction

Packages and Libraries Required

Exploratory Data Analysis

Top 10 Product Categories by Revenue

Table for Top 10

Visualize with ggplot2

Explore product_type within COUPON/MISC ITEMS

Print results

Top 10 product categories by revenue, excluding COUPON/MISC ITEMS

Create and display table

Visualize with ggplot2

Generate top 10 product categories by total sales as percentages

Pie chart of top 10 product sales

Define top 10 product categories (excluding COUPON/MISC ITEMS)

Sales by product_category and income

Top 5 product categories by sales for each income bracket

Display table

Filter for income up to 99K

Stacked bar plot - Lower Income

Filter for income 100K and higher

Stacked bar plot - Higher Income

Join datasets with many-to-many relationship and filter out NA product_category

Display table

Join datasets with many-to-many relationship and filter out NA values

Stacked bar plot - Coupon usage for lower incomes

Join datasets with many-to-many relationship and filter out NA values

Stacked bar plot - Coupon usage for higher incomes

Analysis of Coupon Usage

Summary

Problem Statement

Methodology

Insights

Implications and Recommendations

Limitations and Improvements