suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(tidyverse)))
knitr::opts_chunk$set(echo = TRUE)
The following report is addressed to the CEO of Regork for viewership and consideration.
Business Problem Addressed:
There is an opportunity for Regork to experience business growth through a strategic marketing campaign. The business problem at hand is how to leverage the spending habits and coupon usage across different income brackets to target customers and boost sales revenue. The focus here will be on the sales of actual groceries and everyday household items.
Approach:
Using the programming language known as R, robust data analysis was performed on the CompleteJourney data package. I extracted information from datasets such as transactions, products, demographics, coupons, and coupon redemptions for Regork spanning a one-year period. Then, I conducted exploratory analysis of the data through joins, table summaries, and graphical visualizations to find insights.
Usefulness:
The analysis that follows takes raw data, like sales revenue by product and coupon redemption counts, and configures them in an easy-to-interpret format. You will be able to view the extent of sales and coupon usage across demographic groups varying by household income, to see who buys what and who is attracted to the best deals. This will enable us to identify habits and propose a solution for how to best target customers to attract them to spend more money shopping at Regork.
The following R packages will be required to run this code:
completejourney - Needed for access to the core
datasets (e.g., transactions, products, demographics, coupons, and
coupon redemptions).
tidyverse – Needed for data wrangling (i.e. filter,
group_by, summarize), visualization (i.e., ggplot2), and glimpse() to
view data summaries.
curl – May be needed to circumvent SSL connectivity
problems on Windows devices.
knitr – Used for clean R Markdown output
(i.e. formatting tables, suppressing messages).
Import CompleteJourney if you haven’t already:
options(download.file.method = "wininet")
install.packages("remotes")
remotes::install_github("bradleyboehmke/completejourney")
Load the libraries for the following packages:
suppressMessages(suppressWarnings(library(completejourney)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(curl)))
Load the CompleteJourney datasets:
transactions <- get_transactions()
promotions <- get_promotions()
If those get_xxx() commands don’t work due to SSL connectivity issues, try these commands for Windows R:
url <- "https://raw.githubusercontent.com/bradleyboehmke/completejourney/master/data/transactions.rds"
tmp <- tempfile(fileext = ".rds")
curl::curl_download(url, tmp)
transactions <- readRDS(tmp)
url2 <- "https://raw.githubusercontent.com/bradleyboehmke/completejourney/master/data/promotions.rds"
tmp2 <- tempfile(fileext = ".rds")
curl::curl_download(url2, tmp2)
promotions <- readRDS(tmp2)
Load the rest of the relevant datasets:
data("demographics")
data("products")
data("coupons")
data("coupon_redemptions")
First, a good habit is to view the names of the columns in the data, and also check to see where there are unknown “NA” values, so that we can avoid NA errors when joining the data.
options(tibble.width = Inf)
transactions %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 11
## household_id store_id basket_id product_id quantity sales_value retail_disc
## <int> <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0 0
## coupon_disc coupon_match_disc week transaction_timestamp
## <int> <int> <int> <int>
## 1 0 0 0 0
products %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 7
## product_id manufacturer_id department brand product_category product_type
## <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 540 528
## package_size
## <int>
## 1 30586
promotions %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 5
## product_id store_id display_location mailer_location week
## <int> <int> <int> <int> <int>
## 1 0 0 0 0 0
demographics %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 8
## household_id age income home_ownership marital_status household_size
## <int> <int> <int> <int> <int> <int>
## 1 0 0 0 233 137 0
## household_comp kids_count
## <int> <int>
## 1 0 0
coupons %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 3
## coupon_upc product_id campaign_id
## <int> <int> <int>
## 1 0 0 0
coupon_redemptions %>% summarize_all(~sum(is.na(.))) %>% print()
## # A tibble: 1 × 4
## household_id coupon_upc campaign_id redemption_date
## <int> <int> <int> <int>
## 1 0 0 0 0
This reveals useful columns of data i.e. household_id, product_id, sales_value, product_category, product_type, and income.
Information like product_category and product_type have NA values, so we will need to be mindful of this.
We can now begin summarizing and viewing the actual data in meaningful ways, to see where exploration leads us. First, let’s take a look at the top 10 product categories in terms of sales revenue.
top_categories <- transactions %>%
left_join(products, by = "product_id") %>%
group_by(product_category) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales)) %>%
slice_head(n = 10) %>%
print()
## # A tibble: 10 × 2
## product_category total_sales
## <chr> <dbl>
## 1 COUPON/MISC ITEMS 385972.
## 2 SOFT DRINKS 182126.
## 3 BEEF 176615.
## 4 FLUID MILK PRODUCTS 116361.
## 5 CHEESE 107012.
## 6 FRZN MEAT/MEAT DINNERS 93024.
## 7 BAG SNACKS 84944.
## 8 BAKED BREAD/BUNS/ROLLS 82430.
## 9 BEERS/ALES 82039.
## 10 FROZEN PIZZA 81175.
We can visualize this with a simple table and bar graph:
kable(top_categories, caption = "Top 10 Product Categories by Total Sales", align = "lc")
| product_category | total_sales |
|---|---|
| COUPON/MISC ITEMS | 385971.85 |
| SOFT DRINKS | 182126.30 |
| BEEF | 176614.54 |
| FLUID MILK PRODUCTS | 116360.68 |
| CHEESE | 107011.61 |
| FRZN MEAT/MEAT DINNERS | 93023.74 |
| BAG SNACKS | 84943.85 |
| BAKED BREAD/BUNS/ROLLS | 82429.70 |
| BEERS/ALES | 82039.45 |
| FROZEN PIZZA | 81175.04 |
ggplot(top_categories, aes(x = reorder(product_category, total_sales), y = total_sales)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Top 10 Product Categories by Revenue",
x = "Product Category", y = "Total Sales ($)") +
theme_minimal()
At a glance, this shows us that identifiable items like soft drinks, beef, milk, cheese, and frozen meat products have the highest sales revenue, with the largest amount of sales going toward “COUPON / MISC ITEMS.” We will need to explore this product category further to see if it warrants inclusion in general groceries and storebought household items.
coupon_misc_sales <- transactions %>%
left_join(products, by = "product_id") %>%
filter(product_category == "COUPON/MISC ITEMS") %>%
group_by(product_type) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales))
print("Sales by product_type in COUPON/MISC ITEMS:")
## [1] "Sales by product_type in COUPON/MISC ITEMS:"
coupon_misc_sales %>% print(n = Inf)
## # A tibble: 19 × 2
## product_type total_sales
## <chr> <dbl>
## 1 GASOLINE-REG UNLEADED 364117.
## 2 ELECTRONIC GIFT CARDS ACTIVATI 5780.
## 3 OUTSIDE VENDORS GIFT CARDS 2785.
## 4 MEAT SUPPLIES 2670.
## 5 ELECTRONIC GIFT CARDS REFRESH 2268
## 6 MISCELLANEOUS H & B AIDS 2038.
## 7 MISC SALES TRANS 1739.
## 8 MASTERCARD GIFT CARD 1338.
## 9 PRODUCE DEPT KEY RING 980.
## 10 FLORAL DEPT KEY RING 667.
## 11 AMERICAN EXPRESS GIFT CARD 453.
## 12 CAN DOG FOOD RATION (TRIX/VETS 435.
## 13 TICKETS 302
## 14 MISC. BATH PRODUCTS 188.
## 15 SOFT DRINKS CAN NON-CARB (EXCE 102.
## 16 BULK CANDY 67.0
## 17 JUICE 16.9
## 18 CENTRAL SUPPLIES 15
## 19 COFF SHOP: RETAIL PACK BEVERAG 9.95
There’s no need for further visualization, because the display shows us that the lion’s share of this COUPON / MISC ITEMS category ($364K out of $386K) is gasoline from the fuel pump. Most of the remainder of the sales are for gift cards and miscellaneous items. Any true grocery items have negligible sales. Therefore we will exclude this product category from our top 10 moving forward.
Running the code again with this in mind:
top_categories_no_misc <- transactions %>%
left_join(products, by = "product_id") %>%
filter(product_category != "COUPON/MISC ITEMS") %>%
group_by(product_category) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales)) %>%
slice_head(n = 10)
kable(top_categories_no_misc,
caption = "Top 10 Product Categories by Revenue (Excluding COUPON/MISC ITEMS)", align = "lc")
| product_category | total_sales |
|---|---|
| SOFT DRINKS | 182126.30 |
| BEEF | 176614.54 |
| FLUID MILK PRODUCTS | 116360.68 |
| CHEESE | 107011.61 |
| FRZN MEAT/MEAT DINNERS | 93023.74 |
| BAG SNACKS | 84943.85 |
| BAKED BREAD/BUNS/ROLLS | 82429.70 |
| BEERS/ALES | 82039.45 |
| FROZEN PIZZA | 81175.04 |
| COLD CEREAL | 63009.35 |
ggplot(top_categories_no_misc, aes(x = reorder(product_category, total_sales), y = total_sales)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Top 10 Product Categories by Revenue (Excl. COUPON/MISC ITEMS)",
x = "Product Category", y = "Total Sales ($)") +
theme_minimal()
Alternatively, we can also visualize these results as a pie chart, to view the relative portions of sales for each product:
top_categories_no_misc <- transactions %>%
left_join(products, by = "product_id") %>%
filter(product_category != "COUPON/MISC ITEMS") %>%
group_by(product_category) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
mutate(percentage = total_sales / sum(total_sales) * 100) %>%
arrange(desc(total_sales)) %>%
slice_head(n = 10)
ggplot(top_categories_no_misc, aes(x = "", y = percentage, fill = product_category)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
labs(title = "Top 10 Product Categories by Sales (Excluding COUPON/MISC ITEMS)",
fill = "Product Category") +
scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey")) + # 10 distinct colors
theme_minimal() +
theme(
axis.title = element_blank(), # Remove axis titles
axis.text = element_blank(), # Remove axis text
panel.grid = element_blank(), # Remove grid
legend.position = "right", # Legend on right
legend.text = element_text(size = 8), # Smaller legend text
legend.key.size = unit(0.5, "cm"), # Smaller legend keys
legend.margin = margin(10, 10, 10, 20), # Space for legend
plot.margin = margin(10, 20, 10, 10) # Extra plot margin
)
Now that the top 10 product categories have been identified, we will carry these forward and analyze what the top 5 are for each income bracket. This will be split into two graphs, since the total sales revenue for higher income brackets is much lower than that of the lower income brackets. This will make each income group (lower vs. higher) easier to visualize.
top_10_categories <- c("SOFT DRINKS", "BEEF", "FLUID MILK PRODUCTS", "CHEESE",
"FRZN MEAT/MEAT DINNERS", "BAG SNACKS",
"BAKED BREAD/BUNS/ROLLS", "BEERS/ALES",
"FROZEN PIZZA", "COLD CEREAL")
sales_by_income <- transactions %>%
left_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id") %>%
filter(product_category %in% top_10_categories, !is.na(income)) %>%
group_by(product_category, income) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE), .groups = "drop") %>%
arrange(product_category, desc(total_sales))
top_5_by_income <- sales_by_income %>%
group_by(income) %>%
slice_max(total_sales, n = 5, with_ties = FALSE) %>% # Top 5 by sales per income
arrange(income, desc(total_sales))
kable(top_5_by_income,
caption = "Sales for Top 5 Product Categories by Income",
align = "llc")
| product_category | income | total_sales |
|---|---|---|
| BEEF | Under 15K | 9567.36 |
| SOFT DRINKS | Under 15K | 9501.77 |
| CHEESE | Under 15K | 4903.56 |
| FROZEN PIZZA | Under 15K | 4839.56 |
| FLUID MILK PRODUCTS | Under 15K | 4248.32 |
| SOFT DRINKS | 15-24K | 10811.04 |
| BEEF | 15-24K | 7813.07 |
| FLUID MILK PRODUCTS | 15-24K | 4552.21 |
| FRZN MEAT/MEAT DINNERS | 15-24K | 4382.31 |
| BAG SNACKS | 15-24K | 3853.93 |
| BEEF | 25-34K | 10082.68 |
| SOFT DRINKS | 25-34K | 8814.52 |
| FRZN MEAT/MEAT DINNERS | 25-34K | 6939.98 |
| FLUID MILK PRODUCTS | 25-34K | 5663.37 |
| CHEESE | 25-34K | 5423.52 |
| BEEF | 35-49K | 19230.10 |
| SOFT DRINKS | 35-49K | 17853.96 |
| FLUID MILK PRODUCTS | 35-49K | 12934.51 |
| CHEESE | 35-49K | 11579.98 |
| FRZN MEAT/MEAT DINNERS | 35-49K | 11564.07 |
| SOFT DRINKS | 50-74K | 22010.52 |
| BEEF | 50-74K | 21663.90 |
| FLUID MILK PRODUCTS | 50-74K | 15707.60 |
| CHEESE | 50-74K | 15155.94 |
| FRZN MEAT/MEAT DINNERS | 50-74K | 12903.46 |
| SOFT DRINKS | 75-99K | 13621.86 |
| BEEF | 75-99K | 10869.89 |
| FLUID MILK PRODUCTS | 75-99K | 8688.98 |
| CHEESE | 75-99K | 7680.54 |
| BEERS/ALES | 75-99K | 6822.22 |
| SOFT DRINKS | 100-124K | 4127.24 |
| BEEF | 100-124K | 3834.48 |
| FLUID MILK PRODUCTS | 100-124K | 3132.09 |
| BEERS/ALES | 100-124K | 2474.89 |
| CHEESE | 100-124K | 2206.70 |
| SOFT DRINKS | 125-149K | 4595.07 |
| FLUID MILK PRODUCTS | 125-149K | 4356.19 |
| CHEESE | 125-149K | 4041.13 |
| BEEF | 125-149K | 3989.74 |
| BAKED BREAD/BUNS/ROLLS | 125-149K | 3285.06 |
| SOFT DRINKS | 150-174K | 5562.43 |
| BEEF | 150-174K | 4817.54 |
| FLUID MILK PRODUCTS | 150-174K | 3208.86 |
| CHEESE | 150-174K | 2868.25 |
| FRZN MEAT/MEAT DINNERS | 150-174K | 2385.66 |
| BEEF | 175-199K | 1475.40 |
| FLUID MILK PRODUCTS | 175-199K | 953.26 |
| CHEESE | 175-199K | 882.88 |
| BAKED BREAD/BUNS/ROLLS | 175-199K | 722.82 |
| BAG SNACKS | 175-199K | 683.26 |
| SOFT DRINKS | 200-249K | 508.74 |
| BEEF | 200-249K | 474.83 |
| CHEESE | 200-249K | 390.98 |
| COLD CEREAL | 200-249K | 299.06 |
| BAG SNACKS | 200-249K | 277.93 |
| BEEF | 250K+ | 2072.10 |
| SOFT DRINKS | 250K+ | 1745.54 |
| FLUID MILK PRODUCTS | 250K+ | 1724.02 |
| COLD CEREAL | 250K+ | 1295.42 |
| CHEESE | 250K+ | 1260.95 |
top_5_up_to_99k <- top_5_by_income %>%
filter(income %in% c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"))
ggplot(top_5_up_to_99k, aes(x = income, y = total_sales, fill = product_category)) +
geom_bar(stat = "identity", width = 0.4) + # Narrow bars
scale_y_continuous(labels = scales::comma, breaks = seq(0, 90000, by = 10000)) + # 20,000 increments up to 160,000
labs(title = "Top 5 Product Categories by Income Bracket (Up to 99K)",
x = "Income Bracket", y = "Total Sales ($)",
fill = "Product Category") +
scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow")) + # 8 distinct colors
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Rotate x-axis
axis.text.y = element_text(size = 8, margin = margin(r = 10)), # Space y-axis
legend.position = "right", # Legend on right
legend.text = element_text(size = 8), # Smaller legend text
legend.key.size = unit(0.5, "cm"), # Smaller legend keys
legend.margin = margin(10, 10, 10, 20), # Space for legend
plot.margin = margin(10, 20, 10, 10), # Extra plot margin
panel.grid.major.x = element_blank() # Remove x-axis grid
)
top_5_100k_plus <- top_5_by_income %>%
filter(income %in% c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+"))
ggplot(top_5_100k_plus, aes(x = income, y = total_sales, fill = product_category)) +
geom_bar(stat = "identity", width = 0.4) + # Narrow bars
scale_y_continuous(labels = scales::comma, breaks = seq(0, 22000, by = 2000)) + # 10,000 increments up to 60,000
labs(title = "Top 5 Product Categories by Income Bracket (100K and Higher)",
x = "Income Bracket", y = "Total Sales ($)",
fill = "Product Category") +
scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown")) + # 9 distinct colors
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Rotate x-axis
axis.text.y = element_text(size = 8, margin = margin(r = 10)), # Space y-axis
legend.position = "right", # Legend on right
legend.text = element_text(size = 8), # Smaller legend text
legend.key.size = unit(0.5, "cm"), # Smaller legend keys
legend.margin = margin(10, 10, 10, 20), # Space for legend
plot.margin = margin(10, 20, 10, 10), # Extra plot margin
panel.grid.major.x = element_blank() # Remove x-axis grid
)
What these tables and graphs reveal is that BEEF and SOFT DRINKS consistently rank among the top 5 across all income brackets (Under 15K to 250K+), with sales ranging from $9,567 to $22,011 for BEEF and $509 to $22,011 for SOFT DRINKS. This suggests a universal demand for meat and beverages. These can be considered staples.
Dairy products like FLUID MILK PRODUCTS and CHEESE also appear frequently as top items across almost all income levels. This is indicative of habitual purchases which draw a wide range of customers through the Regork doors.
The graphs also reveale differences between income brackets. For example, lower-income groups (Under 15K to 25-34K) tend to buy FROZEN PIZZA ($4,840 - $4,433) and FRZN MEAT/MEAT DINNERS ($3,943 - $6,940) more often, suggesting a preference for affordable, ready-to-eat options, likely influenced by constrained budgets.
However, middle-income groups (earning 35-49K to 75-99K) tend to buy BEERS/ALES ($6,822 - $8,787) in addition to higher sales in purchases of FRZN MEAT/MEAT DINNERS ($11,564 - $12,903). This indicates a shift toward discretionary spending and convenient foods as disposable income rises.
Higher income groups (earning 100K+) begin to buy things like BAKED BREAD/BUNS/ROLLS ($723 - $3,285) and BAG SNACKS ($278 - $683) especially in the higher brackets (175-199K to 250K+), suggesting preferences for fresh or snack options, while staple sales (like BEEF, $1,475 - $2,072) decline relatively.
Therefore, spending priorities change from budget-friendly convenience in lower incomes to variety and freshness in higher incomes. People who earn more money also tend to spend less money at Regork, implying they might shop elsewhere or use their disposable income to dine out or have groceries delivered to their homes.
Next, let’s evaluate coupon usage, analyzing coupon redemption counts to see which top 10 product categories drive customers to shop for deals at Regork:
coupon_usage <- coupons %>%
left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>%
left_join(products, by = "product_id") %>%
filter(!is.na(product_category)) %>% # Exclude NA product_category
group_by(product_category) %>%
summarize(coupon_count = n()) %>% # Count coupon redemptions per category
arrange(desc(coupon_count)) %>%
slice_head(n = 10) # Top 10 categories
kable(coupon_usage,
caption = "Top 10 Product Categories by Coupon Usage",
align = "lc")
| product_category | coupon_count |
|---|---|
| BEEF | 133340 |
| LUNCHMEAT | 95849 |
| FRZN MEAT/MEAT DINNERS | 93991 |
| MAKEUP AND TREATMENT | 77488 |
| PORK | 71910 |
| FROZEN PIZZA | 65555 |
| HAIR CARE PRODUCTS | 56719 |
| YOGURT | 55586 |
| CHEESE | 46387 |
| ICE CREAM/MILK/SHERBTS | 44033 |
Many of the top sold items appear here, including BEEF, FRZN MEAT/MEAT DINNERS, FROZEN PIZZA, and CHEESE. Other popular couponed items include other meat items like LUNCHMEAT and PORK, and other dairy items like YOGURT and ICE CREAM/MILK/SHERBTS. Cosmetic products like MAKEUP AND TREATMENT and HAIR CARE PRODUCTS also show up here.
Now that we have an overall view of coupon usage, let’s see which products each income group tends to redeem coupons for the most:
coupon_redemptions_by_income <- coupons %>%
left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>%
left_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id") %>%
filter(!is.na(product_category) & !is.na(income)) %>% # Exclude NA product_category and income
group_by(income, product_category) %>%
summarize(coupon_count = n(), .groups = "drop") %>% # Count coupon redemptions per category per income
group_by(income) %>%
slice_max(coupon_count, n = 5, with_ties = FALSE) %>% # Top 5 per income bracket
arrange(income, desc(coupon_count)) %>%
filter(income %in% c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K")) # Filter up to 99K
ggplot(coupon_redemptions_by_income, aes(x = income, y = coupon_count, fill = product_category)) +
geom_bar(stat = "identity", width = 0.4) + # Narrow bars
labs(title = "Coupon Redemptions by Product Category (Incomes Up to 99K)",
x = "Income Bracket", y = "Number of Coupons Redeemed",
fill = "Product Category") +
scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey", "lightblue", "lightgreen", "lavender", "salmon", "darkgreen")) + # 15 distinct colors
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Rotate x-axis
axis.text.y = element_text(size = 8, margin = margin(r = 10)), # Space y-axis
legend.position = "right", # Legend on right
legend.text = element_text(size = 8), # Smaller legend text
legend.key.size = unit(0.5, "cm"), # Smaller legend keys
legend.margin = margin(10, 10, 10, 20), # Space for legend
plot.margin = margin(10, 20, 10, 10), # Extra plot margin
panel.grid.major.x = element_blank() # Remove x-axis grid
)
coupon_redemptions_by_income <- coupons %>%
left_join(coupon_redemptions, by = c("coupon_upc", "campaign_id"), relationship = "many-to-many") %>%
left_join(products, by = "product_id") %>%
left_join(demographics, by = "household_id") %>%
filter(!is.na(product_category) & !is.na(income)) %>% # Exclude NA product_category and income
group_by(income, product_category) %>%
summarise(coupon_count = n(), .groups = "drop") %>% # Count coupon redemptions per category per income
group_by(income) %>%
slice_max(coupon_count, n = 5, with_ties = FALSE) %>% # Top 5 per income bracket
arrange(income, desc(coupon_count)) %>%
filter(income %in% c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+")) # Filter above 99K
ggplot(coupon_redemptions_by_income, aes(x = income, y = coupon_count, fill = product_category)) +
geom_bar(stat = "identity", width = 0.4) + # Narrow bars
labs(title = "Coupon Redemptions by Product Category (Incomes Above 99K)",
x = "Income Bracket", y = "Number of Coupons Redeemed",
fill = "Product Category") +
scale_fill_manual(values = c("red", "blue", "green", "purple", "orange", "pink", "cyan", "yellow", "brown", "grey", "lightblue", "lightgreen", "lavender", "salmon", "darkgreen")) + # 15 distinct colors
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8), # Rotate x-axis
axis.text.y = element_text(size = 8, margin = margin(r = 10)), # Space y-axis
legend.position = "right", # Legend on right
legend.text = element_text(size = 8), # Smaller legend text
legend.key.size = unit(0.5, "cm"), # Smaller legend keys
legend.margin = margin(10, 10, 10, 20), # Space for legend
plot.margin = margin(10, 20, 10, 10), # Extra plot margin
panel.grid.major.x = element_blank() # Remove x-axis grid
)
BEEF consistently ranks high across all income brackets for coupon usage (i.e. Under 15K: 9,914; 50-74K: 36,370; 250K+: 1,097), reflecting a strong preference for protein-rich foods. LUNCHMEAT and PORK also appear frequently (i.e. 15-24K: 3,164; 150-174K: 4,741), suggesting meat is a coupon-driven priority regardless of income.
There are some differences between income brackets.
Lower-income households (Under 15K to 25-34K) focus their coupon use on convenience foods and essentials, i.e. CHEESE (3,386 - 5,723), CHICKEN (3,106), and FROZEN PIZZA (4,497), alongside other meats. Redemption counts are moderate (e.g., Under 15K: 3,106 - 9,914).
Middle-income groups (35-49K to 75-99K) show by far the highest coupon redemption counts overall, with the 50-74K income range standing out. In addition to the common products like BEEF, coupon use emerges fro MAKEUP AND TREATMENT (5,240 - 31,887) and HAIR CARE PRODUCTS (2,796 - 22,796). This suggests discretionary spending on personal care alongside staples.
Higher-income brackets (100K+ to 250K+) use coupons less, likely because they are less sensitive to prices with more disposable income. Top items include PROCESSED (2,809), YOGURT (705 - 644), and BABY FOODS (635) alongside the meat staples. MAKEUP AND TREATMENT (2,406) persists in the 200-249K income range.
The core challenge I tackled was identifying growth opportunities for Regork by evaluating spending habits and coupon usage across income groups to drive higher sales. The focus was on groceries and household items, excluding non-core categories like “COUPON/MISC ITEMS” to target customers effectively.
I used R to analyze the CompleteJourney dataset, pulling data from transactions, products, demographics, coupons, and coupon redemptions over a year. I performed exploratory analysis with joins to combine datasets, summarized sales and coupon counts by product category and income, and created tables and graphs (like stacked bars and pie charts) to uncover trends. This approach helped me slice the data by income groups and visualize key patterns.
My analysis revealed that BEEF and SOFT DRINKS dominate sales and coupon usage across all income brackets, marking them as staples. Dairy (FLUID MILK PRODUCTS, CHEESE) also shows consistent demand. Lower-income groups (Under 15K–25-34K) favor affordable options like FROZEN PIZZA and FRZN MEAT/MEAT DINNERS, while middle-income (35-49K–75-99K) lean toward BEERS/ALES and discretionary items like MAKEUP AND TREATMENT. Higher-income (100K+) shifts to BAKED BREAD/BUNS/ROLLS and BAG SNACKS, with lower overall spending, suggesting alternative shopping habits.
These insights guide Regork’s growth. For expansion in existing strong sales areas, I recommend heavily marketing BEEF, SOFT DRINKS, and dairy to all income groups, especially middle-income (50-74K), where sales and coupon use both peak. Use targeted ads and in-store promotions to amplify this. For market penetration, focus on lower-income groups with coupons for FROZEN PIZZA and FRZN MEAT/MEAT DINNERS, tapping their budget-driven demand. Middle-income should get coupons for MAKEUP AND TREATMENT and HAIR CARE PRODUCTS to boost discretionary spending. Higher-income needs niche coupons for YOGURT and BABY FOODS to re-engage them, addressing their lower store presence. This dual strategy maximizes revenue and attracts new buyers.
My analysis lacks time trends, limiting seasonal insights, and excludes unknown “NA” data from product info, potentially missing a segment. Future work could add time-series analysis to spot holiday spikes or include unknown product data. Other demographics, like age and household size, could also be incorporated into future studies.