Final Project: BANA 7025

Market Basket Analysis: Charcuterie Box Association

Introduction

Business Questions:

The goal of this project is to identify areas of growth for a national grocery chain named Regork, where the company could invest future resources to increase revenue and profits. We will be using transaction data from the company database. More information about the data set can be found here

In this project, we plan to explore, analyze and visualize the data to ultimately answer the below three questions:

What are the trends of charcuterie board items purchases? Do they tend to peak around at special holidays?
Is there a certain demographic group is generating large amounts of revenue for these products and we could invest to capture market share in this area?
What types of products are usually bought together for a common charcuterie board? How can we design effective promotions that encourage customers to buy more products together as listed in our common charcuterie board list?

This analysis will help the Regork better their marketing strategies to optimize sales and customer retention.

Common items that are usually included in charcuterie board/box are:

Cured Meats like prosciutto, salami, chorizo, or bresaola.
Cheeses: A variety of cheeses with different flavors and textures, such as brie, cheddar, goat cheese, or blue cheese.
Bread and Crackers: Baguette slices, crackers, or crostini.
Fruits: Grapes, figs, berries, or dried fruits like apricots or dates.
Nuts: Almonds, walnuts, pecans, or pistachios.
Olives: Kalamata or other olives.
Spreads: Honey, jam, chutney, or mustard.

Packages Required

The below libraries are used in this project. Most are data wrangling packages except completejourney, which is the transaction data and arules which is association rule mining package. You will need to install these packages to replicate the analysis

library(ggplot2)
library(completejourney)
library(tidyverse)
library(ggrepel)
library(scales)
library(arules)
library(arulesViz)
library(knitr)

Data Prep

Importing the Data

We first import the data from completejournery library with additional of one field called transaction id which is the primary key of the table for us to later remove duplications after joining to products and coupons, household and redemptions.

For the purpose of this analysis - a table charcuterie_board included items mentioned in introduction is created to search againts the product_type

# Create a transaction ID with increment of 1 to remove dups after join
transactions <- get_transactions() %>%
  mutate(transaction_id = row_number()) %>%
  select(transaction_id, everything())

charcuterie_board <- list(
  Cured_Meats = c("Prosciutto", "Salami", "Chorizo", "Bresaola"),
  Cheeses = c("Cheese","Brie", "Cheddar", "Goat Cheese", "Blue Cheese"),
  Bread_and_Crackers = c("Baguette Slices", "Crackers","Bread","Crostini"),
  Fruits = c("Fruits","Grapes", "Figs"),
  Nuts = c("Almonds", "Walnuts", "Pecans", "Pistachios"),
  Olives = c("Kalamata Olives"),
  Spreads = c("Honey", "Mustard"))

data_joined <- transactions %>%
  left_join(products, by = "product_id") %>%
  left_join(coupons, by = "product_id", relationship = "many-to-many") %>%
  left_join(demographics, by = "household_id") %>%
  left_join(campaigns,
            by = c("household_id", "campaign_id"),
            relationship = "many-to-many") %>%
  left_join(
    coupon_redemptions,
    by = c("household_id", "campaign_id", "coupon_upc"),
    relationship = "many-to-many"
  ) %>%
  distinct(transaction_id, .keep_all = TRUE) %>% # Keep one row per transaction_id
  group_by(transaction_id) %>%
  filter(is.na(redemption_date) |
           row_number() == 1) %>%  # Retain one row if redemption_date is NULL
  ungroup()

Exploratory Data Analysis

We filter our common items list for charcuterie board or any combinations of the occurrences and visualize the trend of these purchases throughout the year with the sales values and demographic groups that purchase these products.

Purchase Trends

interested_transactions <- data_joined %>%
  filter(str_detect(product_category, regex(paste(unlist(charcuterie_board), collapse = "|"), ignore_case = TRUE))) %>%
  filter(str_detect(product_type, regex(paste(unlist(charcuterie_board), collapse = "|"), ignore_case = TRUE))) %>%
  mutate(product_category = if_else(product_category=='CHEESE','CHEESES',product_category))

deli_meats <- data_joined %>%
  filter(str_detect(product_category, regex("meats", ignore_case = TRUE)) & department == 'DELI')

interested_transactions <- rbind(interested_transactions,deli_meats) 
print('Selected products transactions total rows count:')

## [1] "Selected products transactions total rows count:"

nrow(interested_transactions)

## [1] 114574

print("First 10 rows instance")

## [1] "First 10 rows instance"

head(interested_transactions, n=10)

## # A tibble: 10 × 28
##    transaction_id household_id store_id basket_id   product_id quantity
##             <int> <chr>        <chr>    <chr>       <chr>         <dbl>
##  1              9 1058         381      31198676055 988791            1
##  2             24 1873         361      31198640134 862854            1
##  3             32 993          32004    31198515122 859075            1
##  4             57 1465         391      31198920207 826144            1
##  5             81 1269         31642    31198520221 1040807           1
##  6             93 58           299      31198665127 965766            1
##  7            101 2084         368      31198690191 9523169           1
##  8            152 1769         319      31198705337 9420140           2
##  9            178 1899         31782    31198500220 833598            1
## 10            196 1899         31782    31198500220 883932            1
## # ℹ 22 more variables: sales_value <dbl>, retail_disc <dbl>, coupon_disc <dbl>,
## #   coupon_match_disc <dbl>, week <int>, transaction_timestamp <dttm>,
## #   manufacturer_id <chr>, department <chr>, brand <fct>,
## #   product_category <chr>, product_type <chr>, package_size <chr>,
## #   coupon_upc <chr>, campaign_id <chr>, age <ord>, income <ord>,
## #   home_ownership <ord>, marital_status <ord>, household_size <ord>,
## #   household_comp <ord>, kids_count <ord>, redemption_date <date>

# Plot the purchasing trends over time
purchasing_trends <- interested_transactions %>%
  mutate(loyalty_sales = sales_value - (retail_disc+coupon_match_disc) / quantity,
         loyalty_sales = if_else(loyalty_sales<0.0,0.0,loyalty_sales )) %>%
      mutate(loyalty_sales = replace_na(loyalty_sales, 0)) %>%  # Replace NA with 0
  group_by(week) %>%
  summarise( total_sales = sum(sales_value),
    total_loyalty_sales = sum(loyalty_sales),
    total_coupon_sales = sum(if_else(!is.na(redemption_date), retail_disc+coupon_match_disc, 0), na.rm = TRUE)  # Sum coupon sales if redemption_date is not NA
) %>% replace(is.na(.), 0.0)

peak_sales <- purchasing_trends %>%
  slice_max(order_by= total_sales, n=4) 

ggplot(purchasing_trends, aes(x = week, y = total_sales)) +
  labs(title = "Purchasing Trends of Selected Products",
       subtitle = "4 peaks of sales throughout the year",
       x = "Week",
       y = "Total Sales") +
  geom_line(aes(color = "Total Sales")) +
  geom_line(aes(y = total_loyalty_sales, color = "Total Loyalty Sales")) +
    geom_line(aes(y = total_coupon_sales, color = "Total Coupon Sales")) +
  geom_point(data = peak_sales, aes(x = week, y = total_sales), color = "red", size = 3, shape = 21, fill = "yellow") +  # Highlight peak sales
  geom_label(data = peak_sales, aes(x = week, y = total_sales, label = scales::dollar(total_sales)), 
           vjust = -0.4, color = "red", fill = "white", label.size = 0.2) +
  scale_y_continuous(labels = dollar) +
  theme_minimal() +
  scale_color_manual(values = c("Total Sales" = "#619CFF", 
                                 "Total Loyalty Sales" = "#00BA38", 
                                 "Total Coupon Sales" = "#F8766D")) +
  theme(legend.title = element_blank())

Based on the visual, we can see that sales peak $7,819 at the very end of the year and stay pretty consistent throughout the year ranges from $4000 to $6000 a week.
The total coupon redemption amount was determined through the retailer coupon and manufacturer coupon if there was a coupon redemption date available.
Most of the week, the sales transaction rarely include any coupon redemption as depicted by the very low red line on the graph that indicates how much total discount was applied. Therefore, we can focus on this to promote more sales through promotions.

Purchase Demographics

# Plot demographic group that purchase items
purchasing_demographics <- interested_transactions %>%
  mutate(loyalty_sales = sales_value - (retail_disc+coupon_match_disc) / quantity,
         loyalty_sales = if_else(loyalty_sales<0.0,0.0,loyalty_sales )) %>%
    mutate(loyalty_sales = replace_na(loyalty_sales, 0)) %>%  # Replace NA with 0
  group_by(age) %>%
  summarise( total_sales = sum(sales_value),
    total_loyalty_sales = sum(loyalty_sales)) 

purchasing_demographics_long <- purchasing_demographics %>%
  pivot_longer(cols = c(total_sales, total_loyalty_sales),
               names_to = "sales_type",
               values_to = "amount")

second_highest_age_group <- purchasing_demographics_long %>%
  filter(!is.na(age)) %>%
  slice_max(order_by= amount, n=1) 

ggplot(purchasing_demographics_long, aes(x = age, y = amount, fill = sales_type)) +
  geom_bar(stat = "identity", position = "dodge") +
    scale_y_continuous(labels = scales::dollar) +
  labs(title = "Total Sales and Loyalty Sales by Age",
       x = "Age Group",
       y = "Sales Amount") +
  theme_minimal() +
  geom_text(data = second_highest_age_group, aes(x = age, y = amount+1000, label = "Age group 45-54 is the second \n highest group in sales for selected products"), size = 5, vjust = 0)

The highest sales generated come from Age Group Unknown ~ $120,000 and Age Group 45-54 ~ $65,000. What interesting is that sales tends to increase with Age then decreases after the middle life mark - possibly they are not the ones doing the shopping but their children?

Association Analysis

In our case, basket id is considered a transaction in terms of association analysis. So, we group baskets together to create dummy variables which indicates if an item is present in the basket to feed into our item set for basic association analysis.
What is the likelihood that for a given basket, the items mentioned are purchased together? What is the likelihood when they are not or what is the likelihood that either any combinations of them are purchased together?

Before we interpret our result, we will use three important metrics used in association rule mining: Support, Confidence, and Lift.

Support measures the proportion of transactions that contain a specific itemset $A$. In this case, what is the proportion of transactions that contains Cheese or Deli Meat? The total number of transactions in this case is the interested transactions that includes selected products, not all transactions at the store - so we don’t introduce noise to the model.

Formula: \[ \text{Support}(A) = \frac{\text{Number of transactions containing } A}{\text{Total number of transactions}} \]

Confidence measures how often items in $B$ appear in transactions that contain $A$. In this case, how often Deli Meat appear in transactions that contains Cheese?

Formula: \[ \text{Confidence}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)} \]

Lift measures the increase in the likelihood of purchasing $B$ when $A$ is purchased, compared to the likelihood of purchasing $B$ independently. Determine the effectiveness of a rule. A lift greater than 1 indicates a positive correlation, suggesting that the presence of $A$ increases the likelihood of purchasing $B$. In other words, Cheese might increase the likelihood of purchasing Meat or otherwise.

Formula: \[ \text{Lift}(A \Rightarrow B) = \frac{\text{Confidence}(A \Rightarrow B)}{\text{Support}(B)} \]

interested_basket <- interested_transactions %>%
  group_by(basket_id) %>%
  summarize(product_list = str_c(unique(product_type), collapse = ";"))

#dummy variable creation
model_basket <- interested_basket %>%
  mutate(
    cured_meats = if_else(str_detect(
      product_list, regex(paste(c("meat", unlist(charcuterie_board$Cured_Meats)), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
    cheeses = if_else(str_detect(
      product_list, regex(paste(unlist(charcuterie_board$Cheeses), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
    bread_and_crackers = if_else(str_detect(
      product_list, regex(paste(unlist(charcuterie_board$Bread_and_Crackers), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
    fruits = if_else(str_detect(
      product_list, regex(paste(unlist(charcuterie_board$Fruits), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
      nuts = if_else(str_detect(
      product_list, regex(paste(unlist(charcuterie_board$Nuts), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
      olives = if_else(str_detect(
      product_list, regex(paste(unlist(charcuterie_board$Olives), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
      spreads = if_else(str_detect(
      product_list, regex(paste(unlist(charcuterie_board$Spreads), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE)
  ) %>%
  select(basket_id, cured_meats, cheeses, bread_and_crackers, fruits, nuts, olives, spreads) %>%
  as.data.frame() %>%
  `rownames<-`(.$basket_id) %>%
  select(-basket_id)

We will use the Apriori algorithm from arules library which is used in data mining for discovering frequent itemsets and generating association rules. It is particularly well-known for its application in market basket analysis, where it helps identify relationships between items purchased together.

basket_trans <- arules::transactions(model_basket)
itemFrequencyPlot(basket_trans,topN = 5)

The Frequent Set Items are, suggesting they are viewed as complementary:
- Cheese
- Bread and Crackers
- Cured Meats
- Fruits
Rule: The set of items in the antecedent (left side) leads to the consequent (right side).

rules_table <- head(its, n = 10)

kable(as(rules_table, "data.frame"), format = "markdown")

	rules	support	confidence	coverage	lift	count
2	{} => {cheeses}	0.6334796	0.6334796	1.0000000	1.000000	32348
1	{} => {bread_and_crackers}	0.6264100	0.6264100	1.0000000	1.000000	31987
3	{cured_meats} => {cheeses}	0.1302287	0.6957522	0.1871769	1.098303	6650
8	{cured_meats,bread_and_crackers} => {cheeses}	0.0853635	0.7728723	0.1104496	1.220043	4359
9	{cured_meats,cheeses} => {bread_and_crackers}	0.0853635	0.6554887	0.1302287	1.046421	4359
6	{bread_and_crackers,fruits} => {cheeses}	0.0374432	0.6697023	0.0559102	1.057181	1912
7	{cheeses,fruits} => {bread_and_crackers}	0.0374432	0.6697023	0.0559102	1.069112	1912
5	{cured_meats,fruits} => {cheeses}	0.0164499	0.7872540	0.0208953	1.242746	840
4	{cured_meats,fruits} => {bread_and_crackers}	0.0147266	0.7047798	0.0208953	1.125109	752
10	{cured_meats,bread_and_crackers,fruits} => {cheeses}	0.0123375	0.8377660	0.0147266	1.322483	630

Interpretation:

High Demand for Cheeses and Bread & Crackers
Rule 1 & 2 : {} => {cheeses} & {} => {bread_and_crackers}
- Support: both have very similar support which is ~ 63% of all transactions
- Confidence: both also have very similar confidence ~ 63%, meaning 63% of transactions include cheeses / bread and crackers, regardless of other items.
  
  => This rule suggests cheeses & bread and crackers are a popular choice overall.
Rule 3 & 9: {cured_meats} => {cheeses} & {cured_meats,cheeses} => {bread_and_crackers}
- Cured meats frequently lead to purchases of cheeses (Rule 3: 69.5% confidence) and, to a lesser extent, bread & crackers (Rule 9: 65.5% confidence). Promoting cured meats as part of a charcuterie board could drive sales of complementary items like cheeses & bread and crackers.
Rule 4 & 5: {cured_meats,fruits} => {bread_and_crackers} & {cured_meats,fruits} => {cheeses}
- Although these items have low support, indicating that they occur less frequently in transactions. They have a very high lift, they still offer valuable insights for targeted marketing and product placement.

Summary

Question 1:

What are the trends of charcuterie board items purchases? Do they tend to peak around at special holidays?

Items classified as common charcuterie board products sales reach a peak of $7,819 at the end of the year, while consistently ranging from $4,000 to $6,000 throughout the rest of the year. Notably, most weeks show minimal coupon redemption, as indicated by the low red line on the first graph in representing total discounts applied. This indicates informed marketing strategies - encouraging targeted promotions around major holidays while maintaining regular pricing strategies for consistent sales.

Question 2:

Is there a certain demographic group is generating large amounts of revenue for these products and we could invest to capture market share in this area? * The bar graph comparison suggests that older individuals may not be the primary shoppers; instead, their children might be making these purchases on their behalf. This insight could guide targeted marketing strategies to better capture market share in these demographics.

Personalized Promotions - Paper Coupons or Cashback: Create tailored offers that appeal to the preferences of consumers of 35-45 age group, potentially encouraging them to shop for charcuterie items for gatherings or family events as they might also want to save money if they bundle.

Question 3:

What types of products are usually bought together for a common charcuterie board? How can we design effective promotions that encourage customers to buy more products together as listed in our common charcuterie board introduction?

Based on the rule tables, the commonly Paired Products are typically bought together and have a strong connection indicated by rule table are:
- Cured Meats
- Cheeses
- Bread & Crackers
- Fruits
Effective Promotion Strategies: Curated Charcuterie Kits: Create bundles that include all essential items: cured meats, cheeses, assorted bread & crackers, and fruits. Offer these kits at a slight discount or cashback or double reward points compared to buying items individually to incentivize purchase. This can encourage bulk purchases and reward repeat customers.
For the target group of marketing is 45-54, hosting in-store tasting events where customers can sample combinations of cured meats, cheeses, fruits, and crackers. This can help them discover new pairings and encourage them to purchase the featured items from gourmet charcuterie board brands and store made board

Conclusion:

The analysis reveals distinct purchasing trends for charcuterie board items, highlighting peak sales around significant holidays such as New Year’s, Thanksgiving, and Christmas. These seasonal spikes indicate a strong consumer interest in charcuterie boards during festive occasions, suggesting that targeted promotional strategies should be developed to capitalize on this demand.

To optimize sales throughout the year, brands can create seasonal or holiday-themed charcuterie board/box that feature carefully curated selections of items like Cured Meats, Cheeses, Bread, Crackers, Fruits (indicated in Frequent Set Items). An area for growth could be to add nuts, spreads and olives to the charcuterie board items. There was no purchase or link of these items to our frequent set. This approach not only caters to customers looking for convenient options for gatherings but also enhances the shopping experience by simplifying decision-making during busy times and increase sales by bundling products.

Additionally, the insights regarding demographics suggest that marketing efforts should focus on middle-aged consumers, who may be purchasing on behalf of older family members. Tailored promotions, such as personalized paper coupons or cashback offers, could further engage other groups, encouraging them to buy charcuterie items for family gatherings and events.

Limitations:

There is only one year of transaction data, which might be bias because different year can have a different purchasing trends such as during COVID-19. We could use the same method of analysis to look at other complementary items such as BBQ and explore further on brand vs store products to drive sales.