Business Questions:
In this project, we plan to explore, analyze and visualize the data to ultimately answer the below three questions:
What are the trends of charcuterie board items purchases? Do they tend to peak around at special holidays?
Is there a certain demographic group is generating large amounts of revenue for these products and we could invest to capture market share in this area?
What types of products are usually bought together for a common charcuterie board? How can we design effective promotions that encourage customers to buy more products together as listed in our common charcuterie board list?
This analysis will help the Regork better their marketing strategies to optimize sales and customer retention.
Common items that are usually included in charcuterie board/box are:
Cured Meats like prosciutto, salami, chorizo, or bresaola.
Cheeses: A variety of cheeses with different flavors and textures, such as brie, cheddar, goat cheese, or blue cheese.
Bread and Crackers: Baguette slices, crackers, or crostini.
Fruits: Grapes, figs, berries, or dried fruits like apricots or dates.
Nuts: Almonds, walnuts, pecans, or pistachios.
Olives: Kalamata or other olives.
Spreads: Honey, jam, chutney, or mustard.
The below libraries are used in this project. Most are data wrangling
packages except completejourney
, which is the transaction
data and arules
which is association rule mining package.
You will need to install these packages to replicate the analysis
library(ggplot2)
library(completejourney)
library(tidyverse)
library(ggrepel)
library(scales)
library(arules)
library(arulesViz)
library(knitr)
completejournery
library
with additional of one field called transaction id
which is
the primary key of the table for us to later remove duplications after
joining to products and coupons, household and redemptions.For the purpose of this analysis - a table
charcuterie_board
included items mentioned in introduction
is created to search againts the product_type
# Create a transaction ID with increment of 1 to remove dups after join
transactions <- get_transactions() %>%
mutate(transaction_id = row_number()) %>%
select(transaction_id, everything())
charcuterie_board <- list(
Cured_Meats = c("Prosciutto", "Salami", "Chorizo", "Bresaola"),
Cheeses = c("Cheese","Brie", "Cheddar", "Goat Cheese", "Blue Cheese"),
Bread_and_Crackers = c("Baguette Slices", "Crackers","Bread","Crostini"),
Fruits = c("Fruits","Grapes", "Figs"),
Nuts = c("Almonds", "Walnuts", "Pecans", "Pistachios"),
Olives = c("Kalamata Olives"),
Spreads = c("Honey", "Mustard"))
data_joined <- transactions %>%
left_join(products, by = "product_id") %>%
left_join(coupons, by = "product_id", relationship = "many-to-many") %>%
left_join(demographics, by = "household_id") %>%
left_join(campaigns,
by = c("household_id", "campaign_id"),
relationship = "many-to-many") %>%
left_join(
coupon_redemptions,
by = c("household_id", "campaign_id", "coupon_upc"),
relationship = "many-to-many"
) %>%
distinct(transaction_id, .keep_all = TRUE) %>% # Keep one row per transaction_id
group_by(transaction_id) %>%
filter(is.na(redemption_date) |
row_number() == 1) %>% # Retain one row if redemption_date is NULL
ungroup()
interested_transactions <- data_joined %>%
filter(str_detect(product_category, regex(paste(unlist(charcuterie_board), collapse = "|"), ignore_case = TRUE))) %>%
filter(str_detect(product_type, regex(paste(unlist(charcuterie_board), collapse = "|"), ignore_case = TRUE))) %>%
mutate(product_category = if_else(product_category=='CHEESE','CHEESES',product_category))
deli_meats <- data_joined %>%
filter(str_detect(product_category, regex("meats", ignore_case = TRUE)) & department == 'DELI')
interested_transactions <- rbind(interested_transactions,deli_meats)
print('Selected products transactions total rows count:')
## [1] "Selected products transactions total rows count:"
nrow(interested_transactions)
## [1] 114574
print("First 10 rows instance")
## [1] "First 10 rows instance"
head(interested_transactions, n=10)
## # A tibble: 10 × 28
## transaction_id household_id store_id basket_id product_id quantity
## <int> <chr> <chr> <chr> <chr> <dbl>
## 1 9 1058 381 31198676055 988791 1
## 2 24 1873 361 31198640134 862854 1
## 3 32 993 32004 31198515122 859075 1
## 4 57 1465 391 31198920207 826144 1
## 5 81 1269 31642 31198520221 1040807 1
## 6 93 58 299 31198665127 965766 1
## 7 101 2084 368 31198690191 9523169 1
## 8 152 1769 319 31198705337 9420140 2
## 9 178 1899 31782 31198500220 833598 1
## 10 196 1899 31782 31198500220 883932 1
## # ℹ 22 more variables: sales_value <dbl>, retail_disc <dbl>, coupon_disc <dbl>,
## # coupon_match_disc <dbl>, week <int>, transaction_timestamp <dttm>,
## # manufacturer_id <chr>, department <chr>, brand <fct>,
## # product_category <chr>, product_type <chr>, package_size <chr>,
## # coupon_upc <chr>, campaign_id <chr>, age <ord>, income <ord>,
## # home_ownership <ord>, marital_status <ord>, household_size <ord>,
## # household_comp <ord>, kids_count <ord>, redemption_date <date>
# Plot the purchasing trends over time
purchasing_trends <- interested_transactions %>%
mutate(loyalty_sales = sales_value - (retail_disc+coupon_match_disc) / quantity,
loyalty_sales = if_else(loyalty_sales<0.0,0.0,loyalty_sales )) %>%
mutate(loyalty_sales = replace_na(loyalty_sales, 0)) %>% # Replace NA with 0
group_by(week) %>%
summarise( total_sales = sum(sales_value),
total_loyalty_sales = sum(loyalty_sales),
total_coupon_sales = sum(if_else(!is.na(redemption_date), retail_disc+coupon_match_disc, 0), na.rm = TRUE) # Sum coupon sales if redemption_date is not NA
) %>% replace(is.na(.), 0.0)
peak_sales <- purchasing_trends %>%
slice_max(order_by= total_sales, n=4)
ggplot(purchasing_trends, aes(x = week, y = total_sales)) +
labs(title = "Purchasing Trends of Selected Products",
subtitle = "4 peaks of sales throughout the year",
x = "Week",
y = "Total Sales") +
geom_line(aes(color = "Total Sales")) +
geom_line(aes(y = total_loyalty_sales, color = "Total Loyalty Sales")) +
geom_line(aes(y = total_coupon_sales, color = "Total Coupon Sales")) +
geom_point(data = peak_sales, aes(x = week, y = total_sales), color = "red", size = 3, shape = 21, fill = "yellow") + # Highlight peak sales
geom_label(data = peak_sales, aes(x = week, y = total_sales, label = scales::dollar(total_sales)),
vjust = -0.4, color = "red", fill = "white", label.size = 0.2) +
scale_y_continuous(labels = dollar) +
theme_minimal() +
scale_color_manual(values = c("Total Sales" = "#619CFF",
"Total Loyalty Sales" = "#00BA38",
"Total Coupon Sales" = "#F8766D")) +
theme(legend.title = element_blank())
Based on the visual, we can see that sales peak
$7,819
at the very end of the year and stay pretty
consistent throughout the year ranges from $4000
to
$6000
a week.
The total coupon redemption amount was determined through the retailer coupon and manufacturer coupon if there was a coupon redemption date available.
Most of the week, the sales transaction rarely include any coupon redemption as depicted by the very low red line on the graph that indicates how much total discount was applied. Therefore, we can focus on this to promote more sales through promotions.
# Plot demographic group that purchase items
purchasing_demographics <- interested_transactions %>%
mutate(loyalty_sales = sales_value - (retail_disc+coupon_match_disc) / quantity,
loyalty_sales = if_else(loyalty_sales<0.0,0.0,loyalty_sales )) %>%
mutate(loyalty_sales = replace_na(loyalty_sales, 0)) %>% # Replace NA with 0
group_by(age) %>%
summarise( total_sales = sum(sales_value),
total_loyalty_sales = sum(loyalty_sales))
purchasing_demographics_long <- purchasing_demographics %>%
pivot_longer(cols = c(total_sales, total_loyalty_sales),
names_to = "sales_type",
values_to = "amount")
second_highest_age_group <- purchasing_demographics_long %>%
filter(!is.na(age)) %>%
slice_max(order_by= amount, n=1)
ggplot(purchasing_demographics_long, aes(x = age, y = amount, fill = sales_type)) +
geom_bar(stat = "identity", position = "dodge") +
scale_y_continuous(labels = scales::dollar) +
labs(title = "Total Sales and Loyalty Sales by Age",
x = "Age Group",
y = "Sales Amount") +
theme_minimal() +
geom_text(data = second_highest_age_group, aes(x = age, y = amount+1000, label = "Age group 45-54 is the second \n highest group in sales for selected products"), size = 5, vjust = 0)
Unknown
~ $120,000
and Age Group 45-54
~
$65,000
. What interesting is that sales tends to increase
with Age then decreases after the middle life mark - possibly they are
not the ones doing the shopping but their children?In our case, basket id is considered a transaction in terms of association analysis. So, we group baskets together to create dummy variables which indicates if an item is present in the basket to feed into our item set for basic association analysis.
What is the likelihood that for a given basket, the items mentioned are purchased together? What is the likelihood when they are not or what is the likelihood that either any combinations of them are purchased together?
Before we interpret our result, we will use three important metrics used in association rule mining: Support, Confidence, and Lift.
Support measures the proportion of transactions that contain a specific itemset \(A\). In this case, what is the proportion of transactions that contains Cheese or Deli Meat? The total number of transactions in this case is the interested transactions that includes selected products, not all transactions at the store - so we don’t introduce noise to the model.
Formula: \[ \text{Support}(A) = \frac{\text{Number of transactions containing } A}{\text{Total number of transactions}} \]
Confidence measures how often items in \(B\) appear in transactions that contain \(A\). In this case, how often Deli Meat appear in transactions that contains Cheese?
Formula: \[ \text{Confidence}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)} \]
Lift measures the increase in the likelihood of purchasing \(B\) when \(A\) is purchased, compared to the likelihood of purchasing \(B\) independently. Determine the effectiveness of a rule. A lift greater than 1 indicates a positive correlation, suggesting that the presence of \(A\) increases the likelihood of purchasing \(B\). In other words, Cheese might increase the likelihood of purchasing Meat or otherwise.
Formula: \[ \text{Lift}(A \Rightarrow B) = \frac{\text{Confidence}(A \Rightarrow B)}{\text{Support}(B)} \]
interested_basket <- interested_transactions %>%
group_by(basket_id) %>%
summarize(product_list = str_c(unique(product_type), collapse = ";"))
#dummy variable creation
model_basket <- interested_basket %>%
mutate(
cured_meats = if_else(str_detect(
product_list, regex(paste(c("meat", unlist(charcuterie_board$Cured_Meats)), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
cheeses = if_else(str_detect(
product_list, regex(paste(unlist(charcuterie_board$Cheeses), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
bread_and_crackers = if_else(str_detect(
product_list, regex(paste(unlist(charcuterie_board$Bread_and_Crackers), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
fruits = if_else(str_detect(
product_list, regex(paste(unlist(charcuterie_board$Fruits), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
nuts = if_else(str_detect(
product_list, regex(paste(unlist(charcuterie_board$Nuts), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
olives = if_else(str_detect(
product_list, regex(paste(unlist(charcuterie_board$Olives), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE),
spreads = if_else(str_detect(
product_list, regex(paste(unlist(charcuterie_board$Spreads), collapse = "|"), ignore_case = TRUE)), TRUE, FALSE)
) %>%
select(basket_id, cured_meats, cheeses, bread_and_crackers, fruits, nuts, olives, spreads) %>%
as.data.frame() %>%
`rownames<-`(.$basket_id) %>%
select(-basket_id)
We will use the Apriori algorithm from arules
library
which is used in data mining for discovering frequent itemsets and
generating association rules. It is particularly well-known for its
application in market basket analysis, where it helps identify
relationships between items purchased together.
basket_trans <- arules::transactions(model_basket)
itemFrequencyPlot(basket_trans,topN = 5)
rules_table <- head(its, n = 10)
kable(as(rules_table, "data.frame"), format = "markdown")
rules | support | confidence | coverage | lift | count | |
---|---|---|---|---|---|---|
2 | {} => {cheeses} | 0.6334796 | 0.6334796 | 1.0000000 | 1.000000 | 32348 |
1 | {} => {bread_and_crackers} | 0.6264100 | 0.6264100 | 1.0000000 | 1.000000 | 31987 |
3 | {cured_meats} => {cheeses} | 0.1302287 | 0.6957522 | 0.1871769 | 1.098303 | 6650 |
8 | {cured_meats,bread_and_crackers} => {cheeses} | 0.0853635 | 0.7728723 | 0.1104496 | 1.220043 | 4359 |
9 | {cured_meats,cheeses} => {bread_and_crackers} | 0.0853635 | 0.6554887 | 0.1302287 | 1.046421 | 4359 |
6 | {bread_and_crackers,fruits} => {cheeses} | 0.0374432 | 0.6697023 | 0.0559102 | 1.057181 | 1912 |
7 | {cheeses,fruits} => {bread_and_crackers} | 0.0374432 | 0.6697023 | 0.0559102 | 1.069112 | 1912 |
5 | {cured_meats,fruits} => {cheeses} | 0.0164499 | 0.7872540 | 0.0208953 | 1.242746 | 840 |
4 | {cured_meats,fruits} => {bread_and_crackers} | 0.0147266 | 0.7047798 | 0.0208953 | 1.125109 | 752 |
10 | {cured_meats,bread_and_crackers,fruits} => {cheeses} | 0.0123375 | 0.8377660 | 0.0147266 | 1.322483 | 630 |
High Demand for Cheeses and Bread & Crackers
Rule 1 & 2 : {} => {cheeses} & {} => {bread_and_crackers}
Support: both have very similar support which is ~ 63% of all transactions
Confidence: both also have very similar confidence ~ 63%, meaning 63% of transactions include cheeses / bread and crackers, regardless of other items.
=> This rule suggests cheeses & bread and crackers are a popular choice overall.
Rule 3 & 9: {cured_meats} => {cheeses} & {cured_meats,cheeses} => {bread_and_crackers}
Rule 4 & 5: {cured_meats,fruits} => {bread_and_crackers} & {cured_meats,fruits} => {cheeses}
What are the trends of charcuterie board items purchases? Do they tend to peak around at special holidays?
Is there a certain demographic group is generating large amounts of revenue for these products and we could invest to capture market share in this area? * The bar graph comparison suggests that older individuals may not be the primary shoppers; instead, their children might be making these purchases on their behalf. This insight could guide targeted marketing strategies to better capture market share in these demographics.
What types of products are usually bought together for a common charcuterie board? How can we design effective promotions that encourage customers to buy more products together as listed in our common charcuterie board introduction?
Based on the rule tables, the commonly Paired Products are typically bought together and have a strong connection indicated by rule table are:
Effective Promotion Strategies: Curated Charcuterie Kits: Create bundles that include all essential items: cured meats, cheeses, assorted bread & crackers, and fruits. Offer these kits at a slight discount or cashback or double reward points compared to buying items individually to incentivize purchase. This can encourage bulk purchases and reward repeat customers.
For the target group of marketing is 45-54, hosting in-store tasting events where customers can sample combinations of cured meats, cheeses, fruits, and crackers. This can help them discover new pairings and encourage them to purchase the featured items from gourmet charcuterie board brands and store made board
The analysis reveals distinct purchasing trends for charcuterie board items, highlighting peak sales around significant holidays such as New Year’s, Thanksgiving, and Christmas. These seasonal spikes indicate a strong consumer interest in charcuterie boards during festive occasions, suggesting that targeted promotional strategies should be developed to capitalize on this demand.
To optimize sales throughout the year, brands can create seasonal or holiday-themed charcuterie board/box that feature carefully curated selections of items like Cured Meats, Cheeses, Bread, Crackers, Fruits (indicated in Frequent Set Items). An area for growth could be to add nuts, spreads and olives to the charcuterie board items. There was no purchase or link of these items to our frequent set. This approach not only caters to customers looking for convenient options for gatherings but also enhances the shopping experience by simplifying decision-making during busy times and increase sales by bundling products.
Additionally, the insights regarding demographics suggest that marketing efforts should focus on middle-aged consumers, who may be purchasing on behalf of older family members. Tailored promotions, such as personalized paper coupons or cashback offers, could further engage other groups, encouraging them to buy charcuterie items for family gatherings and events.
There is only one year of transaction data, which might be bias because different year can have a different purchasing trends such as during COVID-19. We could use the same method of analysis to look at other complementary items such as BBQ and explore further on brand vs store products to drive sales.