Have you ever walked into a bakery planning to buy just a croissant but ended up leaving with a selection of pastries? That’s not just impulse buying — it’s also a pattern in customer behavior. This project explores which bakery items are frequently purchased together using market basket analysis. By analyzing sales data, we can uncover patterns that help optimize product placement, bundling strategies, and promotions to increase sales.
library(tidyverse)
library(arules)
library(arulesViz)
For this project I used a French bakery sales dataset obtained from Kaggle.
This dataset consists of transaction details, where each ticket number represents a single purchase, and each article represents a product bought within that purchase.
To begin exploring the dataset, we need to load the data and check its structure.
bakery_data <- read.csv("bakery_sales.csv")
head(bakery_data)
## X date time ticket_number article Quantity unit_price
## 1 0 2021-01-02 08:38 150040 BAGUETTE 1 0.90
## 2 1 2021-01-02 08:38 150040 PAIN AU CHOCOLAT 3 1.20
## 3 4 2021-01-02 09:14 150041 PAIN AU CHOCOLAT 2 1.20
## 4 5 2021-01-02 09:14 150041 PAIN 1 1.15
## 5 8 2021-01-02 09:25 150042 TRADITIONAL BAGUETTE 5 1.20
## 6 11 2021-01-02 09:25 150043 BAGUETTE 2 0.90
We see that our dataset consists of following columns:
Now we can visualize the frequency of different bakery items to understand which products are sold the most.
ggplot(bakery_data, aes(x = article)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Frequency of Articles in Bakery Sales", x = "Article", y = "Count")
Since the plot above includes all products, the visualization is cluttered and difficult to read. To improve readability, I will focus on the top 15 most frequently sold items and plot only those.
top_articles <- bakery_data %>%
count(article, sort = TRUE) %>%
top_n(15, n)
ggplot(top_articles, aes(x = reorder(article, n), y = n)) +
geom_bar(stat = "identity", fill = "#3fb09d") +
coord_flip() +
labs(title = "Top 15 Most Sold Articles", x = "Article", y = "Count")
This plot gives us an overview of the most popular items in the bakery, which will be important when interpreting association rules later.
Now that we have a better overwiev of what is in this dataset, the next step is to see which articles are commonly purchased together.
First we need to group the transactions by ticket number, converting individual items into transaction baskets. I then saved the transformed data as “transactions.csv”, which I will use for association rule mining in the next step.
transactions <- bakery_data %>% group_by(ticket_number) %>% summarise(items = paste(article, collapse = ",")) %>% select(items)
write.table(transactions, "transactions.csv", row.names = FALSE, col.names = FALSE, sep = ",", quote = FALSE)
transactions_clear <- read.csv("transactions.csv")
This dataset is now ready for market basket analysis, allowing us to uncover patterns in customer purchases and generate useful association rules.
Now that we have our transactions in the correct format, we can apply association rule mining to identify frequent item combinations.
transactions_final <- read.transactions("transactions.csv", format = "basket", sep = ",", header = FALSE)
summary(transactions_final)
## transactions as itemMatrix in sparse format with
## 136451 rows (elements/itemsets/transactions) and
## 150 columns (items) and a density of 0.01134224
##
## most frequent items:
## TRADITIONAL BAGUETTE COUPE BAGUETTE
## 67535 19424 15273
## BANETTE CROISSANT (Other)
## 15107 11446 103364
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12
## 77862 33090 17533 5488 1724 500 155 60 25 9 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.701 2.000 12.000
##
## includes extended item information - examples:
## labels
## 1 .
## 2 00
## 3 12 MACARON
inspect(transactions_final[1:5])
## items
## [1] {BAGUETTE, PAIN AU CHOCOLAT}
## [2] {PAIN, PAIN AU CHOCOLAT}
## [3] {TRADITIONAL BAGUETTE}
## [4] {BAGUETTE, CROISSANT}
## [5] {BANETTE}
In the plot below we can see the distribution of transaction sizes - meaning how many items were in one transcation. We can see that the majority of transactions consisted of only 1 item, but there were still a lot more with 2 or 3 items, which we will use for the analysis.
transaction_sizes <- size(transactions_final)
ggplot(data.frame(transaction_sizes), aes(x = transaction_sizes)) +
geom_bar(fill = "#622766") +
labs(title = "Distribution of Transaction Sizes", x = "Number of Items per Transaction", y = "Count") +
theme_minimal()
Now that we have a structured transaction dataset and an understanding of customer purchase patterns, we can proceed with generating association rules using the Apriori algorithm.
The support threshold (0.001) ensures that we only consider itemsets appearing in at least 0.1% of transactions, while the confidence threshold (0.5) ensures that rules have at least 50% predictive reliability.
rules <- apriori(transactions_final, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 136
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[150 item(s), 136451 transaction(s)] done [0.02s].
## sorting and recoding items ... [73 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [58 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
inspect(head(sort(rules, by = "lift"), 10))
## lhs rhs support confidence coverage lift count
## [1] {CROISSANT,
## PAIN AUX RAISINS,
## TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001421756 0.6830986 0.002081333 8.852644 194
## [2] {BAGUETTE,
## CROISSANT,
## TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001179911 0.6338583 0.001861474 8.214512 161
## [3] {CHAUSSON AUX POMMES,
## CROISSANT} => {PAIN AU CHOCOLAT} 0.001678258 0.6273973 0.002674953 8.130780 229
## [4] {BAGUETTE,
## PAIN AU CHOCOLAT,
## TRADITIONAL BAGUETTE} => {CROISSANT} 0.001179911 0.6652893 0.001773530 7.931101 161
## [5] {CEREAL BAGUETTE,
## PAIN AU CHOCOLAT} => {CROISSANT} 0.002125305 0.6636156 0.003202615 7.911149 290
## [6] {CROISSANT,
## CROISSANT AMANDES} => {PAIN AU CHOCOLAT} 0.001304498 0.5836066 0.002235235 7.563273 178
## [7] {CEREAL BAGUETTE,
## CROISSANT} => {PAIN AU CHOCOLAT} 0.002125305 0.5835010 0.003642333 7.561905 290
## [8] {CROISSANT,
## PAIN AUX RAISINS} => {PAIN AU CHOCOLAT} 0.003195286 0.5729304 0.005577094 7.424914 436
## [9] {FICELLE,
## PAIN AU CHOCOLAT} => {CROISSANT} 0.001363127 0.6200000 0.002198591 7.391195 186
## [10] {COUPE,
## PAIN AU CHOCOLAT,
## TRADITIONAL BAGUETTE} => {CROISSANT} 0.001297169 0.6082474 0.002132634 7.251089 177
To better understand the relationships between items, we can visualize the rules.
plot(rules, method="graph", cex=0.7, shading="lift")
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The graph above visualizes associations between bakery products, with
node size representing support (item frequency) and color intensity
indicating lift (strength of association).
Key Insights:
Baguette, Traditional Baguette, Croissant, and Special Bread are highly connected, suggesting they frequently appear in transactions and drive the sales
Croissants & Pain au chocolat and Baguette & Traditional Baguette are often purchased together
I decided to create a subset of the rules by applying a lift threshold (lift > 5) to focus on the most meaningful and actionable insights while eliminating weaker relationships. Higher lift indicates a stronger relationship.
With lift > 5 we will see very strong associations - meaning when one item is bought, the other is very likely to be bought as well.
strong_rules <- subset(rules, lift > 5 & confidence > 0.6)
inspect(head(sort(strong_rules, by = "lift"), 10))
## lhs rhs support confidence coverage lift count
## [1] {CROISSANT,
## PAIN AUX RAISINS,
## TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001421756 0.6830986 0.002081333 8.852644 194
## [2] {BAGUETTE,
## CROISSANT,
## TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001179911 0.6338583 0.001861474 8.214512 161
## [3] {CHAUSSON AUX POMMES,
## CROISSANT} => {PAIN AU CHOCOLAT} 0.001678258 0.6273973 0.002674953 8.130780 229
## [4] {BAGUETTE,
## PAIN AU CHOCOLAT,
## TRADITIONAL BAGUETTE} => {CROISSANT} 0.001179911 0.6652893 0.001773530 7.931101 161
## [5] {CEREAL BAGUETTE,
## PAIN AU CHOCOLAT} => {CROISSANT} 0.002125305 0.6636156 0.003202615 7.911149 290
## [6] {FICELLE,
## PAIN AU CHOCOLAT} => {CROISSANT} 0.001363127 0.6200000 0.002198591 7.391195 186
## [7] {COUPE,
## PAIN AU CHOCOLAT,
## TRADITIONAL BAGUETTE} => {CROISSANT} 0.001297169 0.6082474 0.002132634 7.251089 177
## [8] {BAGUETTE,
## CAMPAGNE} => {COUPE} 0.002169277 0.9456869 0.002293864 6.643324 296
## [9] {BOULE 200G,
## CROISSANT} => {COUPE} 0.001685587 0.9387755 0.001795516 6.594772 230
## [10] {BAGUETTE,
## BOULE 200G} => {COUPE} 0.001810174 0.9148148 0.001978732 6.426452 247
The graphs below show us more meaningful connections than we saw before, since weak associations have been removed.
plot(strong_rules, method="graph", cex=0.7, shading="lift")
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
We can see that items like baguette, pain au chocolate and croissant are positioned more centrally, suggesting they play a key role in itemsets.
plot(strong_rules, method = "graph", engine = "htmlwidget")
plot(strong_rules, method = "scatterplot", measure = c("support", "lift"), shading = "confidence")
This market basket analysis of bakery sales data revealed strong associations between key products, highlighting customer purchasing patterns. The most frequently occurring itemsets, such as Croissant and Pain au Chocolat, as well as Baguette and Traditional Baguette, demonstrate clear customer preferences for common bakery combinations.
These insights could be used for strategic decision-making. The bakery could implement bundling strategies by offering discounts or promotions on frequently paired items, encouraging customers to purchase more. Bakery could also optimize shelf placement to enhance the shopping experience, by placing highly associated products close to one another.