Introduction

Have you ever walked into a bakery planning to buy just a croissant but ended up leaving with a selection of pastries? That’s not just impulse buying — it’s also a pattern in customer behavior. This project explores which bakery items are frequently purchased together using market basket analysis. By analyzing sales data, we can uncover patterns that help optimize product placement, bundling strategies, and promotions to increase sales.

library(tidyverse)
library(arules)
library(arulesViz)

Data

For this project I used a French bakery sales dataset obtained from Kaggle.

This dataset consists of transaction details, where each ticket number represents a single purchase, and each article represents a product bought within that purchase.

To begin exploring the dataset, we need to load the data and check its structure.

bakery_data <- read.csv("bakery_sales.csv")

head(bakery_data)
##    X       date  time ticket_number              article Quantity unit_price
## 1  0 2021-01-02 08:38        150040             BAGUETTE        1       0.90
## 2  1 2021-01-02 08:38        150040     PAIN AU CHOCOLAT        3       1.20
## 3  4 2021-01-02 09:14        150041     PAIN AU CHOCOLAT        2       1.20
## 4  5 2021-01-02 09:14        150041                 PAIN        1       1.15
## 5  8 2021-01-02 09:25        150042 TRADITIONAL BAGUETTE        5       1.20
## 6 11 2021-01-02 09:25        150043             BAGUETTE        2       0.90

We see that our dataset consists of following columns:

Now we can visualize the frequency of different bakery items to understand which products are sold the most.

ggplot(bakery_data, aes(x = article)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Frequency of Articles in Bakery Sales", x = "Article", y = "Count")

Since the plot above includes all products, the visualization is cluttered and difficult to read. To improve readability, I will focus on the top 15 most frequently sold items and plot only those.

top_articles <- bakery_data %>%
  count(article, sort = TRUE) %>%
  top_n(15, n)

ggplot(top_articles, aes(x = reorder(article, n), y = n)) +
  geom_bar(stat = "identity", fill = "#3fb09d") +
  coord_flip() +
  labs(title = "Top 15 Most Sold Articles", x = "Article", y = "Count")

This plot gives us an overview of the most popular items in the bakery, which will be important when interpreting association rules later.

Grouping Transactions

Now that we have a better overwiev of what is in this dataset, the next step is to see which articles are commonly purchased together.

First we need to group the transactions by ticket number, converting individual items into transaction baskets. I then saved the transformed data as “transactions.csv”, which I will use for association rule mining in the next step.

transactions <- bakery_data %>% group_by(ticket_number) %>% summarise(items = paste(article, collapse = ",")) %>% select(items)

write.table(transactions, "transactions.csv", row.names = FALSE, col.names = FALSE, sep = ",", quote = FALSE)
transactions_clear <- read.csv("transactions.csv")

This dataset is now ready for market basket analysis, allowing us to uncover patterns in customer purchases and generate useful association rules.

Creating Association Rules

Now that we have our transactions in the correct format, we can apply association rule mining to identify frequent item combinations.

transactions_final <- read.transactions("transactions.csv", format = "basket", sep = ",", header = FALSE)
summary(transactions_final)
## transactions as itemMatrix in sparse format with
##  136451 rows (elements/itemsets/transactions) and
##  150 columns (items) and a density of 0.01134224 
## 
## most frequent items:
## TRADITIONAL BAGUETTE                COUPE             BAGUETTE 
##                67535                19424                15273 
##              BANETTE            CROISSANT              (Other) 
##                15107                11446               103364 
## 
## element (itemset/transaction) length distribution:
## sizes
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 77862 33090 17533  5488  1724   500   155    60    25     9     4     1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.701   2.000  12.000 
## 
## includes extended item information - examples:
##       labels
## 1          .
## 2         00
## 3 12 MACARON
inspect(transactions_final[1:5])
##     items                       
## [1] {BAGUETTE, PAIN AU CHOCOLAT}
## [2] {PAIN, PAIN AU CHOCOLAT}    
## [3] {TRADITIONAL BAGUETTE}      
## [4] {BAGUETTE, CROISSANT}       
## [5] {BANETTE}

In the plot below we can see the distribution of transaction sizes - meaning how many items were in one transcation. We can see that the majority of transactions consisted of only 1 item, but there were still a lot more with 2 or 3 items, which we will use for the analysis.

transaction_sizes <- size(transactions_final)
ggplot(data.frame(transaction_sizes), aes(x = transaction_sizes)) +
  geom_bar(fill = "#622766") +
  labs(title = "Distribution of Transaction Sizes", x = "Number of Items per Transaction", y = "Count") +
  theme_minimal()

Now that we have a structured transaction dataset and an understanding of customer purchase patterns, we can proceed with generating association rules using the Apriori algorithm.

The support threshold (0.001) ensures that we only consider itemsets appearing in at least 0.1% of transactions, while the confidence threshold (0.5) ensures that rules have at least 50% predictive reliability.

rules <- apriori(transactions_final, parameter = list(supp = 0.001, conf = 0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 136 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[150 item(s), 136451 transaction(s)] done [0.02s].
## sorting and recoding items ... [73 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [58 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].
inspect(head(sort(rules, by = "lift"), 10))
##      lhs                       rhs                    support confidence    coverage     lift count
## [1]  {CROISSANT,                                                                                   
##       PAIN AUX RAISINS,                                                                            
##       TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001421756  0.6830986 0.002081333 8.852644   194
## [2]  {BAGUETTE,                                                                                    
##       CROISSANT,                                                                                   
##       TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001179911  0.6338583 0.001861474 8.214512   161
## [3]  {CHAUSSON AUX POMMES,                                                                         
##       CROISSANT}            => {PAIN AU CHOCOLAT} 0.001678258  0.6273973 0.002674953 8.130780   229
## [4]  {BAGUETTE,                                                                                    
##       PAIN AU CHOCOLAT,                                                                            
##       TRADITIONAL BAGUETTE} => {CROISSANT}        0.001179911  0.6652893 0.001773530 7.931101   161
## [5]  {CEREAL BAGUETTE,                                                                             
##       PAIN AU CHOCOLAT}     => {CROISSANT}        0.002125305  0.6636156 0.003202615 7.911149   290
## [6]  {CROISSANT,                                                                                   
##       CROISSANT AMANDES}    => {PAIN AU CHOCOLAT} 0.001304498  0.5836066 0.002235235 7.563273   178
## [7]  {CEREAL BAGUETTE,                                                                             
##       CROISSANT}            => {PAIN AU CHOCOLAT} 0.002125305  0.5835010 0.003642333 7.561905   290
## [8]  {CROISSANT,                                                                                   
##       PAIN AUX RAISINS}     => {PAIN AU CHOCOLAT} 0.003195286  0.5729304 0.005577094 7.424914   436
## [9]  {FICELLE,                                                                                     
##       PAIN AU CHOCOLAT}     => {CROISSANT}        0.001363127  0.6200000 0.002198591 7.391195   186
## [10] {COUPE,                                                                                       
##       PAIN AU CHOCOLAT,                                                                            
##       TRADITIONAL BAGUETTE} => {CROISSANT}        0.001297169  0.6082474 0.002132634 7.251089   177

To better understand the relationships between items, we can visualize the rules.

plot(rules, method="graph", cex=0.7, shading="lift")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The graph above visualizes associations between bakery products, with node size representing support (item frequency) and color intensity indicating lift (strength of association).

Key Insights:

I decided to create a subset of the rules by applying a lift threshold (lift > 5) to focus on the most meaningful and actionable insights while eliminating weaker relationships. Higher lift indicates a stronger relationship.

With lift > 5 we will see very strong associations - meaning when one item is bought, the other is very likely to be bought as well.

strong_rules <- subset(rules, lift > 5 & confidence > 0.6)
inspect(head(sort(strong_rules, by = "lift"), 10))
##      lhs                       rhs                    support confidence    coverage     lift count
## [1]  {CROISSANT,                                                                                   
##       PAIN AUX RAISINS,                                                                            
##       TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001421756  0.6830986 0.002081333 8.852644   194
## [2]  {BAGUETTE,                                                                                    
##       CROISSANT,                                                                                   
##       TRADITIONAL BAGUETTE} => {PAIN AU CHOCOLAT} 0.001179911  0.6338583 0.001861474 8.214512   161
## [3]  {CHAUSSON AUX POMMES,                                                                         
##       CROISSANT}            => {PAIN AU CHOCOLAT} 0.001678258  0.6273973 0.002674953 8.130780   229
## [4]  {BAGUETTE,                                                                                    
##       PAIN AU CHOCOLAT,                                                                            
##       TRADITIONAL BAGUETTE} => {CROISSANT}        0.001179911  0.6652893 0.001773530 7.931101   161
## [5]  {CEREAL BAGUETTE,                                                                             
##       PAIN AU CHOCOLAT}     => {CROISSANT}        0.002125305  0.6636156 0.003202615 7.911149   290
## [6]  {FICELLE,                                                                                     
##       PAIN AU CHOCOLAT}     => {CROISSANT}        0.001363127  0.6200000 0.002198591 7.391195   186
## [7]  {COUPE,                                                                                       
##       PAIN AU CHOCOLAT,                                                                            
##       TRADITIONAL BAGUETTE} => {CROISSANT}        0.001297169  0.6082474 0.002132634 7.251089   177
## [8]  {BAGUETTE,                                                                                    
##       CAMPAGNE}             => {COUPE}            0.002169277  0.9456869 0.002293864 6.643324   296
## [9]  {BOULE 200G,                                                                                  
##       CROISSANT}            => {COUPE}            0.001685587  0.9387755 0.001795516 6.594772   230
## [10] {BAGUETTE,                                                                                    
##       BOULE 200G}           => {COUPE}            0.001810174  0.9148148 0.001978732 6.426452   247

The graphs below show us more meaningful connections than we saw before, since weak associations have been removed.

plot(strong_rules, method="graph", cex=0.7, shading="lift")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

We can see that items like baguette, pain au chocolate and croissant are positioned more centrally, suggesting they play a key role in itemsets.

plot(strong_rules, method = "graph", engine = "htmlwidget")
plot(strong_rules, method = "scatterplot", measure = c("support", "lift"), shading = "confidence")

Conclusion

This market basket analysis of bakery sales data revealed strong associations between key products, highlighting customer purchasing patterns. The most frequently occurring itemsets, such as Croissant and Pain au Chocolat, as well as Baguette and Traditional Baguette, demonstrate clear customer preferences for common bakery combinations.

These insights could be used for strategic decision-making. The bakery could implement bundling strategies by offering discounts or promotions on frequently paired items, encouraging customers to purchase more. Bakery could also optimize shelf placement to enhance the shopping experience, by placing highly associated products close to one another.