1 Project description

In this project, we will conduct association rule mining using the Apriori algorithm on a market basket dataset sourced from Kaggle. Market basket analysis is a technique used to discover relationships between products based on customer transactions. By examining these associations, we can identify frequently purchased item combinations, which can help optimize product placement, enhance marketing strategies, and ultimately boost sales. The Apriori algorithm is widely used for extracting association rules, and we will apply it to uncover meaningful insights from our retail transaction dataset.

[Data Source]

(https://www.kaggle.com/datasets/ashwinbadi/market-basket-analysist)

2 Load Necessary Packages

library(arules)
library(arulesViz)
library(arulesCBA)
library(tidyverse)
library(readxl)
library(RColorBrewer)
library(ggplot2)
library(plotly)

3 Load Dataset

# Load data and save as CSV
file_path <- "C:/Users/nijat/Desktop/market_basket_analysis.xlsx"
csv_path <- "C:/Users/nijat/Desktop/csv_market_basket_analysis.csv"
MBA <- read_excel(file_path, sheet = "Worksheet", range = "A1:M1864")
write.csv(MBA, csv_path, row.names = FALSE)
print(head(MBA, 10))

## # A tibble: 10 × 13
##    itemset item1   item2 item3 item4 item5 item6 item7 item8 item9 item10 item11
##      <dbl> <chr>   <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>  <lgl> 
##  1       1 baking… coff… froz… butt… <NA>  <NA>  <NA>  NA    NA    NA     NA    
##  2       2 ice cr… abra… fish  coff… froz… <NA>  <NA>  NA    NA    NA     NA    
##  3       3 butter  baki… coff… ice … froz… <NA>  <NA>  NA    NA    NA     NA    
##  4       4 frozen… abra… ice … butt… coff… <NA>  <NA>  NA    NA    NA     NA    
##  5       5 baking… ice … butt… froz… <NA>  <NA>  <NA>  NA    NA    NA     NA    
##  6       6 ice cr… froz… coff… cake… froz… <NA>  <NA>  NA    NA    NA     NA    
##  7       7 honey   fish  abra… dome… <NA>  <NA>  <NA>  NA    NA    NA     NA    
##  8       8 butter  froz… fish  ice … froz… <NA>  <NA>  NA    NA    NA     NA    
##  9       9 coffee  honey fish  froz… <NA>  <NA>  <NA>  NA    NA    NA     NA    
## 10      10 honey   froz… fish  <NA>  <NA>  <NA>  <NA>  NA    NA    NA     NA    
## # ℹ 1 more variable: item12 <lgl>

Trans <- read.transactions(csv_path, format = "basket", sep = ",", header = TRUE)

## Warning in asMethod(object): removing duplicated items in transactions

print(summary(Trans))

## transactions as itemMatrix in sparse format with
##  1863 rows (elements/itemsets/transactions) and
##  1875 columns (items) and a density of 0.002464269 
## 
## most frequent items:
##  frozen meals        butter baking powder        coffee          fish 
##          1002           840           663           606           563 
##       (Other) 
##          4934 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   5   6   7   8 
##  14 210 256 360 421 391 173  38 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   5.000   4.621   6.000   8.000 
## 
## includes extended item information - examples:
##   labels
## 1      1
## 2     10
## 3    100

4 EDA

4.1 Data Preparation

The dataset Trans consists of market basket transaction data, where each row represents a single shopping transaction made by a customer. Each transaction contains a set of purchased items. Here’s what the data includes:

Transaction ID (e.g., 1858, 1859, 1860, etc.) – Represents a unique shopping trip.
List of Purchased Items – Each transaction contains a set of items bought together, such as:

baking powder
butter
frozen meals
rapes
coffee
ice cream
fish, etc.

This data is used in market basket analysis to identify patterns, such as which items are frequently bought together, helping businesses improve product placement and marketing strategies.

inspect(tail(Trans, 6))

##     items               
## [1] {1858,              
##      baking powder,     
##      butter,            
##      frozen meals,      
##      frozen vegetables} 
## [2] {1859,              
##      baking powder,     
##      butter,            
##      frozen meals,      
##      frozen vegetables, 
##      grapes}            
## [3] {1860,              
##      abrasive cleaner,  
##      butter,            
##      coffee,            
##      fish,              
##      frozen vegetables, 
##      ice cream}         
## [4] {1861,              
##      coffee,            
##      fish,              
##      frozen meals,      
##      frozen vegetables, 
##      grapes}            
## [5] {1862,              
##      butter,            
##      fish,              
##      frozen meals,      
##      ice cream}         
## [6] {1863,              
##      baking powder,     
##      butter,            
##      cake bar,          
##      coffee,            
##      frozen meals,      
##      grapes}

4.2 Data Visualization

itemFrequencyPlot(Trans, topN = 10, col = brewer.pal(10, 'Set3'),
                  main = 'Top 10 Frequent Items', type = "absolute", 
                  ylab = "Frequency", xlab = "Items")

X-axis Title (“Retail items”): Represents the names of different products in the dataset (e.g., frozen meals, butter, baking powder, etc.). Each label corresponds to a specific item sold in the store.

Y-axis Title (“Item Frequency (Absolute)”): Indicates the total number of times each item was purchased across all transactions.

In this context, the x-axis titles (labels) are the distinct product names, and their corresponding y-axis values show how often each product was bought. For example:

“Frozen meals” corresponds to the highest frequency (over 1000 purchases). “Cake bar” corresponds to the lowest frequency in this chart.

5 Apriori algorithm

rules <- apriori(Trans, parameter = list(supp = 0.01, conf = 0.75))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.75    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 18 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1875 item(s), 1863 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [148 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

print(summary(rules))

## set of 148 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4  5  6 
##  1 19 75 48  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    4.00    4.00    4.25    5.00    6.00 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01020   Min.   :0.7500   Min.   :0.01074   Min.   :1.394  
##  1st Qu.:0.01221   1st Qu.:0.8235   1st Qu.:0.01436   1st Qu.:1.682  
##  Median :0.01771   Median :0.8904   Median :0.02120   Median :1.780  
##  Mean   :0.02908   Mean   :0.8769   Mean   :0.03316   Mean   :1.961  
##  3rd Qu.:0.02644   3rd Qu.:0.9333   3rd Qu.:0.03167   3rd Qu.:2.145  
##  Max.   :0.31186   Max.   :1.0000   Max.   :0.32528   Max.   :3.168  
##      count       
##  Min.   : 19.00  
##  1st Qu.: 22.75  
##  Median : 33.00  
##  Mean   : 54.17  
##  3rd Qu.: 49.25  
##  Max.   :581.00  
## 
## mining info:
##   data ntransactions support confidence
##  Trans          1863    0.01       0.75
##                                                               call
##  apriori(data = Trans, parameter = list(supp = 0.01, conf = 0.75))

Apriori algorithm: Extracts association rules from transactional data.

Parameters: Minimum support = 1%, minimum confidence = 75%.

Dataset: 1875 unique items, 1863 transactions.

Output: 148 rules generated for frequent item combinations.

Purpose: Identify relationships between products for better business strategies.

inspect(head(rules, 5))

##     lhs                       rhs                 support    confidence
## [1] {coffee}               => {frozen meals}      0.31186259 0.9587459 
## [2] {baking powder, honey} => {frozen vegetables} 0.01663983 0.7560976 
## [3] {coffee, honey}        => {frozen meals}      0.06548578 0.9682540 
## [4] {baking powder, honey} => {butter}            0.02093398 0.9512195 
## [5] {baking powder, honey} => {frozen meals}      0.01663983 0.7560976 
##     coverage   lift     count
## [1] 0.32528180 1.782578 581  
## [2] 0.02200751 2.603715  31  
## [3] 0.06763285 1.800257 122  
## [4] 0.02200751 2.109669  39  
## [5] 0.02200751 1.405798  31

Rule 1:

If a customer buys coffee, they are likely to also buy frozen meals.

Support: 31.19% of all transactions include both.

Confidence: 95.87% of transactions with coffee also include frozen meals.

Lift: 1.78, meaning this rule is 1.78 times more likely than random chance.

Rule 2:

If a customer buys baking powder, honey, they are likely to also buy frozen vegetables.

Support: 1.66% of all transactions include this combination.

Confidence: 75.61% of transactions with baking powder, honey also include frozen vegetables.

Lift: 2.60, indicating a strong relationship.

rules_df <- as(rules, "data.frame")
print(head(rules_df, 6))

##                                          rules    support confidence   coverage
## 1                   {coffee} => {frozen meals} 0.31186259  0.9587459 0.32528180
## 2 {baking powder,honey} => {frozen vegetables} 0.01663983  0.7560976 0.02200751
## 3             {coffee,honey} => {frozen meals} 0.06548578  0.9682540 0.06763285
## 4            {baking powder,honey} => {butter} 0.02093398  0.9512195 0.02200751
## 5      {baking powder,honey} => {frozen meals} 0.01663983  0.7560976 0.02200751
## 6  {abrasive cleaner,coffee} => {frozen meals} 0.05367687  0.8620690 0.06226516
##       lift count
## 1 1.782578   581
## 2 2.603715    31
## 3 1.800257   122
## 4 2.109669    39
## 5 1.405798    31
## 6 1.602829   100

This process helps structure and analyze the rules more effectively, making it easier to sort, filter, or visualize the results.

inspect(sort(rules, by = "support", decreasing = TRUE)[1:5])

##     lhs                              rhs            support   confidence
## [1] {coffee}                      => {frozen meals} 0.3118626 0.9587459 
## [2] {baking powder, frozen meals} => {butter}       0.1669351 0.7585366 
## [3] {butter, coffee}              => {frozen meals} 0.1556629 0.9324759 
## [4] {baking powder, coffee}       => {frozen meals} 0.1336554 0.9576923 
## [5] {coffee, fish}                => {frozen meals} 0.1143317 0.9424779 
##     coverage  lift     count
## [1] 0.3252818 1.782578 581  
## [2] 0.2200751 1.682326 311  
## [3] 0.1669351 1.733735 290  
## [4] 0.1395598 1.780620 249  
## [5] 0.1213097 1.752332 213

Based on the support level, the top 5 rules are displayed as above. The top 1 is still from coffee and frozen meals, this indicates a higher chance of this rule over others For for the top 5 rules, 4 out of them are related to the frozen meals (as consequent).

inspect(sort(rules, by = "confidence", decreasing = FALSE)[1:5])

##     lhs                                         rhs                 support   
## [1] {butter, frozen vegetables, honey}       => {frozen meals}      0.01771337
## [2] {domestic eggs, frozen meals, ice cream} => {coffee}            0.01610306
## [3] {coffee, frozen vegetables, ice cream}   => {butter}            0.01449275
## [4] {butter, coffee, fish, ice cream}        => {frozen meals}      0.01288245
## [5] {baking powder, honey}                   => {frozen vegetables} 0.01663983
##     confidence coverage   lift     count
## [1] 0.7500000  0.02361782 1.394461 33   
## [2] 0.7500000  0.02147075 2.305693 30   
## [3] 0.7500000  0.01932367 1.663393 27   
## [4] 0.7500000  0.01717660 1.394461 24   
## [5] 0.7560976  0.02200751 2.603715 31

Based on the confidence level, we can see the top 5 rules changed, however, 2 out of 5 are still related to frozen meals(consequent). Even though we can observe a lower confidence level for the tope 5 rules. This could still give us the insight that when the antecedent happens, there is a very high chance that the consequents would happen. For example, when a costumer buys butter, frozen vegetables, honey together in a transaction. There is a 75% chance that this customer will also purchase frozen meals. This applies to the other 4 rules as well.

inspect(sort(rules, by = "lift", decreasing = TRUE)[1:5])

##     lhs                    rhs                    support confidence   coverage     lift count
## [1] {baking powder,                                                                           
##      coffee,                                                                                  
##      honey}             => {frozen vegetables} 0.01234568  0.9200000 0.01341922 3.168133    23
## [2] {baking powder,                                                                           
##      butter,                                                                                  
##      coffee,                                                                                  
##      honey}             => {frozen vegetables} 0.01127214  0.9130435 0.01234568 3.144177    21
## [3] {baking powder,                                                                           
##      coffee,                                                                                  
##      frozen meals,                                                                            
##      honey}             => {frozen vegetables} 0.01127214  0.9130435 0.01234568 3.144177    21
## [4] {baking powder,                                                                           
##      butter,                                                                                  
##      coffee,                                                                                  
##      frozen meals,                                                                            
##      honey}             => {frozen vegetables} 0.01019860  0.9047619 0.01127214 3.115659    19
## [5] {abrasive cleaner,                                                                        
##      butter,                                                                                  
##      fish,                                                                                    
##      ice cream}         => {coffee}            0.01180891  1.0000000 0.01180891 3.074257    22

From the top 5 result sorted by lift, we can see that these rules are positively related since the lift is above 1. However, the most correlated in this section is baking powder, coffee, honey and frozen vegetables with a lift value of 3.168133.

plot_data <- rules_df %>% 
  mutate(support = as.numeric(support), 
         confidence = as.numeric(confidence), 
         lift = as.numeric(lift))

ggplot(plot_data, aes(x = support, y = confidence, color = lift)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +
  theme_minimal() +
  ggtitle("Association Rules: Support vs Confidence")

plot(rules, measure = c("support", "confidence"), shading = "lift", engine = "plotly")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The images show how support (x-axis) and confidence (y-axis) relate to each other for association rules, with lift represented by color. In both charts, we see that rules with higher support (more common itemsets) tend to have slightly lower confidence (reliability of the rule). The first chart uses a gradient from blue to red to show lift, where red indicates stronger associations, while the second chart uses shades of red. Most rules are clustered around low support and high confidence, meaning some itemsets are frequent but not always strongly associated. A few points with high support and high lift indicate rare but very strong associations.

goods <- unique(MBA$item1)[1:12]
goods_rules_list <- list()
goods_rules_plots <- list()

for (g in goods) {
  goods_rules <- apriori(data = Trans, parameter = list(supp = 0.001, conf = 0.75),
                         appearance = list(default = "lhs", rhs = g), control = list(verbose = F))
  goods_rules_list[[g]] <- sort(goods_rules, by = "support", decreasing = TRUE)
  goods_rules_plots[[g]] <- plot(head(goods_rules_list[[g]]), method = "graph") + 
    labs(title = paste(g, "as a consequent item")) + theme(plot.title = element_text(size = 9)) + theme_bw()
}

ggarrange(plotlist = goods_rules_plots, common.legend = TRUE, ncol = 3)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## $`4`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

goods_ant_rules_list <- list()
goods_ant_rules_plots <- list()

for (g in goods) {
  goods_rules <- apriori(data = Trans, parameter = list(supp = 0.01, conf = 0.075, minlen = 2),
                         appearance = list(default = "rhs", lhs = g), control = list(verbose = F))
  goods_ant_rules_list[[g]] <- sort(goods_rules, by = "confidence", decreasing = TRUE)
  goods_ant_rules_plots[[g]] <- plot(head(goods_ant_rules_list[[g]]), method = "graph") + 
    labs(title = paste(g, "as an antecedent item")) + theme(plot.title = element_text(size = 9)) + theme_bw()
}

ggarrange(plotlist = goods_ant_rules_plots, common.legend = TRUE, ncol = 3)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## $`4`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

The graphs illustrates the items that are frequently purchased alongside the item mentioned in the chart’s title. For example, it shows that when customers purchase baking powder, there is a strong likelihood that they will also buy butter, which aligns with common expectations.

6 Conclusion

In this project, I analyzed transaction data to uncover patterns of items frequently purchased together using association rule mining. The goal was to understand customer buying behavior and identify relationships between products, like how buying baking powder often leads to buying butter, as shown by a high confidence value of 0.95 and lift of 2.5. These insights help businesses improve product placement, plan targeted promotions, and enhance customer satisfaction.

This analysis is important because it allows us to make data-driven decisions, improving sales and customer experience. For example, retailers can bundle frequently purchased items or position them closer together on shelves. In the real world, this work can be implemented in retail, e-commerce, and marketing to optimize sales strategies, design recommendation systems, or develop personalized offers for customers based on their purchasing habits. The results, like the high-confidence association rules and visually clear charts, make it easier for businesses to act on these findings.

Market Basket Analysis

Nijat Abiyev