In this project, we will conduct association rule mining using the Apriori algorithm on a market basket dataset sourced from Kaggle. Market basket analysis is a technique used to discover relationships between products based on customer transactions. By examining these associations, we can identify frequently purchased item combinations, which can help optimize product placement, enhance marketing strategies, and ultimately boost sales. The Apriori algorithm is widely used for extracting association rules, and we will apply it to uncover meaningful insights from our retail transaction dataset.
[Data Source]
(https://www.kaggle.com/datasets/ashwinbadi/market-basket-analysist)
library(arules)
library(arulesViz)
library(arulesCBA)
library(tidyverse)
library(readxl)
library(RColorBrewer)
library(ggplot2)
library(plotly)
# Load data and save as CSV
file_path <- "C:/Users/nijat/Desktop/market_basket_analysis.xlsx"
csv_path <- "C:/Users/nijat/Desktop/csv_market_basket_analysis.csv"
MBA <- read_excel(file_path, sheet = "Worksheet", range = "A1:M1864")
write.csv(MBA, csv_path, row.names = FALSE)
print(head(MBA, 10))
## # A tibble: 10 × 13
## itemset item1 item2 item3 item4 item5 item6 item7 item8 item9 item10 item11
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl> <lgl>
## 1 1 baking… coff… froz… butt… <NA> <NA> <NA> NA NA NA NA
## 2 2 ice cr… abra… fish coff… froz… <NA> <NA> NA NA NA NA
## 3 3 butter baki… coff… ice … froz… <NA> <NA> NA NA NA NA
## 4 4 frozen… abra… ice … butt… coff… <NA> <NA> NA NA NA NA
## 5 5 baking… ice … butt… froz… <NA> <NA> <NA> NA NA NA NA
## 6 6 ice cr… froz… coff… cake… froz… <NA> <NA> NA NA NA NA
## 7 7 honey fish abra… dome… <NA> <NA> <NA> NA NA NA NA
## 8 8 butter froz… fish ice … froz… <NA> <NA> NA NA NA NA
## 9 9 coffee honey fish froz… <NA> <NA> <NA> NA NA NA NA
## 10 10 honey froz… fish <NA> <NA> <NA> <NA> NA NA NA NA
## # ℹ 1 more variable: item12 <lgl>
Trans <- read.transactions(csv_path, format = "basket", sep = ",", header = TRUE)
## Warning in asMethod(object): removing duplicated items in transactions
print(summary(Trans))
## transactions as itemMatrix in sparse format with
## 1863 rows (elements/itemsets/transactions) and
## 1875 columns (items) and a density of 0.002464269
##
## most frequent items:
## frozen meals butter baking powder coffee fish
## 1002 840 663 606 563
## (Other)
## 4934
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8
## 14 210 256 360 421 391 173 38
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.621 6.000 8.000
##
## includes extended item information - examples:
## labels
## 1 1
## 2 10
## 3 100
The dataset Trans consists of market basket transaction data, where each row represents a single shopping transaction made by a customer. Each transaction contains a set of purchased items. Here’s what the data includes:
Transaction ID (e.g., 1858, 1859, 1860, etc.) – Represents a unique shopping trip.
List of Purchased Items – Each transaction contains a set of items bought together, such as:
This data is used in market basket analysis to identify patterns, such as which items are frequently bought together, helping businesses improve product placement and marketing strategies.
inspect(tail(Trans, 6))
## items
## [1] {1858,
## baking powder,
## butter,
## frozen meals,
## frozen vegetables}
## [2] {1859,
## baking powder,
## butter,
## frozen meals,
## frozen vegetables,
## grapes}
## [3] {1860,
## abrasive cleaner,
## butter,
## coffee,
## fish,
## frozen vegetables,
## ice cream}
## [4] {1861,
## coffee,
## fish,
## frozen meals,
## frozen vegetables,
## grapes}
## [5] {1862,
## butter,
## fish,
## frozen meals,
## ice cream}
## [6] {1863,
## baking powder,
## butter,
## cake bar,
## coffee,
## frozen meals,
## grapes}
itemFrequencyPlot(Trans, topN = 10, col = brewer.pal(10, 'Set3'),
main = 'Top 10 Frequent Items', type = "absolute",
ylab = "Frequency", xlab = "Items")
X-axis Title (“Retail items”): Represents the names of different products in the dataset (e.g., frozen meals, butter, baking powder, etc.). Each label corresponds to a specific item sold in the store.
Y-axis Title (“Item Frequency (Absolute)”): Indicates the total number of times each item was purchased across all transactions.
In this context, the x-axis titles (labels) are the distinct product names, and their corresponding y-axis values show how often each product was bought. For example:
“Frozen meals” corresponds to the highest frequency (over 1000 purchases). “Cake bar” corresponds to the lowest frequency in this chart.
rules <- apriori(Trans, parameter = list(supp = 0.01, conf = 0.75))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 18
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1875 item(s), 1863 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [148 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
print(summary(rules))
## set of 148 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 1 19 75 48 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 4.00 4.00 4.25 5.00 6.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01020 Min. :0.7500 Min. :0.01074 Min. :1.394
## 1st Qu.:0.01221 1st Qu.:0.8235 1st Qu.:0.01436 1st Qu.:1.682
## Median :0.01771 Median :0.8904 Median :0.02120 Median :1.780
## Mean :0.02908 Mean :0.8769 Mean :0.03316 Mean :1.961
## 3rd Qu.:0.02644 3rd Qu.:0.9333 3rd Qu.:0.03167 3rd Qu.:2.145
## Max. :0.31186 Max. :1.0000 Max. :0.32528 Max. :3.168
## count
## Min. : 19.00
## 1st Qu.: 22.75
## Median : 33.00
## Mean : 54.17
## 3rd Qu.: 49.25
## Max. :581.00
##
## mining info:
## data ntransactions support confidence
## Trans 1863 0.01 0.75
## call
## apriori(data = Trans, parameter = list(supp = 0.01, conf = 0.75))
Apriori algorithm: Extracts association rules from transactional data.
Parameters: Minimum support = 1%, minimum confidence = 75%.
Dataset: 1875 unique items, 1863 transactions.
Output: 148 rules generated for frequent item combinations.
Purpose: Identify relationships between products for better business strategies.
inspect(head(rules, 5))
## lhs rhs support confidence
## [1] {coffee} => {frozen meals} 0.31186259 0.9587459
## [2] {baking powder, honey} => {frozen vegetables} 0.01663983 0.7560976
## [3] {coffee, honey} => {frozen meals} 0.06548578 0.9682540
## [4] {baking powder, honey} => {butter} 0.02093398 0.9512195
## [5] {baking powder, honey} => {frozen meals} 0.01663983 0.7560976
## coverage lift count
## [1] 0.32528180 1.782578 581
## [2] 0.02200751 2.603715 31
## [3] 0.06763285 1.800257 122
## [4] 0.02200751 2.109669 39
## [5] 0.02200751 1.405798 31
Rule 1:
If a customer buys coffee, they are likely to also buy frozen meals.
Support: 31.19% of all transactions include both.
Confidence: 95.87% of transactions with coffee also include frozen meals.
Lift: 1.78, meaning this rule is 1.78 times more likely than random chance.
Rule 2:
If a customer buys baking powder, honey, they are likely to also buy frozen vegetables.
Support: 1.66% of all transactions include this combination.
Confidence: 75.61% of transactions with baking powder, honey also include frozen vegetables.
Lift: 2.60, indicating a strong relationship.
rules_df <- as(rules, "data.frame")
print(head(rules_df, 6))
## rules support confidence coverage
## 1 {coffee} => {frozen meals} 0.31186259 0.9587459 0.32528180
## 2 {baking powder,honey} => {frozen vegetables} 0.01663983 0.7560976 0.02200751
## 3 {coffee,honey} => {frozen meals} 0.06548578 0.9682540 0.06763285
## 4 {baking powder,honey} => {butter} 0.02093398 0.9512195 0.02200751
## 5 {baking powder,honey} => {frozen meals} 0.01663983 0.7560976 0.02200751
## 6 {abrasive cleaner,coffee} => {frozen meals} 0.05367687 0.8620690 0.06226516
## lift count
## 1 1.782578 581
## 2 2.603715 31
## 3 1.800257 122
## 4 2.109669 39
## 5 1.405798 31
## 6 1.602829 100
This process helps structure and analyze the rules more effectively, making it easier to sort, filter, or visualize the results.
inspect(sort(rules, by = "support", decreasing = TRUE)[1:5])
## lhs rhs support confidence
## [1] {coffee} => {frozen meals} 0.3118626 0.9587459
## [2] {baking powder, frozen meals} => {butter} 0.1669351 0.7585366
## [3] {butter, coffee} => {frozen meals} 0.1556629 0.9324759
## [4] {baking powder, coffee} => {frozen meals} 0.1336554 0.9576923
## [5] {coffee, fish} => {frozen meals} 0.1143317 0.9424779
## coverage lift count
## [1] 0.3252818 1.782578 581
## [2] 0.2200751 1.682326 311
## [3] 0.1669351 1.733735 290
## [4] 0.1395598 1.780620 249
## [5] 0.1213097 1.752332 213
Based on the support level, the top 5 rules are displayed as above. The top 1 is still from coffee and frozen meals, this indicates a higher chance of this rule over others For for the top 5 rules, 4 out of them are related to the frozen meals (as consequent).
inspect(sort(rules, by = "confidence", decreasing = FALSE)[1:5])
## lhs rhs support
## [1] {butter, frozen vegetables, honey} => {frozen meals} 0.01771337
## [2] {domestic eggs, frozen meals, ice cream} => {coffee} 0.01610306
## [3] {coffee, frozen vegetables, ice cream} => {butter} 0.01449275
## [4] {butter, coffee, fish, ice cream} => {frozen meals} 0.01288245
## [5] {baking powder, honey} => {frozen vegetables} 0.01663983
## confidence coverage lift count
## [1] 0.7500000 0.02361782 1.394461 33
## [2] 0.7500000 0.02147075 2.305693 30
## [3] 0.7500000 0.01932367 1.663393 27
## [4] 0.7500000 0.01717660 1.394461 24
## [5] 0.7560976 0.02200751 2.603715 31
Based on the confidence level, we can see the top 5 rules changed, however, 2 out of 5 are still related to frozen meals(consequent). Even though we can observe a lower confidence level for the tope 5 rules. This could still give us the insight that when the antecedent happens, there is a very high chance that the consequents would happen. For example, when a costumer buys butter, frozen vegetables, honey together in a transaction. There is a 75% chance that this customer will also purchase frozen meals. This applies to the other 4 rules as well.
inspect(sort(rules, by = "lift", decreasing = TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {baking powder,
## coffee,
## honey} => {frozen vegetables} 0.01234568 0.9200000 0.01341922 3.168133 23
## [2] {baking powder,
## butter,
## coffee,
## honey} => {frozen vegetables} 0.01127214 0.9130435 0.01234568 3.144177 21
## [3] {baking powder,
## coffee,
## frozen meals,
## honey} => {frozen vegetables} 0.01127214 0.9130435 0.01234568 3.144177 21
## [4] {baking powder,
## butter,
## coffee,
## frozen meals,
## honey} => {frozen vegetables} 0.01019860 0.9047619 0.01127214 3.115659 19
## [5] {abrasive cleaner,
## butter,
## fish,
## ice cream} => {coffee} 0.01180891 1.0000000 0.01180891 3.074257 22
From the top 5 result sorted by lift, we can see that these rules are positively related since the lift is above 1. However, the most correlated in this section is baking powder, coffee, honey and frozen vegetables with a lift value of 3.168133.
plot_data <- rules_df %>%
mutate(support = as.numeric(support),
confidence = as.numeric(confidence),
lift = as.numeric(lift))
ggplot(plot_data, aes(x = support, y = confidence, color = lift)) +
geom_point(size = 3, alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
theme_minimal() +
ggtitle("Association Rules: Support vs Confidence")
plot(rules, measure = c("support", "confidence"), shading = "lift", engine = "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The images show how support (x-axis) and confidence (y-axis) relate to each other for association rules, with lift represented by color. In both charts, we see that rules with higher support (more common itemsets) tend to have slightly lower confidence (reliability of the rule). The first chart uses a gradient from blue to red to show lift, where red indicates stronger associations, while the second chart uses shades of red. Most rules are clustered around low support and high confidence, meaning some itemsets are frequent but not always strongly associated. A few points with high support and high lift indicate rare but very strong associations.
goods <- unique(MBA$item1)[1:12]
goods_rules_list <- list()
goods_rules_plots <- list()
for (g in goods) {
goods_rules <- apriori(data = Trans, parameter = list(supp = 0.001, conf = 0.75),
appearance = list(default = "lhs", rhs = g), control = list(verbose = F))
goods_rules_list[[g]] <- sort(goods_rules, by = "support", decreasing = TRUE)
goods_rules_plots[[g]] <- plot(head(goods_rules_list[[g]]), method = "graph") +
labs(title = paste(g, "as a consequent item")) + theme(plot.title = element_text(size = 9)) + theme_bw()
}
ggarrange(plotlist = goods_rules_plots, common.legend = TRUE, ncol = 3)
## $`1`
##
## $`2`
##
## $`3`
##
## $`4`
##
## attr(,"class")
## [1] "list" "ggarrange"
goods_ant_rules_list <- list()
goods_ant_rules_plots <- list()
for (g in goods) {
goods_rules <- apriori(data = Trans, parameter = list(supp = 0.01, conf = 0.075, minlen = 2),
appearance = list(default = "rhs", lhs = g), control = list(verbose = F))
goods_ant_rules_list[[g]] <- sort(goods_rules, by = "confidence", decreasing = TRUE)
goods_ant_rules_plots[[g]] <- plot(head(goods_ant_rules_list[[g]]), method = "graph") +
labs(title = paste(g, "as an antecedent item")) + theme(plot.title = element_text(size = 9)) + theme_bw()
}
ggarrange(plotlist = goods_ant_rules_plots, common.legend = TRUE, ncol = 3)
## $`1`
##
## $`2`
##
## $`3`
##
## $`4`
##
## attr(,"class")
## [1] "list" "ggarrange"
The graphs illustrates the items that are frequently purchased alongside the item mentioned in the chart’s title. For example, it shows that when customers purchase baking powder, there is a strong likelihood that they will also buy butter, which aligns with common expectations.
In this project, I analyzed transaction data to uncover patterns of items frequently purchased together using association rule mining. The goal was to understand customer buying behavior and identify relationships between products, like how buying baking powder often leads to buying butter, as shown by a high confidence value of 0.95 and lift of 2.5. These insights help businesses improve product placement, plan targeted promotions, and enhance customer satisfaction.
This analysis is important because it allows us to make data-driven decisions, improving sales and customer experience. For example, retailers can bundle frequently purchased items or position them closer together on shelves. In the real world, this work can be implemented in retail, e-commerce, and marketing to optimize sales strategies, design recommendation systems, or develop personalized offers for customers based on their purchasing habits. The results, like the high-confidence association rules and visually clear charts, make it easier for businesses to act on these findings.