Market basket analysis is a data mining technique used to identify relationships and patterns between products purchased together by customers. It involves analyzing transaction data from point-of-sale systems to identify which products are frequently purchased together and to discover associations between products that are often bought together. The insights derived from market basket analysis can be used to optimize store layout, product placement, and promotions, as well as to identify new product development opportunities. It is a powerful tool for retailers and marketers looking to improve their sales and customer engagement.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.2.2
## Loading required package: arules
## Warning: package 'arules' was built under R version 4.2.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
This paper is based on a dataset from kaggle: https://www.kaggle.com/code/nandinibagga/apriori-algorithm/data?select=bread+basket.csv. The dataset belongs to “The Bread Basket” a bakery located in Edinburgh. The dataset has 20507 entries, over 9000 transactions, and 4 columns.
trans <-read.transactions("C:/Users/User/OneDrive/Desktop/bread basket (2).csv", format = "single", sep=",", header =TRUE, cols = c("Transaction","Item"))
# Summary of the transaction data
summary(trans)
## transactions as itemMatrix in sparse format with
## 6576 rows (elements/itemsets/transactions) and
## 102 columns (items) and a density of 0.01988962
##
## most frequent items:
## Coffee Bread Tea Cake Pastry (Other)
## 3188 2146 941 694 576 5796
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10
## 2652 2155 1029 502 174 47 10 2 4 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.029 3.000 10.000
##
## includes extended item information - examples:
## labels
## 1 Adjustment
## 2 Afternoon with the baker
## 3 Alfajores
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 10
## 3 1000
glimpse(trans)
## Formal class 'transactions' [package "arules"] with 3 slots
## ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
## ..@ itemInfo :'data.frame': 102 obs. of 1 variable:
## .. ..$ labels: chr [1:102] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
## ..@ itemsetInfo:'data.frame': 6576 obs. of 1 variable:
## .. ..$ transactionID: chr [1:6576] "1" "10" "1000" "1001" ...
There are 6576 transactions (or itemsets) and 102 unique items in the dataset. We have a “density” of 0.01988962 meaning that only 1.98% of the matrix is non-zero, which confirms that the matrix is sparse. There are 2652 itemsets with only one item, 2155 itemsets with two items, and so on. The minimum itemset length is 1, the median itemset length is 2, and the maximum itemset length is 10. And, most importantly, we get to know what are the most frequent items, which are coffee, bread and tea, etc.
df <- read.csv("C:/Users/User/OneDrive/Desktop/bread basket (2).csv")
View(df)
Le´s plot a bar graph that shows the distribution of the number of items in each transaction (basket size).
The most frequent basket is of size 1 and the mean size is equal to almost 2.
group_basket = df %>% group_by(., Transaction) %>% summarise(basket_size=n())
basket_sizes = group_basket %>% group_by(.,basket_size) %>% summarise(count=n())
ggplot(basket_sizes, aes(x=basket_size, y=count)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(0, 80, by = 1))
As we previously mentioned ‘Coffee’ is the product that customers bought most times, which is also followed by bread. Let´s plot topN most frequent items.
itemFrequencyPlot(trans,topN=20,type="absolute")
In association rule mining, support and confidence are used as measures of the strength of a rule.
In general, the levels of support and confidence should be set high enough to filter out uninteresting or insignificant patterns, but not so high that interesting patterns are excluded. It is common to experiment with different levels of support and confidence to find the most interesting and useful patterns.
One way to identify the appropriate levels of support and confidence is to plot the support and confidence levels against the number of rules generated, where increasing the support or confidence levels results in a significant decrease in the number of rules generated. At this point, the levels can be set to balance the number of rules generated with the importance of the patterns identified.
# Generating frequent itemsets from our transaction dataset, with a minimum support threshold of 0.05 and a maximum itemset size of 10.
freq_items<-eclat(trans, parameter=list(supp=0.05, maxlen=10))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 328
##
## create itemset ...
## set transactions ...[102 item(s), 6576 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating bit matrix ... [9 row(s), 6576 column(s)] done [0.00s].
## writing ... [12 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
# print the generated frequent itemsets generated and sort the count column by descending order
inspect(freq_items)
## items support count
## [1] {Cake, Coffee} 0.05687348 374
## [2] {Coffee, Tea} 0.05200730 342
## [3] {Bread, Coffee} 0.09032847 594
## [4] {Coffee} 0.48479319 3188
## [5] {Bread} 0.32633820 2146
## [6] {Tea} 0.14309611 941
## [7] {Cake} 0.10553528 694
## [8] {Pastry} 0.08759124 576
## [9] {Sandwich} 0.07496959 493
## [10] {Cookies} 0.05687348 374
## [11] {Medialuna} 0.05763382 379
## [12] {Hot chocolate} 0.05200730 342
freq_df <- as.data.frame(inspect(freq_items))
## items support count
## [1] {Cake, Coffee} 0.05687348 374
## [2] {Coffee, Tea} 0.05200730 342
## [3] {Bread, Coffee} 0.09032847 594
## [4] {Coffee} 0.48479319 3188
## [5] {Bread} 0.32633820 2146
## [6] {Tea} 0.14309611 941
## [7] {Cake} 0.10553528 694
## [8] {Pastry} 0.08759124 576
## [9] {Sandwich} 0.07496959 493
## [10] {Cookies} 0.05687348 374
## [11] {Medialuna} 0.05763382 379
## [12] {Hot chocolate} 0.05200730 342
freq_df %>% arrange(desc(count))
## items support count
## [4] {Coffee} 0.48479319 3188
## [5] {Bread} 0.32633820 2146
## [6] {Tea} 0.14309611 941
## [7] {Cake} 0.10553528 694
## [3] {Bread, Coffee} 0.09032847 594
## [8] {Pastry} 0.08759124 576
## [9] {Sandwich} 0.07496959 493
## [11] {Medialuna} 0.05763382 379
## [1] {Cake, Coffee} 0.05687348 374
## [10] {Cookies} 0.05687348 374
## [2] {Coffee, Tea} 0.05200730 342
## [12] {Hot chocolate} 0.05200730 342
The most frequent item sets are one and two-item baskets. In this dataset with minimal support value of 0.05 there are no baskets that contain more than two different items.
The next step is to recognize the most frequent rules. To obtain any rules, the support value needs to be lower in order to get item sets of more than than two items.
# Lowering the minimum support level to 0.01 with the aim to include more useful and interesting patterns
freq_items<-eclat(trans, parameter=list(supp=0.01, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 65
##
## create itemset ...
## set transactions ...[102 item(s), 6576 transaction(s)] done [0.00s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating sparse bit matrix ... [30 row(s), 6576 column(s)] done [0.00s].
## writing ... [61 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
# print the generated frequent itemsets generated and sort the count column by descending order
inspect(freq_items)
## items support count
## [1] {Coffee, Tiffin} 0.01064477 70
## [2] {Coffee, Spanish Brunch} 0.01414234 93
## [3] {Coffee, Scone} 0.01855231 122
## [4] {Coffee, Toast} 0.02585158 170
## [5] {Coffee, Muffin} 0.01809611 119
## [6] {Coffee, Soup} 0.01718370 113
## [7] {Alfajores, Coffee} 0.02250608 148
## [8] {Alfajores, Bread} 0.01125304 74
## [9] {Brownie, Coffee} 0.02098540 138
## [10] {Bread, Brownie} 0.01186131 78
## [11] {Coffee, Juice} 0.02144161 141
## [12] {Coffee, Hot chocolate} 0.02737226 180
## [13] {Bread, Hot chocolate} 0.01201338 79
## [14] {Cake, Hot chocolate} 0.01034063 68
## [15] {Coffee, Medialuna} 0.03315085 218
## [16] {Bread, Medialuna} 0.01642336 108
## [17] {Coffee, Cookies} 0.02995742 197
## [18] {Bread, Cookies} 0.01520681 100
## [19] {Coffee, Sandwich} 0.04257908 280
## [20] {Bread, Sandwich} 0.01703163 112
## [21] {Sandwich, Tea} 0.01414234 93
## [22] {Bread, Coffee, Pastry} 0.01140511 75
## [23] {Coffee, Pastry} 0.04896594 322
## [24] {Bread, Pastry} 0.02980535 196
## [25] {Cake, Coffee, Tea} 0.01125304 74
## [26] {Cake, Coffee} 0.05687348 374
## [27] {Bread, Cake} 0.02341849 154
## [28] {Cake, Tea} 0.02630779 173
## [29] {Coffee, Tea} 0.05200730 342
## [30] {Bread, Tea} 0.02965328 195
## [31] {Bread, Coffee} 0.09032847 594
## [32] {Coffee} 0.48479319 3188
## [33] {Bread} 0.32633820 2146
## [34] {Tea} 0.14309611 941
## [35] {Cake} 0.10553528 694
## [36] {Pastry} 0.08759124 576
## [37] {Sandwich} 0.07496959 493
## [38] {Cookies} 0.05687348 374
## [39] {Medialuna} 0.05763382 379
## [40] {Hot chocolate} 0.05200730 342
## [41] {Juice} 0.04045012 266
## [42] {Brownie} 0.04409976 290
## [43] {Alfajores} 0.04075426 268
## [44] {Soup} 0.03527981 232
## [45] {Muffin} 0.03649635 240
## [46] {Toast} 0.03543187 233
## [47] {Scone} 0.03421533 225
## [48] {Spanish Brunch} 0.02235401 147
## [49] {Truffles} 0.02311436 152
## [50] {Farm House} 0.03832117 252
## [51] {Tiffin} 0.01946472 128
## [52] {Scandinavian} 0.02934915 193
## [53] {Coke} 0.02083333 137
## [54] {Mineral water} 0.01657543 109
## [55] {Chicken Stew} 0.01475061 97
## [56] {Jammie Dodgers} 0.01414234 93
## [57] {Salad} 0.01292579 85
## [58] {Baguette} 0.01992092 131
## [59] {Jam} 0.01475061 97
## [60] {Hearty & Seasonal} 0.01003650 66
## [61] {Fudge} 0.01246959 82
freq_df <- as.data.frame(inspect(freq_items))
## items support count
## [1] {Coffee, Tiffin} 0.01064477 70
## [2] {Coffee, Spanish Brunch} 0.01414234 93
## [3] {Coffee, Scone} 0.01855231 122
## [4] {Coffee, Toast} 0.02585158 170
## [5] {Coffee, Muffin} 0.01809611 119
## [6] {Coffee, Soup} 0.01718370 113
## [7] {Alfajores, Coffee} 0.02250608 148
## [8] {Alfajores, Bread} 0.01125304 74
## [9] {Brownie, Coffee} 0.02098540 138
## [10] {Bread, Brownie} 0.01186131 78
## [11] {Coffee, Juice} 0.02144161 141
## [12] {Coffee, Hot chocolate} 0.02737226 180
## [13] {Bread, Hot chocolate} 0.01201338 79
## [14] {Cake, Hot chocolate} 0.01034063 68
## [15] {Coffee, Medialuna} 0.03315085 218
## [16] {Bread, Medialuna} 0.01642336 108
## [17] {Coffee, Cookies} 0.02995742 197
## [18] {Bread, Cookies} 0.01520681 100
## [19] {Coffee, Sandwich} 0.04257908 280
## [20] {Bread, Sandwich} 0.01703163 112
## [21] {Sandwich, Tea} 0.01414234 93
## [22] {Bread, Coffee, Pastry} 0.01140511 75
## [23] {Coffee, Pastry} 0.04896594 322
## [24] {Bread, Pastry} 0.02980535 196
## [25] {Cake, Coffee, Tea} 0.01125304 74
## [26] {Cake, Coffee} 0.05687348 374
## [27] {Bread, Cake} 0.02341849 154
## [28] {Cake, Tea} 0.02630779 173
## [29] {Coffee, Tea} 0.05200730 342
## [30] {Bread, Tea} 0.02965328 195
## [31] {Bread, Coffee} 0.09032847 594
## [32] {Coffee} 0.48479319 3188
## [33] {Bread} 0.32633820 2146
## [34] {Tea} 0.14309611 941
## [35] {Cake} 0.10553528 694
## [36] {Pastry} 0.08759124 576
## [37] {Sandwich} 0.07496959 493
## [38] {Cookies} 0.05687348 374
## [39] {Medialuna} 0.05763382 379
## [40] {Hot chocolate} 0.05200730 342
## [41] {Juice} 0.04045012 266
## [42] {Brownie} 0.04409976 290
## [43] {Alfajores} 0.04075426 268
## [44] {Soup} 0.03527981 232
## [45] {Muffin} 0.03649635 240
## [46] {Toast} 0.03543187 233
## [47] {Scone} 0.03421533 225
## [48] {Spanish Brunch} 0.02235401 147
## [49] {Truffles} 0.02311436 152
## [50] {Farm House} 0.03832117 252
## [51] {Tiffin} 0.01946472 128
## [52] {Scandinavian} 0.02934915 193
## [53] {Coke} 0.02083333 137
## [54] {Mineral water} 0.01657543 109
## [55] {Chicken Stew} 0.01475061 97
## [56] {Jammie Dodgers} 0.01414234 93
## [57] {Salad} 0.01292579 85
## [58] {Baguette} 0.01992092 131
## [59] {Jam} 0.01475061 97
## [60] {Hearty & Seasonal} 0.01003650 66
## [61] {Fudge} 0.01246959 82
freq_df %>% arrange(desc(count))
## items support count
## [32] {Coffee} 0.48479319 3188
## [33] {Bread} 0.32633820 2146
## [34] {Tea} 0.14309611 941
## [35] {Cake} 0.10553528 694
## [31] {Bread, Coffee} 0.09032847 594
## [36] {Pastry} 0.08759124 576
## [37] {Sandwich} 0.07496959 493
## [39] {Medialuna} 0.05763382 379
## [26] {Cake, Coffee} 0.05687348 374
## [38] {Cookies} 0.05687348 374
## [29] {Coffee, Tea} 0.05200730 342
## [40] {Hot chocolate} 0.05200730 342
## [23] {Coffee, Pastry} 0.04896594 322
## [42] {Brownie} 0.04409976 290
## [19] {Coffee, Sandwich} 0.04257908 280
## [43] {Alfajores} 0.04075426 268
## [41] {Juice} 0.04045012 266
## [50] {Farm House} 0.03832117 252
## [45] {Muffin} 0.03649635 240
## [46] {Toast} 0.03543187 233
## [44] {Soup} 0.03527981 232
## [47] {Scone} 0.03421533 225
## [15] {Coffee, Medialuna} 0.03315085 218
## [17] {Coffee, Cookies} 0.02995742 197
## [24] {Bread, Pastry} 0.02980535 196
## [30] {Bread, Tea} 0.02965328 195
## [52] {Scandinavian} 0.02934915 193
## [12] {Coffee, Hot chocolate} 0.02737226 180
## [28] {Cake, Tea} 0.02630779 173
## [4] {Coffee, Toast} 0.02585158 170
## [27] {Bread, Cake} 0.02341849 154
## [49] {Truffles} 0.02311436 152
## [7] {Alfajores, Coffee} 0.02250608 148
## [48] {Spanish Brunch} 0.02235401 147
## [11] {Coffee, Juice} 0.02144161 141
## [9] {Brownie, Coffee} 0.02098540 138
## [53] {Coke} 0.02083333 137
## [58] {Baguette} 0.01992092 131
## [51] {Tiffin} 0.01946472 128
## [3] {Coffee, Scone} 0.01855231 122
## [5] {Coffee, Muffin} 0.01809611 119
## [6] {Coffee, Soup} 0.01718370 113
## [20] {Bread, Sandwich} 0.01703163 112
## [54] {Mineral water} 0.01657543 109
## [16] {Bread, Medialuna} 0.01642336 108
## [18] {Bread, Cookies} 0.01520681 100
## [55] {Chicken Stew} 0.01475061 97
## [59] {Jam} 0.01475061 97
## [2] {Coffee, Spanish Brunch} 0.01414234 93
## [21] {Sandwich, Tea} 0.01414234 93
## [56] {Jammie Dodgers} 0.01414234 93
## [57] {Salad} 0.01292579 85
## [61] {Fudge} 0.01246959 82
## [13] {Bread, Hot chocolate} 0.01201338 79
## [10] {Bread, Brownie} 0.01186131 78
## [22] {Bread, Coffee, Pastry} 0.01140511 75
## [8] {Alfajores, Bread} 0.01125304 74
## [25] {Cake, Coffee, Tea} 0.01125304 74
## [1] {Coffee, Tiffin} 0.01064477 70
## [14] {Cake, Hot chocolate} 0.01034063 68
## [60] {Hearty & Seasonal} 0.01003650 66
# Create association rules from the frequent itemsets
freq_rules<-ruleInduction(freq_items, trans, confidence=0.3)
# Provide a summary of the generated rules
summary(freq_rules)
## set of 19 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 17 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.105 2.000 3.000
##
## summary of quality measures:
## support confidence lift itemset
## Min. :0.01064 Min. :0.3403 Min. :0.7497 Min. : 1.00
## 1st Qu.:0.01764 1st Qu.:0.4815 1st Qu.:1.0137 1st Qu.: 5.50
## Median :0.02251 Median :0.5301 Median :1.0934 Median :12.00
## Mean :0.02699 Mean :0.5158 Mean :1.0820 Mean :13.68
## 3rd Qu.:0.03155 3rd Qu.:0.5556 3rd Qu.:1.1461 3rd Qu.:22.50
## Max. :0.05687 Max. :0.7296 Max. :1.5050 Max. :29.00
##
## mining info:
## data ntransactions support
## trans 6576 0.01
## call confidence
## eclat(data = trans, parameter = list(supp = 0.01, maxlen = 15)) 0.3
The rule length distribution indicates that there are 17 rules with a length of 2 (i.e., consisting of two items, one on the left-hand side and one on the right-hand side) and 2 rules with a length of 3 (i.e., consisting of three items, two on the left-hand side and one on the right-hand side).
The items in the rules appear in a minimum of about 1% and a maximum of about 6% of the transactions.The rules are correct in about 34% to 73% of the cases where the items on the left-hand side appear. The median value for the lift measure is 1.0934, indicating that the items in the rules are weakly positively associated on average
High lift values indicate that the occurrence of the antecedent (left-hand side) of the rule is associated with a higher than expected occurrence of the consequent (right-hand side), after taking into account their support levels. So, rules with high lift values can be interpreted as indicating a strong relationship between the antecedent and the consequent, and are often considered to be the most interesting rules.
Accordingly, the rules with the highest lift value will be evaluated.
# select and inspect the top 10 rules with the highest lift values
inspect(head(sort(freq_rules, by ="lift"),10))
## lhs rhs support confidence lift itemset
## [1] {Toast} => {Coffee} 0.02585158 0.7296137 1.505000 4
## [2] {Spanish Brunch} => {Coffee} 0.01414234 0.6326531 1.304996 2
## [3] {Medialuna} => {Coffee} 0.03315085 0.5751979 1.186481 15
## [4] {Sandwich} => {Coffee} 0.04257908 0.5679513 1.171533 19
## [5] {Pastry} => {Coffee} 0.04896594 0.5590278 1.153126 23
## [6] {Alfajores} => {Coffee} 0.02250608 0.5522388 1.139122 7
## [7] {Tiffin} => {Coffee} 0.01064477 0.5468750 1.128058 1
## [8] {Scone} => {Coffee} 0.01855231 0.5422222 1.118461 3
## [9] {Cake} => {Coffee} 0.05687348 0.5389049 1.111618 26
## [10] {Juice} => {Coffee} 0.02144161 0.5300752 1.093405 11
the first rule Toast => Coffee has a lift of 1.505. This means that customers who buy toast are 1.505 times more likely to buy coffee when comparing the general rate of coffee sales. The support for this rule is 0.025, meaning that that 2.5% of all transactions contain both toast and coffee. The confidence is 0.73 reflecting that 73% of customers who buy toast also buy coffee.
Similarly, the second rule Spanish Brunch => Coffee has a lift of 1.305, which means that customers who buy Spanish brunch are 1.305 times more likely to buy coffee compared to the general rate of coffee sales. its support for this rule is 0.014, so only 1.4% of all transactions contain both Spanish brunch and coffee. And, with a condidence of 0.63,we have 63% of customers who buy Spanish brunch also buy coffee.
The rest of the rules can be interpreted similarly. Note that the support for each rule is relatively low, showing that these rules may not be very common. However, the high lift values suggest that these rules are still strong indicators of customer behavior.
The below plot is created to visualize the relationship between the support, confidence, and lift measures of the association rules generated from the transaction dataset.
plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.
The obtained result fron the above plot shows that almost all of the
rules have a low support, less than 5%.
# Sort the association rules by support and then by confidence and show the first 15 rules
inspect(head(sort(sort(freq_rules, by ="confidence"),by="support"),15))
## lhs rhs support confidence lift itemset
## [1] {Cake} => {Coffee} 0.05687348 0.5389049 1.1116181 26
## [2] {Tea} => {Coffee} 0.05200730 0.3634431 0.7496870 29
## [3] {Pastry} => {Coffee} 0.04896594 0.5590278 1.1531263 23
## [4] {Sandwich} => {Coffee} 0.04257908 0.5679513 1.1715332 19
## [5] {Medialuna} => {Coffee} 0.03315085 0.5751979 1.1864810 15
## [6] {Cookies} => {Coffee} 0.02995742 0.5267380 1.0865210 17
## [7] {Pastry} => {Bread} 0.02980535 0.3402778 1.0427151 24
## [8] {Hot chocolate} => {Coffee} 0.02737226 0.5263158 1.0856501 12
## [9] {Toast} => {Coffee} 0.02585158 0.7296137 1.5050000 4
## [10] {Alfajores} => {Coffee} 0.02250608 0.5522388 1.1391225 7
## [11] {Juice} => {Coffee} 0.02144161 0.5300752 1.0934048 11
## [12] {Brownie} => {Coffee} 0.02098540 0.4758621 0.9815775 9
## [13] {Scone} => {Coffee} 0.01855231 0.5422222 1.1184609 3
## [14] {Muffin} => {Coffee} 0.01809611 0.4958333 1.0227729 5
## [15] {Soup} => {Coffee} 0.01718370 0.4870690 1.0046943 6
The top 3 interesting rules can be interprete as followed: the 1st rule with the highest support and confidence shows that 53.9% of the baskets that contained Cake also has Coffee, and these items were bought together 26 times. As for the 2nd rule, where the second highest support and confidence shows that 36.3% of the baskets that contained Tea also contained Coffee, and these items were bought together 29 times. The 3rd rule with the third highest support and confidence shows that 55.9% of the baskets that contained Pastry also contained Coffee, and these items were bought together 23 times.
Rule 9 in the list, Toast => Coffee, have the highest confidence values among the top 15 rules, indicating that if a customer bought Toast, there is a high likelihood to have also purchased Coffee. However, the support for this rules is relatively low, indicating that these items are not frequently purchased together.
# Plot a matrix of the 19 association rules sorted by support and confidence, and measured by lift
rules_for_plot = head(sort(sort(freq_rules, by ="confidence"),by="support"),19)
plot(rules_for_plot, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
## [1] "{Toast}" "{Spanish Brunch}" "{Medialuna}" "{Sandwich}"
## [5] "{Alfajores}" "{Tiffin}" "{Scone}" "{Cake}"
## [9] "{Pastry}" "{Juice}" "{Cookies}" "{Hot chocolate}"
## [13] "{Muffin}" "{Soup}" "{Brownie}" "{Cake,Tea}"
## [17] "{Bread,Pastry}" "{Tea}"
## Itemsets in Consequent (RHS)
## [1] "{Bread}" "{Coffee}"
We can see that the association rule with the highest lift (1.505) is the rule that customers who buy Toast are likely to also buy Coffee. This rule has a relatively high confidence (0.73) and a support of 2.6%, indicating that it appears in a relatively high number of baskets.
Other high lift rules include those involving Sandwich, Medialuna, and Pastry, which have confidence values above 0.5 and support values between 3% and 5%.
Another way to plot the rules fo further analysis is the Parallel Coordinates Plot.
#Create parallel coordinates plot
plot(rules_for_plot, method="paracoord")
We will analyze what products drive people to buy two most frequent items, this way we can see which items tend to be associated with coffee and bread.
Apriori will be used in rule induction because it allows for the extraction of frequent itemsets from the transaction dataset.
# Use Apriori algorithm to generate association rules related to the purchase of coffee in our transaction dataset.
rules_coffee<-apriori(data=trans, parameter=list(supp=0.03,conf = 0.3),
appearance=list(default="lhs", rhs="Coffee"), control=list(verbose=F))
inspect(sort(rules_coffee, by='lift'))
## lhs rhs support confidence coverage lift count
## [1] {Medialuna} => {Coffee} 0.03315085 0.5751979 0.05763382 1.186481 218
## [2] {Sandwich} => {Coffee} 0.04257908 0.5679513 0.07496959 1.171533 280
## [3] {Pastry} => {Coffee} 0.04896594 0.5590278 0.08759124 1.153126 322
## [4] {Cake} => {Coffee} 0.05687348 0.5389049 0.10553528 1.111618 374
## [5] {} => {Coffee} 0.48479319 0.4847932 1.00000000 1.000000 3188
## [6] {Tea} => {Coffee} 0.05200730 0.3634431 0.14309611 0.749687 342
# Significant Rules
is.significant(rules_coffee, trans)
## [1] FALSE TRUE TRUE TRUE TRUE FALSE
is.superset(rules_coffee)
## 6 x 6 sparse Matrix of class "ngCMatrix"
## {Coffee} {Coffee,Medialuna} {Coffee,Sandwich}
## {Coffee} | . .
## {Coffee,Medialuna} | | .
## {Coffee,Sandwich} | . |
## {Coffee,Pastry} | . .
## {Cake,Coffee} | . .
## {Coffee,Tea} | . .
## {Coffee,Pastry} {Cake,Coffee} {Coffee,Tea}
## {Coffee} . . .
## {Coffee,Medialuna} . . .
## {Coffee,Sandwich} . . .
## {Coffee,Pastry} | . .
## {Cake,Coffee} . | .
## {Coffee,Tea} . . |
is.superset(rules_coffee, sparse = FALSE)
## {Coffee} {Coffee,Medialuna} {Coffee,Sandwich}
## {Coffee} TRUE FALSE FALSE
## {Coffee,Medialuna} TRUE TRUE FALSE
## {Coffee,Sandwich} TRUE FALSE TRUE
## {Coffee,Pastry} TRUE FALSE FALSE
## {Cake,Coffee} TRUE FALSE FALSE
## {Coffee,Tea} TRUE FALSE FALSE
## {Coffee,Pastry} {Cake,Coffee} {Coffee,Tea}
## {Coffee} FALSE FALSE FALSE
## {Coffee,Medialuna} FALSE FALSE FALSE
## {Coffee,Sandwich} FALSE FALSE FALSE
## {Coffee,Pastry} TRUE FALSE FALSE
## {Cake,Coffee} FALSE TRUE FALSE
## {Coffee,Tea} FALSE FALSE TRUE
Coffee is the most popular item by far. Coffee is often paired with so many other items, but when we increase the support level 3%, it has benn noticed that the most popular combinations are Coffee and Cake, Coffee and Pastry, and coffee and sandwich, where its mostly having a small snack or desert along with drinking coffee.
plot(rules_coffee, method="graph",control = list(cex=0.9))
## Warning: Unknown control parameters: cex
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
# Use Apriori algorithm to generate association rules related to the purchase of bread in our transaction dataset.
rules_bread<-apriori(data=trans, parameter=list(supp=0.005,conf = 0.3),
appearance=list(default="lhs", rhs="Bread"), control=list(verbose=F))
inspect(sort(rules_bread, by="lift"))
## lhs rhs support confidence coverage lift
## [1] {Jam} => {Bread} 0.005474453 0.3711340 0.01475061 1.1372681
## [2] {Jammie Dodgers} => {Bread} 0.005018248 0.3548387 0.01414234 1.0873343
## [3] {Pastry} => {Bread} 0.029805353 0.3402778 0.08759124 1.0427151
## [4] {} => {Bread} 0.326338200 0.3263382 1.00000000 1.0000000
## [5] {Tiffin} => {Bread} 0.006082725 0.3125000 0.01946472 0.9575955
## count
## [1] 36
## [2] 33
## [3] 196
## [4] 2146
## [5] 40
is.significant(rules_bread, trans)
## [1] FALSE FALSE FALSE FALSE FALSE
There are no significant rules found using association rule mining, it means that there is no strong association between any pair of items in the data that meets the specified thresholds for support (0.5%) and confidence (30%).
is.superset(rules_bread)
## 5 x 5 sparse Matrix of class "ngCMatrix"
## {Bread} {Bread,Jam} {Bread,Jammie Dodgers}
## {Bread} | . .
## {Bread,Jam} | | .
## {Bread,Jammie Dodgers} | . |
## {Bread,Tiffin} | . .
## {Bread,Pastry} | . .
## {Bread,Tiffin} {Bread,Pastry}
## {Bread} . .
## {Bread,Jam} . .
## {Bread,Jammie Dodgers} . .
## {Bread,Tiffin} | .
## {Bread,Pastry} . |
is.subset(rules_bread)
## 5 x 5 sparse Matrix of class "ngCMatrix"
## {Bread} {Bread,Jam} {Bread,Jammie Dodgers}
## {Bread} | | |
## {Bread,Jam} . | .
## {Bread,Jammie Dodgers} . . |
## {Bread,Tiffin} . . .
## {Bread,Pastry} . . .
## {Bread,Tiffin} {Bread,Pastry}
## {Bread} | |
## {Bread,Jam} . .
## {Bread,Jammie Dodgers} . .
## {Bread,Tiffin} | .
## {Bread,Pastry} . |
In the case of bread, it is a staple item that is likely to be purchased on a regular basis, and so its presence in a transaction may not be a reliable indicator of the potential presence of other items. Therefore, it’s possible that the lack of significant rules involving bread is due to the fact that it is not strongly associated with other items in the dataset.
plot(rules_bread, method="graph",control = list(cex=0.6))
## Warning: Unknown control parameters: cex
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The Jaccard index is a statistical measure used to determine the similarity between two sets of data. It is calculated as the ratio of the intersection of the sets to the union of the sets. The resulting value ranges from 0 to 1, with 1 indicating that the sets are identical and 0 indicating that they share no common elements.
Affinity, on the other hand, can be used as a measure of the strength of association between items in a basket or as a similarity measure between baskets.
trans.sel<-trans[,itemFrequency(trans)>0.05]
jac<-dissimilarity(trans.sel, which="items")
round(jac,digits=3)
## Bread Cake Coffee Cookies Hot chocolate Medialuna Pastry
## Cake 0.943
## Coffee 0.875 0.893
## Cookies 0.959 0.951 0.941
## Hot chocolate 0.967 0.930 0.946 0.946
## Medialuna 0.955 0.974 0.935 0.978 0.955
## Pastry 0.922 0.974 0.906 0.974 0.964 0.941
## Sandwich 0.956 0.957 0.918 0.978 0.969 0.984 0.991
## Tea 0.933 0.882 0.910 0.951 0.964 0.958 0.958
## Sandwich
## Cake
## Coffee
## Cookies
## Hot chocolate
## Medialuna
## Pastry
## Sandwich
## Tea 0.931
plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")
a = affinity(trans.sel)
round(a, digits=3)
## An object of class "ar_similarity"
## Bread Cake Coffee Cookies Hot chocolate Medialuna Pastry
## Bread 0.000 0.057 0.125 0.041 0.033 0.045 0.078
## Cake 0.057 0.000 0.107 0.049 0.070 0.026 0.026
## Coffee 0.125 0.107 0.000 0.059 0.054 0.065 0.094
## Cookies 0.041 0.049 0.059 0.000 0.054 0.022 0.026
## Hot chocolate 0.033 0.070 0.054 0.054 0.000 0.045 0.036
## Medialuna 0.045 0.026 0.065 0.022 0.045 0.000 0.059
## Pastry 0.078 0.026 0.094 0.026 0.036 0.059 0.000
## Sandwich 0.044 0.043 0.082 0.022 0.031 0.016 0.009
## Tea 0.067 0.118 0.090 0.049 0.036 0.042 0.042
## Sandwich Tea
## Bread 0.044 0.067
## Cake 0.043 0.118
## Coffee 0.082 0.090
## Cookies 0.022 0.049
## Hot chocolate 0.031 0.036
## Medialuna 0.016 0.042
## Pastry 0.009 0.042
## Sandwich 0.000 0.069
## Tea 0.069 0.000
## Slot "method":
## [1] "Affinity"
par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)
In conclusion, the market basket analysis conducted on the bread basket dataset showed that coffee and bread are the two most frequently purchased items, with bread being a staple item for most consumers. In addition, the analysis revealed that the purchase of snacks such as cake, sandwich, and pastry is strongly associated with the purchase of coffee. Therefore, it is recommended to implement a marketing strategy that encourages the purchase of coffee with snacks or reminds customers to purchase coffee when buying any snack. This strategy can potentially lead to an increase in sales and overall revenue for the grocery store.