Name: Marwan Otrok

1. Introduction: Market Basket analysis

Market basket analysis is a data mining technique used to identify relationships and patterns between products purchased together by customers. It involves analyzing transaction data from point-of-sale systems to identify which products are frequently purchased together and to discover associations between products that are often bought together. The insights derived from market basket analysis can be used to optimize store layout, product placement, and promotions, as well as to identify new product development opportunities. It is a powerful tool for retailers and marketers looking to improve their sales and customer engagement.

2. Importing Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.2.2
## Loading required package: arules
## Warning: package 'arules' was built under R version 4.2.2
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

3. Loading Data

This paper is based on a dataset from kaggle: https://www.kaggle.com/code/nandinibagga/apriori-algorithm/data?select=bread+basket.csv. The dataset belongs to “The Bread Basket” a bakery located in Edinburgh. The dataset has 20507 entries, over 9000 transactions, and 4 columns.

trans <-read.transactions("C:/Users/User/OneDrive/Desktop/bread basket (2).csv", format = "single", sep=",", header =TRUE, cols = c("Transaction","Item"))
# Summary of the transaction data
summary(trans)
## transactions as itemMatrix in sparse format with
##  6576 rows (elements/itemsets/transactions) and
##  102 columns (items) and a density of 0.01988962 
## 
## most frequent items:
##  Coffee   Bread     Tea    Cake  Pastry (Other) 
##    3188    2146     941     694     576    5796 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10 
## 2652 2155 1029  502  174   47   10    2    4    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.029   3.000  10.000 
## 
## includes extended item information - examples:
##                     labels
## 1               Adjustment
## 2 Afternoon with the baker
## 3                Alfajores
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3          1000
glimpse(trans)
## Formal class 'transactions' [package "arules"] with 3 slots
##   ..@ data       :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
##   ..@ itemInfo   :'data.frame':  102 obs. of  1 variable:
##   .. ..$ labels: chr [1:102] "Adjustment" "Afternoon with the baker" "Alfajores" "Argentina Night" ...
##   ..@ itemsetInfo:'data.frame':  6576 obs. of  1 variable:
##   .. ..$ transactionID: chr [1:6576] "1" "10" "1000" "1001" ...

There are 6576 transactions (or itemsets) and 102 unique items in the dataset. We have a “density” of 0.01988962 meaning that only 1.98% of the matrix is non-zero, which confirms that the matrix is sparse. There are 2652 itemsets with only one item, 2155 itemsets with two items, and so on. The minimum itemset length is 1, the median itemset length is 2, and the maximum itemset length is 10. And, most importantly, we get to know what are the most frequent items, which are coffee, bread and tea, etc.

df <- read.csv("C:/Users/User/OneDrive/Desktop/bread basket (2).csv")
View(df)

Le´s plot a bar graph that shows the distribution of the number of items in each transaction (basket size).

The most frequent basket is of size 1 and the mean size is equal to almost 2.

group_basket = df %>% group_by(., Transaction) %>% summarise(basket_size=n())
basket_sizes = group_basket %>% group_by(.,basket_size) %>% summarise(count=n())

ggplot(basket_sizes, aes(x=basket_size, y=count)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(0, 80, by = 1))

As we previously mentioned ‘Coffee’ is the product that customers bought most times, which is also followed by bread. Let´s plot topN most frequent items.

itemFrequencyPlot(trans,topN=20,type="absolute")

4. Assocoiation Rule Mining

In association rule mining, support and confidence are used as measures of the strength of a rule.

In general, the levels of support and confidence should be set high enough to filter out uninteresting or insignificant patterns, but not so high that interesting patterns are excluded. It is common to experiment with different levels of support and confidence to find the most interesting and useful patterns.

One way to identify the appropriate levels of support and confidence is to plot the support and confidence levels against the number of rules generated, where increasing the support or confidence levels results in a significant decrease in the number of rules generated. At this point, the levels can be set to balance the number of rules generated with the importance of the patterns identified.

# Generating frequent itemsets from our transaction dataset, with a minimum support threshold of 0.05 and a maximum itemset size of 10.
freq_items<-eclat(trans, parameter=list(supp=0.05, maxlen=10))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 328 
## 
## create itemset ... 
## set transactions ...[102 item(s), 6576 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating bit matrix ... [9 row(s), 6576 column(s)] done [0.00s].
## writing  ... [12 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
# print the generated frequent itemsets generated and sort the count column by descending order
inspect(freq_items)
##      items           support    count
## [1]  {Cake, Coffee}  0.05687348  374 
## [2]  {Coffee, Tea}   0.05200730  342 
## [3]  {Bread, Coffee} 0.09032847  594 
## [4]  {Coffee}        0.48479319 3188 
## [5]  {Bread}         0.32633820 2146 
## [6]  {Tea}           0.14309611  941 
## [7]  {Cake}          0.10553528  694 
## [8]  {Pastry}        0.08759124  576 
## [9]  {Sandwich}      0.07496959  493 
## [10] {Cookies}       0.05687348  374 
## [11] {Medialuna}     0.05763382  379 
## [12] {Hot chocolate} 0.05200730  342
freq_df <- as.data.frame(inspect(freq_items))
##      items           support    count
## [1]  {Cake, Coffee}  0.05687348  374 
## [2]  {Coffee, Tea}   0.05200730  342 
## [3]  {Bread, Coffee} 0.09032847  594 
## [4]  {Coffee}        0.48479319 3188 
## [5]  {Bread}         0.32633820 2146 
## [6]  {Tea}           0.14309611  941 
## [7]  {Cake}          0.10553528  694 
## [8]  {Pastry}        0.08759124  576 
## [9]  {Sandwich}      0.07496959  493 
## [10] {Cookies}       0.05687348  374 
## [11] {Medialuna}     0.05763382  379 
## [12] {Hot chocolate} 0.05200730  342
freq_df %>% arrange(desc(count))
##                items    support count
## [4]         {Coffee} 0.48479319  3188
## [5]          {Bread} 0.32633820  2146
## [6]            {Tea} 0.14309611   941
## [7]           {Cake} 0.10553528   694
## [3]  {Bread, Coffee} 0.09032847   594
## [8]         {Pastry} 0.08759124   576
## [9]       {Sandwich} 0.07496959   493
## [11]     {Medialuna} 0.05763382   379
## [1]   {Cake, Coffee} 0.05687348   374
## [10]       {Cookies} 0.05687348   374
## [2]    {Coffee, Tea} 0.05200730   342
## [12] {Hot chocolate} 0.05200730   342

The most frequent item sets are one and two-item baskets. In this dataset with minimal support value of 0.05 there are no baskets that contain more than two different items.

The next step is to recognize the most frequent rules. To obtain any rules, the support value needs to be lower in order to get item sets of more than than two items.

# Lowering the minimum support level to 0.01 with the aim to include more useful and interesting patterns
freq_items<-eclat(trans, parameter=list(supp=0.01, maxlen=15)) 
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 65 
## 
## create itemset ... 
## set transactions ...[102 item(s), 6576 transaction(s)] done [0.00s].
## sorting and recoding items ... [30 item(s)] done [0.00s].
## creating sparse bit matrix ... [30 row(s), 6576 column(s)] done [0.00s].
## writing  ... [61 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
# print the generated frequent itemsets generated and sort the count column by descending order
inspect(freq_items)
##      items                    support    count
## [1]  {Coffee, Tiffin}         0.01064477   70 
## [2]  {Coffee, Spanish Brunch} 0.01414234   93 
## [3]  {Coffee, Scone}          0.01855231  122 
## [4]  {Coffee, Toast}          0.02585158  170 
## [5]  {Coffee, Muffin}         0.01809611  119 
## [6]  {Coffee, Soup}           0.01718370  113 
## [7]  {Alfajores, Coffee}      0.02250608  148 
## [8]  {Alfajores, Bread}       0.01125304   74 
## [9]  {Brownie, Coffee}        0.02098540  138 
## [10] {Bread, Brownie}         0.01186131   78 
## [11] {Coffee, Juice}          0.02144161  141 
## [12] {Coffee, Hot chocolate}  0.02737226  180 
## [13] {Bread, Hot chocolate}   0.01201338   79 
## [14] {Cake, Hot chocolate}    0.01034063   68 
## [15] {Coffee, Medialuna}      0.03315085  218 
## [16] {Bread, Medialuna}       0.01642336  108 
## [17] {Coffee, Cookies}        0.02995742  197 
## [18] {Bread, Cookies}         0.01520681  100 
## [19] {Coffee, Sandwich}       0.04257908  280 
## [20] {Bread, Sandwich}        0.01703163  112 
## [21] {Sandwich, Tea}          0.01414234   93 
## [22] {Bread, Coffee, Pastry}  0.01140511   75 
## [23] {Coffee, Pastry}         0.04896594  322 
## [24] {Bread, Pastry}          0.02980535  196 
## [25] {Cake, Coffee, Tea}      0.01125304   74 
## [26] {Cake, Coffee}           0.05687348  374 
## [27] {Bread, Cake}            0.02341849  154 
## [28] {Cake, Tea}              0.02630779  173 
## [29] {Coffee, Tea}            0.05200730  342 
## [30] {Bread, Tea}             0.02965328  195 
## [31] {Bread, Coffee}          0.09032847  594 
## [32] {Coffee}                 0.48479319 3188 
## [33] {Bread}                  0.32633820 2146 
## [34] {Tea}                    0.14309611  941 
## [35] {Cake}                   0.10553528  694 
## [36] {Pastry}                 0.08759124  576 
## [37] {Sandwich}               0.07496959  493 
## [38] {Cookies}                0.05687348  374 
## [39] {Medialuna}              0.05763382  379 
## [40] {Hot chocolate}          0.05200730  342 
## [41] {Juice}                  0.04045012  266 
## [42] {Brownie}                0.04409976  290 
## [43] {Alfajores}              0.04075426  268 
## [44] {Soup}                   0.03527981  232 
## [45] {Muffin}                 0.03649635  240 
## [46] {Toast}                  0.03543187  233 
## [47] {Scone}                  0.03421533  225 
## [48] {Spanish Brunch}         0.02235401  147 
## [49] {Truffles}               0.02311436  152 
## [50] {Farm House}             0.03832117  252 
## [51] {Tiffin}                 0.01946472  128 
## [52] {Scandinavian}           0.02934915  193 
## [53] {Coke}                   0.02083333  137 
## [54] {Mineral water}          0.01657543  109 
## [55] {Chicken Stew}           0.01475061   97 
## [56] {Jammie Dodgers}         0.01414234   93 
## [57] {Salad}                  0.01292579   85 
## [58] {Baguette}               0.01992092  131 
## [59] {Jam}                    0.01475061   97 
## [60] {Hearty & Seasonal}      0.01003650   66 
## [61] {Fudge}                  0.01246959   82
freq_df <- as.data.frame(inspect(freq_items))
##      items                    support    count
## [1]  {Coffee, Tiffin}         0.01064477   70 
## [2]  {Coffee, Spanish Brunch} 0.01414234   93 
## [3]  {Coffee, Scone}          0.01855231  122 
## [4]  {Coffee, Toast}          0.02585158  170 
## [5]  {Coffee, Muffin}         0.01809611  119 
## [6]  {Coffee, Soup}           0.01718370  113 
## [7]  {Alfajores, Coffee}      0.02250608  148 
## [8]  {Alfajores, Bread}       0.01125304   74 
## [9]  {Brownie, Coffee}        0.02098540  138 
## [10] {Bread, Brownie}         0.01186131   78 
## [11] {Coffee, Juice}          0.02144161  141 
## [12] {Coffee, Hot chocolate}  0.02737226  180 
## [13] {Bread, Hot chocolate}   0.01201338   79 
## [14] {Cake, Hot chocolate}    0.01034063   68 
## [15] {Coffee, Medialuna}      0.03315085  218 
## [16] {Bread, Medialuna}       0.01642336  108 
## [17] {Coffee, Cookies}        0.02995742  197 
## [18] {Bread, Cookies}         0.01520681  100 
## [19] {Coffee, Sandwich}       0.04257908  280 
## [20] {Bread, Sandwich}        0.01703163  112 
## [21] {Sandwich, Tea}          0.01414234   93 
## [22] {Bread, Coffee, Pastry}  0.01140511   75 
## [23] {Coffee, Pastry}         0.04896594  322 
## [24] {Bread, Pastry}          0.02980535  196 
## [25] {Cake, Coffee, Tea}      0.01125304   74 
## [26] {Cake, Coffee}           0.05687348  374 
## [27] {Bread, Cake}            0.02341849  154 
## [28] {Cake, Tea}              0.02630779  173 
## [29] {Coffee, Tea}            0.05200730  342 
## [30] {Bread, Tea}             0.02965328  195 
## [31] {Bread, Coffee}          0.09032847  594 
## [32] {Coffee}                 0.48479319 3188 
## [33] {Bread}                  0.32633820 2146 
## [34] {Tea}                    0.14309611  941 
## [35] {Cake}                   0.10553528  694 
## [36] {Pastry}                 0.08759124  576 
## [37] {Sandwich}               0.07496959  493 
## [38] {Cookies}                0.05687348  374 
## [39] {Medialuna}              0.05763382  379 
## [40] {Hot chocolate}          0.05200730  342 
## [41] {Juice}                  0.04045012  266 
## [42] {Brownie}                0.04409976  290 
## [43] {Alfajores}              0.04075426  268 
## [44] {Soup}                   0.03527981  232 
## [45] {Muffin}                 0.03649635  240 
## [46] {Toast}                  0.03543187  233 
## [47] {Scone}                  0.03421533  225 
## [48] {Spanish Brunch}         0.02235401  147 
## [49] {Truffles}               0.02311436  152 
## [50] {Farm House}             0.03832117  252 
## [51] {Tiffin}                 0.01946472  128 
## [52] {Scandinavian}           0.02934915  193 
## [53] {Coke}                   0.02083333  137 
## [54] {Mineral water}          0.01657543  109 
## [55] {Chicken Stew}           0.01475061   97 
## [56] {Jammie Dodgers}         0.01414234   93 
## [57] {Salad}                  0.01292579   85 
## [58] {Baguette}               0.01992092  131 
## [59] {Jam}                    0.01475061   97 
## [60] {Hearty & Seasonal}      0.01003650   66 
## [61] {Fudge}                  0.01246959   82
freq_df %>% arrange(desc(count))
##                         items    support count
## [32]                 {Coffee} 0.48479319  3188
## [33]                  {Bread} 0.32633820  2146
## [34]                    {Tea} 0.14309611   941
## [35]                   {Cake} 0.10553528   694
## [31]          {Bread, Coffee} 0.09032847   594
## [36]                 {Pastry} 0.08759124   576
## [37]               {Sandwich} 0.07496959   493
## [39]              {Medialuna} 0.05763382   379
## [26]           {Cake, Coffee} 0.05687348   374
## [38]                {Cookies} 0.05687348   374
## [29]            {Coffee, Tea} 0.05200730   342
## [40]          {Hot chocolate} 0.05200730   342
## [23]         {Coffee, Pastry} 0.04896594   322
## [42]                {Brownie} 0.04409976   290
## [19]       {Coffee, Sandwich} 0.04257908   280
## [43]              {Alfajores} 0.04075426   268
## [41]                  {Juice} 0.04045012   266
## [50]             {Farm House} 0.03832117   252
## [45]                 {Muffin} 0.03649635   240
## [46]                  {Toast} 0.03543187   233
## [44]                   {Soup} 0.03527981   232
## [47]                  {Scone} 0.03421533   225
## [15]      {Coffee, Medialuna} 0.03315085   218
## [17]        {Coffee, Cookies} 0.02995742   197
## [24]          {Bread, Pastry} 0.02980535   196
## [30]             {Bread, Tea} 0.02965328   195
## [52]           {Scandinavian} 0.02934915   193
## [12]  {Coffee, Hot chocolate} 0.02737226   180
## [28]              {Cake, Tea} 0.02630779   173
## [4]           {Coffee, Toast} 0.02585158   170
## [27]            {Bread, Cake} 0.02341849   154
## [49]               {Truffles} 0.02311436   152
## [7]       {Alfajores, Coffee} 0.02250608   148
## [48]         {Spanish Brunch} 0.02235401   147
## [11]          {Coffee, Juice} 0.02144161   141
## [9]         {Brownie, Coffee} 0.02098540   138
## [53]                   {Coke} 0.02083333   137
## [58]               {Baguette} 0.01992092   131
## [51]                 {Tiffin} 0.01946472   128
## [3]           {Coffee, Scone} 0.01855231   122
## [5]          {Coffee, Muffin} 0.01809611   119
## [6]            {Coffee, Soup} 0.01718370   113
## [20]        {Bread, Sandwich} 0.01703163   112
## [54]          {Mineral water} 0.01657543   109
## [16]       {Bread, Medialuna} 0.01642336   108
## [18]         {Bread, Cookies} 0.01520681   100
## [55]           {Chicken Stew} 0.01475061    97
## [59]                    {Jam} 0.01475061    97
## [2]  {Coffee, Spanish Brunch} 0.01414234    93
## [21]          {Sandwich, Tea} 0.01414234    93
## [56]         {Jammie Dodgers} 0.01414234    93
## [57]                  {Salad} 0.01292579    85
## [61]                  {Fudge} 0.01246959    82
## [13]   {Bread, Hot chocolate} 0.01201338    79
## [10]         {Bread, Brownie} 0.01186131    78
## [22]  {Bread, Coffee, Pastry} 0.01140511    75
## [8]        {Alfajores, Bread} 0.01125304    74
## [25]      {Cake, Coffee, Tea} 0.01125304    74
## [1]          {Coffee, Tiffin} 0.01064477    70
## [14]    {Cake, Hot chocolate} 0.01034063    68
## [60]      {Hearty & Seasonal} 0.01003650    66
# Create association rules from the frequent itemsets
freq_rules<-ruleInduction(freq_items, trans, confidence=0.3)
# Provide a summary of the generated rules
summary(freq_rules)
## set of 19 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 17  2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.105   2.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift           itemset     
##  Min.   :0.01064   Min.   :0.3403   Min.   :0.7497   Min.   : 1.00  
##  1st Qu.:0.01764   1st Qu.:0.4815   1st Qu.:1.0137   1st Qu.: 5.50  
##  Median :0.02251   Median :0.5301   Median :1.0934   Median :12.00  
##  Mean   :0.02699   Mean   :0.5158   Mean   :1.0820   Mean   :13.68  
##  3rd Qu.:0.03155   3rd Qu.:0.5556   3rd Qu.:1.1461   3rd Qu.:22.50  
##  Max.   :0.05687   Max.   :0.7296   Max.   :1.5050   Max.   :29.00  
## 
## mining info:
##   data ntransactions support
##  trans          6576    0.01
##                                                             call confidence
##  eclat(data = trans, parameter = list(supp = 0.01, maxlen = 15))        0.3

The rule length distribution indicates that there are 17 rules with a length of 2 (i.e., consisting of two items, one on the left-hand side and one on the right-hand side) and 2 rules with a length of 3 (i.e., consisting of three items, two on the left-hand side and one on the right-hand side).

The items in the rules appear in a minimum of about 1% and a maximum of about 6% of the transactions.The rules are correct in about 34% to 73% of the cases where the items on the left-hand side appear. The median value for the lift measure is 1.0934, indicating that the items in the rules are weakly positively associated on average

5. Top Association Rules

High lift values indicate that the occurrence of the antecedent (left-hand side) of the rule is associated with a higher than expected occurrence of the consequent (right-hand side), after taking into account their support levels. So, rules with high lift values can be interpreted as indicating a strong relationship between the antecedent and the consequent, and are often considered to be the most interesting rules.

Accordingly, the rules with the highest lift value will be evaluated.

# select and inspect the top 10 rules with the highest lift values
inspect(head(sort(freq_rules, by ="lift"),10))
##      lhs                 rhs      support    confidence lift     itemset
## [1]  {Toast}          => {Coffee} 0.02585158 0.7296137  1.505000  4     
## [2]  {Spanish Brunch} => {Coffee} 0.01414234 0.6326531  1.304996  2     
## [3]  {Medialuna}      => {Coffee} 0.03315085 0.5751979  1.186481 15     
## [4]  {Sandwich}       => {Coffee} 0.04257908 0.5679513  1.171533 19     
## [5]  {Pastry}         => {Coffee} 0.04896594 0.5590278  1.153126 23     
## [6]  {Alfajores}      => {Coffee} 0.02250608 0.5522388  1.139122  7     
## [7]  {Tiffin}         => {Coffee} 0.01064477 0.5468750  1.128058  1     
## [8]  {Scone}          => {Coffee} 0.01855231 0.5422222  1.118461  3     
## [9]  {Cake}           => {Coffee} 0.05687348 0.5389049  1.111618 26     
## [10] {Juice}          => {Coffee} 0.02144161 0.5300752  1.093405 11

the first rule Toast => Coffee has a lift of 1.505. This means that customers who buy toast are 1.505 times more likely to buy coffee when comparing the general rate of coffee sales. The support for this rule is 0.025, meaning that that 2.5% of all transactions contain both toast and coffee. The confidence is 0.73 reflecting that 73% of customers who buy toast also buy coffee.

Similarly, the second rule Spanish Brunch => Coffee has a lift of 1.305, which means that customers who buy Spanish brunch are 1.305 times more likely to buy coffee compared to the general rate of coffee sales. its support for this rule is 0.014, so only 1.4% of all transactions contain both Spanish brunch and coffee. And, with a condidence of 0.63,we have 63% of customers who buy Spanish brunch also buy coffee.

The rest of the rules can be interpreted similarly. Note that the support for each rule is relatively low, showing that these rules may not be very common. However, the high lift values suggest that these rules are still strong indicators of customer behavior.

The below plot is created to visualize the relationship between the support, confidence, and lift measures of the association rules generated from the transaction dataset.

plot(freq_rules, measure=c("support", "confidence"), shading="lift", interactive=FALSE)
## Warning in plot.rules(freq_rules, measure = c("support", "confidence"), : The
## parameter interactive is deprecated. Use engine='interactive' instead.

The obtained result fron the above plot shows that almost all of the rules have a low support, less than 5%.

# Sort the association rules by support and then by confidence and show the first 15 rules
inspect(head(sort(sort(freq_rules, by ="confidence"),by="support"),15))
##      lhs                rhs      support    confidence lift      itemset
## [1]  {Cake}          => {Coffee} 0.05687348 0.5389049  1.1116181 26     
## [2]  {Tea}           => {Coffee} 0.05200730 0.3634431  0.7496870 29     
## [3]  {Pastry}        => {Coffee} 0.04896594 0.5590278  1.1531263 23     
## [4]  {Sandwich}      => {Coffee} 0.04257908 0.5679513  1.1715332 19     
## [5]  {Medialuna}     => {Coffee} 0.03315085 0.5751979  1.1864810 15     
## [6]  {Cookies}       => {Coffee} 0.02995742 0.5267380  1.0865210 17     
## [7]  {Pastry}        => {Bread}  0.02980535 0.3402778  1.0427151 24     
## [8]  {Hot chocolate} => {Coffee} 0.02737226 0.5263158  1.0856501 12     
## [9]  {Toast}         => {Coffee} 0.02585158 0.7296137  1.5050000  4     
## [10] {Alfajores}     => {Coffee} 0.02250608 0.5522388  1.1391225  7     
## [11] {Juice}         => {Coffee} 0.02144161 0.5300752  1.0934048 11     
## [12] {Brownie}       => {Coffee} 0.02098540 0.4758621  0.9815775  9     
## [13] {Scone}         => {Coffee} 0.01855231 0.5422222  1.1184609  3     
## [14] {Muffin}        => {Coffee} 0.01809611 0.4958333  1.0227729  5     
## [15] {Soup}          => {Coffee} 0.01718370 0.4870690  1.0046943  6

The top 3 interesting rules can be interprete as followed: the 1st rule with the highest support and confidence shows that 53.9% of the baskets that contained Cake also has Coffee, and these items were bought together 26 times. As for the 2nd rule, where the second highest support and confidence shows that 36.3% of the baskets that contained Tea also contained Coffee, and these items were bought together 29 times. The 3rd rule with the third highest support and confidence shows that 55.9% of the baskets that contained Pastry also contained Coffee, and these items were bought together 23 times.

Rule 9 in the list, Toast => Coffee, have the highest confidence values among the top 15 rules, indicating that if a customer bought Toast, there is a high likelihood to have also purchased Coffee. However, the support for this rules is relatively low, indicating that these items are not frequently purchased together.

# Plot a matrix of the 19 association rules sorted by support and confidence, and measured by lift
rules_for_plot = head(sort(sort(freq_rules, by ="confidence"),by="support"),19)
plot(rules_for_plot, method="matrix", measure="lift")
## Itemsets in Antecedent (LHS)
##  [1] "{Toast}"          "{Spanish Brunch}" "{Medialuna}"      "{Sandwich}"      
##  [5] "{Alfajores}"      "{Tiffin}"         "{Scone}"          "{Cake}"          
##  [9] "{Pastry}"         "{Juice}"          "{Cookies}"        "{Hot chocolate}" 
## [13] "{Muffin}"         "{Soup}"           "{Brownie}"        "{Cake,Tea}"      
## [17] "{Bread,Pastry}"   "{Tea}"           
## Itemsets in Consequent (RHS)
## [1] "{Bread}"  "{Coffee}"

We can see that the association rule with the highest lift (1.505) is the rule that customers who buy Toast are likely to also buy Coffee. This rule has a relatively high confidence (0.73) and a support of 2.6%, indicating that it appears in a relatively high number of baskets.

Other high lift rules include those involving Sandwich, Medialuna, and Pastry, which have confidence values above 0.5 and support values between 3% and 5%.

Another way to plot the rules fo further analysis is the Parallel Coordinates Plot.

#Create parallel coordinates plot
plot(rules_for_plot, method="paracoord")

5. Rule Induction

We will analyze what products drive people to buy two most frequent items, this way we can see which items tend to be associated with coffee and bread.

Apriori will be used in rule induction because it allows for the extraction of frequent itemsets from the transaction dataset.

5.1 Coffee Rules

# Use Apriori algorithm to generate association rules related to the purchase of coffee in our transaction dataset.
rules_coffee<-apriori(data=trans, parameter=list(supp=0.03,conf = 0.3), 
appearance=list(default="lhs", rhs="Coffee"), control=list(verbose=F)) 
inspect(sort(rules_coffee, by='lift'))
##     lhs            rhs      support    confidence coverage   lift     count
## [1] {Medialuna} => {Coffee} 0.03315085 0.5751979  0.05763382 1.186481  218 
## [2] {Sandwich}  => {Coffee} 0.04257908 0.5679513  0.07496959 1.171533  280 
## [3] {Pastry}    => {Coffee} 0.04896594 0.5590278  0.08759124 1.153126  322 
## [4] {Cake}      => {Coffee} 0.05687348 0.5389049  0.10553528 1.111618  374 
## [5] {}          => {Coffee} 0.48479319 0.4847932  1.00000000 1.000000 3188 
## [6] {Tea}       => {Coffee} 0.05200730 0.3634431  0.14309611 0.749687  342
# Significant Rules
is.significant(rules_coffee, trans)
## [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE
is.superset(rules_coffee)
## 6 x 6 sparse Matrix of class "ngCMatrix"
##                    {Coffee} {Coffee,Medialuna} {Coffee,Sandwich}
## {Coffee}                  |                  .                 .
## {Coffee,Medialuna}        |                  |                 .
## {Coffee,Sandwich}         |                  .                 |
## {Coffee,Pastry}           |                  .                 .
## {Cake,Coffee}             |                  .                 .
## {Coffee,Tea}              |                  .                 .
##                    {Coffee,Pastry} {Cake,Coffee} {Coffee,Tea}
## {Coffee}                         .             .            .
## {Coffee,Medialuna}               .             .            .
## {Coffee,Sandwich}                .             .            .
## {Coffee,Pastry}                  |             .            .
## {Cake,Coffee}                    .             |            .
## {Coffee,Tea}                     .             .            |
is.superset(rules_coffee, sparse = FALSE)
##                    {Coffee} {Coffee,Medialuna} {Coffee,Sandwich}
## {Coffee}               TRUE              FALSE             FALSE
## {Coffee,Medialuna}     TRUE               TRUE             FALSE
## {Coffee,Sandwich}      TRUE              FALSE              TRUE
## {Coffee,Pastry}        TRUE              FALSE             FALSE
## {Cake,Coffee}          TRUE              FALSE             FALSE
## {Coffee,Tea}           TRUE              FALSE             FALSE
##                    {Coffee,Pastry} {Cake,Coffee} {Coffee,Tea}
## {Coffee}                     FALSE         FALSE        FALSE
## {Coffee,Medialuna}           FALSE         FALSE        FALSE
## {Coffee,Sandwich}            FALSE         FALSE        FALSE
## {Coffee,Pastry}               TRUE         FALSE        FALSE
## {Cake,Coffee}                FALSE          TRUE        FALSE
## {Coffee,Tea}                 FALSE         FALSE         TRUE

Coffee is the most popular item by far. Coffee is often paired with so many other items, but when we increase the support level 3%, it has benn noticed that the most popular combinations are Coffee and Cake, Coffee and Pastry, and coffee and sandwich, where its mostly having a small snack or desert along with drinking coffee.

plot(rules_coffee, method="graph",control = list(cex=0.9))
## Warning: Unknown control parameters: cex
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

5.2 Bread

# Use Apriori algorithm to generate association rules related to the purchase of bread in our transaction dataset.
rules_bread<-apriori(data=trans, parameter=list(supp=0.005,conf = 0.3), 
appearance=list(default="lhs", rhs="Bread"), control=list(verbose=F)) 
inspect(sort(rules_bread, by="lift"))
##     lhs                 rhs     support     confidence coverage   lift     
## [1] {Jam}            => {Bread} 0.005474453 0.3711340  0.01475061 1.1372681
## [2] {Jammie Dodgers} => {Bread} 0.005018248 0.3548387  0.01414234 1.0873343
## [3] {Pastry}         => {Bread} 0.029805353 0.3402778  0.08759124 1.0427151
## [4] {}               => {Bread} 0.326338200 0.3263382  1.00000000 1.0000000
## [5] {Tiffin}         => {Bread} 0.006082725 0.3125000  0.01946472 0.9575955
##     count
## [1]   36 
## [2]   33 
## [3]  196 
## [4] 2146 
## [5]   40
is.significant(rules_bread, trans)
## [1] FALSE FALSE FALSE FALSE FALSE

There are no significant rules found using association rule mining, it means that there is no strong association between any pair of items in the data that meets the specified thresholds for support (0.5%) and confidence (30%).

is.superset(rules_bread)
## 5 x 5 sparse Matrix of class "ngCMatrix"
##                        {Bread} {Bread,Jam} {Bread,Jammie Dodgers}
## {Bread}                      |           .                      .
## {Bread,Jam}                  |           |                      .
## {Bread,Jammie Dodgers}       |           .                      |
## {Bread,Tiffin}               |           .                      .
## {Bread,Pastry}               |           .                      .
##                        {Bread,Tiffin} {Bread,Pastry}
## {Bread}                             .              .
## {Bread,Jam}                         .              .
## {Bread,Jammie Dodgers}              .              .
## {Bread,Tiffin}                      |              .
## {Bread,Pastry}                      .              |
is.subset(rules_bread)
## 5 x 5 sparse Matrix of class "ngCMatrix"
##                        {Bread} {Bread,Jam} {Bread,Jammie Dodgers}
## {Bread}                      |           |                      |
## {Bread,Jam}                  .           |                      .
## {Bread,Jammie Dodgers}       .           .                      |
## {Bread,Tiffin}               .           .                      .
## {Bread,Pastry}               .           .                      .
##                        {Bread,Tiffin} {Bread,Pastry}
## {Bread}                             |              |
## {Bread,Jam}                         .              .
## {Bread,Jammie Dodgers}              .              .
## {Bread,Tiffin}                      |              .
## {Bread,Pastry}                      .              |

In the case of bread, it is a staple item that is likely to be purchased on a regular basis, and so its presence in a transaction may not be a reliable indicator of the potential presence of other items. Therefore, it’s possible that the lack of significant rules involving bread is due to the fact that it is not strongly associated with other items in the dataset.

plot(rules_bread, method="graph",control = list(cex=0.6))
## Warning: Unknown control parameters: cex
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

6. Dissimilarity measures

The Jaccard index is a statistical measure used to determine the similarity between two sets of data. It is calculated as the ratio of the intersection of the sets to the union of the sets. The resulting value ranges from 0 to 1, with 1 indicating that the sets are identical and 0 indicating that they share no common elements.

Affinity, on the other hand, can be used as a measure of the strength of association between items in a basket or as a similarity measure between baskets.

6.1 Jaccard Index

trans.sel<-trans[,itemFrequency(trans)>0.05]
jac<-dissimilarity(trans.sel, which="items") 
round(jac,digits=3)
##               Bread  Cake Coffee Cookies Hot chocolate Medialuna Pastry
## Cake          0.943                                                    
## Coffee        0.875 0.893                                              
## Cookies       0.959 0.951  0.941                                       
## Hot chocolate 0.967 0.930  0.946   0.946                               
## Medialuna     0.955 0.974  0.935   0.978         0.955                 
## Pastry        0.922 0.974  0.906   0.974         0.964     0.941       
## Sandwich      0.956 0.957  0.918   0.978         0.969     0.984  0.991
## Tea           0.933 0.882  0.910   0.951         0.964     0.958  0.958
##               Sandwich
## Cake                  
## Coffee                
## Cookies               
## Hot chocolate         
## Medialuna             
## Pastry                
## Sandwich              
## Tea              0.931
plot(hclust(jac, method = "ward.D2"), main = "Dendrogram for items")

6.2 Affinity measure

a = affinity(trans.sel)
round(a, digits=3)
## An object of class "ar_similarity"
##               Bread  Cake Coffee Cookies Hot chocolate Medialuna Pastry
## Bread         0.000 0.057  0.125   0.041         0.033     0.045  0.078
## Cake          0.057 0.000  0.107   0.049         0.070     0.026  0.026
## Coffee        0.125 0.107  0.000   0.059         0.054     0.065  0.094
## Cookies       0.041 0.049  0.059   0.000         0.054     0.022  0.026
## Hot chocolate 0.033 0.070  0.054   0.054         0.000     0.045  0.036
## Medialuna     0.045 0.026  0.065   0.022         0.045     0.000  0.059
## Pastry        0.078 0.026  0.094   0.026         0.036     0.059  0.000
## Sandwich      0.044 0.043  0.082   0.022         0.031     0.016  0.009
## Tea           0.067 0.118  0.090   0.049         0.036     0.042  0.042
##               Sandwich   Tea
## Bread            0.044 0.067
## Cake             0.043 0.118
## Coffee           0.082 0.090
## Cookies          0.022 0.049
## Hot chocolate    0.031 0.036
## Medialuna        0.016 0.042
## Pastry           0.009 0.042
## Sandwich         0.000 0.069
## Tea              0.069 0.000
## Slot "method":
## [1] "Affinity"
par(mar=c(4,8,4,4))
image(a, axes=FALSE)
axis(1,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.5)
axis(2,at=seq(0,1,l=ncol(a)),labels=rownames(a),cex.axis=0.6, las=2)

7. Conclusion

In conclusion, the market basket analysis conducted on the bread basket dataset showed that coffee and bread are the two most frequently purchased items, with bread being a staple item for most consumers. In addition, the analysis revealed that the purchase of snacks such as cake, sandwich, and pastry is strongly associated with the purchase of coffee. Therefore, it is recommended to implement a marketing strategy that encourages the purchase of coffee with snacks or reminds customers to purchase coffee when buying any snack. This strategy can potentially lead to an increase in sales and overall revenue for the grocery store.