Association rules - is a statistical method that serves the purpose of predicting which item the client would buy, based on the item he already chose.This helps businesses develope different selling strategies, that would help them recommend/sell items to their clients more efficiently.
I would like to go through the rules and explain what each one means.
For that I will be using the Market Basket Data found on Kaggle.com
First of all, we will need the arules libraries and read the data.
library(arules)
library(arulesViz)
mbo<-read.transactions("Market_Basket_Optimisation.csv", format="basket", sep=",", skip=0)
Our data has 7501 observations and 119 variables. To make the interpretations a bit more compact, i’ll only be extracting 10 X 10 tables.
We can start from going through the cross tables.
Shows us the number of times given combinations of variables/products appeared in the same transactions.
ctab<-crossTable(mbo, measure="count", sort=TRUE)
ctab[1:10,1:10]
## mineral water eggs spaghetti french fries chocolate green tea
## mineral water 1788 382 448 253 395 233
## eggs 382 1348 274 273 249 191
## spaghetti 448 274 1306 207 294 199
## french fries 253 273 207 1282 258 214
## chocolate 395 249 294 258 1229 176
## green tea 233 191 199 214 176 991
## milk 360 231 266 178 241 132
## ground beef 307 150 294 104 173 111
## frozen vegetables 268 163 209 143 172 108
## pancakes 253 163 189 151 149 123
## milk ground beef frozen vegetables pancakes
## mineral water 360 307 268 253
## eggs 231 150 163 163
## spaghetti 266 294 209 189
## french fries 178 104 143 151
## chocolate 241 173 172 149
## green tea 132 111 108 123
## milk 972 165 177 124
## ground beef 165 737 127 109
## frozen vegetables 177 127 715 101
## pancakes 124 109 101 713
if we would like to see the frequencies of each product separately, we could use:
itemFrequencyPlot(mbo, topN=20, type = "absolute")
This one gives us a clearer idea about the frequency of the itemset reoccurence, as it is \(\frac{transactions\ containing\ both\ items}{total\ number\ of\ transactions}\)
stab<-crossTable(mbo, measure="support", sort=TRUE)
stab[1:10,1:10]
## mineral water eggs spaghetti french fries chocolate
## mineral water 0.23836822 0.05092654 0.05972537 0.03372884 0.05265965
## eggs 0.05092654 0.17970937 0.03652846 0.03639515 0.03319557
## spaghetti 0.05972537 0.03652846 0.17411012 0.02759632 0.03919477
## french fries 0.03372884 0.03639515 0.02759632 0.17091055 0.03439541
## chocolate 0.05265965 0.03319557 0.03919477 0.03439541 0.16384482
## green tea 0.03106252 0.02546327 0.02652980 0.02852953 0.02346354
## milk 0.04799360 0.03079589 0.03546194 0.02373017 0.03212905
## ground beef 0.04092788 0.01999733 0.03919477 0.01386482 0.02306359
## frozen vegetables 0.03572857 0.02173044 0.02786295 0.01906412 0.02293028
## pancakes 0.03372884 0.02173044 0.02519664 0.02013065 0.01986402
## green tea milk ground beef frozen vegetables
## mineral water 0.03106252 0.04799360 0.04092788 0.03572857
## eggs 0.02546327 0.03079589 0.01999733 0.02173044
## spaghetti 0.02652980 0.03546194 0.03919477 0.02786295
## french fries 0.02852953 0.02373017 0.01386482 0.01906412
## chocolate 0.02346354 0.03212905 0.02306359 0.02293028
## green tea 0.13211572 0.01759765 0.01479803 0.01439808
## milk 0.01759765 0.12958272 0.02199707 0.02359685
## ground beef 0.01479803 0.02199707 0.09825357 0.01693108
## frozen vegetables 0.01439808 0.02359685 0.01693108 0.09532062
## pancakes 0.01639781 0.01653113 0.01453140 0.01346487
## pancakes
## mineral water 0.03372884
## eggs 0.02173044
## spaghetti 0.02519664
## french fries 0.02013065
## chocolate 0.01986402
## green tea 0.01639781
## milk 0.01653113
## ground beef 0.01453140
## frozen vegetables 0.01346487
## pancakes 0.09505399
Lift gives us the rate at which the probability of having item Y in our basket increases, given that item X is already there.
\(\frac{(transactions\ containing\ both\ items)}{(total\ number\ of\ transactions\ containing\ X)(total\ number\ of\ transactions\ containing\ Y}\)
ltab<-crossTable(mbo, measure="lift", sort=TRUE)
ltab[1:10,1:10]
## mineral water eggs spaghetti french fries chocolate
## mineral water NA 1.188845 1.4390851 0.8279119 1.348332
## eggs 1.1888447 NA 1.1674456 1.1849606 1.127397
## spaghetti 1.4390851 1.167446 NA 0.9273812 1.373952
## french fries 0.8279119 1.184961 0.9273812 NA 1.228284
## chocolate 1.3483321 1.127397 1.3739516 1.2282845 NA
## green tea 0.9863565 1.072479 1.1533348 1.2634884 1.083943
## milk 1.5537741 1.322437 1.5717786 1.0714820 1.513276
## ground beef 1.7475215 1.132539 2.2911622 0.8256519 1.432669
## frozen vegetables 1.5724629 1.268559 1.6788668 1.1702028 1.468215
## pancakes 1.4886159 1.272118 1.5224683 1.2391348 1.275452
## green tea milk ground beef frozen vegetables pancakes
## mineral water 0.9863565 1.553774 1.7475215 1.572463 1.488616
## eggs 1.0724795 1.322437 1.1325387 1.268559 1.272118
## spaghetti 1.1533348 1.571779 2.2911622 1.678867 1.522468
## french fries 1.2634884 1.071482 0.8256519 1.170203 1.239135
## chocolate 1.0839426 1.513276 1.4326691 1.468215 1.275452
## green tea NA 1.027905 1.1399899 1.143308 1.305753
## milk 1.0279055 NA 1.7277041 1.910382 1.342101
## ground beef 1.1399899 1.727704 NA 1.807796 1.555925
## frozen vegetables 1.1433080 1.910382 1.8077957 NA 1.486090
## pancakes 1.3057532 1.342101 1.5559250 1.486090 NA
Chi Squared table shows us the cross table of independency of the variables, where: H0: independent rows and columns
chtab<-crossTable(mbo, measure="chiSquared", sort=TRUE)
round(chtab[1:10,1:10],3)
## mineral water eggs spaghetti french fries chocolate
## mineral water NA 0.002 0.008 0.001 0.005
## eggs 0.002 NA 0.001 0.001 0.000
## spaghetti 0.008 0.001 NA 0.000 0.004
## french fries 0.001 0.001 0.000 NA 0.001
## chocolate 0.005 0.000 0.004 0.001 NA
## green tea 0.000 0.000 0.001 0.002 0.000
## milk 0.009 0.002 0.007 0.000 0.006
## ground beef 0.013 0.000 0.029 0.001 0.003
## frozen vegetables 0.007 0.001 0.008 0.000 0.003
## pancakes 0.005 0.001 0.005 0.001 0.001
## green tea milk ground beef frozen vegetables pancakes
## mineral water 0.000 0.009 0.013 0.007 0.005
## eggs 0.000 0.002 0.000 0.001 0.001
## spaghetti 0.001 0.007 0.029 0.008 0.005
## french fries 0.002 0.000 0.001 0.000 0.001
## chocolate 0.000 0.006 0.003 0.003 0.001
## green tea NA 0.000 0.000 0.000 0.001
## milk 0.000 NA 0.007 0.010 0.001
## ground beef 0.000 0.007 NA 0.006 0.003
## frozen vegetables 0.000 0.010 0.006 NA 0.002
## pancakes 0.001 0.001 0.003 0.002 NA
As we can see, it is quite inconvinient looking for meaningful information when the number of variables is that big, even when we sort the data.
In such cases, to determine significant links and set a certain treshold, we can use:
Eclat algorithm only deals with support(unlike the Apriori algorithm that also deals with confidence) and shows us which items are frequently purchased together.
Looking at the support cross table we can see that the support level should be less than 0.06 if we want to have at least one itemsets.
If we choose 0.05 as our itemsets, and minimum lenght = 2(because we want to have at least two items) we will have 3 sets.
freq.items <-eclat(mbo, parameter = list(supp = 0.05,minlen = 2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.05 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 375
##
## create itemset ...
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 7501 column(s)] done [0.00s].
## writing ... [3 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(sort(freq.items))
## items support count
## [1] {mineral water,spaghetti} 0.05972537 448
## [2] {chocolate,mineral water} 0.05265965 395
## [3] {eggs,mineral water} 0.05092654 382
And if we choose 0.03 as our itemsets, accordingly, we will have 18 sets.
freq.items <-eclat(mbo, parameter = list(supp = 0.03,minlen = 2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.03 2 10 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 225
##
## create itemset ...
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating sparse bit matrix ... [36 row(s), 7501 column(s)] done [0.00s].
## writing ... [18 set(s)] done [0.01s].
## Creating S4 object ... done [0.00s].
inspect(sort(freq.items))
## items support count
## [1] {mineral water,spaghetti} 0.05972537 448
## [2] {chocolate,mineral water} 0.05265965 395
## [3] {eggs,mineral water} 0.05092654 382
## [4] {milk,mineral water} 0.04799360 360
## [5] {ground beef,mineral water} 0.04092788 307
## [6] {ground beef,spaghetti} 0.03919477 294
## [7] {chocolate,spaghetti} 0.03919477 294
## [8] {eggs,spaghetti} 0.03652846 274
## [9] {eggs,french fries} 0.03639515 273
## [10] {frozen vegetables,mineral water} 0.03572857 268
## [11] {milk,spaghetti} 0.03546194 266
## [12] {chocolate,french fries} 0.03439541 258
## [13] {mineral water,pancakes} 0.03372884 253
## [14] {french fries,mineral water} 0.03372884 253
## [15] {chocolate,eggs} 0.03319557 249
## [16] {chocolate,milk} 0.03212905 241
## [17] {green tea,mineral water} 0.03106252 233
## [18] {eggs,milk} 0.03079589 231
For further analysis, I think it would be better to disregard the mineral water, as it clearly is a nessessity that can be bought along with almost everything without a certain purpose, unlike, for example, beef & spagetti combination.