Association Rules

Association rules - is a statistical method that serves the purpose of predicting which item the client would buy, based on the item he already chose.This helps businesses develope different selling strategies, that would help them recommend/sell items to their clients more efficiently.

I would like to go through the rules and explain what each one means.

For that I will be using the Market Basket Data found on Kaggle.com

First of all, we will need the arules libraries and read the data.

library(arules)
library(arulesViz)
mbo<-read.transactions("Market_Basket_Optimisation.csv", format="basket", sep=",", skip=0)

Our data has 7501 observations and 119 variables. To make the interpretations a bit more compact, i’ll only be extracting 10 X 10 tables.

We can start from going through the cross tables.

Measure = “count”

Shows us the number of times given combinations of variables/products appeared in the same transactions.

ctab<-crossTable(mbo, measure="count", sort=TRUE) 
ctab[1:10,1:10]

##                   mineral water eggs spaghetti french fries chocolate green tea
## mineral water              1788  382       448          253       395       233
## eggs                        382 1348       274          273       249       191
## spaghetti                   448  274      1306          207       294       199
## french fries                253  273       207         1282       258       214
## chocolate                   395  249       294          258      1229       176
## green tea                   233  191       199          214       176       991
## milk                        360  231       266          178       241       132
## ground beef                 307  150       294          104       173       111
## frozen vegetables           268  163       209          143       172       108
## pancakes                    253  163       189          151       149       123
##                   milk ground beef frozen vegetables pancakes
## mineral water      360         307               268      253
## eggs               231         150               163      163
## spaghetti          266         294               209      189
## french fries       178         104               143      151
## chocolate          241         173               172      149
## green tea          132         111               108      123
## milk               972         165               177      124
## ground beef        165         737               127      109
## frozen vegetables  177         127               715      101
## pancakes           124         109               101      713

if we would like to see the frequencies of each product separately, we could use:

itemFrequencyPlot(mbo, topN=20, type = "absolute")

Measure=“support”

This one gives us a clearer idea about the frequency of the itemset reoccurence, as it is \(\frac{transactions\ containing\ both\ items}{total\ number\ of\ transactions}\)

stab<-crossTable(mbo, measure="support", sort=TRUE) 
stab[1:10,1:10]

##                   mineral water       eggs  spaghetti french fries  chocolate
## mineral water        0.23836822 0.05092654 0.05972537   0.03372884 0.05265965
## eggs                 0.05092654 0.17970937 0.03652846   0.03639515 0.03319557
## spaghetti            0.05972537 0.03652846 0.17411012   0.02759632 0.03919477
## french fries         0.03372884 0.03639515 0.02759632   0.17091055 0.03439541
## chocolate            0.05265965 0.03319557 0.03919477   0.03439541 0.16384482
## green tea            0.03106252 0.02546327 0.02652980   0.02852953 0.02346354
## milk                 0.04799360 0.03079589 0.03546194   0.02373017 0.03212905
## ground beef          0.04092788 0.01999733 0.03919477   0.01386482 0.02306359
## frozen vegetables    0.03572857 0.02173044 0.02786295   0.01906412 0.02293028
## pancakes             0.03372884 0.02173044 0.02519664   0.02013065 0.01986402
##                    green tea       milk ground beef frozen vegetables
## mineral water     0.03106252 0.04799360  0.04092788        0.03572857
## eggs              0.02546327 0.03079589  0.01999733        0.02173044
## spaghetti         0.02652980 0.03546194  0.03919477        0.02786295
## french fries      0.02852953 0.02373017  0.01386482        0.01906412
## chocolate         0.02346354 0.03212905  0.02306359        0.02293028
## green tea         0.13211572 0.01759765  0.01479803        0.01439808
## milk              0.01759765 0.12958272  0.02199707        0.02359685
## ground beef       0.01479803 0.02199707  0.09825357        0.01693108
## frozen vegetables 0.01439808 0.02359685  0.01693108        0.09532062
## pancakes          0.01639781 0.01653113  0.01453140        0.01346487
##                     pancakes
## mineral water     0.03372884
## eggs              0.02173044
## spaghetti         0.02519664
## french fries      0.02013065
## chocolate         0.01986402
## green tea         0.01639781
## milk              0.01653113
## ground beef       0.01453140
## frozen vegetables 0.01346487
## pancakes          0.09505399

Measure=“lift”

Lift gives us the rate at which the probability of having item Y in our basket increases, given that item X is already there.

\(\frac{(transactions\ containing\ both\ items)}{(total\ number\ of\ transactions\ containing\ X)(total\ number\ of\ transactions\ containing\ Y}\)

ltab<-crossTable(mbo, measure="lift", sort=TRUE) 
ltab[1:10,1:10]

##                   mineral water     eggs spaghetti french fries chocolate
## mineral water                NA 1.188845 1.4390851    0.8279119  1.348332
## eggs                  1.1888447       NA 1.1674456    1.1849606  1.127397
## spaghetti             1.4390851 1.167446        NA    0.9273812  1.373952
## french fries          0.8279119 1.184961 0.9273812           NA  1.228284
## chocolate             1.3483321 1.127397 1.3739516    1.2282845        NA
## green tea             0.9863565 1.072479 1.1533348    1.2634884  1.083943
## milk                  1.5537741 1.322437 1.5717786    1.0714820  1.513276
## ground beef           1.7475215 1.132539 2.2911622    0.8256519  1.432669
## frozen vegetables     1.5724629 1.268559 1.6788668    1.1702028  1.468215
## pancakes              1.4886159 1.272118 1.5224683    1.2391348  1.275452
##                   green tea     milk ground beef frozen vegetables pancakes
## mineral water     0.9863565 1.553774   1.7475215          1.572463 1.488616
## eggs              1.0724795 1.322437   1.1325387          1.268559 1.272118
## spaghetti         1.1533348 1.571779   2.2911622          1.678867 1.522468
## french fries      1.2634884 1.071482   0.8256519          1.170203 1.239135
## chocolate         1.0839426 1.513276   1.4326691          1.468215 1.275452
## green tea                NA 1.027905   1.1399899          1.143308 1.305753
## milk              1.0279055       NA   1.7277041          1.910382 1.342101
## ground beef       1.1399899 1.727704          NA          1.807796 1.555925
## frozen vegetables 1.1433080 1.910382   1.8077957                NA 1.486090
## pancakes          1.3057532 1.342101   1.5559250          1.486090       NA

Measure=“chiSquared”

Chi Squared table shows us the cross table of independency of the variables, where: H0: independent rows and columns

chtab<-crossTable(mbo, measure="chiSquared", sort=TRUE) 
round(chtab[1:10,1:10],3)

##                   mineral water  eggs spaghetti french fries chocolate
## mineral water                NA 0.002     0.008        0.001     0.005
## eggs                      0.002    NA     0.001        0.001     0.000
## spaghetti                 0.008 0.001        NA        0.000     0.004
## french fries              0.001 0.001     0.000           NA     0.001
## chocolate                 0.005 0.000     0.004        0.001        NA
## green tea                 0.000 0.000     0.001        0.002     0.000
## milk                      0.009 0.002     0.007        0.000     0.006
## ground beef               0.013 0.000     0.029        0.001     0.003
## frozen vegetables         0.007 0.001     0.008        0.000     0.003
## pancakes                  0.005 0.001     0.005        0.001     0.001
##                   green tea  milk ground beef frozen vegetables pancakes
## mineral water         0.000 0.009       0.013             0.007    0.005
## eggs                  0.000 0.002       0.000             0.001    0.001
## spaghetti             0.001 0.007       0.029             0.008    0.005
## french fries          0.002 0.000       0.001             0.000    0.001
## chocolate             0.000 0.006       0.003             0.003    0.001
## green tea                NA 0.000       0.000             0.000    0.001
## milk                  0.000    NA       0.007             0.010    0.001
## ground beef           0.000 0.007          NA             0.006    0.003
## frozen vegetables     0.000 0.010       0.006                NA    0.002
## pancakes              0.001 0.001       0.003             0.002       NA

As we can see, it is quite inconvinient looking for meaningful information when the number of variables is that big, even when we sort the data.

In such cases, to determine significant links and set a certain treshold, we can use:

Eclat algorithm & rules induction.We can set up the minimum support level, and the maximum amount of itemsets.

Eclat algorithm only deals with support(unlike the Apriori algorithm that also deals with confidence) and shows us which items are frequently purchased together.

Looking at the support cross table we can see that the support level should be less than 0.06 if we want to have at least one itemsets.

If we choose 0.05 as our itemsets, and minimum lenght = 2(because we want to have at least two items) we will have 3 sets.

freq.items <-eclat(mbo, parameter = list(supp = 0.05,minlen = 2))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.05      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 375 
## 
## create itemset ... 
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 7501 column(s)] done [0.00s].
## writing  ... [3 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(sort(freq.items))

##     items                     support    count
## [1] {mineral water,spaghetti} 0.05972537 448  
## [2] {chocolate,mineral water} 0.05265965 395  
## [3] {eggs,mineral water}      0.05092654 382

And if we choose 0.03 as our itemsets, accordingly, we will have 18 sets.

freq.items <-eclat(mbo, parameter = list(supp = 0.03,minlen = 2))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target   ext
##     FALSE    0.03      2     10 frequent itemsets FALSE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 225 
## 
## create itemset ... 
## set transactions ...[119 item(s), 7501 transaction(s)] done [0.01s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating sparse bit matrix ... [36 row(s), 7501 column(s)] done [0.00s].
## writing  ... [18 set(s)] done [0.01s].
## Creating S4 object  ... done [0.00s].

inspect(sort(freq.items))

##      items                             support    count
## [1]  {mineral water,spaghetti}         0.05972537 448  
## [2]  {chocolate,mineral water}         0.05265965 395  
## [3]  {eggs,mineral water}              0.05092654 382  
## [4]  {milk,mineral water}              0.04799360 360  
## [5]  {ground beef,mineral water}       0.04092788 307  
## [6]  {ground beef,spaghetti}           0.03919477 294  
## [7]  {chocolate,spaghetti}             0.03919477 294  
## [8]  {eggs,spaghetti}                  0.03652846 274  
## [9]  {eggs,french fries}               0.03639515 273  
## [10] {frozen vegetables,mineral water} 0.03572857 268  
## [11] {milk,spaghetti}                  0.03546194 266  
## [12] {chocolate,french fries}          0.03439541 258  
## [13] {mineral water,pancakes}          0.03372884 253  
## [14] {french fries,mineral water}      0.03372884 253  
## [15] {chocolate,eggs}                  0.03319557 249  
## [16] {chocolate,milk}                  0.03212905 241  
## [17] {green tea,mineral water}         0.03106252 233  
## [18] {eggs,milk}                       0.03079589 231

For further analysis, I think it would be better to disregard the mineral water, as it clearly is a nessessity that can be bought along with almost everything without a certain purpose, unlike, for example, beef & spagetti combination.