Association Rule Mining for Online Retail Consumers

Background

The Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between December 1, 2009 and December 9, 2011. The company mainly sells unique all-occasion gift-ware. Many customers of the firm are wholesalers.

The data set used here can be downloaded from this link.
The goal of this project is to study the consumer behavior of a retailer customers.

Setting up packages

library(arules) # Apriori and Eclat algorithms
library(readxl) # Read excel
library(arulesViz) # Association rules visualization
library(dplyr) # Data manipulation
library(stringr) # String processing

Getting and preprocessing data

Just get observations from 2010 and 2011 since there are too many.

dat <- read_excel("online_retail_II.xlsx",
                  sheet='Year 2010-2011',
                  guess_max = 100,
                  range = cell_cols("A:C"),
                  col_types = c("text", "text", "text") 
)

Only select the Invoice and Description columns, this is the basket info needed. Remove invoices that start with the letter ‘c’, which indicates a cancellation. We’re only interested in purchases transactions
Removed leading and trailing spaces and also reduced repeated white space inside a string.

dat <- select(dat, Invoice, Description) %>%
    filter(Description != "",
           Description != "Discount",
           !grepl("^C", Invoice, ignore.case = TRUE)) %>%
    mutate(Description = tolower(str_squish(Description)))

Remove some punctuation

dat <- mutate(dat, Description = str_remove_all(Description, "'|\\.|,"))

Write csv so it’s easier to read with read.transactions

write.csv(dat, file="dat.csv", row.names=FALSE)

Convert data frame to transactions format.

tr <- read.transactions("dat.csv", format = "single", sep=",",
                        rm.duplicates=TRUE, header=TRUE, cols=1:2)

Since many customers are wholesalers is expected to see plenty of baskets that contain at least 100 items.

summary(tr)

## transactions as itemMatrix in sparse format with
##  20610 rows (elements/itemsets/transactions) and
##  4157 columns (items) and a density of 0.006073024 
## 
## most frequent items:
## white hanging heart t-light holder            jumbo bag red retrospot 
##                               2260                               2092 
##           regency cakestand 3 tier                      party bunting 
##                               1989                               1686 
##            lunch bag red retrospot                            (Other) 
##                               1564                             510720 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2272  836  689  666  687  615  606  605  611  545  552  493  507  528  547  552 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  467  443  484  435  402  347  345  309  248  259  243  243  270  225  197  188 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  162  177  136  135  132  122  136  123  123  103   96  104  100   90   85   95 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   87   86   57   65   78   71   72   50   64   52   35   61   40   29   43   39 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   39   43   33   40   29   33   39   24   24   34   26   21   19   27   16   12 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   20   21   15   23   17   17    9   17   11   12    9   15   16    7    5   10 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    9   13    5   11   10    3    6    9    2    5    6    4    4    4    7    3 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##    5    6    6    9    5    4    8    5    6   11    4    5    3    4    8    1 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
##    2    4    3    3    2    5    4    2    6    6    2    5    6    2    2    5 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##    5    3    2    4    5    3    5    3    6    2    2    2    4    4    1    2 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##    3    3    3    2    5    4    1    4    4    2    2    4    3    4    2    5 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    5    4    2    4    2    6    4    3    3    3    2    3    4    4    2    3 
##  193  194  195  196  197  198  199  202  203  204  205  206  207  208  210  211 
##    2    3    3    4    2    2    3    2    5    5    1    2    1    4    1    4 
##  212  213  214  215  216  217  218  219  220  222  223  224  225  226  227  228 
##    1    1    2    1    2    4    2    2    2    1    1    3    3    1    1    1 
##  229  230  232  233  234  235  237  238  239  241  242  243  244  247  249  250 
##    2    1    1    1    1    1    3    3    1    2    1    2    2    2    3    2 
##  253  254  255  257  259  261  262  263  264  266  267  270  275  279  280  282 
##    1    2    2    2    1    2    2    1    2    1    1    2    1    2    2    1 
##  283  285  286  288  289  291  292  295  296  298  299  301  309  310  315  319 
##    2    2    1    2    1    1    1    1    1    2    1    1    1    1    1    1 
##  320  331  332  333  334  339  341  344  345  347  348  349  352  354  357  358 
##    1    1    4    1    1    1    1    1    1    2    1    1    2    1    1    1 
##  363  369  375  376  379  382  386  388  399  404  408  411  414  415  416  419 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    2    1 
##  420  428  433  434  438  439  443  449  453  455  458  460  463  471  482  486 
##    1    1    1    2    1    2    1    1    1    1    1    1    1    1    1    1 
##  487  488  494  499  503  506  514  515  517  518  520  522  524  525  527  529 
##    1    1    1    1    2    1    1    1    1    1    1    1    1    2    1    1 
##  531  536  539  541  543  552  561  567  572  578  585  588  589  593  595  599 
##    1    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
##  601  607  622  629  635  645  647  649  661  673  676  687  703  720  731  748 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 1108 
##    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   25.25   28.00 1108.00 
## 
## includes extended item information - examples:
##                    labels
## 1   *boombox ipod classic
## 2 *usb office mirror ball
## 3                       ?
## 
## includes extended transaction information - examples:
##   transactionID
## 1        536365
## 2        536366
## 3        536367

Plot of the most frequently bought items

itemFrequencyPlot(tr, topN=10, cex=0.7)

There are just 6 items that have a support (relative frequency) of at least 7%. Perhaps this is because the company offers a pretty wide range of products, as these are unique all-occasion gifts.

itemFrequencyPlot(tr, support=0.07, cex=0.8)

Running apriori algorithm

Support: It’s the probability of an specific event, in this case, it’s the proportion of times a specific item appears compared to the total number of transactions. In this online retail data, the basket rules support is somewhat low, with a maximum support a little greater than 4%.
Confidence: Confidence level of two events that occur simultaneously. It is defined as \(conf(X \cup Y)=supp(X \cup Y)/supp(X)\). For example, in this data set, the rule {childrens cutlery dolly girl} => {childrens cutlery spaceboy} has a confidence close to 76%. This means that for 76% of the “childrens cutlery dolly girl” transactions this rule is correct. Confidence can be interpreted as an estimate of the conditional probability \(P(Y \mid X)\), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.
Lift: It’s defined as \(lift(X \cup Y)=supp(X \cup Y)/(supp(X)supp(Y))\). It can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. It helps to filter or rank found rules. Greater lift values indicate stronger associations (greater dependence among items in the rule).

Setting a support that isn’t two high (there were only two items with a support above 10%) but also not too low because we want items that often bought. In this case a support of 1% and a confidence of 30% were chosen because association rules must satisfy both a minimum support and a minimum confidence constraint at the same time. Having that said, the algorithm developed 1381 rules that passed both support and confidence minimum requirements.

rules <- apriori(tr, parameter=list(support=0.01, confidence=0.3, target="rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 206 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4157 item(s), 20610 transaction(s)] done [0.47s].
## sorting and recoding items ... [783 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 done [0.13s].
## writing ... [1381 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

Basket rules summary

summary(rules)

## set of 1381 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 735 582  64 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.514   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01004   Min.   :0.3002   Min.   :0.01058   Min.   : 2.819  
##  1st Qu.:0.01092   1st Qu.:0.4000   1st Qu.:0.01926   1st Qu.: 7.791  
##  Median :0.01252   Median :0.5144   Median :0.02586   Median :10.087  
##  Mean   :0.01395   Mean   :0.5299   Mean   :0.02880   Mean   :13.144  
##  3rd Qu.:0.01519   3rd Qu.:0.6396   3rd Qu.:0.03474   3rd Qu.:14.675  
##  Max.   :0.04003   Max.   :0.9587   Max.   :0.10150   Max.   :80.078  
##      count      
##  Min.   :207.0  
##  1st Qu.:225.0  
##  Median :258.0  
##  Mean   :287.6  
##  3rd Qu.:313.0  
##  Max.   :825.0  
## 
## mining info:
##  data ntransactions support confidence
##    tr         20610    0.01        0.3

Basket rules of size equal to 2

inspect(head(subset(rules, size(rules) == 2), 10))

##      lhs                                     rhs                                  support confidence   coverage      lift count
## [1]  {childrens cutlery dolly girl}       => {childrens cutlery spaceboy}      0.01072295  0.7594502 0.01411936 43.478522   221
## [2]  {childrens cutlery spaceboy}         => {childrens cutlery dolly girl}    0.01072295  0.6138889 0.01746725 43.478522   221
## [3]  {childrens cutlery polkadot blue}    => {childrens cutlery polkadot pink} 0.01067443  0.7612457 0.01402232 37.090481   220
## [4]  {childrens cutlery polkadot pink}    => {childrens cutlery polkadot blue} 0.01067443  0.5200946 0.02052402 37.090481   220
## [5]  {painted metal pears assorted}       => {assorted colour bird ornament}   0.01256672  0.7000000 0.01795245  9.915464   259
## [6]  {round snack boxes set of4 woodland} => {postage}                         0.01154779  0.3251366 0.03551674  5.945932   238
## [7]  {lunch bag doiley pattern}           => {jumbo bag doiley patterns}       0.01101407  0.4613821 0.02387191 18.500166   227
## [8]  {jumbo bag doiley patterns}          => {lunch bag doiley pattern}        0.01101407  0.4416342 0.02493935 18.500166   227
## [9]  {lunch bag doiley pattern}           => {lunch bag apple design}          0.01004367  0.4207317 0.02387191  8.369962   207
## [10] {pink happy birthday bunting}        => {blue happy birthday bunting}     0.01271228  0.6517413 0.01950509 34.442021   262

Basket rules of size greater than 3

inspect(head(subset(rules, size(rules) > 3), 10))

##      lhs                                  rhs                                  support confidence   coverage      lift count
## [1]  {green regency teacup and saucer,                                                                                      
##       pink regency teacup and saucer,                                                                                       
##       roses regency teacup and saucer} => {regency cakestand 3 tier}        0.01460456  0.5553506 0.02629791  5.754537   301
## [2]  {pink regency teacup and saucer,                                                                                       
##       regency cakestand 3 tier,                                                                                             
##       roses regency teacup and saucer} => {green regency teacup and saucer} 0.01460456  0.9093656 0.01606016 18.465048   301
## [3]  {green regency teacup and saucer,                                                                                      
##       pink regency teacup and saucer,                                                                                       
##       regency cakestand 3 tier}        => {roses regency teacup and saucer} 0.01460456  0.8775510 0.01664241 16.966535   301
## [4]  {green regency teacup and saucer,                                                                                      
##       regency cakestand 3 tier,                                                                                             
##       roses regency teacup and saucer} => {pink regency teacup and saucer}  0.01460456  0.7377451 0.01979622 19.849773   301
## [5]  {lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot,                                                                                              
##       lunch bag suki design}           => {lunch bag red retrospot}         0.01023775  0.7873134 0.01300340 10.375019   211
## [6]  {lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot,                                                                                              
##       lunch bag red retrospot}         => {lunch bag suki design}           0.01023775  0.6552795 0.01562348 10.509969   211
## [7]  {lunch bag pink polkadot,                                                                                              
##       lunch bag red retrospot,                                                                                              
##       lunch bag suki design}           => {lunch bag cars blue}             0.01023775  0.6374622 0.01606016 11.424432   211
## [8]  {lunch bag cars blue,                                                                                                  
##       lunch bag red retrospot,                                                                                              
##       lunch bag suki design}           => {lunch bag pink polkadot}         0.01023775  0.6242604 0.01639981 11.803675   211
## [9]  {lunch bag black skull,                                                                                                
##       lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot}         => {lunch bag red retrospot}         0.01091703  0.7258065 0.01504124  9.564496   225
## [10] {lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot,                                                                                              
##       lunch bag red retrospot}         => {lunch bag black skull}           0.01091703  0.6987578 0.01562348 11.312960   225

Taking a look at the top 11 rules by lift

inspect(sort(rules, by='lift', decreasing=TRUE)[1:11])

##      lhs                       rhs                    support    confidence
## [1]  {herb marker thyme}    => {herb marker rosemary} 0.01072295 0.9324895 
## [2]  {herb marker rosemary} => {herb marker thyme}    0.01072295 0.9208333 
## [3]  {herb marker thyme}    => {herb marker parsley}  0.01033479 0.8987342 
## [4]  {herb marker parsley}  => {herb marker thyme}    0.01033479 0.8949580 
## [5]  {herb marker parsley}  => {herb marker rosemary} 0.01043183 0.9033613 
## [6]  {herb marker rosemary} => {herb marker parsley}  0.01043183 0.8958333 
## [7]  {herb marker parsley}  => {herb marker mint}     0.01028627 0.8907563 
## [8]  {herb marker mint}     => {herb marker parsley}  0.01028627 0.8833333 
## [9]  {herb marker basil}    => {herb marker rosemary} 0.01038331 0.8842975 
## [10] {herb marker rosemary} => {herb marker basil}    0.01038331 0.8916667 
## [11] {herb marker parsley}  => {herb marker basil}    0.01028627 0.8907563 
##      coverage   lift     count
## [1]  0.01149927 80.07753 221  
## [2]  0.01164483 80.07753 221  
## [3]  0.01149927 77.82736 213  
## [4]  0.01154779 77.82736 213  
## [5]  0.01154779 77.57616 215  
## [6]  0.01164483 77.57616 215  
## [7]  0.01154779 76.49370 212  
## [8]  0.01164483 76.49370 212  
## [9]  0.01174187 75.93905 214  
## [10] 0.01164483 75.93905 214  
## [11] 0.01154779 75.86152 212

Checking the rules that have the product with highest support (white hanging heart t-light holder in the right hand side) at the right hand side (rhs)

heart.rhs <- subset(rules, subset = rhs %in% 'white hanging heart t-light holder')
inspect(heart.rhs)

##      lhs                                    rhs                                     support confidence   coverage     lift count
## [1]  {candleholder pink hanging heart}   => {white hanging heart t-light holder} 0.01368268  0.7085427 0.01931101 6.461533   282
## [2]  {zinc metal heart decoration}       => {white hanging heart t-light holder} 0.01014071  0.3906542 0.02595827 3.562559   209
## [3]  {love building block word}          => {white hanging heart t-light holder} 0.01077147  0.3529412 0.03051917 3.218636   222
## [4]  {red hanging heart t-light holder}  => {white hanging heart t-light holder} 0.02401747  0.6680162 0.03595342 6.091953   495
## [5]  {home building block word}          => {white hanging heart t-light holder} 0.01237263  0.3281853 0.03770015 2.992876   255
## [6]  {heart of wicker large}             => {white hanging heart t-light holder} 0.01737021  0.3866091 0.04492965 3.525669   358
## [7]  {hanging heart jar t-light holder}  => {white hanging heart t-light holder} 0.01091703  0.3090659 0.03532266 2.818517   225
## [8]  {hanging heart zinc t-light holder} => {white hanging heart t-light holder} 0.01052887  0.3598673 0.02925764 3.281799   217
## [9]  {wooden frame antique white}        => {white hanging heart t-light holder} 0.01644833  0.3491246 0.04711305 3.183831   339
## [10] {lovebird hanging decoration white} => {white hanging heart t-light holder} 0.01111111  0.4209559 0.02639495 3.838894   229
## [11] {wooden picture frame white finish} => {white hanging heart t-light holder} 0.01979622  0.3709091 0.05337215 3.382494   408
## [12] {bathroom metal sign}               => {white hanging heart t-light holder} 0.01213003  0.3714710 0.03265405 3.387619   250
## [13] {heart of wicker small}             => {white hanging heart t-light holder} 0.01897137  0.3255620 0.05827268 2.968953   391
## [14] {natural slate heart chalkboard}    => {white hanging heart t-light holder} 0.02013586  0.3322658 0.06060165 3.030088   415
## [15] {dotcom postage}                    => {white hanging heart t-light holder} 0.01319748  0.3841808 0.03435226 3.503525   272
## [16] {wooden frame antique white,                                                                                               
##       wooden picture frame white finish} => {white hanging heart t-light holder} 0.01057739  0.4044527 0.02615235 3.688394   218

Visualizing rules

plot(rules)

Graph visualization for small subsets of rules, in this case, the top 11 with the highest lift.

plot(sort(rules, by='lift', decreasing=TRUE)[1:11], method='graph')

Running eclat algorithm

As opposed to apriori, eclat just measures a set support not an item support. It only requires the support level. There is no confidence or lift involved. Here the algorithm outputs subsets, not rules.

Performing eclat with a minimum subset length of 2.

eclat_sets <- eclat(tr, parameter=list(support=0.01, minlen = 2))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      2     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 206 
## 
## create itemset ... 
## set transactions ...[4157 item(s), 20610 transaction(s)] done [0.52s].
## sorting and recoding items ... [783 item(s)] done [0.01s].
## creating sparse bit matrix ... [783 row(s), 20610 column(s)] done [0.05s].
## writing  ... [971 set(s)] done [1.83s].
## Creating S4 object  ... done [0.00s].

There were 971 subsets (or itemsets) that satisfied a minimum support of 10%.

summary(eclat_sets)

## set of 971 itemsets
## 
## most frequent items:
##     jumbo bag red retrospot     lunch bag red retrospot 
##                         140                          90 
##      jumbo storage bag suki              dotcom postage 
##                          69                          64 
## red retrospot charlotte bag                     (Other) 
##                          61                        1746 
## 
## element (itemset/transaction) length distribution:sizes
##   2   3   4 
## 759 196  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.235   2.000   4.000 
## 
## summary of quality measures:
##     support        transIdenticalToItemsets     count      
##  Min.   :0.01004   Min.   :207.0            Min.   :207.0  
##  1st Qu.:0.01089   1st Qu.:224.5            1st Qu.:224.5  
##  Median :0.01228   Median :253.0            Median :253.0  
##  Mean   :0.01358   Mean   :280.0            Mean   :280.0  
##  3rd Qu.:0.01460   3rd Qu.:301.0            3rd Qu.:301.0  
##  Max.   :0.04003   Max.   :825.0            Max.   :825.0  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support
##    tr         20610    0.01

Sets with the highest support

The most frequent combination among all the transactions was jumbo bag pink polkadot and jumbo bag red retrospot, with a support of 0.0400291.

inspect(sort(eclat_sets, by='support', descending=TRUE)[1:9])

##     items                                  support transIdenticalToItemsets count
## [1] {jumbo bag pink polkadot,                                                    
##      jumbo bag red retrospot}           0.04002911                      825   825
## [2] {green regency teacup and saucer,                                            
##      roses regency teacup and saucer}   0.03726346                      768   768
## [3] {jumbo bag red retrospot,                                                    
##      jumbo storage bag suki}            0.03512858                      724   724
## [4] {jumbo bag red retrospot,                                                    
##      jumbo shopper vintage red paisley} 0.03299369                      680   680
## [5] {lunch bag red retrospot,                                                    
##      lunch bag suki design}             0.03173217                      654   654
## [6] {lunch bag black skull,                                                      
##      lunch bag red retrospot}           0.03110141                      641   641
## [7] {alarm clock bakelike green,                                                 
##      alarm clock bakelike red}          0.03105289                      640   640
## [8] {green regency teacup and saucer,                                            
##      pink regency teacup and saucer}    0.03071325                      633   633
## [9] {lunch bag pink polkadot,                                                    
##      lunch bag red retrospot}           0.02940320                      606   606

Sets graph

plot(sort(eclat_sets, by='support', decreasing=TRUE)[1:10], method='graph')

Reference

Chen, D. Sain, S.L., and Guo, K. (2012). Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. doi: Web Link.