Background

The Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between December 1, 2009 and December 9, 2011. The company mainly sells unique all-occasion gift-ware. Many customers of the firm are wholesalers.


Setting up packages

library(arules) # Apriori and Eclat algorithms
library(readxl) # Read excel
library(arulesViz) # Association rules visualization
library(dplyr) # Data manipulation
library(stringr) # String processing

Getting and preprocessing data

Just get observations from 2010 and 2011 since there are too many.

dat <- read_excel("online_retail_II.xlsx",
                  sheet='Year 2010-2011',
                  guess_max = 100,
                  range = cell_cols("A:C"),
                  col_types = c("text", "text", "text") 
)
dat <- select(dat, Invoice, Description) %>%
    filter(Description != "",
           Description != "Discount",
           !grepl("^C", Invoice, ignore.case = TRUE)) %>%
    mutate(Description = tolower(str_squish(Description)))

Remove some punctuation

dat <- mutate(dat, Description = str_remove_all(Description, "'|\\.|,"))

Write csv so it’s easier to read with read.transactions

write.csv(dat, file="dat.csv", row.names=FALSE)

Convert data frame to transactions format.

tr <- read.transactions("dat.csv", format = "single", sep=",",
                        rm.duplicates=TRUE, header=TRUE, cols=1:2)

Since many customers are wholesalers is expected to see plenty of baskets that contain at least 100 items.

summary(tr)
## transactions as itemMatrix in sparse format with
##  20610 rows (elements/itemsets/transactions) and
##  4157 columns (items) and a density of 0.006073024 
## 
## most frequent items:
## white hanging heart t-light holder            jumbo bag red retrospot 
##                               2260                               2092 
##           regency cakestand 3 tier                      party bunting 
##                               1989                               1686 
##            lunch bag red retrospot                            (Other) 
##                               1564                             510720 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2272  836  689  666  687  615  606  605  611  545  552  493  507  528  547  552 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##  467  443  484  435  402  347  345  309  248  259  243  243  270  225  197  188 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##  162  177  136  135  132  122  136  123  123  103   96  104  100   90   85   95 
##   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64 
##   87   86   57   65   78   71   72   50   64   52   35   61   40   29   43   39 
##   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80 
##   39   43   33   40   29   33   39   24   24   34   26   21   19   27   16   12 
##   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96 
##   20   21   15   23   17   17    9   17   11   12    9   15   16    7    5   10 
##   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112 
##    9   13    5   11   10    3    6    9    2    5    6    4    4    4    7    3 
##  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128 
##    5    6    6    9    5    4    8    5    6   11    4    5    3    4    8    1 
##  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144 
##    2    4    3    3    2    5    4    2    6    6    2    5    6    2    2    5 
##  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160 
##    5    3    2    4    5    3    5    3    6    2    2    2    4    4    1    2 
##  161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176 
##    3    3    3    2    5    4    1    4    4    2    2    4    3    4    2    5 
##  177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192 
##    5    4    2    4    2    6    4    3    3    3    2    3    4    4    2    3 
##  193  194  195  196  197  198  199  202  203  204  205  206  207  208  210  211 
##    2    3    3    4    2    2    3    2    5    5    1    2    1    4    1    4 
##  212  213  214  215  216  217  218  219  220  222  223  224  225  226  227  228 
##    1    1    2    1    2    4    2    2    2    1    1    3    3    1    1    1 
##  229  230  232  233  234  235  237  238  239  241  242  243  244  247  249  250 
##    2    1    1    1    1    1    3    3    1    2    1    2    2    2    3    2 
##  253  254  255  257  259  261  262  263  264  266  267  270  275  279  280  282 
##    1    2    2    2    1    2    2    1    2    1    1    2    1    2    2    1 
##  283  285  286  288  289  291  292  295  296  298  299  301  309  310  315  319 
##    2    2    1    2    1    1    1    1    1    2    1    1    1    1    1    1 
##  320  331  332  333  334  339  341  344  345  347  348  349  352  354  357  358 
##    1    1    4    1    1    1    1    1    1    2    1    1    2    1    1    1 
##  363  369  375  376  379  382  386  388  399  404  408  411  414  415  416  419 
##    1    1    1    1    1    1    1    1    2    2    1    1    1    1    2    1 
##  420  428  433  434  438  439  443  449  453  455  458  460  463  471  482  486 
##    1    1    1    2    1    2    1    1    1    1    1    1    1    1    1    1 
##  487  488  494  499  503  506  514  515  517  518  520  522  524  525  527  529 
##    1    1    1    1    2    1    1    1    1    1    1    1    1    2    1    1 
##  531  536  539  541  543  552  561  567  572  578  585  588  589  593  595  599 
##    1    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
##  601  607  622  629  635  645  647  649  661  673  676  687  703  720  731  748 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 1108 
##    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00   15.00   25.25   28.00 1108.00 
## 
## includes extended item information - examples:
##                    labels
## 1   *boombox ipod classic
## 2 *usb office mirror ball
## 3                       ?
## 
## includes extended transaction information - examples:
##   transactionID
## 1        536365
## 2        536366
## 3        536367

Plot of the most frequently bought items

itemFrequencyPlot(tr, topN=10, cex=0.7)

There are just 6 items that have a support (relative frequency) of at least 7%. Perhaps this is because the company offers a pretty wide range of products, as these are unique all-occasion gifts.

itemFrequencyPlot(tr, support=0.07, cex=0.8)


Running apriori algorithm

Setting a support that isn’t two high (there were only two items with a support above 10%) but also not too low because we want items that often bought. In this case a support of 1% and a confidence of 30% were chosen because association rules must satisfy both a minimum support and a minimum confidence constraint at the same time. Having that said, the algorithm developed 1381 rules that passed both support and confidence minimum requirements.

rules <- apriori(tr, parameter=list(support=0.01, confidence=0.3, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 206 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4157 item(s), 20610 transaction(s)] done [0.47s].
## sorting and recoding items ... [783 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 done [0.13s].
## writing ... [1381 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

Basket rules summary

summary(rules)
## set of 1381 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
## 735 582  64 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.514   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.01004   Min.   :0.3002   Min.   :0.01058   Min.   : 2.819  
##  1st Qu.:0.01092   1st Qu.:0.4000   1st Qu.:0.01926   1st Qu.: 7.791  
##  Median :0.01252   Median :0.5144   Median :0.02586   Median :10.087  
##  Mean   :0.01395   Mean   :0.5299   Mean   :0.02880   Mean   :13.144  
##  3rd Qu.:0.01519   3rd Qu.:0.6396   3rd Qu.:0.03474   3rd Qu.:14.675  
##  Max.   :0.04003   Max.   :0.9587   Max.   :0.10150   Max.   :80.078  
##      count      
##  Min.   :207.0  
##  1st Qu.:225.0  
##  Median :258.0  
##  Mean   :287.6  
##  3rd Qu.:313.0  
##  Max.   :825.0  
## 
## mining info:
##  data ntransactions support confidence
##    tr         20610    0.01        0.3

Basket rules of size equal to 2

inspect(head(subset(rules, size(rules) == 2), 10))
##      lhs                                     rhs                                  support confidence   coverage      lift count
## [1]  {childrens cutlery dolly girl}       => {childrens cutlery spaceboy}      0.01072295  0.7594502 0.01411936 43.478522   221
## [2]  {childrens cutlery spaceboy}         => {childrens cutlery dolly girl}    0.01072295  0.6138889 0.01746725 43.478522   221
## [3]  {childrens cutlery polkadot blue}    => {childrens cutlery polkadot pink} 0.01067443  0.7612457 0.01402232 37.090481   220
## [4]  {childrens cutlery polkadot pink}    => {childrens cutlery polkadot blue} 0.01067443  0.5200946 0.02052402 37.090481   220
## [5]  {painted metal pears assorted}       => {assorted colour bird ornament}   0.01256672  0.7000000 0.01795245  9.915464   259
## [6]  {round snack boxes set of4 woodland} => {postage}                         0.01154779  0.3251366 0.03551674  5.945932   238
## [7]  {lunch bag doiley pattern}           => {jumbo bag doiley patterns}       0.01101407  0.4613821 0.02387191 18.500166   227
## [8]  {jumbo bag doiley patterns}          => {lunch bag doiley pattern}        0.01101407  0.4416342 0.02493935 18.500166   227
## [9]  {lunch bag doiley pattern}           => {lunch bag apple design}          0.01004367  0.4207317 0.02387191  8.369962   207
## [10] {pink happy birthday bunting}        => {blue happy birthday bunting}     0.01271228  0.6517413 0.01950509 34.442021   262

Basket rules of size greater than 3

inspect(head(subset(rules, size(rules) > 3), 10))
##      lhs                                  rhs                                  support confidence   coverage      lift count
## [1]  {green regency teacup and saucer,                                                                                      
##       pink regency teacup and saucer,                                                                                       
##       roses regency teacup and saucer} => {regency cakestand 3 tier}        0.01460456  0.5553506 0.02629791  5.754537   301
## [2]  {pink regency teacup and saucer,                                                                                       
##       regency cakestand 3 tier,                                                                                             
##       roses regency teacup and saucer} => {green regency teacup and saucer} 0.01460456  0.9093656 0.01606016 18.465048   301
## [3]  {green regency teacup and saucer,                                                                                      
##       pink regency teacup and saucer,                                                                                       
##       regency cakestand 3 tier}        => {roses regency teacup and saucer} 0.01460456  0.8775510 0.01664241 16.966535   301
## [4]  {green regency teacup and saucer,                                                                                      
##       regency cakestand 3 tier,                                                                                             
##       roses regency teacup and saucer} => {pink regency teacup and saucer}  0.01460456  0.7377451 0.01979622 19.849773   301
## [5]  {lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot,                                                                                              
##       lunch bag suki design}           => {lunch bag red retrospot}         0.01023775  0.7873134 0.01300340 10.375019   211
## [6]  {lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot,                                                                                              
##       lunch bag red retrospot}         => {lunch bag suki design}           0.01023775  0.6552795 0.01562348 10.509969   211
## [7]  {lunch bag pink polkadot,                                                                                              
##       lunch bag red retrospot,                                                                                              
##       lunch bag suki design}           => {lunch bag cars blue}             0.01023775  0.6374622 0.01606016 11.424432   211
## [8]  {lunch bag cars blue,                                                                                                  
##       lunch bag red retrospot,                                                                                              
##       lunch bag suki design}           => {lunch bag pink polkadot}         0.01023775  0.6242604 0.01639981 11.803675   211
## [9]  {lunch bag black skull,                                                                                                
##       lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot}         => {lunch bag red retrospot}         0.01091703  0.7258065 0.01504124  9.564496   225
## [10] {lunch bag cars blue,                                                                                                  
##       lunch bag pink polkadot,                                                                                              
##       lunch bag red retrospot}         => {lunch bag black skull}           0.01091703  0.6987578 0.01562348 11.312960   225

Taking a look at the top 11 rules by lift

inspect(sort(rules, by='lift', decreasing=TRUE)[1:11])
##      lhs                       rhs                    support    confidence
## [1]  {herb marker thyme}    => {herb marker rosemary} 0.01072295 0.9324895 
## [2]  {herb marker rosemary} => {herb marker thyme}    0.01072295 0.9208333 
## [3]  {herb marker thyme}    => {herb marker parsley}  0.01033479 0.8987342 
## [4]  {herb marker parsley}  => {herb marker thyme}    0.01033479 0.8949580 
## [5]  {herb marker parsley}  => {herb marker rosemary} 0.01043183 0.9033613 
## [6]  {herb marker rosemary} => {herb marker parsley}  0.01043183 0.8958333 
## [7]  {herb marker parsley}  => {herb marker mint}     0.01028627 0.8907563 
## [8]  {herb marker mint}     => {herb marker parsley}  0.01028627 0.8833333 
## [9]  {herb marker basil}    => {herb marker rosemary} 0.01038331 0.8842975 
## [10] {herb marker rosemary} => {herb marker basil}    0.01038331 0.8916667 
## [11] {herb marker parsley}  => {herb marker basil}    0.01028627 0.8907563 
##      coverage   lift     count
## [1]  0.01149927 80.07753 221  
## [2]  0.01164483 80.07753 221  
## [3]  0.01149927 77.82736 213  
## [4]  0.01154779 77.82736 213  
## [5]  0.01154779 77.57616 215  
## [6]  0.01164483 77.57616 215  
## [7]  0.01154779 76.49370 212  
## [8]  0.01164483 76.49370 212  
## [9]  0.01174187 75.93905 214  
## [10] 0.01164483 75.93905 214  
## [11] 0.01154779 75.86152 212

Checking the rules that have the product with highest support (white hanging heart t-light holder in the right hand side) at the right hand side (rhs)

heart.rhs <- subset(rules, subset = rhs %in% 'white hanging heart t-light holder')
inspect(heart.rhs)
##      lhs                                    rhs                                     support confidence   coverage     lift count
## [1]  {candleholder pink hanging heart}   => {white hanging heart t-light holder} 0.01368268  0.7085427 0.01931101 6.461533   282
## [2]  {zinc metal heart decoration}       => {white hanging heart t-light holder} 0.01014071  0.3906542 0.02595827 3.562559   209
## [3]  {love building block word}          => {white hanging heart t-light holder} 0.01077147  0.3529412 0.03051917 3.218636   222
## [4]  {red hanging heart t-light holder}  => {white hanging heart t-light holder} 0.02401747  0.6680162 0.03595342 6.091953   495
## [5]  {home building block word}          => {white hanging heart t-light holder} 0.01237263  0.3281853 0.03770015 2.992876   255
## [6]  {heart of wicker large}             => {white hanging heart t-light holder} 0.01737021  0.3866091 0.04492965 3.525669   358
## [7]  {hanging heart jar t-light holder}  => {white hanging heart t-light holder} 0.01091703  0.3090659 0.03532266 2.818517   225
## [8]  {hanging heart zinc t-light holder} => {white hanging heart t-light holder} 0.01052887  0.3598673 0.02925764 3.281799   217
## [9]  {wooden frame antique white}        => {white hanging heart t-light holder} 0.01644833  0.3491246 0.04711305 3.183831   339
## [10] {lovebird hanging decoration white} => {white hanging heart t-light holder} 0.01111111  0.4209559 0.02639495 3.838894   229
## [11] {wooden picture frame white finish} => {white hanging heart t-light holder} 0.01979622  0.3709091 0.05337215 3.382494   408
## [12] {bathroom metal sign}               => {white hanging heart t-light holder} 0.01213003  0.3714710 0.03265405 3.387619   250
## [13] {heart of wicker small}             => {white hanging heart t-light holder} 0.01897137  0.3255620 0.05827268 2.968953   391
## [14] {natural slate heart chalkboard}    => {white hanging heart t-light holder} 0.02013586  0.3322658 0.06060165 3.030088   415
## [15] {dotcom postage}                    => {white hanging heart t-light holder} 0.01319748  0.3841808 0.03435226 3.503525   272
## [16] {wooden frame antique white,                                                                                               
##       wooden picture frame white finish} => {white hanging heart t-light holder} 0.01057739  0.4044527 0.02615235 3.688394   218

Visualizing rules

plot(rules)

Graph visualization for small subsets of rules, in this case, the top 11 with the highest lift.

plot(sort(rules, by='lift', decreasing=TRUE)[1:11], method='graph')


Running eclat algorithm

As opposed to apriori, eclat just measures a set support not an item support. It only requires the support level. There is no confidence or lift involved. Here the algorithm outputs subsets, not rules.

Performing eclat with a minimum subset length of 2.

eclat_sets <- eclat(tr, parameter=list(support=0.01, minlen = 2))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      2     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 206 
## 
## create itemset ... 
## set transactions ...[4157 item(s), 20610 transaction(s)] done [0.52s].
## sorting and recoding items ... [783 item(s)] done [0.01s].
## creating sparse bit matrix ... [783 row(s), 20610 column(s)] done [0.05s].
## writing  ... [971 set(s)] done [1.83s].
## Creating S4 object  ... done [0.00s].

There were 971 subsets (or itemsets) that satisfied a minimum support of 10%.

summary(eclat_sets)
## set of 971 itemsets
## 
## most frequent items:
##     jumbo bag red retrospot     lunch bag red retrospot 
##                         140                          90 
##      jumbo storage bag suki              dotcom postage 
##                          69                          64 
## red retrospot charlotte bag                     (Other) 
##                          61                        1746 
## 
## element (itemset/transaction) length distribution:sizes
##   2   3   4 
## 759 196  16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.235   2.000   4.000 
## 
## summary of quality measures:
##     support        transIdenticalToItemsets     count      
##  Min.   :0.01004   Min.   :207.0            Min.   :207.0  
##  1st Qu.:0.01089   1st Qu.:224.5            1st Qu.:224.5  
##  Median :0.01228   Median :253.0            Median :253.0  
##  Mean   :0.01358   Mean   :280.0            Mean   :280.0  
##  3rd Qu.:0.01460   3rd Qu.:301.0            3rd Qu.:301.0  
##  Max.   :0.04003   Max.   :825.0            Max.   :825.0  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions support
##    tr         20610    0.01

Sets with the highest support

The most frequent combination among all the transactions was jumbo bag pink polkadot and jumbo bag red retrospot, with a support of 0.0400291.

inspect(sort(eclat_sets, by='support', descending=TRUE)[1:9])
##     items                                  support transIdenticalToItemsets count
## [1] {jumbo bag pink polkadot,                                                    
##      jumbo bag red retrospot}           0.04002911                      825   825
## [2] {green regency teacup and saucer,                                            
##      roses regency teacup and saucer}   0.03726346                      768   768
## [3] {jumbo bag red retrospot,                                                    
##      jumbo storage bag suki}            0.03512858                      724   724
## [4] {jumbo bag red retrospot,                                                    
##      jumbo shopper vintage red paisley} 0.03299369                      680   680
## [5] {lunch bag red retrospot,                                                    
##      lunch bag suki design}             0.03173217                      654   654
## [6] {lunch bag black skull,                                                      
##      lunch bag red retrospot}           0.03110141                      641   641
## [7] {alarm clock bakelike green,                                                 
##      alarm clock bakelike red}          0.03105289                      640   640
## [8] {green regency teacup and saucer,                                            
##      pink regency teacup and saucer}    0.03071325                      633   633
## [9] {lunch bag pink polkadot,                                                    
##      lunch bag red retrospot}           0.02940320                      606   606

Sets graph

plot(sort(eclat_sets, by='support', decreasing=TRUE)[1:10], method='graph')

Reference

Chen, D. Sain, S.L., and Guo, K. (2012). Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. doi: Web Link.