Market Basket Analysis

Market Basket Analysis (MBA) uncovers associations between products by looking for combinations of products that frequently co-occur in transactions. It allows the supermarkets to identify relationships between the products that customer buy for various purposes.

Retail Market Basket Data Set - General Description

This project analyses the retail market basket data set supplied by a anonymous Belgian retail supermarket store. The data are collected over three non-consecutive periods. This results in approximately 5 months of data. The total amount of receipts being collected equals 88,162. Over the entire data collection period, the supermarket store carries 16,470 unique SKU’s (Stock Keeping Units). In total, 5,133 customers have purchased at least one product in the supermarket during the data collection period.

Project Description

Algorithms and Packages used

+ arules
+ arulesViz
+ eclat
+ Frequent Pattern growth
+ bigmemory

Data Statistics

retail <- read.transactions("Retail.csv", sep = " ")
summary(retail)
## transactions as itemMatrix in sparse format with
##  88162 rows (elements/itemsets/transactions) and
##  16470 columns (items) and a density of 0.0006257289 
## 
## most frequent items:
##      39      48      38      32      41 (Other) 
##   50675   42135   15596   15167   14945  770058 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3016 5516 6919 7210 6814 6163 5746 5143 4660 4086 3751 3285 2866 2620 2310 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 2115 1874 1645 1469 1290 1205  981  887  819  684  586  582  472  480  355 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  310  303  272  234  194  136  153  123  115  112   76   66   71   60   50 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   44   37   37   33   22   24   21   21   10   11   10    9   11    4    9 
##   61   62   63   64   65   66   67   68   71   73   74   76 
##    7    4    5    2    2    5    3    3    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    8.00   10.31   14.00   76.00 
## 
## includes extended item information - examples:
##   labels
## 1      0
## 2      1
## 3     10

Item Frequency Plot

itemFrequencyPlot(retail,topN=20, main = "Frequently purchased items", xlab = "Item", ylab = "Item Frequency (Relative")  

We may find from the above plot that the most frequently purchased item is “39” followed by 48. Here, we are going to study about item “39” which has nearly 0.6 relative frequency and item “110” which has relative frequency of 0.025.

Apriori algorithm

a.rules <- apriori(retail)
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE     0.1      1     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.12s].
## sorting and recoding items ... [5 item(s)] done [0.02s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 done [0.05s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.01s].

By default, Aprori algorithm runs for 10% support and 80% confidence. There are zero rules as there is no item in the Retail market data set with support of 10% and confidence of 80%.

Apriori algorithm with support of 1% and confidence of 25%

retailrules <- apriori(retail, parameter = list(support = 0.01, confidence = 0.25, minlen = 2))
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##        0.25    0.1    1 none FALSE            TRUE    0.01      2     10
##  target   ext
##   rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.12s].
## sorting and recoding items ... [70 item(s)] done [0.00s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [125 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Setting the support as 1% and confidence as 25% to generate more rules. There are 125 rules generated.

Analysing the rules

summary(retailrules)
## set of 125 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3  4 
## 56 51 18 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.696   3.000   4.000 
## 
## summary of quality measures:
##     support          confidence          lift       
##  Min.   :0.01013   Min.   :0.2528   Min.   :0.9698  
##  1st Qu.:0.01287   1st Qu.:0.5751   1st Qu.:1.1618  
##  Median :0.01745   Median :0.6590   Median :1.2632  
##  Mean   :0.03043   Mean   :0.6576   Mean   :1.7518  
##  3rd Qu.:0.02667   3rd Qu.:0.7509   3rd Qu.:1.4591  
##  Max.   :0.33055   Max.   :0.9942   Max.   :5.6202  
## 
## mining info:
##    data ntransactions support confidence
##  retail         88162    0.01       0.25

Sorting by lift

inspect(sort(retailrules,by = "lift")[1:20])
##     lhs            rhs  support    confidence lift    
## 110 {110,39,48} => {38} 0.01169438 0.9942141  5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206  5.591925
## 60  {110,39}    => {38} 0.01973639 0.9891984  5.591800
## 72  {170,48}    => {38} 0.01744516 0.9877970  5.583878
## 58  {110,48}    => {38} 0.01543749 0.9862319  5.575030
## 74  {170,39}    => {38} 0.02290102 0.9805731  5.543042
## 33  {170}       => {38} 0.03437989 0.9780574  5.528821
## 19  {110}       => {38} 0.03090901 0.9753042  5.513258
## 1   {37}        => {38} 0.01186452 0.9739292  5.505485
## 113 {36,39,48}  => {38} 0.01225018 0.9677419  5.470509
## 64  {36,48}     => {38} 0.01542615 0.9604520  5.429300
## 66  {36,39}     => {38} 0.02206166 0.9548355  5.397551
## 28  {36}        => {38} 0.03164629 0.9502725  5.371757
## 2   {286}       => {38} 0.01265852 0.9433643  5.332706
## 121 {38,39,48}  => {41} 0.02258343 0.3262865  1.924795
## 125 {32,39,48}  => {41} 0.01867018 0.3047020  1.797466
## 92  {38,48}     => {41} 0.02692770 0.2988419  1.762897
## 95  {38,39}     => {41} 0.03460675 0.2949251  1.739792
## 102 {32,39}     => {41} 0.02675756 0.2790065  1.645886
## 86  {39,89}     => {48} 0.02410336 0.7730084  1.617419

Sorting by confidence

inspect(sort(retailrules,by = "confidence")[1:20])
##     lhs            rhs  support    confidence lift    
## 110 {110,39,48} => {38} 0.01169438 0.9942141  5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206  5.591925
## 60  {110,39}    => {38} 0.01973639 0.9891984  5.591800
## 72  {170,48}    => {38} 0.01744516 0.9877970  5.583878
## 58  {110,48}    => {38} 0.01543749 0.9862319  5.575030
## 74  {170,39}    => {38} 0.02290102 0.9805731  5.543042
## 33  {170}       => {38} 0.03437989 0.9780574  5.528821
## 19  {110}       => {38} 0.03090901 0.9753042  5.513258
## 1   {37}        => {38} 0.01186452 0.9739292  5.505485
## 113 {36,39,48}  => {38} 0.01225018 0.9677419  5.470509
## 64  {36,48}     => {38} 0.01542615 0.9604520  5.429300
## 66  {36,39}     => {38} 0.02206166 0.9548355  5.397551
## 28  {36}        => {38} 0.03164629 0.9502725  5.371757
## 2   {286}       => {38} 0.01265852 0.9433643  5.332706
## 119 {38,41,48}  => {39} 0.02258343 0.8386689  1.459077
## 105 {41,48}     => {39} 0.08355074 0.8168108  1.421049
## 83  {225,48}    => {39} 0.01587986 0.8064516  1.403027
## 123 {32,41,48}  => {39} 0.01867018 0.7978672  1.388092
## 79  {310,48}    => {39} 0.01527869 0.7960993  1.385016
## 111 {36,38,48}  => {39} 0.01225018 0.7941176  1.381569

We find difference between the above set of rules as we first inspected the rules sorting by the order of “lift”. Though the value of lift is high, the confidence is low. In the second set of rules we can see that the Confidence is almost 1 which tells that those purchased item “110, 39, 48” definitely purchased “38” and so on.

Plotting the rules

plot(retailrules)

Graph method:

plot(head(sort(retailrules),10), method = "graph", control = list(type ="items"))

Grouped method:

plot(head(sort(retailrules),10), method = "grouped")

Matrix method:

plot(head(sort(retailrules),20), method = "matrix", measure = c("lift", "confidence"), control=list(reorder = T))
## Itemsets in Antecedent (LHS)
##  [1] "{170}"   "{41}"    "{39,41}" "{32,39}" "{39}"    "{32}"    "{38}"   
##  [8] "{41,48}" "{38,41}" "{38,48}" "{48}"    "{32,48}" "{39,48}" "{38,39}"
## Itemsets in Consequent (RHS)
## [1] "{39}" "{48}" "{41}" "{38}"

Double Decker:

samplerule <- head(sort(retailrules, by = "lift"), 1)
inspect(samplerule)
##     lhs            rhs  support    confidence lift    
## 110 {110,39,48} => {38} 0.01169438 0.9942141  5.620153
plot(samplerule, method = "doubledecker", data = retail)  

Looking at some interesting measures

im <- interestMeasure(head(sort(retailrules),20), c("coverage", "oddsRatio", "leverage", "hyperConfidence", "chiSquared"), transactions = retail)
head(im)
##    coverage oddsRatio     leverage hyperConfidence chiSquared
## 1 0.4779270 2.5513209  0.055840945    1.000000e+00 4507.98616
## 2 0.5747941 2.5513209  0.055840945    1.000000e+00 4507.98616
## 3 0.1695175 2.7957306  0.032028558    1.000000e+00 2628.44959
## 4 0.1769016 1.5747130  0.015658796    1.000000e+00  607.43952
## 5 0.1695175 1.8423354  0.021271988    1.000000e+00 1135.69003
## 6 0.1720356 0.9182089 -0.002982039    1.024248e-06   22.51991

Taking two items “39” and “110” and studying

s.retailrules <- sort(retailrules, by = "lift")
rules39 <- subset((s.retailrules), items %in% "39")
top.rules39 <- head(sort(rules39, by = "lift"))
inspect(top.rules39)
##     lhs            rhs  support    confidence lift    
## 110 {110,39,48} => {38} 0.01169438 0.9942141  5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206  5.591925
## 60  {110,39}    => {38} 0.01973639 0.9891984  5.591800
## 74  {170,39}    => {38} 0.02290102 0.9805731  5.543042
## 113 {36,39,48}  => {38} 0.01225018 0.9677419  5.470509
## 66  {36,39}     => {38} 0.02206166 0.9548355  5.397551
rules110 <- subset(s.retailrules, items %in% "110")
top.rules110 <- head(sort(rules110, by = "lift"))
inspect(top.rules110)
##     lhs            rhs  support    confidence lift    
## 110 {110,39,48} => {38} 0.01169438 0.9942141  5.620153
## 60  {110,39}    => {38} 0.01973639 0.9891984  5.591800
## 58  {110,48}    => {38} 0.01543749 0.9862319  5.575030
## 19  {110}       => {38} 0.03090901 0.9753042  5.513258
## 108 {110,38,48} => {39} 0.01169438 0.7575312  1.317917
## 61  {110,48}    => {39} 0.01176244 0.7514493  1.307336

Writing the rules to a CSV file and converting the rule set to a data frame

write(retailrules, file = "Retail_Rules.csv", sep = ",", quote = TRUE, row.names = FALSE)
retailrules.df <- as(retailrules, "data.frame")
str(retailrules.df)
## 'data.frame':    125 obs. of  4 variables:
##  $ rules     : Factor w/ 125 levels "{101,39} => {48}",..: 83 51 18 17 123 122 20 19 111 110 ...
##  $ support   : num  0.0119 0.0127 0.0106 0.0111 0.0101 ...
##  $ confidence: num  0.974 0.943 0.639 0.689 0.558 ...
##  $ lift      : num  5.51 5.33 1.11 1.2 1.17 ...

Eclat Algorithm

inspect(head(sort(eclat.rules)))
##    items   support  
## 89 {39,48} 0.3305506
## 87 {39,41} 0.1294662
## 75 {38,39} 0.1173408
## 88 {41,48} 0.1022890
## 83 {32,39} 0.0959030
## 84 {32,48} 0.0911277
inspect(head(sort(retailrules, by = "support"), 12))
##     lhs        rhs  support    confidence lift     
## 55  {48}    => {39} 0.33055058 0.6916340  1.2032726
## 56  {39}    => {48} 0.33055058 0.5750765  1.2032726
## 54  {41}    => {39} 0.12946621 0.7637337  1.3287082
## 50  {38}    => {39} 0.11734080 0.6633111  1.1539977
## 53  {41}    => {48} 0.10228897 0.6034125  1.2625621
## 52  {32}    => {39} 0.09590300 0.5574603  0.9698434
## 51  {32}    => {48} 0.09112770 0.5297026  1.1083338
## 49  {38}    => {48} 0.09010685 0.5093614  1.0657723
## 105 {41,48} => {39} 0.08355074 0.8168108  1.4210493
## 106 {39,41} => {48} 0.08355074 0.6453478  1.3503063
## 107 {39,48} => {41} 0.08355074 0.2527623  1.4910695
## 97  {38,48} => {39} 0.06921349 0.7681269  1.3363513

Frequent Pattern Algorithm from SPMF

# Rules generated by Apriori
inspect(head(sort(retailrules, by = "lift")))
##     lhs            rhs  support    confidence lift    
## 110 {110,39,48} => {38} 0.01169438 0.9942141  5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206  5.591925
## 60  {110,39}    => {38} 0.01973639 0.9891984  5.591800
## 72  {170,48}    => {38} 0.01744516 0.9877970  5.583878
## 58  {110,48}    => {38} 0.01543749 0.9862319  5.575030
## 74  {170,39}    => {38} 0.02290102 0.9805731  5.543042
# Rules generated by Frequent Pattern

##37 ==> 38 #SUP: 1046 #CONF: 0.9739292364990689 #LIFT: 5.505485339076103
##110 ==> 38 #SUP: 2725 #CONF: 0.9753042233357194 #LIFT: 5.513257946763509
##170 ==> 38 #SUP: 3031 #CONF: 0.9780574378831881 #LIFT: 5.528821482345322
##39 110 ==> 38 #SUP: 1740 #CONF: 0.9891984081864695 #LIFT: 5.591799824476502
##39 170 ==> 38 #SUP: 2019 #CONF: 0.9805730937348227 #LIFT: 5.543042131947258
##48 110 ==> 38 #SUP: 1361 #CONF: 0.986231884057971 #LIFT: 5.575030479758838
##48 170 ==> 38 #SUP: 1538 #CONF: 0.9877970456005138 #LIFT: 5.583878118378591
##39 48 110 ==> 38 #SUP: 1031 #CONF: 0.9942140790742526 #LIFT: 5.62015270834472
##39 48 170 ==> 38 #SUP: 1193 #CONF: 0.9892205638474295 #LIFT: 5.591925067319639

Items with least lift and confidence

inspect(tail(sort(retailrules, by = "lift")))
##    lhs         rhs  support    confidence lift     
## 27 {413}    => {39} 0.01281731 0.6010638  1.0457028
## 57 {110,38} => {48} 0.01543749 0.4994495  1.0450331
## 20 {110}    => {48} 0.01565300 0.4939155  1.0334539
## 63 {36,38}  => {48} 0.01542615 0.4874552  1.0199365
## 29 {36}     => {48} 0.01606134 0.4822888  1.0091266
## 52 {32}     => {39} 0.09590300 0.5574603  0.9698434

How the rules could be useful: