Market Basket Analysis (MBA) uncovers associations between products by looking for combinations of products that frequently co-occur in transactions. It allows the supermarkets to identify relationships between the products that customer buy for various purposes.
This project analyses the retail market basket data set supplied by a anonymous Belgian retail supermarket store. The data are collected over three non-consecutive periods. This results in approximately 5 months of data. The total amount of receipts being collected equals 88,162. Over the entire data collection period, the supermarket store carries 16,470 unique SKU’s (Stock Keeping Units). In total, 5,133 customers have purchased at least one product in the supermarket during the data collection period.
Grouping products that co-occur in the design of a store’s layout to increase the chance of cross-selling. For this purpose, we would be using Apriori, Eclat and Frequent Pattern algorithm to study the customer behaviour.
Targeting marketing campaigns by sending out promotional offers to customers related to product they purchased.
+ arules
+ arulesViz
+ eclat
+ Frequent Pattern growth
+ bigmemory
retail <- read.transactions("Retail.csv", sep = " ")
summary(retail)
## transactions as itemMatrix in sparse format with
## 88162 rows (elements/itemsets/transactions) and
## 16470 columns (items) and a density of 0.0006257289
##
## most frequent items:
## 39 48 38 32 41 (Other)
## 50675 42135 15596 15167 14945 770058
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3016 5516 6919 7210 6814 6163 5746 5143 4660 4086 3751 3285 2866 2620 2310
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 2115 1874 1645 1469 1290 1205 981 887 819 684 586 582 472 480 355
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 310 303 272 234 194 136 153 123 115 112 76 66 71 60 50
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 44 37 37 33 22 24 21 21 10 11 10 9 11 4 9
## 61 62 63 64 65 66 67 68 71 73 74 76
## 7 4 5 2 2 5 3 3 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 8.00 10.31 14.00 76.00
##
## includes extended item information - examples:
## labels
## 1 0
## 2 1
## 3 10
itemFrequencyPlot(retail,topN=20, main = "Frequently purchased items", xlab = "Item", ylab = "Item Frequency (Relative")
We may find from the above plot that the most frequently purchased item is “39” followed by 48. Here, we are going to study about item “39” which has nearly 0.6 relative frequency and item “110” which has relative frequency of 0.025.
a.rules <- apriori(retail)
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.8 0.1 1 none FALSE TRUE 0.1 1 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.12s].
## sorting and recoding items ... [5 item(s)] done [0.02s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 done [0.05s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
By default, Aprori algorithm runs for 10% support and 80% confidence. There are zero rules as there is no item in the Retail market data set with support of 10% and confidence of 80%.
retailrules <- apriori(retail, parameter = list(support = 0.01, confidence = 0.25, minlen = 2))
##
## Parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.25 0.1 1 none FALSE TRUE 0.01 2 10
## target ext
## rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.12s].
## sorting and recoding items ... [70 item(s)] done [0.00s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [125 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Setting the support as 1% and confidence as 25% to generate more rules. There are 125 rules generated.
summary(retailrules)
## set of 125 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 56 51 18
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.696 3.000 4.000
##
## summary of quality measures:
## support confidence lift
## Min. :0.01013 Min. :0.2528 Min. :0.9698
## 1st Qu.:0.01287 1st Qu.:0.5751 1st Qu.:1.1618
## Median :0.01745 Median :0.6590 Median :1.2632
## Mean :0.03043 Mean :0.6576 Mean :1.7518
## 3rd Qu.:0.02667 3rd Qu.:0.7509 3rd Qu.:1.4591
## Max. :0.33055 Max. :0.9942 Max. :5.6202
##
## mining info:
## data ntransactions support confidence
## retail 88162 0.01 0.25
inspect(sort(retailrules,by = "lift")[1:20])
## lhs rhs support confidence lift
## 110 {110,39,48} => {38} 0.01169438 0.9942141 5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206 5.591925
## 60 {110,39} => {38} 0.01973639 0.9891984 5.591800
## 72 {170,48} => {38} 0.01744516 0.9877970 5.583878
## 58 {110,48} => {38} 0.01543749 0.9862319 5.575030
## 74 {170,39} => {38} 0.02290102 0.9805731 5.543042
## 33 {170} => {38} 0.03437989 0.9780574 5.528821
## 19 {110} => {38} 0.03090901 0.9753042 5.513258
## 1 {37} => {38} 0.01186452 0.9739292 5.505485
## 113 {36,39,48} => {38} 0.01225018 0.9677419 5.470509
## 64 {36,48} => {38} 0.01542615 0.9604520 5.429300
## 66 {36,39} => {38} 0.02206166 0.9548355 5.397551
## 28 {36} => {38} 0.03164629 0.9502725 5.371757
## 2 {286} => {38} 0.01265852 0.9433643 5.332706
## 121 {38,39,48} => {41} 0.02258343 0.3262865 1.924795
## 125 {32,39,48} => {41} 0.01867018 0.3047020 1.797466
## 92 {38,48} => {41} 0.02692770 0.2988419 1.762897
## 95 {38,39} => {41} 0.03460675 0.2949251 1.739792
## 102 {32,39} => {41} 0.02675756 0.2790065 1.645886
## 86 {39,89} => {48} 0.02410336 0.7730084 1.617419
inspect(sort(retailrules,by = "confidence")[1:20])
## lhs rhs support confidence lift
## 110 {110,39,48} => {38} 0.01169438 0.9942141 5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206 5.591925
## 60 {110,39} => {38} 0.01973639 0.9891984 5.591800
## 72 {170,48} => {38} 0.01744516 0.9877970 5.583878
## 58 {110,48} => {38} 0.01543749 0.9862319 5.575030
## 74 {170,39} => {38} 0.02290102 0.9805731 5.543042
## 33 {170} => {38} 0.03437989 0.9780574 5.528821
## 19 {110} => {38} 0.03090901 0.9753042 5.513258
## 1 {37} => {38} 0.01186452 0.9739292 5.505485
## 113 {36,39,48} => {38} 0.01225018 0.9677419 5.470509
## 64 {36,48} => {38} 0.01542615 0.9604520 5.429300
## 66 {36,39} => {38} 0.02206166 0.9548355 5.397551
## 28 {36} => {38} 0.03164629 0.9502725 5.371757
## 2 {286} => {38} 0.01265852 0.9433643 5.332706
## 119 {38,41,48} => {39} 0.02258343 0.8386689 1.459077
## 105 {41,48} => {39} 0.08355074 0.8168108 1.421049
## 83 {225,48} => {39} 0.01587986 0.8064516 1.403027
## 123 {32,41,48} => {39} 0.01867018 0.7978672 1.388092
## 79 {310,48} => {39} 0.01527869 0.7960993 1.385016
## 111 {36,38,48} => {39} 0.01225018 0.7941176 1.381569
We find difference between the above set of rules as we first inspected the rules sorting by the order of “lift”. Though the value of lift is high, the confidence is low. In the second set of rules we can see that the Confidence is almost 1 which tells that those purchased item “110, 39, 48” definitely purchased “38” and so on.
plot(retailrules)
plot(head(sort(retailrules),10), method = "graph", control = list(type ="items"))
plot(head(sort(retailrules),10), method = "grouped")
plot(head(sort(retailrules),20), method = "matrix", measure = c("lift", "confidence"), control=list(reorder = T))
## Itemsets in Antecedent (LHS)
## [1] "{170}" "{41}" "{39,41}" "{32,39}" "{39}" "{32}" "{38}"
## [8] "{41,48}" "{38,41}" "{38,48}" "{48}" "{32,48}" "{39,48}" "{38,39}"
## Itemsets in Consequent (RHS)
## [1] "{39}" "{48}" "{41}" "{38}"
samplerule <- head(sort(retailrules, by = "lift"), 1)
inspect(samplerule)
## lhs rhs support confidence lift
## 110 {110,39,48} => {38} 0.01169438 0.9942141 5.620153
plot(samplerule, method = "doubledecker", data = retail)
im <- interestMeasure(head(sort(retailrules),20), c("coverage", "oddsRatio", "leverage", "hyperConfidence", "chiSquared"), transactions = retail)
head(im)
## coverage oddsRatio leverage hyperConfidence chiSquared
## 1 0.4779270 2.5513209 0.055840945 1.000000e+00 4507.98616
## 2 0.5747941 2.5513209 0.055840945 1.000000e+00 4507.98616
## 3 0.1695175 2.7957306 0.032028558 1.000000e+00 2628.44959
## 4 0.1769016 1.5747130 0.015658796 1.000000e+00 607.43952
## 5 0.1695175 1.8423354 0.021271988 1.000000e+00 1135.69003
## 6 0.1720356 0.9182089 -0.002982039 1.024248e-06 22.51991
s.retailrules <- sort(retailrules, by = "lift")
rules39 <- subset((s.retailrules), items %in% "39")
top.rules39 <- head(sort(rules39, by = "lift"))
inspect(top.rules39)
## lhs rhs support confidence lift
## 110 {110,39,48} => {38} 0.01169438 0.9942141 5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206 5.591925
## 60 {110,39} => {38} 0.01973639 0.9891984 5.591800
## 74 {170,39} => {38} 0.02290102 0.9805731 5.543042
## 113 {36,39,48} => {38} 0.01225018 0.9677419 5.470509
## 66 {36,39} => {38} 0.02206166 0.9548355 5.397551
rules110 <- subset(s.retailrules, items %in% "110")
top.rules110 <- head(sort(rules110, by = "lift"))
inspect(top.rules110)
## lhs rhs support confidence lift
## 110 {110,39,48} => {38} 0.01169438 0.9942141 5.620153
## 60 {110,39} => {38} 0.01973639 0.9891984 5.591800
## 58 {110,48} => {38} 0.01543749 0.9862319 5.575030
## 19 {110} => {38} 0.03090901 0.9753042 5.513258
## 108 {110,38,48} => {39} 0.01169438 0.7575312 1.317917
## 61 {110,48} => {39} 0.01176244 0.7514493 1.307336
write(retailrules, file = "Retail_Rules.csv", sep = ",", quote = TRUE, row.names = FALSE)
retailrules.df <- as(retailrules, "data.frame")
str(retailrules.df)
## 'data.frame': 125 obs. of 4 variables:
## $ rules : Factor w/ 125 levels "{101,39} => {48}",..: 83 51 18 17 123 122 20 19 111 110 ...
## $ support : num 0.0119 0.0127 0.0106 0.0111 0.0101 ...
## $ confidence: num 0.974 0.943 0.639 0.689 0.558 ...
## $ lift : num 5.51 5.33 1.11 1.2 1.17 ...
inspect(head(sort(eclat.rules)))
## items support
## 89 {39,48} 0.3305506
## 87 {39,41} 0.1294662
## 75 {38,39} 0.1173408
## 88 {41,48} 0.1022890
## 83 {32,39} 0.0959030
## 84 {32,48} 0.0911277
inspect(head(sort(retailrules, by = "support"), 12))
## lhs rhs support confidence lift
## 55 {48} => {39} 0.33055058 0.6916340 1.2032726
## 56 {39} => {48} 0.33055058 0.5750765 1.2032726
## 54 {41} => {39} 0.12946621 0.7637337 1.3287082
## 50 {38} => {39} 0.11734080 0.6633111 1.1539977
## 53 {41} => {48} 0.10228897 0.6034125 1.2625621
## 52 {32} => {39} 0.09590300 0.5574603 0.9698434
## 51 {32} => {48} 0.09112770 0.5297026 1.1083338
## 49 {38} => {48} 0.09010685 0.5093614 1.0657723
## 105 {41,48} => {39} 0.08355074 0.8168108 1.4210493
## 106 {39,41} => {48} 0.08355074 0.6453478 1.3503063
## 107 {39,48} => {41} 0.08355074 0.2527623 1.4910695
## 97 {38,48} => {39} 0.06921349 0.7681269 1.3363513
# Rules generated by Apriori
inspect(head(sort(retailrules, by = "lift")))
## lhs rhs support confidence lift
## 110 {110,39,48} => {38} 0.01169438 0.9942141 5.620153
## 116 {170,39,48} => {38} 0.01353191 0.9892206 5.591925
## 60 {110,39} => {38} 0.01973639 0.9891984 5.591800
## 72 {170,48} => {38} 0.01744516 0.9877970 5.583878
## 58 {110,48} => {38} 0.01543749 0.9862319 5.575030
## 74 {170,39} => {38} 0.02290102 0.9805731 5.543042
# Rules generated by Frequent Pattern
##37 ==> 38 #SUP: 1046 #CONF: 0.9739292364990689 #LIFT: 5.505485339076103
##110 ==> 38 #SUP: 2725 #CONF: 0.9753042233357194 #LIFT: 5.513257946763509
##170 ==> 38 #SUP: 3031 #CONF: 0.9780574378831881 #LIFT: 5.528821482345322
##39 110 ==> 38 #SUP: 1740 #CONF: 0.9891984081864695 #LIFT: 5.591799824476502
##39 170 ==> 38 #SUP: 2019 #CONF: 0.9805730937348227 #LIFT: 5.543042131947258
##48 110 ==> 38 #SUP: 1361 #CONF: 0.986231884057971 #LIFT: 5.575030479758838
##48 170 ==> 38 #SUP: 1538 #CONF: 0.9877970456005138 #LIFT: 5.583878118378591
##39 48 110 ==> 38 #SUP: 1031 #CONF: 0.9942140790742526 #LIFT: 5.62015270834472
##39 48 170 ==> 38 #SUP: 1193 #CONF: 0.9892205638474295 #LIFT: 5.591925067319639
inspect(tail(sort(retailrules, by = "lift")))
## lhs rhs support confidence lift
## 27 {413} => {39} 0.01281731 0.6010638 1.0457028
## 57 {110,38} => {48} 0.01543749 0.4994495 1.0450331
## 20 {110} => {48} 0.01565300 0.4939155 1.0334539
## 63 {36,38} => {48} 0.01542615 0.4874552 1.0199365
## 29 {36} => {48} 0.01606134 0.4822888 1.0091266
## 52 {32} => {39} 0.09590300 0.5574603 0.9698434