The Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between December 1, 2009 and December 9, 2011. The company mainly sells unique all-occasion gift-ware. Many customers of the firm are wholesalers.
library(arules) # Apriori and Eclat algorithms
library(readxl) # Read excel
library(arulesViz) # Association rules visualization
library(dplyr) # Data manipulation
library(stringr) # String processing
Just get observations from 2010 and 2011 since there are too many.
dat <- read_excel("online_retail_II.xlsx",
sheet='Year 2010-2011',
guess_max = 100,
range = cell_cols("A:C"),
col_types = c("text", "text", "text")
)
Invoice and Description columns, this is the basket info needed. Remove invoices that start with the letter ‘c’, which indicates a cancellation. We’re only interested in purchases transactionsdat <- select(dat, Invoice, Description) %>%
filter(Description != "",
Description != "Discount",
!grepl("^C", Invoice, ignore.case = TRUE)) %>%
mutate(Description = tolower(str_squish(Description)))
Remove some punctuation
dat <- mutate(dat, Description = str_remove_all(Description, "'|\\.|,"))
Write csv so it’s easier to read with read.transactions
write.csv(dat, file="dat.csv", row.names=FALSE)
Convert data frame to transactions format.
tr <- read.transactions("dat.csv", format = "single", sep=",",
rm.duplicates=TRUE, header=TRUE, cols=1:2)
Since many customers are wholesalers is expected to see plenty of baskets that contain at least 100 items.
summary(tr)
## transactions as itemMatrix in sparse format with
## 20610 rows (elements/itemsets/transactions) and
## 4157 columns (items) and a density of 0.006073024
##
## most frequent items:
## white hanging heart t-light holder jumbo bag red retrospot
## 2260 2092
## regency cakestand 3 tier party bunting
## 1989 1686
## lunch bag red retrospot (Other)
## 1564 510720
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2272 836 689 666 687 615 606 605 611 545 552 493 507 528 547 552
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 467 443 484 435 402 347 345 309 248 259 243 243 270 225 197 188
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 162 177 136 135 132 122 136 123 123 103 96 104 100 90 85 95
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 87 86 57 65 78 71 72 50 64 52 35 61 40 29 43 39
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 39 43 33 40 29 33 39 24 24 34 26 21 19 27 16 12
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 20 21 15 23 17 17 9 17 11 12 9 15 16 7 5 10
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 9 13 5 11 10 3 6 9 2 5 6 4 4 4 7 3
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 5 6 6 9 5 4 8 5 6 11 4 5 3 4 8 1
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 2 4 3 3 2 5 4 2 6 6 2 5 6 2 2 5
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 5 3 2 4 5 3 5 3 6 2 2 2 4 4 1 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 3 3 3 2 5 4 1 4 4 2 2 4 3 4 2 5
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 5 4 2 4 2 6 4 3 3 3 2 3 4 4 2 3
## 193 194 195 196 197 198 199 202 203 204 205 206 207 208 210 211
## 2 3 3 4 2 2 3 2 5 5 1 2 1 4 1 4
## 212 213 214 215 216 217 218 219 220 222 223 224 225 226 227 228
## 1 1 2 1 2 4 2 2 2 1 1 3 3 1 1 1
## 229 230 232 233 234 235 237 238 239 241 242 243 244 247 249 250
## 2 1 1 1 1 1 3 3 1 2 1 2 2 2 3 2
## 253 254 255 257 259 261 262 263 264 266 267 270 275 279 280 282
## 1 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1
## 283 285 286 288 289 291 292 295 296 298 299 301 309 310 315 319
## 2 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1
## 320 331 332 333 334 339 341 344 345 347 348 349 352 354 357 358
## 1 1 4 1 1 1 1 1 1 2 1 1 2 1 1 1
## 363 369 375 376 379 382 386 388 399 404 408 411 414 415 416 419
## 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 1
## 420 428 433 434 438 439 443 449 453 455 458 460 463 471 482 486
## 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1
## 487 488 494 499 503 506 514 515 517 518 520 522 524 525 527 529
## 1 1 1 1 2 1 1 1 1 1 1 1 1 2 1 1
## 531 536 539 541 543 552 561 567 572 578 585 588 589 593 595 599
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 601 607 622 629 635 645 647 649 661 673 676 687 703 720 731 748
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1108
## 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 15.00 25.25 28.00 1108.00
##
## includes extended item information - examples:
## labels
## 1 *boombox ipod classic
## 2 *usb office mirror ball
## 3 ?
##
## includes extended transaction information - examples:
## transactionID
## 1 536365
## 2 536366
## 3 536367
Plot of the most frequently bought items
itemFrequencyPlot(tr, topN=10, cex=0.7)
There are just 6 items that have a support (relative frequency) of at least 7%. Perhaps this is because the company offers a pretty wide range of products, as these are unique all-occasion gifts.
itemFrequencyPlot(tr, support=0.07, cex=0.8)
Support: It’s the probability of an specific event, in this case, it’s the proportion of times a specific item appears compared to the total number of transactions. In this online retail data, the basket rules support is somewhat low, with a maximum support a little greater than 4%.
Confidence: Confidence level of two events that occur simultaneously. It is defined as \(conf(X \cup Y)=supp(X \cup Y)/supp(X)\). For example, in this data set, the rule {childrens cutlery dolly girl} => {childrens cutlery spaceboy} has a confidence close to 76%. This means that for 76% of the “childrens cutlery dolly girl” transactions this rule is correct. Confidence can be interpreted as an estimate of the conditional probability \(P(Y \mid X)\), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS.
Lift: It’s defined as \(lift(X \cup Y)=supp(X \cup Y)/(supp(X)supp(Y))\). It can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. It helps to filter or rank found rules. Greater lift values indicate stronger associations (greater dependence among items in the rule).
Setting a support that isn’t two high (there were only two items with a support above 10%) but also not too low because we want items that often bought. In this case a support of 1% and a confidence of 30% were chosen because association rules must satisfy both a minimum support and a minimum confidence constraint at the same time. Having that said, the algorithm developed 1381 rules that passed both support and confidence minimum requirements.
rules <- apriori(tr, parameter=list(support=0.01, confidence=0.3, target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.3 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 206
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4157 item(s), 20610 transaction(s)] done [0.47s].
## sorting and recoding items ... [783 item(s)] done [0.01s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 done [0.13s].
## writing ... [1381 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
Basket rules summary
summary(rules)
## set of 1381 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 735 582 64
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.514 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01004 Min. :0.3002 Min. :0.01058 Min. : 2.819
## 1st Qu.:0.01092 1st Qu.:0.4000 1st Qu.:0.01926 1st Qu.: 7.791
## Median :0.01252 Median :0.5144 Median :0.02586 Median :10.087
## Mean :0.01395 Mean :0.5299 Mean :0.02880 Mean :13.144
## 3rd Qu.:0.01519 3rd Qu.:0.6396 3rd Qu.:0.03474 3rd Qu.:14.675
## Max. :0.04003 Max. :0.9587 Max. :0.10150 Max. :80.078
## count
## Min. :207.0
## 1st Qu.:225.0
## Median :258.0
## Mean :287.6
## 3rd Qu.:313.0
## Max. :825.0
##
## mining info:
## data ntransactions support confidence
## tr 20610 0.01 0.3
Basket rules of size equal to 2
inspect(head(subset(rules, size(rules) == 2), 10))
## lhs rhs support confidence coverage lift count
## [1] {childrens cutlery dolly girl} => {childrens cutlery spaceboy} 0.01072295 0.7594502 0.01411936 43.478522 221
## [2] {childrens cutlery spaceboy} => {childrens cutlery dolly girl} 0.01072295 0.6138889 0.01746725 43.478522 221
## [3] {childrens cutlery polkadot blue} => {childrens cutlery polkadot pink} 0.01067443 0.7612457 0.01402232 37.090481 220
## [4] {childrens cutlery polkadot pink} => {childrens cutlery polkadot blue} 0.01067443 0.5200946 0.02052402 37.090481 220
## [5] {painted metal pears assorted} => {assorted colour bird ornament} 0.01256672 0.7000000 0.01795245 9.915464 259
## [6] {round snack boxes set of4 woodland} => {postage} 0.01154779 0.3251366 0.03551674 5.945932 238
## [7] {lunch bag doiley pattern} => {jumbo bag doiley patterns} 0.01101407 0.4613821 0.02387191 18.500166 227
## [8] {jumbo bag doiley patterns} => {lunch bag doiley pattern} 0.01101407 0.4416342 0.02493935 18.500166 227
## [9] {lunch bag doiley pattern} => {lunch bag apple design} 0.01004367 0.4207317 0.02387191 8.369962 207
## [10] {pink happy birthday bunting} => {blue happy birthday bunting} 0.01271228 0.6517413 0.01950509 34.442021 262
Basket rules of size greater than 3
inspect(head(subset(rules, size(rules) > 3), 10))
## lhs rhs support confidence coverage lift count
## [1] {green regency teacup and saucer,
## pink regency teacup and saucer,
## roses regency teacup and saucer} => {regency cakestand 3 tier} 0.01460456 0.5553506 0.02629791 5.754537 301
## [2] {pink regency teacup and saucer,
## regency cakestand 3 tier,
## roses regency teacup and saucer} => {green regency teacup and saucer} 0.01460456 0.9093656 0.01606016 18.465048 301
## [3] {green regency teacup and saucer,
## pink regency teacup and saucer,
## regency cakestand 3 tier} => {roses regency teacup and saucer} 0.01460456 0.8775510 0.01664241 16.966535 301
## [4] {green regency teacup and saucer,
## regency cakestand 3 tier,
## roses regency teacup and saucer} => {pink regency teacup and saucer} 0.01460456 0.7377451 0.01979622 19.849773 301
## [5] {lunch bag cars blue,
## lunch bag pink polkadot,
## lunch bag suki design} => {lunch bag red retrospot} 0.01023775 0.7873134 0.01300340 10.375019 211
## [6] {lunch bag cars blue,
## lunch bag pink polkadot,
## lunch bag red retrospot} => {lunch bag suki design} 0.01023775 0.6552795 0.01562348 10.509969 211
## [7] {lunch bag pink polkadot,
## lunch bag red retrospot,
## lunch bag suki design} => {lunch bag cars blue} 0.01023775 0.6374622 0.01606016 11.424432 211
## [8] {lunch bag cars blue,
## lunch bag red retrospot,
## lunch bag suki design} => {lunch bag pink polkadot} 0.01023775 0.6242604 0.01639981 11.803675 211
## [9] {lunch bag black skull,
## lunch bag cars blue,
## lunch bag pink polkadot} => {lunch bag red retrospot} 0.01091703 0.7258065 0.01504124 9.564496 225
## [10] {lunch bag cars blue,
## lunch bag pink polkadot,
## lunch bag red retrospot} => {lunch bag black skull} 0.01091703 0.6987578 0.01562348 11.312960 225
Taking a look at the top 11 rules by lift
inspect(sort(rules, by='lift', decreasing=TRUE)[1:11])
## lhs rhs support confidence
## [1] {herb marker thyme} => {herb marker rosemary} 0.01072295 0.9324895
## [2] {herb marker rosemary} => {herb marker thyme} 0.01072295 0.9208333
## [3] {herb marker thyme} => {herb marker parsley} 0.01033479 0.8987342
## [4] {herb marker parsley} => {herb marker thyme} 0.01033479 0.8949580
## [5] {herb marker parsley} => {herb marker rosemary} 0.01043183 0.9033613
## [6] {herb marker rosemary} => {herb marker parsley} 0.01043183 0.8958333
## [7] {herb marker parsley} => {herb marker mint} 0.01028627 0.8907563
## [8] {herb marker mint} => {herb marker parsley} 0.01028627 0.8833333
## [9] {herb marker basil} => {herb marker rosemary} 0.01038331 0.8842975
## [10] {herb marker rosemary} => {herb marker basil} 0.01038331 0.8916667
## [11] {herb marker parsley} => {herb marker basil} 0.01028627 0.8907563
## coverage lift count
## [1] 0.01149927 80.07753 221
## [2] 0.01164483 80.07753 221
## [3] 0.01149927 77.82736 213
## [4] 0.01154779 77.82736 213
## [5] 0.01154779 77.57616 215
## [6] 0.01164483 77.57616 215
## [7] 0.01154779 76.49370 212
## [8] 0.01164483 76.49370 212
## [9] 0.01174187 75.93905 214
## [10] 0.01164483 75.93905 214
## [11] 0.01154779 75.86152 212
Checking the rules that have the product with highest support (white hanging heart t-light holder in the right hand side) at the right hand side (rhs)
heart.rhs <- subset(rules, subset = rhs %in% 'white hanging heart t-light holder')
inspect(heart.rhs)
## lhs rhs support confidence coverage lift count
## [1] {candleholder pink hanging heart} => {white hanging heart t-light holder} 0.01368268 0.7085427 0.01931101 6.461533 282
## [2] {zinc metal heart decoration} => {white hanging heart t-light holder} 0.01014071 0.3906542 0.02595827 3.562559 209
## [3] {love building block word} => {white hanging heart t-light holder} 0.01077147 0.3529412 0.03051917 3.218636 222
## [4] {red hanging heart t-light holder} => {white hanging heart t-light holder} 0.02401747 0.6680162 0.03595342 6.091953 495
## [5] {home building block word} => {white hanging heart t-light holder} 0.01237263 0.3281853 0.03770015 2.992876 255
## [6] {heart of wicker large} => {white hanging heart t-light holder} 0.01737021 0.3866091 0.04492965 3.525669 358
## [7] {hanging heart jar t-light holder} => {white hanging heart t-light holder} 0.01091703 0.3090659 0.03532266 2.818517 225
## [8] {hanging heart zinc t-light holder} => {white hanging heart t-light holder} 0.01052887 0.3598673 0.02925764 3.281799 217
## [9] {wooden frame antique white} => {white hanging heart t-light holder} 0.01644833 0.3491246 0.04711305 3.183831 339
## [10] {lovebird hanging decoration white} => {white hanging heart t-light holder} 0.01111111 0.4209559 0.02639495 3.838894 229
## [11] {wooden picture frame white finish} => {white hanging heart t-light holder} 0.01979622 0.3709091 0.05337215 3.382494 408
## [12] {bathroom metal sign} => {white hanging heart t-light holder} 0.01213003 0.3714710 0.03265405 3.387619 250
## [13] {heart of wicker small} => {white hanging heart t-light holder} 0.01897137 0.3255620 0.05827268 2.968953 391
## [14] {natural slate heart chalkboard} => {white hanging heart t-light holder} 0.02013586 0.3322658 0.06060165 3.030088 415
## [15] {dotcom postage} => {white hanging heart t-light holder} 0.01319748 0.3841808 0.03435226 3.503525 272
## [16] {wooden frame antique white,
## wooden picture frame white finish} => {white hanging heart t-light holder} 0.01057739 0.4044527 0.02615235 3.688394 218
plot(rules)
Graph visualization for small subsets of rules, in this case, the top 11 with the highest lift.
plot(sort(rules, by='lift', decreasing=TRUE)[1:11], method='graph')
As opposed to apriori, eclat just measures a set support not an item support. It only requires the support level. There is no confidence or lift involved. Here the algorithm outputs subsets, not rules.
Performing eclat with a minimum subset length of 2.
eclat_sets <- eclat(tr, parameter=list(support=0.01, minlen = 2))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 2 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 206
##
## create itemset ...
## set transactions ...[4157 item(s), 20610 transaction(s)] done [0.52s].
## sorting and recoding items ... [783 item(s)] done [0.01s].
## creating sparse bit matrix ... [783 row(s), 20610 column(s)] done [0.05s].
## writing ... [971 set(s)] done [1.83s].
## Creating S4 object ... done [0.00s].
There were 971 subsets (or itemsets) that satisfied a minimum support of 10%.
summary(eclat_sets)
## set of 971 itemsets
##
## most frequent items:
## jumbo bag red retrospot lunch bag red retrospot
## 140 90
## jumbo storage bag suki dotcom postage
## 69 64
## red retrospot charlotte bag (Other)
## 61 1746
##
## element (itemset/transaction) length distribution:sizes
## 2 3 4
## 759 196 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.235 2.000 4.000
##
## summary of quality measures:
## support transIdenticalToItemsets count
## Min. :0.01004 Min. :207.0 Min. :207.0
## 1st Qu.:0.01089 1st Qu.:224.5 1st Qu.:224.5
## Median :0.01228 Median :253.0 Median :253.0
## Mean :0.01358 Mean :280.0 Mean :280.0
## 3rd Qu.:0.01460 3rd Qu.:301.0 3rd Qu.:301.0
## Max. :0.04003 Max. :825.0 Max. :825.0
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support
## tr 20610 0.01
The most frequent combination among all the transactions was jumbo bag pink polkadot and jumbo bag red retrospot, with a support of 0.0400291.
inspect(sort(eclat_sets, by='support', descending=TRUE)[1:9])
## items support transIdenticalToItemsets count
## [1] {jumbo bag pink polkadot,
## jumbo bag red retrospot} 0.04002911 825 825
## [2] {green regency teacup and saucer,
## roses regency teacup and saucer} 0.03726346 768 768
## [3] {jumbo bag red retrospot,
## jumbo storage bag suki} 0.03512858 724 724
## [4] {jumbo bag red retrospot,
## jumbo shopper vintage red paisley} 0.03299369 680 680
## [5] {lunch bag red retrospot,
## lunch bag suki design} 0.03173217 654 654
## [6] {lunch bag black skull,
## lunch bag red retrospot} 0.03110141 641 641
## [7] {alarm clock bakelike green,
## alarm clock bakelike red} 0.03105289 640 640
## [8] {green regency teacup and saucer,
## pink regency teacup and saucer} 0.03071325 633 633
## [9] {lunch bag pink polkadot,
## lunch bag red retrospot} 0.02940320 606 606
Sets graph
plot(sort(eclat_sets, by='support', decreasing=TRUE)[1:10], method='graph')
Chen, D. Sain, S.L., and Guo, K. (2012). Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. doi: Web Link.