As part of my Supply Chain Graduate program, we cover data mining and analytics using R. Though I have used multi-linear regression and forecasting models in the past, the concept of Un-Supervised learning / association rules got me interested to the variety of applications for this topic. In this exercise, I have attempted to build association rules on a trans-national data set which contains all the invoice transactions captured between 01/12/2010 and 09/12/2011 (mostly wholesale purchases) for a UK-based and registered non-store online retail.
A special thanks to the UC Irvine Center for Machine Learning and Intelligent Systems, for making the data available for my analysis. Online Retail Dataset
Credit: Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.
The e-commerce invoice transaction dataset provided by UCI is a very structured dataset with 541909 transactions, with 8 attributes describing these transactions.
The read_xlsx() function in the readxl package is used to build the dataset in Rstudio for mining and later analysis.
#
UK.OnlineRetail.df <- read_xlsx("UK_Ecommerce_OnlineRetail.xlsx")
#
str(UK.OnlineRetail.df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num 17850 17850 17850 17850 17850 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
#
Per the data dictionary, the invoice transactions also include cancellations. They are distinguished by the invoice#, which are prefixed by letter “c”. Part of the data-mining activity would be to cleanse the dataset.
UK.OnlineRetail.df <- as.data.frame(UK.OnlineRetail.df) #confirm the data in a data.frame format
str(UK.OnlineRetail.df)
## 'data.frame': 541909 obs. of 8 variables:
## $ InvoiceNo : chr "536365" "536365" "536365" "536365" ...
## $ StockCode : chr "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num 17850 17850 17850 17850 17850 ...
## $ Country : chr "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
#some of the transactions are cancellations - to avoid duplication, let us remove the "cancellations"
length(which(substr(str_to_upper(UK.OnlineRetail.df$InvoiceNo),1,1) == "C")) #Number of cancellations
## [1] 9288
Extract valid transactions to prepare them for building association rules
UK.OnlineRetail.nonCancel.df <- UK.OnlineRetail.df[-which(substr(str_to_upper(UK.OnlineRetail.df$InvoiceNo),1,1) == "C"),] #just thus active transactions
# 9288 entries (transactions) are cancellations
#
# validation step - Records excluded (cancelled) + Records in the non-cancel set = total number of records in the parent dataset (541909)
length(which(substr(str_to_upper(UK.OnlineRetail.df$InvoiceNo),1,1) == "C")) + nrow(UK.OnlineRetail.nonCancel.df)
## [1] 541909
Post the sanity-check / record-count validation, the data frame is ready for the next step.
Plotting the products which sold most during the past one year.
Exploring further on the data to list the top 3 best seller SKUs in each country. Criteria: Atleast 50 units in the past 1 year.
## Top 3 SKUs purchased by customers in United Kingdom
##
## CountryName StockCode ProductsSold
## United Kingdom 85123A 2175
## United Kingdom 85099B 1960
## United Kingdom 22423 1714
##
## Top 3 SKUs purchased by customers in Germany
##
## CountryName StockCode ProductsSold
## Germany POST 374
## Germany 22326 113
## Germany 22328 72
##
## Top 3 SKUs purchased by customers in France
##
## CountryName StockCode ProductsSold
## France POST 300
## France 23084 74
## France 21731 71
##
## Top 3 SKUs purchased by customers in EIRE
##
## CountryName StockCode ProductsSold
## EIRE C2 106
## EIRE 22423 73
##
## Top 3 SKUs purchased by customers in Belgium
##
## CountryName StockCode ProductsSold
## Belgium POST 96
##
## Top 3 SKUs purchased by customers in Spain
##
## CountryName StockCode ProductsSold
## Spain POST 62
The above analysis provide an insight into which product sold most, however doesn’t answer the question - Is there any specific product contributing to another product’s purchase? This is where Association rules makes its entry.
The following sections is specific on the steps; data preparation to finally building association rules.
UK.OnlineRetail.tabular <- table(UK.OnlineRetail.nonCancel.df$InvoiceNo,UK.OnlineRetail.nonCancel.df$StockCode) #Invoice# (transactions)-Items ordered
str(UK.OnlineRetail.tabular)
## 'table' int [1:22064, 1:4059] 0 0 0 0 0 1 0 0 0 0 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:22064] "536365" "536366" "536367" "536368" ...
## ..$ : chr [1:4059] "10002" "10080" "10120" "10123C" ...
The type convert function as.data.frame.matrix() is very useful to convert the table format to dataframe
UK.OnlineRetail.tabular.df <- as.data.frame.matrix(UK.OnlineRetail.tabular) #This step is expected to convert a table to Dataframe
#
str(UK.OnlineRetail.tabular.df)
## 'data.frame': 22064 obs. of 4059 variables:
## $ 10002 : int 0 0 0 0 0 1 0 0 0 0 ...
## $ 10080 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10120 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10123C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10123G : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10124A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10124G : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10125 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10133 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10134 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 10135 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 11001 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15030 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15034 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15036 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15039 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15044A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15044B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15044C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15044D : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15056bl : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15056BL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15056n : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15056N : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15056p : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15056P : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15058A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15058B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15058C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15060b : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 15060B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16008 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16010 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16011 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16012 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16014 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16015 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16016 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16020C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16033 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16043 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16045 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16046 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16048 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16049 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16052 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16053 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16054 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16151A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16156L : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16156S : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16161G : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16161M : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16161P : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16161U : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16162L : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16162M : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16168M : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16169E : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16169K : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16169M : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16169N : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16169P : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16202A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16202B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16202E : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16206B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16207A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16207B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16216 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16218 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16219 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16225 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16235 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16236 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16237 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16238 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16244B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16248B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16254 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16258A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 16259 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17001 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17003 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17007B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17011A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17011F : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17012A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17012B : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17012C : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17012D : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17012E : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17012F : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17013D : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17014A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17021 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17028J : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17038 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ 17084A : int 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
The invoice transactions are successfully captured into a dataframe with purchase flag set to a binary (1 or 0S) format, ideal for type-conversion to transactions
UK.OnlineRetail.matrix <- as.matrix(UK.OnlineRetail.tabular.df)
NOTE :- I advise to perform data validation on the transactions dataset, to confirm if there are duplicate StockCodes or SKUs in a single Invoice. If the Invoice System doesn’t consolidate the invoices at the SKU level, the matrix may have non-binary entries. In case there are non-binary entries, the transactions type-conversion step will force the SKU count on transactions anything other than 0 or 1 to 1. In some cases, this may be due to bad data and may need more cleansing prior to applying analytics on them. Following steps were executed to check for any transactions with duplicate StockCodes in the same invoice.
The following step lists the Invoice#, Record# and Column# (StockCode) which has non-binary values in the matrix
UK.OnlineRetail.duplicate.items.sameInvoice <- which((UK.OnlineRetail.matrix[,]>1 | UK.OnlineRetail.matrix[,]<0),arr.ind = TRUE) #lists the invoice with multiple items in the same invoice (array format - arr.ind=TRUE).
Top 5 entries from the matrix
UK.OnlineRetail.duplicate.items.sameInvoice[1:5,]
## row col
## 548309 5611 3
## 547729 5343 8
## 558700 10813 8
## 558860 10890 8
## 558049 10502 9
Are all the non-binary SKU counts, just a case of duplicate SKUs reported in an invoice?
length(which((UK.OnlineRetail.matrix[,]>1),arr.ind = TRUE)) #Transactions with SKU count greater than 1
## [1] 19198
length(which((UK.OnlineRetail.matrix[,]<0),arr.ind = TRUE)) #Transactions with SKU count less than 0
## [1] 0
We confirm that we dont have any bad data, and we could force the non-binary values to binary.
Prior to type-converting the matrix to a transaction type, the following step will force non-binary values to 1
UK.OnlineRetail.matrix[which(UK.OnlineRetail.matrix[,]>1 | UK.OnlineRetail.matrix[,]<0)] <- 1
#force them to 1 - That means, flag SKUs in transactions as bought by customer
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
#
UK.OnlineRetail.trans <- as(UK.OnlineRetail.matrix,"transactions")
summary(UK.OnlineRetail.trans) #summary of transaction file
## transactions as itemMatrix in sparse format with
## 22064 rows (elements/itemsets/transactions) and
## 4059 columns (items) and a density of 0.0058296
##
## most frequent items:
## 85123A 85099B 22423 47566 20725 (Other)
## 2204 2092 1992 1686 1565 512547
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3726 834 691 663 688 617 607 604 605 547 555 492 507 524 549
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 553 467 439 483 439 402 348 345 311 247 259 244 242 271 224
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 196 189 162 178 134 134 133 120 135 125 126 101 97 105 99
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 91 81 93 93 85 57 66 77 72 72 49 62 54 36 59
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 43 29 42 37 40 44 34 40 28 34 38 22 27 33 27
## 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## 21 17 28 15 13 19 23 15 21 18 15 11 17 12 10
## 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
## 10 15 16 7 6 9 10 13 4 9 13 3 6 9 2
## 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 4 6 4 4 4 8 3 5 5 5 10 6 4 7 6
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135
## 6 11 4 5 3 4 7 2 2 4 2 4 2 3 6
## 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150
## 2 5 7 2 4 7 2 2 4 5 3 1 6 5 2
## 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
## 6 3 5 2 3 2 3 5 1 1 4 3 3 2 5
## 166 168 169 170 171 172 173 174 175 176 177 178 179 180 181
## 4 5 4 1 3 3 4 4 2 4 6 3 3 3 3
## 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196
## 6 4 3 2 2 4 2 5 4 2 3 2 3 3 4
## 197 198 199 202 203 204 205 206 207 208 210 211 212 214 215
## 2 2 3 2 5 5 1 2 1 4 1 4 1 1 2
## 216 217 218 219 220 223 224 225 226 228 229 230 232 233 235
## 3 4 1 3 2 2 3 3 1 1 2 2 1 1 1
## 236 237 238 239 241 242 243 244 247 249 250 253 254 255 257
## 1 3 3 1 2 1 2 2 2 3 2 1 2 2 2
## 259 261 262 263 264 266 267 271 275 279 280 282 283 285 287
## 1 2 2 1 2 1 1 2 1 1 2 2 2 1 2
## 288 289 290 292 293 295 297 298 300 301 309 310 315 319 320
## 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## 331 332 333 334 335 340 341 344 346 347 350 352 353 355 358
## 1 2 1 2 1 1 1 1 1 2 1 2 1 1 1
## 360 363 371 375 378 379 384 386 388 402 404 406 410 411 415
## 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 416 418 419 420 429 434 438 439 443 449 455 460 463 471 482
## 2 1 1 1 1 3 1 2 1 1 2 2 1 1 1
## 486 487 488 494 502 503 509 515 516 517 518 520 523 525 526
## 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1
## 527 529 531 536 539 541 543 552 561 568 572 578 585 586 589
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 590 593 595 599 601 607 622 629 635 645 648 650 661 674 676
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 687 704 721 731 749 1110
## 1 1 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 13.00 23.66 27.00 1110.00
##
## includes extended item information - examples:
## labels
## 1 10002
## 2 10080
## 3 10120
##
## includes extended transaction information - examples:
## transactionID
## 1 536365
## 2 536366
## 3 536367
#
UK.OnlineRetail.trans
## transactions in sparse format with
## 22064 transactions (rows) and
## 4059 items (columns)
The apriori() function in the arules package is used to generate association rules.
Requirements set for Association rules
Minimum Support is set to 0.015 - i.e. I expect 1.5% chance for StockCodes in the association (lhs = antecedant and rhs = consequent) are in the Purchase.
Minimum Confidence is set to 0.73 - i.e. I expect atleast 73% chance consequent StockCode (rhs) is present in the basket given the antecedant stock (lhs) is added to the purchase.
rules.UK.OnlineRetail <- apriori(UK.OnlineRetail.matrix,parameter = list(supp=0.015,conf = 0.73,target = "rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.73 0.1 1 none FALSE TRUE 5 0.015 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 330
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4059 item(s), 22064 transaction(s)] done [0.08s].
## sorting and recoding items ... [417 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
List the top 20 rules sorted on lift
assoc.rules.list.UK.OnlRetail <- inspect(head(sort(rules.UK.OnlineRetail,by="lift"),20))
## lhs rhs support confidence lift count
## [1] {21086} => {21094} 0.01586294 0.8177570 34.237174 350
## [2] {84596F} => {84596B} 0.01581762 0.8022989 33.911728 349
## [3] {22578} => {22577} 0.01758521 0.7563353 30.960634 388
## [4] {22698,22699} => {22697} 0.02456490 0.9048414 19.669380 542
## [5] {22423,22698} => {22697} 0.01554569 0.8750000 19.020690 343
## [6] {22698} => {22697} 0.02868927 0.8252934 17.940170 633
## [7] {22697,22698} => {22699} 0.02456490 0.8562401 17.722404 542
## [8] {22423,22698} => {22699} 0.01500181 0.8443878 17.477084 331
## [9] {20723,22355} => {20724} 0.01513778 0.8048193 17.173629 334
## [10] {22423,22699} => {22697} 0.01849166 0.7771429 16.893478 408
## [11] {22423,22697} => {22699} 0.01849166 0.8015717 16.590880 408
## [12] {22698} => {22699} 0.02714830 0.7809648 16.164360 599
## [13] {20719,22355} => {20724} 0.01522843 0.7368421 15.723099 336
## [14] {22697} => {22699} 0.03480783 0.7566502 15.661099 768
## [15] {22386,22411} => {85099B} 0.01713198 0.8076923 8.518606 378
## [16] {21931,22386} => {85099B} 0.01871827 0.8019417 8.457955 413
## [17] {20712,22386} => {85099B} 0.01545504 0.7948718 8.383390 341
## [18] {21931,22411} => {85099B} 0.01767585 0.7442748 7.849751 390
NOTE: Lift is the measure of efficacy for the association rule to confirm purchase on the consequent item. Technically defined as Confidence / Benchmark Confidence = P(B|A)/P(B), where association rules say, A => B
When I generated the association rules on 12 months worth of wholesale-customer purchases, the following two came at the top,
# lhs rhs support confidence lift count
#[1] {21086} => {21094} 0.01586294 0.8177570 34.237174 350
#[2] {84596F} => {84596B} 0.01581762 0.8022989 33.911728 349
Following are the stock description, which would be useful to interpret the rule generated on the 11-month worth invoice data;
unique(UK.OnlineRetail.df[which(UK.OnlineRetail.df$StockCode %in% c("21086","21094")),c(2,3)])
## StockCode Description
## 93 21094 SET/6 RED SPOTTY PAPER PLATES
## 227 21086 SET/6 RED SPOTTY PAPER CUPS
unique(UK.OnlineRetail.df[which(UK.OnlineRetail.df$StockCode %in% c("84596F","84596B")),c(2,3)])
## StockCode Description
## 9686 84596F SMALL MARSHMALLOWS PINK BOWL
## 11945 84596B SMALL DOLLY MIX DESIGN ORANGE BOWL
The first two association rules can be interpreted as follows,
If a customer has Stock# 21086 (SET/6 RED SPOTTY PAPER CUPS) in his / her basket, Retailer can be 81.78% confident that he / she will add Stock# 21094 (SET/6 RED SPOTTY PAPER PLATES) to his / her purchase.
Also, it is 34 times more likely that a customer will purchase Stock# 21094 (SET/6 RED SPOTTY PAPER PLATES) given Stock# 21086 (SET/6 RED SPOTTY PAPER CUPS) in the basket, than by itself.
If a customer has Stock# 84596F (SMALL MARSHMALLOWS PINK BOWL) in his / her basket, Retailer can be 80.23% confident that he / she will add Stock# 84596B (SMALL DOLLY MIX DESIGN ORANGE BOWL) to his / her purchase.
Also, it is 33.9 times more likely that a customer will purchase Stock# 84596F (SMALL MARSHMALLOWS PINK BOWL) given Stock# 84596B (SMALL DOLLY MIX DESIGN ORANGE BOWL) in the basket, than by itself.
Following is how to interpret the graph:
Arrows represent the association - head-of-arrow represent the rhs SKU, tail points to the lhs SKU
Brighter the bubble, Higher the lift
Bigger the bubble, Higher the support
library(arulesViz)
## Loading required package: grid
UK.OnlineRetail.head20 <- head(sort(rules.UK.OnlineRetail,by="lift"),20) ##capture the rules sorted by "lift" for plotting
plot(UK.OnlineRetail.head20,method = "graph")
As in the case of online transactions where the association rules can be used to push recommendations to the customer on an attempt to boost sales, the same approach could be utilized on brick-and-mortar store to review the planograms and effectiveness of SKU placements in the aisle or gondola shelves. This would be very much relevant on SKUs which are small and could possibly be positioned at the check-out counter, and bulkier items (hardline or softline) which has more chances of a sale if placed together. When Information Systems provide the data (Point Of Sales), Data Analytics can pave the way to provide options for calculated and effective decision-making to improve inventory turn-over.
The applications of market-basket analysis doesn’t limit to recommendations to the SKU placements at the store, but setting-up the SKUs in pick-area in a Warehouse (B2C). On a Keizen (Continuous Improvement) perspective, this focus will reduce the unwanted “Transportation” and “Motion” wastes (two of the 8 mudas defined in Lean), walking up-and-down the aisle or searching the product for picking.
Big Data - For a huge superstore retailer, one of the main challenges would be on the velocity and volume of POS data captured into their enterprise systems. As RStudio and R is capable of leveraging the potential of Big Data, we should be in a good spot to perform data analytics on Big Data.
Review frequency - How often does the retailer review the association rules and SKU placement?
Store Returns - Should the market-basket analysis on retail store POS data, consider the SKUs frequently returned to the store?
The evolution of direct, data and digital marketing, Richard Webber, Journal of Direct, Data and Digital Marketing Practice (2013) 14, 291â???“309.
Clustering Experiments on Big Transaction Data for Market Segmentation, Ashishkumar Singh, Grace Rumantir, Annie South, Blair Bethwaite, Proceedings of the 2014 International Conference on Big Data Science and Computing.
A decision-making framework for precision marketing, Zhen You, Yain-Whar Si, Defu Zhang, XiangXiang Zeng, Stephen C.H. Leung c, Tao Li, Expert Systems with Applications, 42 (2015) 3357â???“3367.
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â???“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).
Visualizing Association Rules: Introduction to the R-extension Package arulesViz; Michael Hahsler & Sudheer Chelluboina (Southern Methodist University)