An overview of the data analysis

As part of my Supply Chain Graduate program, we cover data mining and analytics using R. Though I have used multi-linear regression and forecasting models in the past, the concept of Un-Supervised learning / association rules got me interested to the variety of applications for this topic. In this exercise, I have attempted to build association rules on a trans-national data set which contains all the invoice transactions captured between 01/12/2010 and 09/12/2011 (mostly wholesale purchases) for a UK-based and registered non-store online retail.

Data Source

A special thanks to the UC Irvine Center for Machine Learning and Intelligent Systems, for making the data available for my analysis. Online Retail Dataset

Credit: Dr Daqing Chen, Director: Public Analytics group. chend ‘@’ lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

Attribute Information:

  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.

Data Analysis approach

The e-commerce invoice transaction dataset provided by UCI is a very structured dataset with 541909 transactions, with 8 attributes describing these transactions.
The read_xlsx() function in the readxl package is used to build the dataset in Rstudio for mining and later analysis.

#
UK.OnlineRetail.df <- read_xlsx("UK_Ecommerce_OnlineRetail.xlsx")
#
str(UK.OnlineRetail.df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num  17850 17850 17850 17850 17850 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
#

Data-mining - Pick valid transactions

Per the data dictionary, the invoice transactions also include cancellations. They are distinguished by the invoice#, which are prefixed by letter “c”. Part of the data-mining activity would be to cleanse the dataset.

UK.OnlineRetail.df <- as.data.frame(UK.OnlineRetail.df)  #confirm the data in a data.frame format
str(UK.OnlineRetail.df)
## 'data.frame':    541909 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : num  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: POSIXct, format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : num  17850 17850 17850 17850 17850 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
#some of the transactions are cancellations - to avoid duplication, let us remove the "cancellations"
length(which(substr(str_to_upper(UK.OnlineRetail.df$InvoiceNo),1,1) == "C")) #Number of cancellations
## [1] 9288

Extract valid transactions to prepare them for building association rules

UK.OnlineRetail.nonCancel.df <- UK.OnlineRetail.df[-which(substr(str_to_upper(UK.OnlineRetail.df$InvoiceNo),1,1) == "C"),] #just thus active transactions
# 9288 entries (transactions) are cancellations
#
# validation step - Records excluded (cancelled) + Records in the non-cancel set = total number of records in the parent dataset (541909)
length(which(substr(str_to_upper(UK.OnlineRetail.df$InvoiceNo),1,1) == "C")) + nrow(UK.OnlineRetail.nonCancel.df)
## [1] 541909

Post the sanity-check / record-count validation, the data frame is ready for the next step.

Explore Data - Most selling products

Plotting the products which sold most during the past one year.

Exploring further on the data to list the top 3 best seller SKUs in each country. Criteria: Atleast 50 units in the past 1 year.

## Top 3 SKUs purchased by customers in  United Kingdom 
##  
##     CountryName StockCode ProductsSold
##  United Kingdom    85123A         2175
##  United Kingdom    85099B         1960
##  United Kingdom     22423         1714
## 
## Top 3 SKUs purchased by customers in  Germany 
##  
##  CountryName StockCode ProductsSold
##      Germany      POST          374
##      Germany     22326          113
##      Germany     22328           72
## 
## Top 3 SKUs purchased by customers in  France 
##  
##  CountryName StockCode ProductsSold
##       France      POST          300
##       France     23084           74
##       France     21731           71
## 
## Top 3 SKUs purchased by customers in  EIRE 
##  
##  CountryName StockCode ProductsSold
##         EIRE        C2          106
##         EIRE     22423           73
## 
## Top 3 SKUs purchased by customers in  Belgium 
##  
##  CountryName StockCode ProductsSold
##      Belgium      POST           96
## 
## Top 3 SKUs purchased by customers in  Spain 
##  
##  CountryName StockCode ProductsSold
##        Spain      POST           62

The above analysis provide an insight into which product sold most, however doesn’t answer the question - Is there any specific product contributing to another product’s purchase? This is where Association rules makes its entry.

The following sections is specific on the steps; data preparation to finally building association rules.

Data Transformation - Convert data to a transactions format

Step 1 : Generate 2x2 table matrix on Valid invoice and StockCode

UK.OnlineRetail.tabular <- table(UK.OnlineRetail.nonCancel.df$InvoiceNo,UK.OnlineRetail.nonCancel.df$StockCode) #Invoice# (transactions)-Items ordered
str(UK.OnlineRetail.tabular)
##  'table' int [1:22064, 1:4059] 0 0 0 0 0 1 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:22064] "536365" "536366" "536367" "536368" ...
##   ..$ : chr [1:4059] "10002" "10080" "10120" "10123C" ...

Step 2 : Convert table matrix to a dataframe

The type convert function as.data.frame.matrix() is very useful to convert the table format to dataframe

UK.OnlineRetail.tabular.df <- as.data.frame.matrix(UK.OnlineRetail.tabular) #This step is expected to convert a table to Dataframe
#
str(UK.OnlineRetail.tabular.df)
## 'data.frame':    22064 obs. of  4059 variables:
##  $ 10002       : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ 10080       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10120       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10123C      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10123G      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10124A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10124G      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10125       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10133       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10134       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 10135       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 11001       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15030       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15034       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15036       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15039       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15044A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15044B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15044C      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15044D      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15056bl     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15056BL     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15056n      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15056N      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15056p      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15056P      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15058A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15058B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15058C      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15060b      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 15060B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16008       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16010       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16011       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16012       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16014       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16015       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16016       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16020C      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16033       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16043       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16045       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16046       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16048       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16049       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16052       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16053       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16054       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16151A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16156L      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16156S      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16161G      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16161M      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16161P      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16161U      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16162L      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16162M      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16168M      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16169E      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16169K      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16169M      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16169N      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16169P      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16202A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16202B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16202E      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16206B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16207A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16207B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16216       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16218       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16219       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16225       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16235       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16236       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16237       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16238       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16244B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16248B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16254       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16258A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 16259       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17001       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17003       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17007B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17011A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17011F      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17012A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17012B      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17012C      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17012D      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17012E      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17012F      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17013D      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17014A      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17021       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17028J      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17038       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ 17084A      : int  0 0 0 0 0 0 0 0 0 0 ...
##   [list output truncated]

The invoice transactions are successfully captured into a dataframe with purchase flag set to a binary (1 or 0S) format, ideal for type-conversion to transactions

Step 3 : Convert dataframe to a matrix

UK.OnlineRetail.matrix <- as.matrix(UK.OnlineRetail.tabular.df)

NOTE :- I advise to perform data validation on the transactions dataset, to confirm if there are duplicate StockCodes or SKUs in a single Invoice. If the Invoice System doesn’t consolidate the invoices at the SKU level, the matrix may have non-binary entries. In case there are non-binary entries, the transactions type-conversion step will force the SKU count on transactions anything other than 0 or 1 to 1. In some cases, this may be due to bad data and may need more cleansing prior to applying analytics on them. Following steps were executed to check for any transactions with duplicate StockCodes in the same invoice.

Step 3.1 : List the invoices with duplicate SKUs

The following step lists the Invoice#, Record# and Column# (StockCode) which has non-binary values in the matrix

UK.OnlineRetail.duplicate.items.sameInvoice <- which((UK.OnlineRetail.matrix[,]>1 | UK.OnlineRetail.matrix[,]<0),arr.ind = TRUE) #lists the invoice with multiple items in the same invoice (array format - arr.ind=TRUE).

Top 5 entries from the matrix

UK.OnlineRetail.duplicate.items.sameInvoice[1:5,]
##          row col
## 548309  5611   3
## 547729  5343   8
## 558700 10813   8
## 558860 10890   8
## 558049 10502   9

Are all the non-binary SKU counts, just a case of duplicate SKUs reported in an invoice?

length(which((UK.OnlineRetail.matrix[,]>1),arr.ind = TRUE)) #Transactions with SKU count greater than 1
## [1] 19198
length(which((UK.OnlineRetail.matrix[,]<0),arr.ind = TRUE)) #Transactions with SKU count less than 0 
## [1] 0

We confirm that we dont have any bad data, and we could force the non-binary values to binary.

Step 4 : Final transformation prior to type-conversion step

Prior to type-converting the matrix to a transaction type, the following step will force non-binary values to 1

UK.OnlineRetail.matrix[which(UK.OnlineRetail.matrix[,]>1 | UK.OnlineRetail.matrix[,]<0)] <- 1 
#force them to 1 - That means, flag SKUs in transactions as bought by customer 

Step 5: Generate association rules

library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
#
UK.OnlineRetail.trans <- as(UK.OnlineRetail.matrix,"transactions")
summary(UK.OnlineRetail.trans) #summary of transaction file
## transactions as itemMatrix in sparse format with
##  22064 rows (elements/itemsets/transactions) and
##  4059 columns (items) and a density of 0.0058296 
## 
## most frequent items:
##  85123A  85099B   22423   47566   20725 (Other) 
##    2204    2092    1992    1686    1565  512547 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3726  834  691  663  688  617  607  604  605  547  555  492  507  524  549 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
##  553  467  439  483  439  402  348  345  311  247  259  244  242  271  224 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  196  189  162  178  134  134  133  120  135  125  126  101   97  105   99 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   91   81   93   93   85   57   66   77   72   72   49   62   54   36   59 
##   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
##   43   29   42   37   40   44   34   40   28   34   38   22   27   33   27 
##   76   77   78   79   80   81   82   83   84   85   86   87   88   89   90 
##   21   17   28   15   13   19   23   15   21   18   15   11   17   12   10 
##   91   92   93   94   95   96   97   98   99  100  101  102  103  104  105 
##   10   15   16    7    6    9   10   13    4    9   13    3    6    9    2 
##  106  107  108  109  110  111  112  113  114  115  116  117  118  119  120 
##    4    6    4    4    4    8    3    5    5    5   10    6    4    7    6 
##  121  122  123  124  125  126  127  128  129  130  131  132  133  134  135 
##    6   11    4    5    3    4    7    2    2    4    2    4    2    3    6 
##  136  137  138  139  140  141  142  143  144  145  146  147  148  149  150 
##    2    5    7    2    4    7    2    2    4    5    3    1    6    5    2 
##  151  152  153  154  155  156  157  158  159  160  161  162  163  164  165 
##    6    3    5    2    3    2    3    5    1    1    4    3    3    2    5 
##  166  168  169  170  171  172  173  174  175  176  177  178  179  180  181 
##    4    5    4    1    3    3    4    4    2    4    6    3    3    3    3 
##  182  183  184  185  186  187  188  189  190  191  192  193  194  195  196 
##    6    4    3    2    2    4    2    5    4    2    3    2    3    3    4 
##  197  198  199  202  203  204  205  206  207  208  210  211  212  214  215 
##    2    2    3    2    5    5    1    2    1    4    1    4    1    1    2 
##  216  217  218  219  220  223  224  225  226  228  229  230  232  233  235 
##    3    4    1    3    2    2    3    3    1    1    2    2    1    1    1 
##  236  237  238  239  241  242  243  244  247  249  250  253  254  255  257 
##    1    3    3    1    2    1    2    2    2    3    2    1    2    2    2 
##  259  261  262  263  264  266  267  271  275  279  280  282  283  285  287 
##    1    2    2    1    2    1    1    2    1    1    2    2    2    1    2 
##  288  289  290  292  293  295  297  298  300  301  309  310  315  319  320 
##    1    1    1    1    1    1    1    2    1    1    1    1    1    1    1 
##  331  332  333  334  335  340  341  344  346  347  350  352  353  355  358 
##    1    2    1    2    1    1    1    1    1    2    1    2    1    1    1 
##  360  363  371  375  378  379  384  386  388  402  404  406  410  411  415 
##    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
##  416  418  419  420  429  434  438  439  443  449  455  460  463  471  482 
##    2    1    1    1    1    3    1    2    1    1    2    2    1    1    1 
##  486  487  488  494  502  503  509  515  516  517  518  520  523  525  526 
##    1    1    1    1    1    2    1    1    1    1    1    1    1    2    1 
##  527  529  531  536  539  541  543  552  561  568  572  578  585  586  589 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  590  593  595  599  601  607  622  629  635  645  648  650  661  674  676 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  687  704  721  731  749 1110 
##    1    1    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00   13.00   23.66   27.00 1110.00 
## 
## includes extended item information - examples:
##   labels
## 1  10002
## 2  10080
## 3  10120
## 
## includes extended transaction information - examples:
##   transactionID
## 1        536365
## 2        536366
## 3        536367
#
UK.OnlineRetail.trans
## transactions in sparse format with
##  22064 transactions (rows) and
##  4059 items (columns)

The apriori() function in the arules package is used to generate association rules.

Requirements set for Association rules

  • Minimum Support is set to 0.015 - i.e. I expect 1.5% chance for StockCodes in the association (lhs = antecedant and rhs = consequent) are in the Purchase.

  • Minimum Confidence is set to 0.73 - i.e. I expect atleast 73% chance consequent StockCode (rhs) is present in the basket given the antecedant stock (lhs) is added to the purchase.

rules.UK.OnlineRetail <- apriori(UK.OnlineRetail.matrix,parameter = list(supp=0.015,conf = 0.73,target = "rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.73    0.1    1 none FALSE            TRUE       5   0.015      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 330 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4059 item(s), 22064 transaction(s)] done [0.08s].
## sorting and recoding items ... [417 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.02s].
## writing ... [18 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

List the top 20 rules sorted on lift

assoc.rules.list.UK.OnlRetail <- inspect(head(sort(rules.UK.OnlineRetail,by="lift"),20))
##      lhs              rhs      support    confidence lift      count
## [1]  {21086}       => {21094}  0.01586294 0.8177570  34.237174 350  
## [2]  {84596F}      => {84596B} 0.01581762 0.8022989  33.911728 349  
## [3]  {22578}       => {22577}  0.01758521 0.7563353  30.960634 388  
## [4]  {22698,22699} => {22697}  0.02456490 0.9048414  19.669380 542  
## [5]  {22423,22698} => {22697}  0.01554569 0.8750000  19.020690 343  
## [6]  {22698}       => {22697}  0.02868927 0.8252934  17.940170 633  
## [7]  {22697,22698} => {22699}  0.02456490 0.8562401  17.722404 542  
## [8]  {22423,22698} => {22699}  0.01500181 0.8443878  17.477084 331  
## [9]  {20723,22355} => {20724}  0.01513778 0.8048193  17.173629 334  
## [10] {22423,22699} => {22697}  0.01849166 0.7771429  16.893478 408  
## [11] {22423,22697} => {22699}  0.01849166 0.8015717  16.590880 408  
## [12] {22698}       => {22699}  0.02714830 0.7809648  16.164360 599  
## [13] {20719,22355} => {20724}  0.01522843 0.7368421  15.723099 336  
## [14] {22697}       => {22699}  0.03480783 0.7566502  15.661099 768  
## [15] {22386,22411} => {85099B} 0.01713198 0.8076923   8.518606 378  
## [16] {21931,22386} => {85099B} 0.01871827 0.8019417   8.457955 413  
## [17] {20712,22386} => {85099B} 0.01545504 0.7948718   8.383390 341  
## [18] {21931,22411} => {85099B} 0.01767585 0.7442748   7.849751 390

NOTE: Lift is the measure of efficacy for the association rule to confirm purchase on the consequent item. Technically defined as Confidence / Benchmark Confidence = P(B|A)/P(B), where association rules say, A => B

What does Association rules mean to the Retailer?

When I generated the association rules on 12 months worth of wholesale-customer purchases, the following two came at the top,

#     lhs              rhs      support    confidence lift      count
#[1]  {21086}       => {21094}  0.01586294 0.8177570  34.237174 350
#[2]  {84596F}      => {84596B} 0.01581762 0.8022989  33.911728 349

Following are the stock description, which would be useful to interpret the rule generated on the 11-month worth invoice data;

unique(UK.OnlineRetail.df[which(UK.OnlineRetail.df$StockCode %in% c("21086","21094")),c(2,3)])
##     StockCode                   Description
## 93      21094 SET/6 RED SPOTTY PAPER PLATES
## 227     21086   SET/6 RED SPOTTY PAPER CUPS
unique(UK.OnlineRetail.df[which(UK.OnlineRetail.df$StockCode %in% c("84596F","84596B")),c(2,3)])
##       StockCode                        Description
## 9686     84596F       SMALL MARSHMALLOWS PINK BOWL
## 11945    84596B SMALL DOLLY MIX DESIGN ORANGE BOWL

The first two association rules can be interpreted as follows,

  • If a customer has Stock# 21086 (SET/6 RED SPOTTY PAPER CUPS) in his / her basket, Retailer can be 81.78% confident that he / she will add Stock# 21094 (SET/6 RED SPOTTY PAPER PLATES) to his / her purchase.

  • Also, it is 34 times more likely that a customer will purchase Stock# 21094 (SET/6 RED SPOTTY PAPER PLATES) given Stock# 21086 (SET/6 RED SPOTTY PAPER CUPS) in the basket, than by itself.

  • If a customer has Stock# 84596F (SMALL MARSHMALLOWS PINK BOWL) in his / her basket, Retailer can be 80.23% confident that he / she will add Stock# 84596B (SMALL DOLLY MIX DESIGN ORANGE BOWL) to his / her purchase.

  • Also, it is 33.9 times more likely that a customer will purchase Stock# 84596F (SMALL MARSHMALLOWS PINK BOWL) given Stock# 84596B (SMALL DOLLY MIX DESIGN ORANGE BOWL) in the basket, than by itself.

How to graphically represent the product-association?

Following is how to interpret the graph:

  • Arrows represent the association - head-of-arrow represent the rhs SKU, tail points to the lhs SKU

  • Brighter the bubble, Higher the lift

  • Bigger the bubble, Higher the support

library(arulesViz) 
## Loading required package: grid
UK.OnlineRetail.head20 <- head(sort(rules.UK.OnlineRetail,by="lift"),20) ##capture the rules sorted by "lift" for plotting
plot(UK.OnlineRetail.head20,method = "graph")

Association rules in brick-and-mortar retail environment

As in the case of online transactions where the association rules can be used to push recommendations to the customer on an attempt to boost sales, the same approach could be utilized on brick-and-mortar store to review the planograms and effectiveness of SKU placements in the aisle or gondola shelves. This would be very much relevant on SKUs which are small and could possibly be positioned at the check-out counter, and bulkier items (hardline or softline) which has more chances of a sale if placed together. When Information Systems provide the data (Point Of Sales), Data Analytics can pave the way to provide options for calculated and effective decision-making to improve inventory turn-over.

The applications of market-basket analysis doesn’t limit to recommendations to the SKU placements at the store, but setting-up the SKUs in pick-area in a Warehouse (B2C). On a Keizen (Continuous Improvement) perspective, this focus will reduce the unwanted “Transportation” and “Motion” wastes (two of the 8 mudas defined in Lean), walking up-and-down the aisle or searching the product for picking.

Possible Challenges / Questions

Relevant Papers:

Citation Request: