Overview

My objective for this piece of work is to carry out a Market Basket Analysis as an end-to-end data science project. I have split the output into three parts, of which this is the SECOND, that I have organised as follows:

In the first chapter, I will source, explore and format a complex dataset suitable for modelling with recommendation algorithms.
For the second part, I will apply various machine learning algorithms for Product Recommendation and select the best performing model. This will be done with the support of the recommenderlab package.
In the third and final instalment, I will implement the selected model in a Shiny Web Application.

Loading the Packages

# Importing libraries
library(data.table)
library(tidyverse)            
library(knitr)
library(recommenderlab)

Data

For the analysis I will be using the retail dataset prepared and cleansed in Part 1.

glimpse(retail)
## Observations: 528,148
## Variables: 10
## $ InvoiceNo   <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 53...
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LA...
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2...
## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12...
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1....
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 1...
## $ Country     <fct> United Kingdom, United Kingdom, United Kingdom, Un...
## $ Date        <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2...
## $ Time        <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, 08:26:00, ...

Modelling

For the analysis part of this project I am using recommenderlab, an R package which provides a convenient framework to evaluate and compare various recommendation algorithms and quickly establish the best performing one.

Create the Rating Matrix

I now need to arrange the purchase history in a ratings matrix, with orders in rows and products in columns. This format is often called user-item matrix because “users” (e.g. customers or orders) tend to be on the rows and “items” (e.g. products) on the columns.

recommenderlab accepts 2 types of rating matrix for modelling:

real rating matrix consisting of actual user ratings, which requires normalisation.
binary rating matrix consisting of 0’s and 1’s, where 1’s indicate if the product was purchased. This is the matrix type needed for the analysis and it does not require normalisation.

However, when creating the rating matrix, it becomes apparent that some orders contain the same item more than once, as shown in the example below.

retail %>% 
# Filtering by an order number which contains the same stock code more than once
  filter(InvoiceNo == 557886 & StockCode == 22436) %>% 
  select(InvoiceNo, StockCode, Quantity, UnitPrice, CustomerID)

## # A tibble: 2 x 5
##   InvoiceNo StockCode Quantity UnitPrice CustomerID
##       <dbl> <chr>        <dbl>     <dbl>      <dbl>
## 1    557886 22436            1      0.65      17799
## 2    557886 22436            3      0.65      17799

It is possible that the company that donated this dataset to the UCI Machine Learning Repository had an order processing system that allowed for an item to be added multiple times to the same order. For this analysis, I only need to know if an item was included in an order, so these duplicates need to be removed.

retail <- retail %>% 
# create unique identifier
    mutate(InNo_Desc = paste(InvoiceNo, Description, sep = ' ')) 
# filter out duplicates and drop unique identifier
    retail <- retail[!duplicated(retail$InNo_Desc), ] %>% 
    select(-InNo_Desc)

# CHECK:  total row count - 517,354

I can now create the rating matrix.

ratings_matrix <- retail %>%
# Select only needed variables
  select(InvoiceNo, Description) %>% 
# Add a column of 1s
  mutate(value = 1) %>%
# Spread into user-item format
  spread(Description, value, fill = 0) %>%
  select(-InvoiceNo) %>%
# Convert to matrix
  as.matrix() %>%
# Convert to recommenderlab class 'binaryRatingsMatrix'
  as("binaryRatingMatrix")

ratings_matrix
## 19792 x 4001 rating matrix of class 'binaryRatingMatrix' with 517354 ratings.

Evaluation Scheme and Model Validation

In order to establish the models effectiveness, recommenderlab implements a number of evaluation schemes. In this scheme, I split the data into a train set and a test set by selecting train = 0.8 for a 80/20 train/test split. I have also set method = “cross” and k = 5 for a 5-fold cross validation. This means that the data is divided into k subsets of equal size, with 80% of the data used for training and the remaining 20% used for evaluation. The models are recursively estimated 5 times, each time using a different train/test split, which ensures that all users and items are considered for both training and testing. The results can then be averaged to produce a single evaluation.

Selecting given = -1 means that for the test users ‘all but 1’ randomly selected item is withheld for evaluation.

scheme <- ratings_matrix %>% 
  evaluationScheme(method = "cross",
                   k      = 5, 
                   train  = 0.8,  
                   given  = -1)

scheme
## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 5 run(s).
## Good ratings: NA
## Data set: 19792 x 4001 rating matrix of class 'binaryRatingMatrix' with 517354 ratings.

Set up List of Algorithms

One of recommenderlab main features is the ability to estimate multiple algorithms in one go. First, I create a list with the algorithms I want to estimate, specifying all the models parameters. Here, I consider schemes which evaluate on a binary rating matrix and include the random items algorithm for benchmarking purposes.

algorithms <- list(
  "association rules" = list(name  = "AR", param = list(supp = 0.01, conf = 0.01)),
  "random items"      = list(name  = "RANDOM",  param = NULL),
  "popular items"     = list(name  = "POPULAR", param = NULL),
  "item-based CF"     = list(name  = "IBCF", param = list(k = 5)),
  "user-based CF"     = list(name  = "UBCF", param = list(method = "Cosine", nn = 500))
  )

Estimate the Models

All I have to do now is to pass scheme and algoritms to the evaluate() function, select type = topNList to evaluate a Top N List of product recommendations and specify how many recommendations to calculate with the parameter n = c(1, 3, 5, 10, 15, 20).

Please note that the CF based algorithms take a few minutes each to estimate.

results <- recommenderlab::evaluate(scheme, 
                                    algorithms, 
                                    type  = "topNList", 
                                    n     = c(1, 3, 5, 10, 15, 20)
                                    )

## AR run fold/sample [model time/prediction time]
##   1  [0.28sec/75.01sec] 
##   2  [0.28sec/80.36sec] 
##   3  [0.25sec/79.74sec] 
##   4  [0.27sec/78.68sec] 
##   5  [0.25sec/79.47sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/22.57sec] 
##   2  [0sec/26.23sec] 
##   3  [0sec/20.61sec] 
##   4  [0sec/27.64sec] 
##   5  [0.02sec/20.93sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.02sec/17.54sec] 
##   2  [0.02sec/17.41sec] 
##   3  [0sec/17.66sec] 
##   4  [0.02sec/17.52sec] 
##   5  [0sec/17.78sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [581.77sec/3.13sec] 
##   2  [587.98sec/3.17sec] 
##   3  [584.76sec/3.2sec] 
##   4  [578.73sec/3.46sec] 
##   5  [585.58sec/3.17sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/323sec] 
##   2  [0sec/320.84sec] 
##   3  [0sec/321.04sec] 
##   4  [0sec/321.85sec] 
##   5  [0sec/320.54sec]

The output is stored as a list containing all the evaluations.

results

## List of evaluation results for 5 recommenders:
## Evaluation results for 5 folds/samples using method 'AR'.
## Evaluation results for 5 folds/samples using method 'RANDOM'.
## Evaluation results for 5 folds/samples using method 'POPULAR'.
## Evaluation results for 5 folds/samples using method 'IBCF'.
## Evaluation results for 5 folds/samples using method 'UBCF'.

Results for each single model can be easily retrieved and inspected.

results$'popular' %>% 
  getConfusionMatrix()

## [[1]]
##             TP         FP        FN       TN   precision      recall
## 1  0.005808081  0.9093434 0.9093434 4000.176 0.006346578 0.006346578
## 3  0.015151515  2.7303030 0.9000000 3998.355 0.005518764 0.016556291
## 5  0.024494949  4.5512626 0.8906566 3996.534 0.005353201 0.026766004
## 10 0.042676768  9.1088384 0.8724747 3991.976 0.004663355 0.046633554
## 15 0.053535354 13.6737374 0.8616162 3987.411 0.003899926 0.058498896
## 20 0.066666667 18.2363636 0.8484848 3982.848 0.003642384 0.072847682
##            TPR          FPR
## 1  0.006346578 0.0002272790
## 3  0.016556291 0.0006824052
## 5  0.026766004 0.0011375313
## 10 0.046633554 0.0022766404
## 15 0.058498896 0.0034175799
## 20 0.072847682 0.0045579514
## 
## [[2]]
##             TP         FP        FN       TN   precision      recall
## 1  0.005808081  0.9179293 0.9179293 4000.158 0.006287589 0.006287589
## 3  0.019191919  2.7520202 0.9045455 3998.324 0.006925460 0.020776381
## 5  0.026262626  4.5924242 0.8974747 3996.484 0.005686167 0.028430837
## 10 0.041666667  9.1957071 0.8820707 3991.881 0.004510662 0.045106616
## 15 0.058585859 13.7974747 0.8651515 3987.279 0.004228176 0.063422635
## 20 0.070707071 18.4040404 0.8530303 3982.672 0.003827228 0.076544560
##            TPR          FPR
## 1  0.006287589 0.0002294250
## 3  0.020776381 0.0006878331
## 5  0.028430837 0.0011478191
## 10 0.045106616 0.0022983522
## 15 0.063422635 0.0034485066
## 20 0.076544560 0.0045998601
## 
## [[3]]
##             TP         FP        FN       TN   precision      recall
## 1  0.008080808  0.9227273 0.9227273 4000.146 0.008681498 0.008681498
## 3  0.018434343  2.7739899 0.9123737 3998.295 0.006601555 0.019804666
## 5  0.027272727  4.6267677 0.9035354 3996.442 0.005860011 0.029300054
## 10 0.045454545  9.2626263 0.8853535 3991.807 0.004883342 0.048833424
## 15 0.058585859 13.9035354 0.8722222 3987.166 0.004196057 0.062940857
## 20 0.071212121 18.5449495 0.8595960 3982.524 0.003825285 0.076505697
##            TPR          FPR
## 1  0.008681498 0.0002306242
## 3  0.019804666 0.0006933241
## 5  0.029300054 0.0011564028
## 10 0.048833424 0.0023150778
## 15 0.062940857 0.0034750151
## 20 0.076505697 0.0046350786
## 
## [[4]]
##             TP         FP        FN       TN   precision     recall
## 1  0.005050505  0.9209596 0.9209596 4000.153 0.005454050 0.00545405
## 3  0.016919192  2.7611111 0.9090909 3998.313 0.006090355 0.01827107
## 5  0.026262626  4.6037879 0.8997475 3996.470 0.005672212 0.02836106
## 10 0.041666667  9.2184343 0.8843434 3991.856 0.004499591 0.04499591
## 15 0.052525253 13.8376263 0.8734848 3987.236 0.003781474 0.05672212
## 20 0.066414141 18.4537879 0.8595960 3982.620 0.003586038 0.07172075
##           TPR          FPR
## 1  0.00545405 0.0002301824
## 3  0.01827107 0.0006901053
## 5  0.02836106 0.0011506593
## 10 0.04499591 0.0023040326
## 15 0.05672212 0.0034585419
## 20 0.07172075 0.0046122939
## 
## [[5]]
##             TP         FP        FN       TN   precision     recall
## 1  0.007828283  0.9176768 0.9176768 4000.157 0.008458390 0.00845839
## 3  0.019696970  2.7568182 0.9058081 3998.318 0.007094134 0.02128240
## 5  0.027020202  4.6005051 0.8984848 3996.474 0.005839018 0.02919509
## 10 0.045454545  9.2095960 0.8800505 3991.865 0.004911323 0.04911323
## 15 0.055303030 13.8272727 0.8702020 3987.247 0.003983629 0.05975443
## 20 0.068181818 18.4419192 0.8573232 3982.633 0.003683492 0.07366985
##           TPR          FPR
## 1  0.00845839 0.0002293619
## 3  0.02128240 0.0006890323
## 5  0.02919509 0.0011498388
## 10 0.04911323 0.0023018235
## 15 0.05975443 0.0034559542
## 20 0.07366985 0.0046093275

Visualise results

recommenderlab has a basic plot function that can be used to compare models performance. However, I prefer to sort out the results in a tidy format for added flexibility and charting customisation.

First, I arrange the confusion matrix output for one model in a convenient format.

# Pull into a list all confusion matrix information for one model 
tmp <- results$`user-based CF` %>%
  getConfusionMatrix()  %>%  
  as.list() 
# Calculate average value of 5 cross-validation rounds 
as.data.frame( Reduce("+",tmp) / length(tmp)) %>% 
# Add a column to mark the number of recommendations calculated
  mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
# Select only columns needed and sorting out order 
  select('n', 'precision', 'recall', 'TPR', 'FPR')

##    n  precision     recall        TPR          FPR
## 1  1 0.07571073 0.08189699 0.08189699 0.0002309981
## 2  3 0.04464857 0.14489531 0.14489531 0.0007162840
## 3  5 0.03434509 0.18577780 0.18577780 0.0012066822
## 4 10 0.02304660 0.24934013 0.24934013 0.0024416024
## 5 15 0.01811874 0.29403601 0.29403601 0.0036808775
## 6 20 0.01499064 0.32436463 0.32436463 0.0049234725

Then, I put the previous steps into a formula.

avg_conf_matr <- function(results) {
  tmp <- results %>%
    getConfusionMatrix()  %>%  
    as.list() 
    as.data.frame( Reduce("+",tmp) / length(tmp)) %>% 
    mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
    select('n', 'precision', 'recall', 'TPR', 'FPR') 
}

Next, I use the map() function from the purrr package to get all results in a tidy format, ready for charting.

# Using map() to iterate function across all models
results_tbl <- results %>%
  map(avg_conf_matr) %>% 
# Turning into an unnested tibble
  enframe() %>%
# Unnesting to have all variables on same level
  unnest()

results_tbl

## # A tibble: 30 x 6
##    name                  n precision   recall      TPR      FPR
##    <chr>             <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
##  1 association rules     1  0.0434   0.0385   0.0385   0.000196
##  2 association rules     3  0.0316   0.0751   0.0751   0.000579
##  3 association rules     5  0.0279   0.102    0.102    0.000943
##  4 association rules    10  0.0235   0.144    0.144    0.00178 
##  5 association rules    15  0.0211   0.167    0.167    0.00254 
##  6 association rules    20  0.0198   0.183    0.183    0.00324 
##  7 random items          1  0.000101 0.000109 0.000109 0.000250
##  8 random items          3  0.000219 0.000711 0.000711 0.000750
##  9 random items          5  0.000242 0.00131  0.00131  0.00125 
## 10 random items         10  0.000283 0.00306  0.00306  0.00250 
## # ... with 20 more rows

ROC curve

Classification models performance can be compared using the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR).

The item-based collaborative filtering model is the clear winner as it achieves the highest TPR for any given level of FPR. This means that the model is producing the highest number of relevant recommendations (true positives) for the same level of non-relevant recommendations (false positives).

Note that using fct_reorder2() arranges plot legend entries by best final FPR and TPR, aligning them with the curves and making the plot easier to read.

results_tbl %>%
  ggplot(aes(FPR, TPR, colour = fct_reorder2(as.factor(name), FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "ROC curves",
       colour = "Model") +
  theme_grey(base_size = 14)

Precision-Recall curve

Another common way to compare the performance of classification models is with Precision Vs Recall curves. Precision shows how sensitive models are to False Positives (i.e. recommending an item not very likely to be purchased) whereas Recall (which is just another name for the TPR) looks at how sensitive models are to False Negatives (i.e. do not suggest an item which is highly likely to be purchased).

Normally, we care about accurately predicting which items are more likely to be purchased because this would have a positive impact on sales and revenue. In other words, we want to maximise Recall (or minimise FNs) for the same Precision.

The plot confirms that item-based CF is the best model because it has higher Recall (i.e. lower FNs) for any given level of Precision. This means that IBCF minimises FNs (i.e. not suggesting an item highly likely to be purchased) for all level of FPs.

results_tbl %>%
  ggplot(aes(recall, precision, 
             colour = fct_reorder2(as.factor(name),  precision, recall))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "Precision-Recall curves",
       colour = "Model") +
  theme_grey(base_size = 14)

Predictions for a new user

The final step is to generate a prediction with the best performing model. To do that, I need to create a made-up purchase order.

First, I create a string containing 6 products selected at random.

customer_order <- c("GREEN REGENCY TEACUP AND SAUCER",
                     "SET OF 3 BUTTERFLY COOKIE CUTTERS",
                     "JAM MAKING SET WITH JARS",
                     "SET OF TEA COFFEE SUGAR TINS PANTRY",
                     "SET OF 4 PANTRY JELLY MOULDS")

Next, I put this order in a format that recommenderlab accepts.

new_order_rat_matrx <- retail %>% 
# Select item descriptions from retail dataset
  select(Description) %>% 
  unique() %>% 
# Add a 'value' column with 1's for customer order items
  mutate(value = as.numeric(Description %in% customer_order)) %>% 
# Spread into sparse matrix format
  spread(key = Description, value = value) %>% 
# Change to a matrix
  as.matrix() %>% 
# Convert to recommenderlab class 'binaryRatingsMatrix'
  as("binaryRatingMatrix")

Now, I can create a Recommender. I use getData to retrieve training data and set method = “IBCF” to select the best performing model (“item-based collaborative filtering”).

recomm <- Recommender(getData(scheme, 'train'), 
                       method = "IBCF",  
                       param = list(k = 5))

recomm

## Recommender of type 'IBCF' for 'binaryRatingMatrix' 
## learned using 15832 users.

Finally, I can pass the Recommender and the made-up order to the predict function to create a top 10 recommendation list for the new customer.

pred <- predict(recomm, 
                newdata = new_order_rat_matrx, 
                n       = 10)

Lastly, the suggested items can be inspected as a list.

as(pred, 'list')

## $`1`
##  [1] "ROSES REGENCY TEACUP AND SAUCER"  
##  [2] "PINK REGENCY TEACUP AND SAUCER"   
##  [3] "SET OF 3 HEART COOKIE CUTTERS"    
##  [4] "REGENCY CAKESTAND 3 TIER"         
##  [5] "JAM MAKING SET PRINTED"           
##  [6] "RECIPE BOX PANTRY YELLOW DESIGN"  
##  [7] "SET OF 3 CAKE TINS PANTRY DESIGN" 
##  [8] "GINGERBREAD MAN COOKIE CUTTER"    
##  [9] "SET OF 3 REGENCY CAKE TINS"       
## [10] "SET OF 6 SPICE TINS PANTRY DESIGN"

Comments

This brings to an end the modelling and evaluation part of this project, which I found straightforward and quite enjoyable. recommenderlab is intuitive and easy to use and I particularly appreciated its ability to estimate and compare several classification algorithms at the same time. In summary, I have learned how to carry out Market Basket Analysis with recommenderlab in R, how to interpret the results and choose the best performing model.

Code Repository

The full R code can be found on my GitHub profile

References

For Recommenderlab Package see: https://cran.r-project.org/package=recommenderlab
For Recommenderlab Package Vignette see: https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf

Market Basket Analysis

Part 2 of 3 - Market Basket Analysis with recommenderlab

Diego Usai

25 March 2019