Market Basket Analysis and Recommender

Author: Qasim Ahmed

The purpose of this exercise is to work through a real-world example of Market Basket Analysis. For this we look at goods sold by a real-world company and try to build a list of recommended goods. At the end we also deploy these findings as a shiny-based web app.

Introduction

Market Basket Analysis is a category of data analytics techniques used to understand customers behaviors, based on either their similarity to each other or on the goods they buy, and uncover relationships amongst the products they buy. The idea would be for a business to use these learning to increase engagement by helping customers find relevant products faster. This can lead to more cross-selling and up-selling by suggesting additional goods or services.

Data

In this project we use the Online Retail Data Set, donated to UCI in 2015 by the School of Engineering at London South Bank University.

This dataset contains transactions occurring between the 1st and 9th of December in the year 2010 for a UK-based and registered online retail company. The company sells mainly unique all-occasion gifts and many of their customers are wholesalers.

We picked this particular dataset because of its â€œreal lifeâ€ nature. the data set with its many flaws. This allows for an excellent exercise is data cleaning as well.

Itâ€™s reasonable to assume that the dataset came straight from the companyâ€™s database with very little alterations. From experience this is consistent with the state in which an analyst is most likely to receive data from a client they are to conduct a piece of analysis for.

Data Prep

Loading and Inspecting the Data

## Loading the libraries, setting work directory and loading the data file.

library(data.table)
library(readxl)               
library(tidyverse)
library(lubridate)
library(knitr)


setwd("C:/Users/qasim/OneDrive/Desktop/York U/5 - Advanced Methods of Data Analytics/Assignment 1/Recommender in R")

## Clearing variables and importing raw data file as well as trimming leading and trailing whitespaces
rm(list = ls())
retail <- read_excel("Online Retail.xlsx", trim_ws = TRUE)

## First glance at the data
##skim(retail)
summary(retail)

##   InvoiceNo          StockCode         Description       
##  Length:541909      Length:541909      Length:541909     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     Quantity          InvoiceDate                    UnitPrice        
##  Min.   :-80995.00   Min.   :2010-12-01 08:26:00   Min.   :-11062.06  
##  1st Qu.:     1.00   1st Qu.:2011-03-28 11:34:00   1st Qu.:     1.25  
##  Median :     3.00   Median :2011-07-19 17:17:00   Median :     2.08  
##  Mean   :     9.55   Mean   :2011-07-04 13:34:57   Mean   :     4.61  
##  3rd Qu.:    10.00   3rd Qu.:2011-10-19 11:27:00   3rd Qu.:     4.13  
##  Max.   : 80995.00   Max.   :2011-12-09 12:50:00   Max.   : 38970.00  
##                                                                       
##    CustomerID       Country         
##  Min.   :12346    Length:541909     
##  1st Qu.:13953    Class :character  
##  Median :15152    Mode  :character  
##  Mean   :15288                      
##  3rd Qu.:16791                      
##  Max.   :18287                      
##  NA's   :135080

The dataset consists of just over 540,000 observations spread across 8 variables. A few Descriptions and several CustomerIDs are missing and there are also some odd negatives Quantity and UnitPrice that need to be cleaned up.

Cancelled Orders

The handy prefix indicates that the Invoice Numbers that starts with letter â€˜Câ€™ are a cancellation. By definition ‘Cancellations’ are products that the customers did not want. they are not needed for the analysis so they can be removed.

## if the InvoiceNo starts with letter 'C', it indicates a cancellation
retail %>% 
  filter(grepl("C", retail$InvoiceNo)) %>% 
  summarise(Total = n())

## Cancellations are not needed for the analysis so they can be removed
retail  <- retail %>% 
  filter(!grepl("C", retail$InvoiceNo))

Negative Quantities

When filtering by non-positive Quantity, the Description shows what looks like a series of manually entered comments (e.g. â€œthrown awayâ€, â€œUnsaleableâ€, â€œdamagedâ€,â€œ?â€). Given that UnitPrice is also set to zero for all of these, it is safe to assume that these were inventory adjustments of some type. We remove these from our usable data as well.

## filtering by non-positive Quantity, Description shows manually entered adjustments codes. 
retail %>% 
  filter(Quantity <= 0) %>% 
  group_by(Description, UnitPrice) %>% 
  summarise(count =n()) %>%
  arrange(desc(count)) %>% 
  ungroup()

## remove all rows with non-positive _Quantity_. 

retail  <- retail %>%
  filter(Quantity > 0)

Non-Product Codes

Turning our attention to StockCode, we see a handful of non-product related codes (â€˜Postageâ€™, â€˜Bank Chargesâ€™, â€˜Gift Vouchersâ€™, etc.). These are removed as well.

## There are a handful of non-product related codes - creating a filter that can be updated in the future.
stc <- c('AMAZONFEE', 'BANK CHARGES', 'C2', 'DCGSSBOY', 'DCGSSGIRL',
         'DOT', 'gift_0001_', 'PADS', 'POST')

## Summary
retail %>%  
  filter(grepl(paste(stc, collapse="|"), StockCode))  %>% 
  group_by(StockCode, Description) %>% 
  summarise(count =n()) %>%
  arrange(desc(count)) %>% 
  ungroup()

## These can all be removed. 
retail <- filter(retail, !grepl(paste(stc, collapse="|"), StockCode))

Description

Next, we focus on description. A quick inspection shows there are about 50 or so manual annotations entered by employees.

We also see about 600 NAs. given the small number of these entries (less than 0.1%), we go ahead and remove these.

## Additional adjustment codes to remove - again creating a filter that can be updated in the future.
descr <- c( "check", "check?", "?", "??", "damaged", "found", 
            "adjustment", "Amazon", "AMAZON", "amazon adjust", 
            "Amazon Adjustment", "amazon sales", "Found", "FOUND",
            "found box", "Found by jackie ", "Found in w/hse", "dotcom",
            "dotcom adjust", "allocate stock for dotcom orders ta", "FBA",
            "Dotcomgiftshop Gift Voucher Â£100.00", "on cargo order",
            "wrongly sold (22719) barcode", "wrongly marked 23343",
            "dotcomstock", "rcvd be air temp fix for dotcom sit", "Manual",
            "John Lewis", "had been put aside", "for online retail orders",  
            "taig adjust", "amazon", "incorrectly credited C550456 see 47",
            "returned", "wrongly coded 20713", "came coded as 20713", 
            "add stock to allocate online orders", "Adjust bad debt",
            "alan hodge cant mamage this section", "website fixed",
            "did  a credit  and did not tick ret", "michel oops",
            "incorrectly credited C550456 see 47", "mailout", "test",
            "Sale error",  "Lighthouse Trading zero invc incorr", "SAMPLES",
            "Marked as 23343", "wrongly coded 23343","Adjustment", 
            "rcvd be air temp fix for dotcom sit", "Had been put aside."
)

## Filtering out the unwanted entries.
retail <- retail %>% 
  filter(!Description %in% descr)

## There are also some 600 NAs in _Description_. 
sum(is.na(retail$Description))

## [1] 584

## Given their small number (around 0.1% of total) We've opted to remove them.

retail <- retail %>% 
  filter(!is.na(Description))

Customer ID

There are a significant number of NAs in CustomerID. We can see that there are about 5 times as many invoices as there are customers, we plan to use Invoice No. as our unit of analysis.

We leave Customer ID untouched.

## There is still a significant number of NAs in _CustomerID_. 
##skim(retail$CustomerID)
summary(retail$CustomerID)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12346   13975   15159   15302   16803   18287  131779

## There are almost 5 times as many Orders as there are customers, so we use `InvoiceNo` for orders 
sapply(retail[,c('InvoiceNo','CustomerID')], function(x) length(unique(x)))

##  InvoiceNo CustomerID 
##      19792       4336

Housekeeping

Final finishing touches to our data set before EDA.

We save our cleaned data set in a .RDS format.

retail <- retail %>%

    ## Setting 'Description' and 'Country' as factors
  
  mutate(Description = as.factor(Description)) %>%
    mutate(Country = as.factor(Country)) %>% 
  
  ## Changing 'InvoiceNo' type to numeric
  
  mutate(InvoiceNo = as.numeric(InvoiceNo)) %>% 
  
  ## Extracting 'Date' and 'Time' from 'InvoiceDate'
  
  mutate(Date = as.Date(InvoiceDate)) %>% 
  mutate(Time = as.factor(format(InvoiceDate,"%H:%M:%S"))) 

glimpse(retail)

## Observations: 528,149
## Variables: 10
## $ InvoiceNo   <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 53...
## $ StockCode   <chr> "85123A", "71053", "84406B", "84029G", "84029E", "...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LA...
## $ Quantity    <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2...
## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12...
## $ UnitPrice   <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1....
## $ CustomerID  <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 1...
## $ Country     <fct> United Kingdom, United Kingdom, United Kingdom, Un...
## $ Date        <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2...
## $ Time        <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, 08:26:00, ...

## Saving clensed data for analysis phase
saveRDS(retail, "retail.rds")

Exploratory Data Analysis

Here we look at the different features of the data set.

What are the most popular items?

retail %>% 
  group_by(Description) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  arrange(desc(count)) %>% 
  ggplot(aes(x = reorder(Description, count), y = count))+
  geom_bar(stat = "identity", fill = "royalblue", colour = "blue") +
  labs(x = "", y = "Top 10 Best Sellers", title = "Most Ordered Products") +
  coord_flip() +
  theme_grey(base_size = 12)

The heart-shaped tea light holder is the most popular item.

retail %>% 
  group_by(Description) %>% 
  summarize(count = n()) %>% 
  mutate(pct=(count/sum(count))*100) %>% 
  arrange(desc(pct)) %>% 
  ungroup() %>% 
  top_n(10, wt=pct)

Top 10 most sold products represent around 3% of total items sold by the company.

Other interesting facts

retail %>% 
  ggplot(aes(hour(hms(Time)))) + 
  geom_histogram(stat = "count",fill = "#E69F00", colour = "red") +
  labs(x = "Hour of Day", y = "") +
  theme_grey(base_size = 12)

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Majority of orders are placed between noon and 3:00 PM.

retail %>% 
  ggplot(aes(wday(Date, 
                  week_start = getOption("lubridate.week.start", 1)))) + 
  geom_histogram(stat = "count" , fill = "forest green", colour = "dark green") +
  labs(x = "Day of Week", y = "") +
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) +
  theme_grey(base_size = 14)

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Most orders processed on Thursday and none processed on Saturday.

How and items does each customer buy?

What is the average value of orders?

retail %>% 
  group_by(InvoiceNo) %>% 
  summarise(n = mean(Quantity)) %>%
  ggplot(aes(x=n)) +
  geom_histogram(bins = 100000,fill = "purple",colour = "black") + 
  coord_cartesian(xlim=c(0,100)) +
  scale_x_continuous(breaks=seq(0,100,10)) +
  labs(x = "Average Number of Items per Purchase", y = "") +
  theme_grey(base_size = 14)

retail %>% 
  mutate(Value = UnitPrice * Quantity) %>% 
  group_by(InvoiceNo) %>% 
  summarise(n = mean(Value)) %>%
  ggplot(aes(x=n)) +
  geom_histogram(bins = 200000, fill="firebrick3", colour = "sandybrown") + 
  coord_cartesian(xlim=c(0,100)) +
  scale_x_continuous(breaks=seq(0,100,10)) +
  labs(x = "Average Value per Purchase", y = "") + 
  theme_grey(base_size = 14)

A typical customer buys between 2 and 15 items. With the Mode being 2.

We see that the average of orders is around GBP 20. Peaks occur at 6 and 17.

Data Modelling

For the analysis, we are selected Recommenderlab, an R package which provides a convenient framework to evaluate and compare various recommendation algorithms and quickly establish the best suited approach.

Data Matrix

Before we can start, we need to arrange the purchase history in a rating matrix, with orders in rows and products in columns. One of the types of matrices that Recomenderlab accepts in a “Binary Rating Matrix”.

Binary rating matrix consisting of 0â€™s and 1â€™s, where 1â€™s indicate that the product was purchased. This is the matrix type needed for the analysis and it does not require normalization.

However, when creating the rating matrix, it becomes apparent that some orders contain the same item more than once, as shown in the example below:

## Filtering by an order number which contains the same stock code more than once
retail %>% 
  filter(InvoiceNo == 557886 & StockCode == 22436) %>% 
  select(InvoiceNo, StockCode, Quantity, UnitPrice, CustomerID)

As we only care about whether an object was ordered or not, we remove these duplicates.

retail <- retail %>% 
## Create unique identifier
    mutate(InNo_Desc = paste(InvoiceNo, Description, sep = ' ')) 
## Filter out duplicates and drop unique identifier
    retail <- retail[!duplicated(retail$InNo_Desc), ] %>% 
    select(-InNo_Desc)

We can now create the required matrix.

ratings_matrix <- retail %>%
  
## Select only needed variables
  select(InvoiceNo, Description) %>% 

  ## Add a column of 1s
  mutate(value = 1) %>%

  ## Spread into user-item format
  spread(Description, value, fill = 0) %>%
  select(-InvoiceNo) %>%

  ## Convert to matrix
  as.matrix()

  ## Convert to recommenderlab class 'binaryRatingsMatrix'
  ratings_matrix <- as(ratings_matrix, "binaryRatingMatrix")

Model Evaluation Scheme and Validation

In order to establish a model’s effectiveness, recommenderlab implements a number of evaluation schemes.

In this scheme, we split the data into a train and a test set by selecting train = 0.8 for a 80/20 train/test split. we have also set method = â€œcrossâ€ and k = 5 for a 5-fold cross validation. This means that the data is divided into k subsets of equal size, with 80% of the data used for training and the remaining 20% used for evaluation. The models are recursively estimated 10 times, each time using a different train/test split, which ensures that all users and items are considered for both training and testing. The results can then be averaged to produce a single evaluation set.

In simple terms we are using a 5-fold cross validation for our model.

Selecting given = -1 means that for the test users â€˜all but 1â€™ randomly selected item is withheld for evaluation.

scheme <- ratings_matrix %>% 
  evaluationScheme(method = "cross",
                   k      = 5, 
                   train  = 0.8,  
                   given  = -1)
scheme

## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 5 run(s).
## Good ratings: NA
## Data set: 19792 x 4002 rating matrix of class 'binaryRatingMatrix' with 517355 ratings.

Algorithms List

One of recommenderlab main features, and reason for selecting it for analysis, is the ability to estimate multiple algorithms in one go. First, we create a list with the algorithms we want to estimate, specifying all the models’ parameters.

Here, we use schemes which evaluate on a binary rating matrix. we include the random items algorithm for benchmarking purposes.

Note: Trail and error leads us to choosing a support of 0.01 and a confidence level of 0.01

algorithms <- list(
  "association rules" = list(name  = "AR", 
                        param = list(supp = 0.01, conf = 0.01)),
  "random items"      = list(name  = "RANDOM",  param = NULL),
  "popular items"     = list(name  = "POPULAR", param = NULL),
  "item-based CF"     = list(name  = "IBCF", param = list(k = 5)),
  "user-based CF"     = list(name  = "UBCF", 
                        param = list(method = "Cosine", nn = 500))
                   )

Estimating the Models

We now “pass” the scheme and algorithms to the evaluate() function, select type = topNList to evaluate a Top N List of product recommendations and specify how many recommendations to calculate with the parameter n = c(1, 3, 5, 10, 15, 20).

Note: that the CF based algorithms take a few minutes each to estimate.

The output is stored as a list containing all the evaluations.

results <- recommenderlab::evaluate(scheme, 
                                    algorithms, 
                                    type  = "topNList", 
                                    n     = c(1, 3, 5, 10, 15, 20)
                                    )

## AR run fold/sample [model time/prediction time]
##   1  [0.39sec/99.57sec] 
##   2  [0.27sec/84.26sec] 
##   3  [0.29sec/85.93sec] 
##   4  [0.28sec/83.35sec] 
##   5  [0.26sec/83.19sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/21.35sec] 
##   2  [0sec/21.9sec] 
##   3  [0sec/21.54sec] 
##   4  [0sec/21.99sec] 
##   5  [0sec/22.12sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.01sec/17.65sec] 
##   2  [0.01sec/17.61sec] 
##   3  [0.02sec/17.39sec] 
##   4  [0.02sec/17.49sec] 
##   5  [0sec/17.84sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [245.18sec/3.05sec] 
##   2  [246.37sec/3.01sec] 
##   3  [245.18sec/2.7sec] 
##   4  [239.61sec/2.67sec] 
##   5  [266.6sec/2.75sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/104.25sec] 
##   2  [0sec/100.52sec] 
##   3  [0sec/102.95sec] 
##   4  [0sec/106.31sec] 
##   5  [0sec/109.7sec]

results

## List of evaluation results for 5 recommenders:
## Evaluation results for 5 folds/samples using method 'AR'.
## Evaluation results for 5 folds/samples using method 'RANDOM'.
## Evaluation results for 5 folds/samples using method 'POPULAR'.
## Evaluation results for 5 folds/samples using method 'IBCF'.
## Evaluation results for 5 folds/samples using method 'UBCF'.

Visualizing and Evaluating Our Models

First, we arrange the confusion matrix output for one model in a convenient format.

## Pull into a list all confusion matrix information for one model 
tmp <- results$`user-based CF` %>%
  getConfusionMatrix()  %>%  
  as.list() 

## Calculate average value of 5 cross-validation rounds 
  as.data.frame( Reduce("+",tmp) / length(tmp)) %>% 

    ## Add a column to mark the number of recommendations calculated
  mutate(n = c(1, 3, 5, 10, 15, 20)) %>%

    ## Select only columns needed and sorting out order 
  select('n', 'precision', 'recall', 'TPR', 'FPR')

We put the previous steps into a formula. Next, We use the map() function from the purrr package to get all results in a tidy format, ready for charting.

avg_conf_matr <- function(results) {
  tmp <- results %>%
    getConfusionMatrix()  %>%  
    as.list() 
    as.data.frame(Reduce("+",tmp) / length(tmp)) %>% 
    mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
    select('n', 'precision', 'recall', 'TPR', 'FPR') 
}
    
## Using map() to iterate function across all models
results_tbl <- results %>%
  map(avg_conf_matr) %>% 

## Turning into an unnested tibble
  enframe() %>%

## Unnesting to have all variables on same level
unnest()

results_tbl

ROC Curve

Classification models’ performance can be compared using the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR).

results_tbl %>%
  ggplot(aes(FPR, TPR, 
             colour = fct_reorder2(as.factor(name), 
                      FPR, TPR))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "ROC curves", colour = "Model") +
  theme_grey(base_size = 14)

The item-based collaborative filtering model is the clear winner as it achieves the highest TPR for any given level of FPR. This means that the model is producing the highest number of relevant recommendations (true positives) for the same level of non-relevant recommendations (false positives).

Precision-Recall Curve

Another common way to compare classification models’ performance is with Precision Vs Recall curves. Precision shows how sensitive models are to False Positives (i.e. recommending an item not very likely to be purchased) whereas Recall (which is just another name for the TPR) looks at how sensitive models are to False Negatives (i.e. do not suggest an item which is highly likely to be purchased). Normally, we care about accurately predicting which items are more likely to be purchased because this would have a positive impact on sales and revenue. In other words, we want to maximize Recall (or minimize FNs) for the same level of Precision.

results_tbl %>%
  ggplot(aes(recall, precision, 
             colour = fct_reorder2(as.factor(name),  
                      precision, recall))) +
  geom_line() +
  geom_label(aes(label = n))  +
  labs(title = "Precision-Recall curves", colour = "Model") +
  theme_grey(base_size = 14)

The plot confirms that item-based Collaborative Filter (IBCF) is the best model because it has higher Recall for any given level of Precision. This means that IBCF minimizes FNs (i.e. not suggesting an item highly likely to be purchased) for all level of FPs.

Test Prediction

The final step is to generate a prediction with the best performing model. To do that, we create a made-up purchase order, put this order in a format that recommenderlab accepts, we create a recommender, and then pass the recommender and our made-up order to predict function.

Lastly, the suggested items can be inspected as a list.

## Create a made-up order with a string containing 6 products selected at random.
customer_order <- c("GREEN REGENCY TEACUP AND SAUCER",
                    "SET OF 3 BUTTERFLY COOKIE CUTTERS",
                    "JAM MAKING SET WITH JARS",
                    "SET OF TEA COFFEE SUGAR TINS PANTRY",
                    "SET OF 4 PANTRY JELLY MOULDS")



## Put string in a format that recommenderlab accepts.
new_order_rat_matrix <- retail %>% 
  select(Description) %>%
  
## Select item descriptions from retail dataset
  unique() %>% 
  mutate(value = as.numeric(Description %in% customer_order)) %>% 
  
## Add a 'value' column
  spread(key = Description, value = value) %>% 
  
## Spread into sparse matrix format
  as.matrix() 
  
## Change to a matrix
new_order_rat_matrix <-   as(new_order_rat_matrix, "binaryRatingMatrix") # Convert to recommenderlab class 'binaryRatingsMatrix'


## Create a `Recommender`
recomm <- Recommender(getData(scheme, 'train'), 
                      method = "IBCF",   
                      param = list(k = 5))


## Pass the `Recommender` and the made-up order to the `predict` function to create 
## A top 10 recommendation list for the new customer.
pred <- predict(recomm, 
                newdata = new_order_rat_matrix, 
                n       = 10)


## Inspect this pediction as a list
as(pred, 'list')

## $`1`
##  [1] "ROSES REGENCY TEACUP AND SAUCER"   
##  [2] "PINK REGENCY TEACUP AND SAUCER"    
##  [3] "SET OF 3 HEART COOKIE CUTTERS"     
##  [4] "REGENCY CAKESTAND 3 TIER"          
##  [5] "JAM MAKING SET PRINTED"            
##  [6] "RECIPE BOX PANTRY YELLOW DESIGN"   
##  [7] "SET OF 3 CAKE TINS PANTRY DESIGN"  
##  [8] "SET OF 3 REGENCY CAKE TINS"        
##  [9] "SET OF 6 SPICE TINS PANTRY DESIGN" 
## [10] "3 PIECE SPACEBOY COOKIE CUTTER SET"

Summary

Recommenderlab leads up to easily deploy and compare various recommender models. Based on our data set, we select IBCF as the best model for this situation.

Shiny App

In the real world, such a model would have to be deployed to be usable.

App Deployment Prep

For the app deployment We create 2 data files: past_orders_matrix and item_list.

past_orders_matrix is a user-item sparse matrix containing the history of past orders. This is needed in the Shiny server.R file for all the calculations.

item_list is a list of all the products available to purchase. This will feed in the Shiny ui.R file to make the products list available for selection.

We save these files for use in the app.

## Create `past_orders_matrix` containing the history of past orders. This is a is a user-item sparse matrix. 

## This is needed in the `server.R` file for all the calculations. 

past_orders_matrix <- retail %>%

## Select only needed variables
  select(InvoiceNo, Description) %>% 

## Add a column of 1s
  mutate(value = 1) %>%

## Spread into user-item format
  spread(Description, value, fill = 0) %>%
  select(-InvoiceNo) %>% 
  
## Convert to matrix
  as.matrix()

## Convert to class "dgCMatrix"
past_orders_matrix   <- as(past_orders_matrix, "dgCMatrix")

## save the file for use in the app
saveRDS(past_orders_matrix, 
        file = "past_orders_matrix.rds")

## Create `item_list` list of all the products available to purchase. 

## This will feed in the `ui.R` part of the app to make the producs list available for selection.

item_list <- retail %>% 
  select(Description) %>% 
  unique()

## I save the file for use in the app
saveRDS(item_list, 
        file = "item_list.rds")

Improved Collaborative Filtering

We run into the problem with our model being slow in prediction.

Fortunately, during research on sample apps we came across this brilliant Kaggle kernel by Philipp Spachtholz.

Spachtholz carried out a salient analysis on a non-Kaggle dataset and crucially a Shiny-based Book Recommender system using a much faster collaborative filtering code. In particular, he uses simililarity_measures.R functions for calculating similarity matrices, and the cf_algorithm.R file with the collaborative filtering algorithm and prediction function. This speeds up the prediction considerably.

Testing the new approach

We create a matrix new_order and put it in a a user_item matrix format. Then, we add the new_order to the past_orders_matrix as its first entry.

## Use `item_list` to put the `new_order` in a user_item matrix format.
new_order <- item_list %>%

## Add a 'value' column with 1's for customer order items
  mutate(value = as.numeric(Description %in% customer_order)) %>%

## Spread into sparse matrix format
  spread(key = Description, value = value) %>%
  
## Change to a matrix
  as.matrix() 

# Convert to class "dgCMatrix"
new_order <-   as(new_order, "dgCMatrix")

# Then, I add the `new_order` to the `past_orders_matrix` as its first order.
all_orders_dgc <- t(rbind(new_order,past_orders_matrix))

We set a number of parameters required by the Improved CF to work. Next, we load the algorithm implementations and similarity calculations.

# Set range of items to calculate predictions for - here I select them all
items_to_predict <- 1:nrow(all_orders_dgc)
# Set current user to 1, which corresponds to new_order
users <- c(1)
# Set prediction indices
prediction_indices <- as.matrix(expand.grid(items_to_predict, users = users))


# Load algorithm implementations and similarity calculations
source("cf_algorithm.R")

## 
## Attaching package: 'slam'

## The following object is masked from 'package:data.table':
## 
##     rollup

source("similarity_measures.R")

Finally, we fit the item-based CF model with the Improved CF and check the runtime.

We compare this to the item-based CF model with recommenderlab and compare performances.

## Forced garbage Collection for Vector memory issues
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2930943 156.6    9659264  515.9  19462042 1039.4
## Vcells 15445155 117.9  249428734 1903.0 759611508 5795.4

start <- Sys.time()
recomm <- predict_cf(all_orders_dgc, 
                       prediction_indices,
                       "ibcf", FALSE, cal_cos, 3, 
                       FALSE, 4000, 2000)
end <- Sys.time()
cat('runtime', end - start)

## runtime 0.7755589

## Runtime 0.3677909 - Impressive

# Convert `all_orders` to class "binaryRatingMatrix" - by way of converting to dataframe first.
    
    b = as.data.frame(summary(all_orders_dgc))
    all_orders_brm <- as(b, "binaryRatingMatrix")
    rm(b)
    
## Run run IBCF model on recommenderlab

    
## Forced garbage Collection for Vector memory issues
gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2964739 158.4    9659264  515.9  19462042 1039.4
## Vcells 15570294 118.8  199542987 1522.4 759611508 5795.4

## Commenting out following code as memory limitations do not allow code to run in R markdown   
## start <- Sys.time()
## recomm <- Recommender(all_orders_brm, 
##                      method = "IBCF",  
##                      param = list(k = 5))
## end <- Sys.time()
## cat('runtime', end - start)

## runtime 12.75939

Summary

We have all the necessary pieces in place for our Goods Recommender. We write and deploy our shinny app in the file shinnyrecommenderapp.R.

Shiny App

The above analysis in replicated in a shiny app we deployed using the shiny framework. Shiny app is deployed at: https://qasimahmed.shinyapps.io/RGoodsRecommender

Market Basket Analysis and Recommender in R

Qasim Ahmed

Market Basket Analysis and Recommender

Author: Qasim Ahmed

Introduction

Data

Data Prep

Loading and Inspecting the Data

Cancelled Orders

Negative Quantities

Non-Product Codes

Description

Customer ID

Housekeeping

Exploratory Data Analysis

What are the most popular items?

Other interesting facts

Data Modelling

Data Matrix

Model Evaluation Scheme and Validation

Algorithms List

Estimating the Models

Visualizing and Evaluating Our Models

ROC Curve

Precision-Recall Curve

Test Prediction

Summary

Shiny App

App Deployment Prep

Improved Collaborative Filtering

Testing the new approach

Summary

Shiny App