Market Basket Analysis is a category of data analytics techniques used to understand customers behaviors, based on either their similarity to each other or on the goods they buy, and uncover relationships amongst the products they buy. The idea would be for a business to use these learning to increase engagement by helping customers find relevant products faster. This can lead to more cross-selling and up-selling by suggesting additional goods or services.
In this project we use the Online Retail Data Set, donated to UCI in 2015 by the School of Engineering at London South Bank University.
This dataset contains transactions occurring between the 1st and 9th of December in the year 2010 for a UK-based and registered online retail company. The company sells mainly unique all-occasion gifts and many of their customers are wholesalers.
We picked this particular dataset because of its “real life†nature. the data set with its many flaws. This allows for an excellent exercise is data cleaning as well.
It’s reasonable to assume that the dataset came straight from the company’s database with very little alterations. From experience this is consistent with the state in which an analyst is most likely to receive data from a client they are to conduct a piece of analysis for.
## Loading the libraries, setting work directory and loading the data file.
library(data.table)
library(readxl)
library(tidyverse)
library(lubridate)
library(knitr)
setwd("C:/Users/qasim/OneDrive/Desktop/York U/5 - Advanced Methods of Data Analytics/Assignment 1/Recommender in R")
## Clearing variables and importing raw data file as well as trimming leading and trailing whitespaces
rm(list = ls())
retail <- read_excel("Online Retail.xlsx", trim_ws = TRUE)
## First glance at the data
##skim(retail)
summary(retail)
## InvoiceNo StockCode Description
## Length:541909 Length:541909 Length:541909
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Quantity InvoiceDate UnitPrice
## Min. :-80995.00 Min. :2010-12-01 08:26:00 Min. :-11062.06
## 1st Qu.: 1.00 1st Qu.:2011-03-28 11:34:00 1st Qu.: 1.25
## Median : 3.00 Median :2011-07-19 17:17:00 Median : 2.08
## Mean : 9.55 Mean :2011-07-04 13:34:57 Mean : 4.61
## 3rd Qu.: 10.00 3rd Qu.:2011-10-19 11:27:00 3rd Qu.: 4.13
## Max. : 80995.00 Max. :2011-12-09 12:50:00 Max. : 38970.00
##
## CustomerID Country
## Min. :12346 Length:541909
## 1st Qu.:13953 Class :character
## Median :15152 Mode :character
## Mean :15288
## 3rd Qu.:16791
## Max. :18287
## NA's :135080
The dataset consists of just over 540,000 observations spread across 8 variables. A few Descriptions and several CustomerIDs are missing and there are also some odd negatives Quantity and UnitPrice that need to be cleaned up.
The handy prefix indicates that the Invoice Numbers that starts with letter ‘C’ are a cancellation. By definition ‘Cancellations’ are products that the customers did not want. they are not needed for the analysis so they can be removed.
## if the InvoiceNo starts with letter 'C', it indicates a cancellation
retail %>%
filter(grepl("C", retail$InvoiceNo)) %>%
summarise(Total = n())
## Cancellations are not needed for the analysis so they can be removed
retail <- retail %>%
filter(!grepl("C", retail$InvoiceNo))
When filtering by non-positive Quantity, the Description shows what looks like a series of manually entered comments (e.g. “thrown awayâ€, “Unsaleableâ€, “damagedâ€,“?â€). Given that UnitPrice is also set to zero for all of these, it is safe to assume that these were inventory adjustments of some type. We remove these from our usable data as well.
## filtering by non-positive Quantity, Description shows manually entered adjustments codes.
retail %>%
filter(Quantity <= 0) %>%
group_by(Description, UnitPrice) %>%
summarise(count =n()) %>%
arrange(desc(count)) %>%
ungroup()
## remove all rows with non-positive _Quantity_.
retail <- retail %>%
filter(Quantity > 0)
Turning our attention to StockCode, we see a handful of non-product related codes (‘Postage’, ‘Bank Charges’, ‘Gift Vouchers’, etc.). These are removed as well.
## There are a handful of non-product related codes - creating a filter that can be updated in the future.
stc <- c('AMAZONFEE', 'BANK CHARGES', 'C2', 'DCGSSBOY', 'DCGSSGIRL',
'DOT', 'gift_0001_', 'PADS', 'POST')
## Summary
retail %>%
filter(grepl(paste(stc, collapse="|"), StockCode)) %>%
group_by(StockCode, Description) %>%
summarise(count =n()) %>%
arrange(desc(count)) %>%
ungroup()
## These can all be removed.
retail <- filter(retail, !grepl(paste(stc, collapse="|"), StockCode))
Next, we focus on description. A quick inspection shows there are about 50 or so manual annotations entered by employees.
We also see about 600 NAs. given the small number of these entries (less than 0.1%), we go ahead and remove these.
## Additional adjustment codes to remove - again creating a filter that can be updated in the future.
descr <- c( "check", "check?", "?", "??", "damaged", "found",
"adjustment", "Amazon", "AMAZON", "amazon adjust",
"Amazon Adjustment", "amazon sales", "Found", "FOUND",
"found box", "Found by jackie ", "Found in w/hse", "dotcom",
"dotcom adjust", "allocate stock for dotcom orders ta", "FBA",
"Dotcomgiftshop Gift Voucher £100.00", "on cargo order",
"wrongly sold (22719) barcode", "wrongly marked 23343",
"dotcomstock", "rcvd be air temp fix for dotcom sit", "Manual",
"John Lewis", "had been put aside", "for online retail orders",
"taig adjust", "amazon", "incorrectly credited C550456 see 47",
"returned", "wrongly coded 20713", "came coded as 20713",
"add stock to allocate online orders", "Adjust bad debt",
"alan hodge cant mamage this section", "website fixed",
"did a credit and did not tick ret", "michel oops",
"incorrectly credited C550456 see 47", "mailout", "test",
"Sale error", "Lighthouse Trading zero invc incorr", "SAMPLES",
"Marked as 23343", "wrongly coded 23343","Adjustment",
"rcvd be air temp fix for dotcom sit", "Had been put aside."
)
## Filtering out the unwanted entries.
retail <- retail %>%
filter(!Description %in% descr)
## There are also some 600 NAs in _Description_.
sum(is.na(retail$Description))
## [1] 584
## Given their small number (around 0.1% of total) We've opted to remove them.
retail <- retail %>%
filter(!is.na(Description))
There are a significant number of NAs in CustomerID. We can see that there are about 5 times as many invoices as there are customers, we plan to use Invoice No. as our unit of analysis.
We leave Customer ID untouched.
## There is still a significant number of NAs in _CustomerID_.
##skim(retail$CustomerID)
summary(retail$CustomerID)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12346 13975 15159 15302 16803 18287 131779
## There are almost 5 times as many Orders as there are customers, so we use `InvoiceNo` for orders
sapply(retail[,c('InvoiceNo','CustomerID')], function(x) length(unique(x)))
## InvoiceNo CustomerID
## 19792 4336
Final finishing touches to our data set before EDA.
We save our cleaned data set in a .RDS format.
retail <- retail %>%
## Setting 'Description' and 'Country' as factors
mutate(Description = as.factor(Description)) %>%
mutate(Country = as.factor(Country)) %>%
## Changing 'InvoiceNo' type to numeric
mutate(InvoiceNo = as.numeric(InvoiceNo)) %>%
## Extracting 'Date' and 'Time' from 'InvoiceDate'
mutate(Date = as.Date(InvoiceDate)) %>%
mutate(Time = as.factor(format(InvoiceDate,"%H:%M:%S")))
glimpse(retail)
## Observations: 528,149
## Variables: 10
## $ InvoiceNo <dbl> 536365, 536365, 536365, 536365, 536365, 536365, 53...
## $ StockCode <chr> "85123A", "71053", "84406B", "84029G", "84029E", "...
## $ Description <fct> WHITE HANGING HEART T-LIGHT HOLDER, WHITE METAL LA...
## $ Quantity <dbl> 6, 6, 8, 6, 6, 2, 6, 6, 6, 32, 6, 6, 8, 6, 6, 3, 2...
## $ InvoiceDate <dttm> 2010-12-01 08:26:00, 2010-12-01 08:26:00, 2010-12...
## $ UnitPrice <dbl> 2.55, 3.39, 2.75, 3.39, 3.39, 7.65, 4.25, 1.85, 1....
## $ CustomerID <dbl> 17850, 17850, 17850, 17850, 17850, 17850, 17850, 1...
## $ Country <fct> United Kingdom, United Kingdom, United Kingdom, Un...
## $ Date <date> 2010-12-01, 2010-12-01, 2010-12-01, 2010-12-01, 2...
## $ Time <fct> 08:26:00, 08:26:00, 08:26:00, 08:26:00, 08:26:00, ...
## Saving clensed data for analysis phase
saveRDS(retail, "retail.rds")
Here we look at the different features of the data set.
retail %>%
group_by(Description) %>%
summarize(count = n()) %>%
top_n(10, wt = count) %>%
arrange(desc(count)) %>%
ggplot(aes(x = reorder(Description, count), y = count))+
geom_bar(stat = "identity", fill = "royalblue", colour = "blue") +
labs(x = "", y = "Top 10 Best Sellers", title = "Most Ordered Products") +
coord_flip() +
theme_grey(base_size = 12)
The heart-shaped tea light holder is the most popular item.
retail %>%
group_by(Description) %>%
summarize(count = n()) %>%
mutate(pct=(count/sum(count))*100) %>%
arrange(desc(pct)) %>%
ungroup() %>%
top_n(10, wt=pct)
Top 10 most sold products represent around 3% of total items sold by the company.
retail %>%
ggplot(aes(hour(hms(Time)))) +
geom_histogram(stat = "count",fill = "#E69F00", colour = "red") +
labs(x = "Hour of Day", y = "") +
theme_grey(base_size = 12)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Majority of orders are placed between noon and 3:00 PM.
retail %>%
ggplot(aes(wday(Date,
week_start = getOption("lubridate.week.start", 1)))) +
geom_histogram(stat = "count" , fill = "forest green", colour = "dark green") +
labs(x = "Day of Week", y = "") +
scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) +
theme_grey(base_size = 14)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Most orders processed on Thursday and none processed on Saturday.
How and items does each customer buy?
What is the average value of orders?
retail %>%
group_by(InvoiceNo) %>%
summarise(n = mean(Quantity)) %>%
ggplot(aes(x=n)) +
geom_histogram(bins = 100000,fill = "purple",colour = "black") +
coord_cartesian(xlim=c(0,100)) +
scale_x_continuous(breaks=seq(0,100,10)) +
labs(x = "Average Number of Items per Purchase", y = "") +
theme_grey(base_size = 14)
retail %>%
mutate(Value = UnitPrice * Quantity) %>%
group_by(InvoiceNo) %>%
summarise(n = mean(Value)) %>%
ggplot(aes(x=n)) +
geom_histogram(bins = 200000, fill="firebrick3", colour = "sandybrown") +
coord_cartesian(xlim=c(0,100)) +
scale_x_continuous(breaks=seq(0,100,10)) +
labs(x = "Average Value per Purchase", y = "") +
theme_grey(base_size = 14)
A typical customer buys between 2 and 15 items. With the Mode being 2.
We see that the average of orders is around GBP 20. Peaks occur at 6 and 17.
For the analysis, we are selected Recommenderlab, an R package which provides a convenient framework to evaluate and compare various recommendation algorithms and quickly establish the best suited approach.
Before we can start, we need to arrange the purchase history in a rating matrix, with orders in rows and products in columns. One of the types of matrices that Recomenderlab accepts in a “Binary Rating Matrix”.
Binary rating matrix consisting of 0’s and 1’s, where 1’s indicate that the product was purchased. This is the matrix type needed for the analysis and it does not require normalization.
However, when creating the rating matrix, it becomes apparent that some orders contain the same item more than once, as shown in the example below:
## Filtering by an order number which contains the same stock code more than once
retail %>%
filter(InvoiceNo == 557886 & StockCode == 22436) %>%
select(InvoiceNo, StockCode, Quantity, UnitPrice, CustomerID)
As we only care about whether an object was ordered or not, we remove these duplicates.
retail <- retail %>%
## Create unique identifier
mutate(InNo_Desc = paste(InvoiceNo, Description, sep = ' '))
## Filter out duplicates and drop unique identifier
retail <- retail[!duplicated(retail$InNo_Desc), ] %>%
select(-InNo_Desc)
We can now create the required matrix.
ratings_matrix <- retail %>%
## Select only needed variables
select(InvoiceNo, Description) %>%
## Add a column of 1s
mutate(value = 1) %>%
## Spread into user-item format
spread(Description, value, fill = 0) %>%
select(-InvoiceNo) %>%
## Convert to matrix
as.matrix()
## Convert to recommenderlab class 'binaryRatingsMatrix'
ratings_matrix <- as(ratings_matrix, "binaryRatingMatrix")
In order to establish a model’s effectiveness, recommenderlab implements a number of evaluation schemes.
In this scheme, we split the data into a train and a test set by selecting train = 0.8 for a 80/20 train/test split. we have also set method = “cross†and k = 5 for a 5-fold cross validation. This means that the data is divided into k subsets of equal size, with 80% of the data used for training and the remaining 20% used for evaluation. The models are recursively estimated 10 times, each time using a different train/test split, which ensures that all users and items are considered for both training and testing. The results can then be averaged to produce a single evaluation set.
In simple terms we are using a 5-fold cross validation for our model.
Selecting given = -1 means that for the test users ‘all but 1’ randomly selected item is withheld for evaluation.
scheme <- ratings_matrix %>%
evaluationScheme(method = "cross",
k = 5,
train = 0.8,
given = -1)
scheme
## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 5 run(s).
## Good ratings: NA
## Data set: 19792 x 4002 rating matrix of class 'binaryRatingMatrix' with 517355 ratings.
One of recommenderlab main features, and reason for selecting it for analysis, is the ability to estimate multiple algorithms in one go. First, we create a list with the algorithms we want to estimate, specifying all the models’ parameters.
Here, we use schemes which evaluate on a binary rating matrix. we include the random items algorithm for benchmarking purposes.
Note: Trail and error leads us to choosing a support of 0.01 and a confidence level of 0.01
algorithms <- list(
"association rules" = list(name = "AR",
param = list(supp = 0.01, conf = 0.01)),
"random items" = list(name = "RANDOM", param = NULL),
"popular items" = list(name = "POPULAR", param = NULL),
"item-based CF" = list(name = "IBCF", param = list(k = 5)),
"user-based CF" = list(name = "UBCF",
param = list(method = "Cosine", nn = 500))
)
We now “pass” the scheme and algorithms to the evaluate() function, select type = topNList to evaluate a Top N List of product recommendations and specify how many recommendations to calculate with the parameter n = c(1, 3, 5, 10, 15, 20).
Note: that the CF based algorithms take a few minutes each to estimate.
The output is stored as a list containing all the evaluations.
results <- recommenderlab::evaluate(scheme,
algorithms,
type = "topNList",
n = c(1, 3, 5, 10, 15, 20)
)
## AR run fold/sample [model time/prediction time]
## 1 [0.39sec/99.57sec]
## 2 [0.27sec/84.26sec]
## 3 [0.29sec/85.93sec]
## 4 [0.28sec/83.35sec]
## 5 [0.26sec/83.19sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/21.35sec]
## 2 [0sec/21.9sec]
## 3 [0sec/21.54sec]
## 4 [0sec/21.99sec]
## 5 [0sec/22.12sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.01sec/17.65sec]
## 2 [0.01sec/17.61sec]
## 3 [0.02sec/17.39sec]
## 4 [0.02sec/17.49sec]
## 5 [0sec/17.84sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [245.18sec/3.05sec]
## 2 [246.37sec/3.01sec]
## 3 [245.18sec/2.7sec]
## 4 [239.61sec/2.67sec]
## 5 [266.6sec/2.75sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/104.25sec]
## 2 [0sec/100.52sec]
## 3 [0sec/102.95sec]
## 4 [0sec/106.31sec]
## 5 [0sec/109.7sec]
results
## List of evaluation results for 5 recommenders:
## Evaluation results for 5 folds/samples using method 'AR'.
## Evaluation results for 5 folds/samples using method 'RANDOM'.
## Evaluation results for 5 folds/samples using method 'POPULAR'.
## Evaluation results for 5 folds/samples using method 'IBCF'.
## Evaluation results for 5 folds/samples using method 'UBCF'.
First, we arrange the confusion matrix output for one model in a convenient format.
## Pull into a list all confusion matrix information for one model
tmp <- results$`user-based CF` %>%
getConfusionMatrix() %>%
as.list()
## Calculate average value of 5 cross-validation rounds
as.data.frame( Reduce("+",tmp) / length(tmp)) %>%
## Add a column to mark the number of recommendations calculated
mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
## Select only columns needed and sorting out order
select('n', 'precision', 'recall', 'TPR', 'FPR')
We put the previous steps into a formula. Next, We use the map() function from the purrr package to get all results in a tidy format, ready for charting.
avg_conf_matr <- function(results) {
tmp <- results %>%
getConfusionMatrix() %>%
as.list()
as.data.frame(Reduce("+",tmp) / length(tmp)) %>%
mutate(n = c(1, 3, 5, 10, 15, 20)) %>%
select('n', 'precision', 'recall', 'TPR', 'FPR')
}
## Using map() to iterate function across all models
results_tbl <- results %>%
map(avg_conf_matr) %>%
## Turning into an unnested tibble
enframe() %>%
## Unnesting to have all variables on same level
unnest()
results_tbl
Classification models’ performance can be compared using the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR).
results_tbl %>%
ggplot(aes(FPR, TPR,
colour = fct_reorder2(as.factor(name),
FPR, TPR))) +
geom_line() +
geom_label(aes(label = n)) +
labs(title = "ROC curves", colour = "Model") +
theme_grey(base_size = 14)
The item-based collaborative filtering model is the clear winner as it achieves the highest TPR for any given level of FPR. This means that the model is producing the highest number of relevant recommendations (true positives) for the same level of non-relevant recommendations (false positives).
Another common way to compare classification models’ performance is with Precision Vs Recall curves. Precision shows how sensitive models are to False Positives (i.e. recommending an item not very likely to be purchased) whereas Recall (which is just another name for the TPR) looks at how sensitive models are to False Negatives (i.e. do not suggest an item which is highly likely to be purchased). Normally, we care about accurately predicting which items are more likely to be purchased because this would have a positive impact on sales and revenue. In other words, we want to maximize Recall (or minimize FNs) for the same level of Precision.
results_tbl %>%
ggplot(aes(recall, precision,
colour = fct_reorder2(as.factor(name),
precision, recall))) +
geom_line() +
geom_label(aes(label = n)) +
labs(title = "Precision-Recall curves", colour = "Model") +
theme_grey(base_size = 14)
The plot confirms that item-based Collaborative Filter (IBCF) is the best model because it has higher Recall for any given level of Precision. This means that IBCF minimizes FNs (i.e. not suggesting an item highly likely to be purchased) for all level of FPs.
The final step is to generate a prediction with the best performing model. To do that, we create a made-up purchase order, put this order in a format that recommenderlab accepts, we create a recommender, and then pass the recommender and our made-up order to predict function.
Lastly, the suggested items can be inspected as a list.
## Create a made-up order with a string containing 6 products selected at random.
customer_order <- c("GREEN REGENCY TEACUP AND SAUCER",
"SET OF 3 BUTTERFLY COOKIE CUTTERS",
"JAM MAKING SET WITH JARS",
"SET OF TEA COFFEE SUGAR TINS PANTRY",
"SET OF 4 PANTRY JELLY MOULDS")
## Put string in a format that recommenderlab accepts.
new_order_rat_matrix <- retail %>%
select(Description) %>%
## Select item descriptions from retail dataset
unique() %>%
mutate(value = as.numeric(Description %in% customer_order)) %>%
## Add a 'value' column
spread(key = Description, value = value) %>%
## Spread into sparse matrix format
as.matrix()
## Change to a matrix
new_order_rat_matrix <- as(new_order_rat_matrix, "binaryRatingMatrix") # Convert to recommenderlab class 'binaryRatingsMatrix'
## Create a `Recommender`
recomm <- Recommender(getData(scheme, 'train'),
method = "IBCF",
param = list(k = 5))
## Pass the `Recommender` and the made-up order to the `predict` function to create
## A top 10 recommendation list for the new customer.
pred <- predict(recomm,
newdata = new_order_rat_matrix,
n = 10)
## Inspect this pediction as a list
as(pred, 'list')
## $`1`
## [1] "ROSES REGENCY TEACUP AND SAUCER"
## [2] "PINK REGENCY TEACUP AND SAUCER"
## [3] "SET OF 3 HEART COOKIE CUTTERS"
## [4] "REGENCY CAKESTAND 3 TIER"
## [5] "JAM MAKING SET PRINTED"
## [6] "RECIPE BOX PANTRY YELLOW DESIGN"
## [7] "SET OF 3 CAKE TINS PANTRY DESIGN"
## [8] "SET OF 3 REGENCY CAKE TINS"
## [9] "SET OF 6 SPICE TINS PANTRY DESIGN"
## [10] "3 PIECE SPACEBOY COOKIE CUTTER SET"
Recommenderlab leads up to easily deploy and compare various recommender models. Based on our data set, we select IBCF as the best model for this situation.
In the real world, such a model would have to be deployed to be usable.
For the app deployment We create 2 data files: past_orders_matrix and item_list.
past_orders_matrix is a user-item sparse matrix containing the history of past orders. This is needed in the Shiny server.R file for all the calculations.
item_list is a list of all the products available to purchase. This will feed in the Shiny ui.R file to make the products list available for selection.
We save these files for use in the app.
## Create `past_orders_matrix` containing the history of past orders. This is a is a user-item sparse matrix.
## This is needed in the `server.R` file for all the calculations.
past_orders_matrix <- retail %>%
## Select only needed variables
select(InvoiceNo, Description) %>%
## Add a column of 1s
mutate(value = 1) %>%
## Spread into user-item format
spread(Description, value, fill = 0) %>%
select(-InvoiceNo) %>%
## Convert to matrix
as.matrix()
## Convert to class "dgCMatrix"
past_orders_matrix <- as(past_orders_matrix, "dgCMatrix")
## save the file for use in the app
saveRDS(past_orders_matrix,
file = "past_orders_matrix.rds")
## Create `item_list` list of all the products available to purchase.
## This will feed in the `ui.R` part of the app to make the producs list available for selection.
item_list <- retail %>%
select(Description) %>%
unique()
## I save the file for use in the app
saveRDS(item_list,
file = "item_list.rds")
We run into the problem with our model being slow in prediction.
Fortunately, during research on sample apps we came across this brilliant Kaggle kernel by Philipp Spachtholz.
Spachtholz carried out a salient analysis on a non-Kaggle dataset and crucially a Shiny-based Book Recommender system using a much faster collaborative filtering code. In particular, he uses simililarity_measures.R functions for calculating similarity matrices, and the cf_algorithm.R file with the collaborative filtering algorithm and prediction function. This speeds up the prediction considerably.
We create a matrix new_order and put it in a a user_item matrix format. Then, we add the new_order to the past_orders_matrix as its first entry.
## Use `item_list` to put the `new_order` in a user_item matrix format.
new_order <- item_list %>%
## Add a 'value' column with 1's for customer order items
mutate(value = as.numeric(Description %in% customer_order)) %>%
## Spread into sparse matrix format
spread(key = Description, value = value) %>%
## Change to a matrix
as.matrix()
# Convert to class "dgCMatrix"
new_order <- as(new_order, "dgCMatrix")
# Then, I add the `new_order` to the `past_orders_matrix` as its first order.
all_orders_dgc <- t(rbind(new_order,past_orders_matrix))
We set a number of parameters required by the Improved CF to work. Next, we load the algorithm implementations and similarity calculations.
# Set range of items to calculate predictions for - here I select them all
items_to_predict <- 1:nrow(all_orders_dgc)
# Set current user to 1, which corresponds to new_order
users <- c(1)
# Set prediction indices
prediction_indices <- as.matrix(expand.grid(items_to_predict, users = users))
# Load algorithm implementations and similarity calculations
source("cf_algorithm.R")
##
## Attaching package: 'slam'
## The following object is masked from 'package:data.table':
##
## rollup
source("similarity_measures.R")
Finally, we fit the item-based CF model with the Improved CF and check the runtime.
We compare this to the item-based CF model with recommenderlab and compare performances.
## Forced garbage Collection for Vector memory issues
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2930943 156.6 9659264 515.9 19462042 1039.4
## Vcells 15445155 117.9 249428734 1903.0 759611508 5795.4
start <- Sys.time()
recomm <- predict_cf(all_orders_dgc,
prediction_indices,
"ibcf", FALSE, cal_cos, 3,
FALSE, 4000, 2000)
end <- Sys.time()
cat('runtime', end - start)
## runtime 0.7755589
## Runtime 0.3677909 - Impressive
# Convert `all_orders` to class "binaryRatingMatrix" - by way of converting to dataframe first.
b = as.data.frame(summary(all_orders_dgc))
all_orders_brm <- as(b, "binaryRatingMatrix")
rm(b)
## Run run IBCF model on recommenderlab
## Forced garbage Collection for Vector memory issues
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2964739 158.4 9659264 515.9 19462042 1039.4
## Vcells 15570294 118.8 199542987 1522.4 759611508 5795.4
## Commenting out following code as memory limitations do not allow code to run in R markdown
## start <- Sys.time()
## recomm <- Recommender(all_orders_brm,
## method = "IBCF",
## param = list(k = 5))
## end <- Sys.time()
## cat('runtime', end - start)
## runtime 12.75939
We have all the necessary pieces in place for our Goods Recommender. We write and deploy our shinny app in the file shinnyrecommenderapp.R.
The above analysis in replicated in a shiny app we deployed using the shiny framework. Shiny app is deployed at: https://qasimahmed.shinyapps.io/RGoodsRecommender