The goal for the final project is to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverable, with separate dates:
The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset.
I intend to build recommendation system that will produce quality recommendation using the Recommenderlab we covered in class and experiment with Apriori Algorithm for Association Rules which is different from what we did in class.
The ultimate goal of using both approaches is to compare the items recommended using algorithm selected from the Recommenderlab and that of Apriori Association Rule.
Recommenderlab accepts 2 types of rating matrix for modelling:
Real rating matrix consisting of actual user ratings, which requires normalisation.
Binary rating matrix consisting of 0’s and 1’s, where 1’s indicate if the product was purchased. This is the matrix type needed for the analysis and it does not require normalization.
I used the recommenderlab, an R package which provides a convenient framework to evaluate and compare various recommendation algorithms and quickly establish the best suited approach.
I arranged the purchase history in a rating matrix, with orders in rows and products in columns. This format is often called a user_item matrix because “users” (e.g. customers or orders) tend to be on the rows and “items” (e.g. products) on the columns.
The recommenderlab has an ability to estimate multiple algorithms at a time, I will create a list with the algorithms and consider schemes which evaluate on a binary rating matrix.
I compared and evaluated the performance of the algorithms using ROC Curve, Precision/Recall, RMSE, MSE and MAE and use the best performing algorithm to predict top items list for the new customer.
Finally, I deployed the algorithm using Shiny App from R
The data for this project comes from the UCI Machine Learning Repository, an online archive of large datasets which includes a wide variety of data types, analysis tasks, and application areas.
In this project I used the Online Retail dataset donated to UCI in 2015 by the School of Engineering at London South Bank University.
This dataset is common and have been used in many Market Basket Analysis; the dataset contains transactions occurring between 01/Dec/2010 and 09/Dec/2011 for a UK-based and registered online retail company and contains 541,909 observations with eight variables.
The data is too large for my GitHub but can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/
library(ggplot2)
library(plyr)
library(kableExtra)
library(recommenderlab)
library(data.table)
library(readxl)
library(tidyverse)
library(skimr)
library(knitr)
library(treemap)
library(RColorBrewer)
library(arules)
library(arulesViz)
retail <- read_excel("C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 612/Week6-Final Project Proposal/Online Retail.xlsx",trim_ws = TRUE)
str(retail)
tibble [541,909 x 8] (S3: tbl_df/tbl/data.frame)
$ InvoiceNo : chr [1:541909] "536365" "536365" "536365" "536365" ...
$ StockCode : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
$ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
$ Quantity : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
$ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
$ UnitPrice : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
$ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
$ Country : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
retail <- retail %>%
filter(!grepl("C", retail$InvoiceNo))
retail <- retail %>%
filter(!is.na(Description))
retail <- retail %>%
filter(!is.na(CustomerID))
Creating and dropping the unique identifier to filter the duplicates
retail <- retail %>%
mutate(InNo_Desc = paste(InvoiceNo, Description, sep = ' '))
retail <- retail[!duplicated(retail$InNo_Desc), ] %>%
select(-InNo_Desc)
summary(retail)
InvoiceNo StockCode Description Quantity
Length:387695 Length:387695 Length:387695 Min. : 1.00
Class :character Class :character Class :character 1st Qu.: 2.00
Mode :character Mode :character Mode :character Median : 6.00
Mean : 13.26
3rd Qu.: 12.00
Max. :80995.00
InvoiceDate UnitPrice CustomerID Country
Min. :2010-12-01 08:26:00 Min. : 0.000 Min. :12346 Length:387695
1st Qu.:2011-04-07 10:24:00 1st Qu.: 1.250 1st Qu.:13941 Class :character
Median :2011-07-29 14:29:00 Median : 1.950 Median :15144 Mode :character
Mean :2011-07-10 12:01:53 Mean : 3.121 Mean :15282
3rd Qu.:2011-10-20 11:03:00 3rd Qu.: 3.750 3rd Qu.:16789
Max. :2011-12-09 12:50:00 Max. :8142.750 Max. :18287
Convert data to numeric
sapply(retail[ ,c('InvoiceNo','CustomerID')],
function(x) length(unique(x)))
InvoiceNo CustomerID
18536 4339
retail <- retail %>%
mutate(Description = as.factor(Description)) %>%
mutate(Country = as.factor(Country)) %>%
mutate(InvoiceNo = as.numeric(InvoiceNo)) %>%
mutate(Date = as.Date(InvoiceDate)) %>%
mutate(Time = as.factor(format(InvoiceDate,"%H:%M:%S")))
View the Distributions
treemap(retail,
index = c("Country"),
vSize = "Quantity",
algorithm = "pivotSize",
title = "The Country with the Most Purchase",
palette = "Set3",
border.col = "grey20")
ratingMat <- retail %>%
select(InvoiceNo, Description) %>%
mutate(value = 1) %>%
spread(Description, value, fill = 0) %>%
select(-InvoiceNo) %>%
as.matrix() %>%
as("binaryRatingMatrix")
ratingMat
18536 x 3866 rating matrix of class ‘binaryRatingMatrix’ with 387695 ratings.
Set up List of Algorithms
Scheme <- ratingMat %>%
evaluationScheme(method = "cross",
k = 5,
train = 0.8,
given = -1)
Scheme
Evaluation scheme using all-but-1 items
Method: ‘cross-validation’ with 5 run(s).
Good ratings: NA
Data set: 18536 x 3866 rating matrix of class ‘binaryRatingMatrix’ with 387695 ratings.
ModelAlgorithms <- list(
"Association" = list(name = "AR",
param = list(supp = 0.01, conf = 0.8)),
"UserBased" = list(name="UBCF", param=list(method="Cosine",
nn=500)),
"ItemBased" = list(name="IBCF", param=list(k = 5))
)
Using n=c(1, 5, 10, 15, 20, 25) to evaluate product recommendations
ModelResults <- evaluate(Scheme, ModelAlgorithms, n=c(1, 5, 10, 15, 20, 25))
AR run fold/sample [model time/prediction time]
1 [0.26sec/43.32sec]
2 [0.21sec/40.63sec]
3 [0.22sec/40.91sec]
4 [0.18sec/40.78sec]
5 [0.2sec/42.16sec]
UBCF run fold/sample [model time/prediction time]
1 [0sec/393.43sec]
2 [0sec/382.14sec]
3 [0.02sec/376.9sec]
4 [0sec/406.61sec]
5 [0sec/402.96sec]
IBCF run fold/sample [model time/prediction time]
1 [240.3sec/1.99sec]
2 [239.71sec/2.29sec]
3 [247.3sec/2.64sec]
4 [242.91sec/2.69sec]
5 [242.28sec/2.44sec]
It seems the IBCF took longer to estimate
plot(ModelResults, annotate = TRUE, legend = "topleft")
title("ROC Curve")
plot(ModelResults, "prec/rec", annotate = TRUE) # precision and recall
title("Precision and Recall")
Classification models performance can be compared using the ROC curve, which plots the true positive rate (TPR) against the false positive rate (FPR).
The item-based collaborative filtering model is the winner as it achieves the highest TPR for any given level of FPR.
This means that the model is producing the highest number of relevant recommendations known as True Positives(TP)) for the same level of non-relevant recommendations known as False Positives(FP)).
Prepare training dataset
train <- getData(Scheme, "train")
Prepare testing set
test <- getData(Scheme, "known")
Prepare evaluation set
evaluation <- getData(Scheme, "unknown")
Itemmodel <- Recommender(train, method = "IBCF",
param = list(k = 5))
Itempred <- predict(Itemmodel, newdata = test)
Itemaccuracy <- calcPredictionAccuracy(Itempred, evaluation, given=10)
kable(Itemaccuracy,caption = "Performance Metrics")
| x | |
|---|---|
| TP | 0.2880259 |
| FP | 8.6933657 |
| FN | 0.7119741 |
| TN | 3846.3066343 |
| precision | 0.0343062 |
| recall | 0.2880259 |
| TPR | 0.2880259 |
| FPR | 0.0022551 |
recommendedItems <- as(Itempred, 'list')
recommendedItems <- do.call("rbind", lapply(recommendedItems, as.data.frame))
write.csv(recommendedItems,
file = "ItemBased Recommendation.csv",
row.names = TRUE,
sep = ',',
col.names = TRUE
)
Now we can see a list of the recommended items as provided by Item-Based collaborative filter. In the second part of this project, I will experiment with Association Rules using Apriori to compare the recommended items based on support, confidence and lift.
Create and view the transaction data since Apriori algorithm only works with transaction object
retailtransaction <- ddply(retail,c("InvoiceNo","Date"),
function(df1)paste(df1$Description,
collapse = ","))
retailtransaction
Cast Invoice and Date to NULL since they’re not required and create items Basket
retailtransaction$InvoiceNo <- NULL
retailtransaction$Date <- NULL
colnames(retailtransaction) <- c("items")
retailtransaction
Save the transaction data as csv and upload it to create transaction Object.
write.csv(retailtransaction,"C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 612/Week7-Final Project Submission/retailtransactions.csv", quote = FALSE, row.names = FALSE)
retailtrans <- read.transactions("C:/Users/Emahayz_Pro/Desktop/Data_Science/Data 612/Week7-Final Project Submission/retailtransactions.csv", format = 'basket', rm.duplicates = FALSE, sep=',')
retailtrans
transactions in sparse format with
18537 transactions (rows) and
7725 items (columns)
summary (retailtrans)
transactions as itemMatrix in sparse format with
18537 rows (elements/itemsets/transactions) and
7725 columns (items) and a density of 0.002292625
most frequent items:
WHITE HANGING HEART T-LIGHT HOLDER REGENCY CAKESTAND 3 TIER
1759 1533
JUMBO BAG RED RETROSPOT PARTY BUNTING
1417 1266
ASSORTED COLOUR BIRD ORNAMENT (Other)
1244 321081
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1574 865 757 758 759 708 653 644 637 584 611 526 510 518 545 515 460 436 483 415
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
400 306 304 275 235 251 233 208 219 209 165 157 133 142 133 109 112 86 109 95
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
94 83 91 67 62 70 64 57 58 51 64 41 42 48 44 37 31 39 31 28
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
25 19 25 25 22 27 25 24 14 19 17 12 12 15 10 16 13 7 7 15
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
15 12 7 9 10 11 12 8 6 5 6 11 6 3 4 3 6 4 1 4
101 102 103 104 105 106 107 108 109 110 111 113 114 116 117 118 120 122 123 125
3 4 4 3 2 2 5 4 4 2 5 3 3 3 3 4 2 2 1 2
126 127 131 132 133 134 136 140 141 142 143 145 146 147 150 151 154 157 168 171
2 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 2 2 2 2
177 178 180 202 204 228 236 249 250 285 320 400 419
1 1 1 1 1 1 1 1 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 5.00 13.00 17.71 23.00 419.00
includes extended item information - examples:
There are 18537 transactions (rows) and 7725 items (columns). Note that 7725 is the product descriptions involved in the dataset and 18537 transactions are collections of these items.
Density tells the percentage of non-zero cells in a sparse matrix. We can say it as the total number of items that are purchased divided by a possible number of items in that matrix.
I can generate an itemFrequencyPlot to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix using absolute item frequency. Here, I will view the top 20 items.
itemFrequencyPlot(retailtrans,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.
Using Min Support as 0.001, confidence as 0.8.
Control The Number Of Rules
I can adjust the maxlen, supp and conf arguments in the apriori function to control the number of rules generated.
• To get ‘strong‘ rules, I can increase the value of ‘conf’ parameter, here we are using 0.8 which is high
at 80%. • To get ‘longer‘ rules, I can increase ‘maxlen’, here we are using the minimum number of length (2) that can generate a rule which is the first attempt for this transaction data.
retailrules <- apriori(retailtrans, parameter = list(supp = 0.001, conf = 0.8, maxlen = 2))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 18
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[7725 item(s), 18537 transaction(s)] done [0.22s].
sorting and recoding items ... [2442 item(s)] done [0.01s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2
Mining stopped (maxlen reached). Only patterns up to a length of 2 returned!
done [0.11s].
writing ... [113 rule(s)] done [0.01s].
creating S4 object ... done [0.01s].
summary(retailrules)
set of 113 rules
rule length distribution (lhs + rhs):sizes
2
113
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 2 2 2 2 2
summary of quality measures:
support confidence coverage lift count
Min. :0.001079 Min. :0.8000 Min. :0.001187 Min. : 24.33 Min. : 20.00
1st Qu.:0.002643 1st Qu.:0.8448 1st Qu.:0.002751 1st Qu.: 97.73 1st Qu.: 49.00
Median :0.004100 Median :0.9167 Median :0.004262 Median :105.16 Median : 76.00
Mean :0.005169 Mean :0.9187 Mean :0.005682 Mean :193.34 Mean : 95.81
3rd Qu.:0.007984 3rd Qu.:1.0000 3rd Qu.:0.008793 3rd Qu.:337.04 3rd Qu.:148.00
Max. :0.022280 Max. :1.0000 Max. :0.027243 Max. :597.97 Max. :413.00
mining info:
Apriori created 113 rules for maxlen = 2, I decided to use maxlen = 2 because greater than 2 will create rules in the thousands which I want to avoid.
The maximum support is 0.022280, the average confidence is 0.9187 and maximum Confidence is 1. The average lift is 193.34 and the maximum lift is 597.97
inspect(retailrules[1:10])
lhs rhs support confidence coverage lift count
[1] {DOG LICENCE WALL ART} => {BICYCLE SAFTEY WALL ART} 0.001078923 0.9090909 0.001186816 510.66116 20
[2] {WOBBLY CHICKEN} => {METAL} 0.001456546 1.0000000 0.001456546 378.30612 27
[3] {WOBBLY CHICKEN} => {DECORATION} 0.001456546 1.0000000 0.001456546 378.30612 27
[4] {MIXED NUTS LIGHT GREEN BOWL} => {SMALL DOLLY MIX DESIGN ORANGE BOWL} 0.001078923 0.9090909 0.001186816 60.18506 20
[5] {MIRRORED WALL ART LADIES} => {MIRRORED WALL ART GENTS} 0.001132869 0.8400000 0.001348654 556.11000 21
[6] {DECOUPAGE} => {GREETING CARD} 0.001240762 1.0000000 0.001240762 331.01786 23
[7] {BILLBOARD FONTS DESIGN} => {WRAP} 0.001564439 1.0000000 0.001564439 597.96774 29
[8] {WRAP} => {BILLBOARD FONTS DESIGN} 0.001564439 0.9354839 0.001672331 597.96774 29
[9] {SET/4 BLUE FLOWER CANDLES IN BOWL} => {S/4 PINK FLOWER CANDLES IN BOWL} 0.001078923 0.8000000 0.001348654 138.59439 20
[10] {ENAMEL PINK TEA CONTAINER} => {ENAMEL PINK COFFEE CONTAINER} 0.001672331 0.8157895 0.002049954 321.75084 31
I will also try to remove redundant rules
retailrules_Lift <- sort (retailrules, by = "lift", decreasing=TRUE)
Groceries.rules_Lift <- retailrules_Lift[!is.redundant(retailrules_Lift)]
# show the support, lift and confidence for the rules
inspect(head(retailrules_Lift, 20))
write(retailrules_Lift,
file = "Apriori Recommended Items.csv",
sep = ",",
quote = TRUE,
row.names = FALSE
)
The rules with confidence of 1 as seen above implies that, whenever the LHS item was purchased, the RHS item was also purchased 100% of the time.
A rule with a lift of 597 as seen above implies that, the items in LHS and RHS are 597 times more likely to be purchased together compared to the purchases when they are assumed to be unrelated.
Plotting all the 113 rules will be too much, so I will subset the rules with confidence of at least 90%.
retailsubrules<-retailrules_Lift[quality(retailrules_Lift)$confidence>0.9]
plot(retailsubrules)
The chart shows the scatter of the reduced number of rules (66).
Let’s select 20 rules having the highest confidence.
Top 20 Rules
topretailsubrules <- head(retailsubrules, n = 20, by = "confidence")
plot(topretailsubrules, method = "graph", layout=igraph::in_circle())
Using the interactive plot, I can see the recommended items from the LHS and the matching items on the RHS with the respective support, lift and confidence.
plotly_arules(topretailsubrules)
I can also interactively select certain rules to view the recommended items
plot(topretailsubrules, method = "graph", engine = "htmlwidget")
The purpose of this project was to build a recommender system and to practice working with apriori association rules to recommend items for possible purchases.
For the given data set, the Item-based Model within recommenderlab provided the best results. It is worth remembering that the choice of the recommender algorithm depends on the context. The accuracy of the Item-based model is acceptable compared to the other algorithms considered.
In addition to the methods provided within the recommenderlab library, there are other supervised and unsupervised machine learning algorithm as well as neutral network-driven deep learning algorithms that could be used to build recommendation systems.
However, I used the unsupervised Apriori Algorithm for Association rules and was able to obtain the recommended items. For each given rule, the items on the LHS have a matching items on the RHS indicating the confidence level to which these items were purchased together. The recommended items for top 20 rules with the corresponding support, lift and confidence are printed and saved as a csv file that is submitted with this project.
Finally, I created an interactive graph to see the recommended items from the LHS and the matching items on the RHS with the respective support, lift and confidence. The second interactive graph shows where I can select any rule and view the recommended items.
Breese JS, Heckerman D, Kadie C (1998). “Empirical Analysis of Predictive Algorithms for Collaborative Filtering.” In Uncertainty in Artificial Intelligence. Proceedings of the Fourteenth Conference, pp. 43-52.
Jabeen, H. (2018). Market Basket Analysis using R. Retrieved from https://www.datacamp.com/community/tutorials/market-basket-analysis-r
Kohavi, Ron (1995). “A study of cross-validation and bootstrap for accuracy estimation and model selection”. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pp. 1137-1143.
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems. Computer, 42(8), 30–37. https://doi.org/10.1109/MC.2009.263
Michael Hahsler (2016). recommenderlab: A Framework for Developing and Testing Recommendation Algorithms, R package. https://CRAN.R-project.org/package=recommenderlab