Project 4: Accuracy and Beyond

Introduction

This project covers comparing accuracy of models and introducing diversity to a recommender system. It is based on the recommenderlab R package.

# Required libraries
library(recommenderlab)  # Matrix/recommender functions
library(dplyr)           # Data manipulation
library(tidyr)           # Data manipulation
library(ggplot2)         # Plotting
library(tictoc)          # Operation timing

Data Set

The data is a product ratings set for beauty products sold on Amazon.com. It was downloaded from Kaggle.com (https://www.kaggle.com/skillsmuggler/amazon-ratings). Original set contains 2,023,070 observations and 4 variables - User ID, Product ID, Rating (from 1 to 5), and Time Stamp. It covers 1,210,271 users and 249,274 products. In order to make the set more manageable it has been reduced to a smaller subset. The final set used in this project includes products with over 30 reviews in the original set and users that rated over 10 products. The final ratings matrix contains 3,562 users and 9,647 prooducts. There are 68,565 ratings. R code used to examine and reduce the original set is available at GitHub (https://github.com/ilyakats/CUNY-DATA643/tree/master/Project%204).

# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
                           "master/Project%204/ratings_Short.csv"))

ratingsMatrix <- sparseMatrix(as.integer(ratings$UserId), as.integer(ratings$ProductId), x = ratings$Rating)
colnames(ratingsMatrix) <- levels(ratings$ProductId)
rownames(ratingsMatrix) <- levels(ratings$UserId)
amazon <- as(ratingsMatrix, "realRatingMatrix")

Each user has rated at least 11 items and at most 187 (median is 15 and average is 19.25). Each item has been rated by at least 1 user and at most by 177 users (median is 3 and average is 7.1).

In order to test any models, we need to split our data into training and testing sets (based on a common 80-20 split).

# Train/test split
set.seed(88)
eval <- evaluationScheme(amazon, method = "split", train = 0.8, given = 5, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

# Set up data frame for timing
timing <- data.frame(Model=factor(), Training=double(), Predicting=double())

Recommender Models

We are going to build and compare several models.

Building Models

The code below builds several models - User Based Collaborative Filtering (UBCF) model, Random model, and SVD model. It also generates several parameters we will use to compare the models.

# ---------------- USER BASED COLLABORATIVE FILTERING ----------------
model_method <- "UBCF"

# Training
tic()
modelUBCF <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(model_method), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 

# Accuracy
accUBCF <- calcPredictionAccuracy(predUBCF, unknown)
#resultsUBCF <- evaluate(x = eval, method = model_method, n = c(1, 5, 10, 30, 60))

# ---------------- RANDOM ----------------
model_method <- "RANDOM"

# Training
tic()
modelRandom <- Recommender(train, method = model_method)
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
predRandom <- predict(modelRandom, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(model_method), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 

# Accuracy
accRandom <- calcPredictionAccuracy(predRandom, unknown)
#resultsRandom <- evaluate(x = eval, method = model_method, n = c(1, 5, 10, 30, 60))

# ---------------- SVD ----------------
model_method <- "SVD"

# Training
tic()
modelSVD <- Recommender(train, method = model_method, parameter = list(k = 50))
t <- toc(quiet = TRUE)
train_time <- round(t$toc - t$tic, 2)

# Predicting
tic()
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
t <- toc(quiet = TRUE)
predict_time <- round(t$toc - t$tic, 2)

timing <- rbind(timing, data.frame(Model = as.factor(model_method), 
                                   Training = as.double(train_time), 
                                   Predicting = as.double(predict_time))) 

# Accuracy
accSVD <- calcPredictionAccuracy(predSVD, unknown)
#resultsSVD <- evaluate(x = eval, method = model_method, n = c(1, 5, 10, 30, 60))

Comparing Models

accuracy <- rbind(accUBCF, accRandom)
accuracy <- rbind(accuracy, accSVD)
rownames(accuracy) <- c("UBCF", "Random", "SVD")
knitr::kable(accuracy, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	RMSE	MSE	MAE
UBCF	1.129307	1.275334	0.838039
Random	1.423150	2.025355	1.025171
SVD	1.123959	1.263285	0.827643

Reviewing the accuracy numbers for 3 models we see that UBCF and SVD models are very close together. UBCF model is only slightly better. The Random model is noticeably worse. It is not suprising that random recommendations are not as accurate as recommendations based on prior ratings.

Next we can review ROC curve and the Precision-Recall plot for all 3 models. Again UBCF performs better than SVD and considerably better than the Random model.

models <- list(
  "UBCF" = list(name = "UBCF", param = NULL),
  "Random" = list(name = "RANDOM", param = NULL),
  "SVD" = list(name = "SVD", param = list(k = 50))
  )
evalResults <- evaluate(x = eval, method = models, n = c(1, 5, 10, 30, 60))

# ROC Curve
plot(evalResults, 
     annotate = TRUE, legend = "topleft", main = "ROC Curve")

# Precision-Recall Plot
plot(evalResults, "prec/rec", 
     annotate = TRUE, legend = "topright", main = "Precision-Recall")

Finally, it is important to consider training and prediction time. From the table below we can see that the UBCF model can be created fairly quickly, but predicting results takes considerable time. The Random model is pretty efficient all around. The SVD model takes longer to build than to predict, but altogether it is quicker than the UBCF model. This may be a factor in some projects.

rownames(timing) <- timing$Model
knitr::kable(timing[, 2:3], format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

	Training	Predicting
UBCF	0.00	208.62
RANDOM	0.01	3.26
SVD	12.45	4.13

Implementing Business/User Experience Goal

It may not always be desireable to recommend products that are likely to be most highly rated by a user. Recommending somewhat unexpected products may improve user experience, expand user preferences, provide additional knowledge about a user. To account for that we build a hybrid model that combines the UBCF model and the Random model. In order to make sure that most of recommendations are still likely to be highly rated we only allow very minor influence of the Random model (0.99 vs. 0.01 weight between UBCF and Random models).

modelHybrid <- HybridRecommender(
    modelUBCF, 
    modelRandom, 
    weights = c(0.99, 0.01))
predHybrid <- predict(modelHybrid, newdata = known, type = "ratings")
( accHybrid <- calcPredictionAccuracy(predHybrid, unknown) )

##      RMSE       MSE       MAE 
## 1.3008284 1.6921544 0.9147718

The accuracy has gone down. It is not as bad as with purely random model, but clearly not as good as UBCF or SVD models. However, the goal here is to influence user experience rather than make the most accurate model, so we need to employ different metrics.

Let us look at top 10 recommendations for the first user in the test set.

pUBCF <- predict(modelUBCF, newdata = known[1], type = "topNList")
pHybrid <- predict(modelHybrid, newdata = known[1], type = "topNList")

pUBCF@items

## $A100ZQDV7L8PVV
##  [1] 8871 4298 8747 9564 8868 2076 1181 9547 7390 9560

pHybrid@items

## $A100ZQDV7L8PVV
##  [1] 4298 8871 8747 8868 2076 9547 9560 1181 8643 9564

The Hybrid model includes most of the items recommended by the UBCF model, but there are new items and the order is different.

Conclusion

The project covered comparing several recommender models. Similar process can be employed to compare additional models or to adjust model parameters to find the most optimal model. Additionally, a random element was introduced in order to diversify recommendations. When recommending a random element it is difficult to predict how a user will react to it and whether the recommendation has a positive impact on business goals or user experience. In order to evaluate the impact you would need access to the online environment.

One of the approaches in measuring success of diversification may be A/B testing. Users are randomly divided into two groups and each group is offered a slightly different experience. For instance, one group may get recommendations only from the UBCF model while the other group will get recommendations from the hybrid model. User experience is measure in some way. The least instrusive way is to monitor user interaction. In this example of Amazon products, a click on a recommendation suggested by the random element of the model will point to the fact that the hybrid model provides valuable recommendations. Of course, it is possible to track other metrics - products bought, time spent on product page, amount spent, etc. The basic idea is to see if the hybrid model provides meaningful improvement to the basic model.

It is important to have objective measures when building and optimizing data science models. Evaluation of a model that returns highly relevant, but redundant recommendations should reflect that the model may score poorly in user experience. One of the approaches to measure diversity is described in Novelty and Diversity in Information Retrieval Evaluation (Clarke et al. 2008; https://plg.uwaterloo.ca/~gvcormac/novelty.pdf). This or similar measurement should be incorporated in projects of this type.