The goal of this assignment is give you practice working with Matrix Factorization techniques.
Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
# Loading required libraries
library(recommenderlab)
library(reshape2)
library(RCurl)
library(dplyr)
library(ggplot2)
library(knitr)
library(tidyverse)
I have chosen movie lens dataset provided by recommenderlab
data_package <- data(package = "recommenderlab")
data_package$results[, "Item"]
## [1] "Jester5k" "JesterJokes (Jester5k)"
## [3] "MSWeb" "MovieLense"
## [5] "MovieLenseMeta (MovieLense)"
The dataset is 943 x 1664 movie tating matrix which consists 99292 rating. Each row corresponds to a user and each column is a movie. This is one huge sparse matrix with over 1500K combinations.
data(MovieLense)
MovieLense
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
Finding similarity for fist 4 users using similarity function. Here are cacluating similarity using cosine distance.
similarity_users <- similarity(MovieLense[1:4, ], method ="cosine", which = "users")
as.matrix(similarity_users)
## 1 2 3 4
## 1 0.0000000 0.9605820 0.8339504 0.9192637
## 2 0.9605820 0.0000000 0.9268716 0.9370341
## 3 0.8339504 0.9268716 0.0000000 0.9130323
## 4 0.9192637 0.9370341 0.9130323 0.0000000
The recommender_models object contains some information about the models. First,let’s see which models we have:
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
kable(names(recommender_models))
## Warning in kable_markdown(x = structure(c("ALS_realRatingMatrix",
## "ALS_implicit_realRatingMatrix", : The table should have a header (column
## names)
ALS_realRatingMatrix |
ALS_implicit_realRatingMatrix |
IBCF_realRatingMatrix |
POPULAR_realRatingMatrix |
RANDOM_realRatingMatrix |
RERECOMMEND_realRatingMatrix |
SVD_realRatingMatrix |
SVDF_realRatingMatrix |
UBCF_realRatingMatrix |
Let’s take a look at their descriptions:
lapply(recommender_models, "[[", "description")
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
We are creating subset of movie lens dataset where each users have rated at least 50 movies and movies that have been watched more than 100 times.
ratings_movies <- MovieLense[rowCounts(MovieLense) > 50,colCounts(MovieLense) > 100]
ratings_movies
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
Defining train and test datasets, TRUE indicates users in train dataset and FALSE indicates users of test dataset.
which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies),replace = TRUE, prob = c(0.8, 0.2))
head(which_train)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]
Checking SVD and ALS recommendation models and their applicable parameters
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$ALS_realRatingMatrix$parameters
## $normalize
## NULL
##
## $lambda
## [1] 0.1
##
## $n_factors
## [1] 10
##
## $n_iterations
## [1] 10
##
## $min_item_nr
## [1] 1
##
## $seed
## NULL
recommender_models$SVD_realRatingMatrix$parameters
## $k
## [1] 10
##
## $maxiter
## [1] 100
##
## $normalize
## [1] "center"
Buiilding SVD recommendation model using SVD algorighm method
SVD.recc_model <- Recommender(data = recc_data_train, method = "SVD",parameter = list(k = 30))
SVD.recc_model
## Recommender of type 'SVD' for 'realRatingMatrix'
## learned using 449 users.
Predict
SVD.recc_predicted <- predict(object = SVD.recc_model, newdata = recc_data_test, n = 6)
SVD.recc_predicted
## Recommendations as 'topNList' with n = 6 for 111 users.
Recommendation for the first user
SVD.recc_predicted@items[[1]]
## [1] 132 149 107 197 151 137
Buiilding ALS recommendation model using ALS algorighm method
ALS.recc_model <- Recommender(data = recc_data_train, method = "ALS")
ALS.recc_model
## Recommender of type 'ALS' for 'realRatingMatrix'
## learned using 449 users.
Predict
ALS.recc_predicted <- predict(object = ALS.recc_model, newdata = recc_data_test, n = 6)
ALS.recc_predicted
## Recommendations as 'topNList' with n = 6 for 111 users.
Recommendation for the first user
ALS.recc_predicted@items[[1]]
## [1] 23 92 197 260 173 95
eval_sets <- evaluationScheme(ratings_movies, method = "split", train = .8, k=4, given = 4, goodRating=3)
algorithms <- list("SVD" = list(name="SVD"), "ALS" = list(name="ALS"))
results <- evaluate(eval_sets, algorithms, n = seq(10, 100, 10))
## SVD run fold/sample [model time/prediction time]
## 1 [0.6sec/0.08sec]
## 2 [0.03sec/0.08sec]
## 3 [0.03sec/0.26sec]
## 4 [0.03sec/0.21sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0.08sec/11.45sec]
## 2 [0sec/11.7sec]
## 3 [0sec/11.88sec]
## 4 [0sec/11.55sec]
class(results)
## [1] "evaluationResultList"
## attr(,"package")
## [1] "recommenderlab"
True Positive Rate (TPR): This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the number of purchased items (TP + FN).
False Positive Rate (FPR): This is the percentage of not purchased items that have been recommended. It’s the number of FP divided by the number of not purchased items (FP + TN)
kable(head(getConfusionMatrix(results$SVD)[1]))
|
kable(head(getConfusionMatrix(results$ALS)[1]))
|
recommenderlab::plot(results, annotate = 1:4, legend="topleft")
columns_to_sum_SVD <- c("TP", "FP", "FN", "TN")
indices_summed_SVD <- Reduce("+", getConfusionMatrix(results$SVD))[, columns_to_sum_SVD]
kable(head(indices_summed_SVD))
TP | FP | FN | TN | |
---|---|---|---|---|
10 | 18.08929 | 21.91071 | 312.3482 | 959.6518 |
20 | 32.48214 | 47.51786 | 297.9554 | 934.0446 |
30 | 45.57143 | 74.42857 | 284.8661 | 907.1339 |
40 | 57.31250 | 102.68750 | 273.1250 | 878.8750 |
50 | 68.33929 | 131.66071 | 262.0982 | 849.9018 |
60 | 79.48214 | 160.51786 | 250.9554 | 821.0446 |
columns_to_sum_ALS <- c("TP", "FP", "FN", "TN")
indices_summed_ALS <- Reduce("+", getConfusionMatrix(results$ALS))[, columns_to_sum_ALS]
kable(head(indices_summed_ALS))
TP | FP | FN | TN | |
---|---|---|---|---|
10 | 16.83929 | 23.16071 | 313.5982 | 958.4018 |
20 | 32.12500 | 47.87500 | 298.3125 | 933.6875 |
30 | 46.19643 | 73.80357 | 284.2411 | 907.7589 |
40 | 60.08036 | 99.91964 | 270.3571 | 881.6429 |
50 | 73.08036 | 126.91964 | 257.3571 | 854.6429 |
60 | 86.10714 | 153.89286 | 244.3304 | 827.6696 |
Two accuracy metrics are as follows:
Precision: This is the percentage of recommended items that have been purchased. It’s the number of FP divided by the total number of positives (TP + FP).
Recall: This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the total number of purchases (TP + FN). It’s also equal to the True Positive Rate.
recommenderlab::plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")
eval_recommender <- Recommender(data = getData(eval_sets, "train"),method = 'SVD')
eval_prediction <- predict(object = eval_recommender, newdata =getData(eval_sets, "known"), n = 10, type = "ratings")
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
kable(head(eval_accuracy))
RMSE | MSE | MAE | |
---|---|---|---|
2 | 1.1526658 | 1.3286384 | 0.8782033 |
3 | 1.4749485 | 2.1754730 | 1.3482052 |
5 | 1.7868933 | 3.1929878 | 1.4359525 |
6 | 1.0900108 | 1.1881236 | 0.7546279 |
8 | 1.3583865 | 1.8452138 | 0.9892600 |
12 | 0.7778402 | 0.6050354 | 0.6092704 |
ALS_eval_recommender <- Recommender(data = getData(eval_sets, "train"),method = 'ALS',parameter = list( normalize=NULL, lambda=0.1, n_factors=200, n_iterations=10, seed = 1234, verbose = TRUE))
## Used parameters:
## normalize = NULL
## lambda = 0.1
## n_factors = 200
## n_iterations = 10
## min_item_nr = 1
## seed = 1234
## verbose = TRUE
ALS_eval_prediction <- predict(object = ALS_eval_recommender, newdata =getData(eval_sets, "known"), n = 10, type = "ratings")
## [1] "0th iteration: cost function = 107881.841099455"
## [1] "1th iteration, step 1: cost function = 101674.016235416"
## [1] "1th iteration, step 2: cost function = 96810.452973072"
## [1] "2th iteration, step 1: cost function = 92514.1825860634"
## [1] "2th iteration, step 2: cost function = 89276.657175504"
## [1] "3th iteration, step 1: cost function = 86374.3048562102"
## [1] "3th iteration, step 2: cost function = 84097.7093784984"
## [1] "4th iteration, step 1: cost function = 82021.1671793639"
## [1] "4th iteration, step 2: cost function = 80146.2888907426"
## [1] "5th iteration, step 1: cost function = 77759.9060897439"
## [1] "5th iteration, step 2: cost function = 75743.7901169019"
## [1] "6th iteration, step 1: cost function = 74414.1932963314"
## [1] "6th iteration, step 2: cost function = 73092.3428067162"
## [1] "7th iteration, step 1: cost function = 71822.6860187558"
## [1] "7th iteration, step 2: cost function = 70706.9337741614"
## [1] "8th iteration, step 1: cost function = 69684.4965611434"
## [1] "8th iteration, step 2: cost function = 68807.6455159863"
## [1] "9th iteration, step 1: cost function = 68065.9730200085"
## [1] "9th iteration, step 2: cost function = 67445.9681242909"
## [1] "10th iteration, step 1: cost function = 66911.5703490813"
## [1] "10th iteration, step 2: cost function = 66463.722089732"
ALS_eval_accuracy <- calcPredictionAccuracy(x = ALS_eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
kable(head(ALS_eval_accuracy))
RMSE | MSE | MAE | |
---|---|---|---|
2 | 0.8082209 | 0.6532211 | 0.5640932 |
3 | 1.2972581 | 1.6828787 | 1.0810463 |
5 | 1.2920681 | 1.6694399 | 1.0167497 |
6 | 0.8130227 | 0.6610059 | 0.6522359 |
8 | 0.8648391 | 0.7479466 | 0.6862509 |
12 | 0.7892665 | 0.6229416 | 0.6092135 |
After analyzing movielens dataset of recommenderlab data package, it appears that both SVD and ALS method RMSE values stay close to each other. It’s very evident from the ALS evaluations that cost function is optimized with each iteration. So, more number of iterations can lead to better optimization of the model. The TPR:FPR ratio is also better for ALS as compared to SVD. Thus for thee given dataset ALS appears to be better model choice when compared to SVD.