The goal of this assignment is give you practice working with Matrix Factorization techniques.

  • Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.

  • You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.

  • SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

# Loading required libraries
library(recommenderlab)
library(reshape2)
library(RCurl)
library(dplyr)
library(ggplot2)
library(knitr)
library(tidyverse)

1 Data Loading

I have chosen movie lens dataset provided by recommenderlab

data_package <- data(package = "recommenderlab")
data_package$results[, "Item"]
## [1] "Jester5k"                    "JesterJokes (Jester5k)"     
## [3] "MSWeb"                       "MovieLense"                 
## [5] "MovieLenseMeta (MovieLense)"

The dataset is 943 x 1664 movie tating matrix which consists 99292 rating. Each row corresponds to a user and each column is a movie. This is one huge sparse matrix with over 1500K combinations.

data(MovieLense)
MovieLense
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

Finding similarity for fist 4 users using similarity function. Here are cacluating similarity using cosine distance.

similarity_users <- similarity(MovieLense[1:4, ], method ="cosine", which = "users")
as.matrix(similarity_users)
##           1         2         3         4
## 1 0.0000000 0.9605820 0.8339504 0.9192637
## 2 0.9605820 0.0000000 0.9268716 0.9370341
## 3 0.8339504 0.9268716 0.0000000 0.9130323
## 4 0.9192637 0.9370341 0.9130323 0.0000000

2 Recommendation Models

The recommender_models object contains some information about the models. First,let’s see which models we have:

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
kable(names(recommender_models))
## Warning in kable_markdown(x = structure(c("ALS_realRatingMatrix",
## "ALS_implicit_realRatingMatrix", : The table should have a header (column
## names)
ALS_realRatingMatrix
ALS_implicit_realRatingMatrix
IBCF_realRatingMatrix
POPULAR_realRatingMatrix
RANDOM_realRatingMatrix
RERECOMMEND_realRatingMatrix
SVD_realRatingMatrix
SVDF_realRatingMatrix
UBCF_realRatingMatrix

Let’s take a look at their descriptions:

lapply(recommender_models, "[[", "description")
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

We are creating subset of movie lens dataset where each users have rated at least 50 movies and movies that have been watched more than 100 times.

ratings_movies <- MovieLense[rowCounts(MovieLense) > 50,colCounts(MovieLense) > 100] 
ratings_movies
## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.

Defining train and test datasets, TRUE indicates users in train dataset and FALSE indicates users of test dataset.

which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies),replace = TRUE, prob = c(0.8, 0.2))
head(which_train)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]

Checking SVD and ALS recommendation models and their applicable parameters

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$ALS_realRatingMatrix$parameters
## $normalize
## NULL
## 
## $lambda
## [1] 0.1
## 
## $n_factors
## [1] 10
## 
## $n_iterations
## [1] 10
## 
## $min_item_nr
## [1] 1
## 
## $seed
## NULL
recommender_models$SVD_realRatingMatrix$parameters
## $k
## [1] 10
## 
## $maxiter
## [1] 100
## 
## $normalize
## [1] "center"

3 SVD (Singular Value Decomposition)

Buiilding SVD recommendation model using SVD algorighm method

SVD.recc_model <- Recommender(data = recc_data_train, method = "SVD",parameter = list(k = 30))
SVD.recc_model
## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 449 users.

Predict

SVD.recc_predicted <- predict(object = SVD.recc_model, newdata = recc_data_test, n = 6)
SVD.recc_predicted
## Recommendations as 'topNList' with n = 6 for 111 users.

Recommendation for the first user

SVD.recc_predicted@items[[1]]
## [1] 132 149 107 197 151 137

4 ALS (Alternating Least Squares)

Buiilding ALS recommendation model using ALS algorighm method

ALS.recc_model <- Recommender(data = recc_data_train, method = "ALS")
ALS.recc_model
## Recommender of type 'ALS' for 'realRatingMatrix' 
## learned using 449 users.

Predict

ALS.recc_predicted <- predict(object = ALS.recc_model, newdata = recc_data_test, n = 6)
ALS.recc_predicted
## Recommendations as 'topNList' with n = 6 for 111 users.

Recommendation for the first user

ALS.recc_predicted@items[[1]]
## [1]  23  92 197 260 173  95

5 Evaluate & Compare SVD and ALS

eval_sets <- evaluationScheme(ratings_movies, method = "split", train = .8, k=4, given = 4, goodRating=3)
algorithms <- list("SVD" = list(name="SVD"), "ALS" = list(name="ALS"))
results <- evaluate(eval_sets, algorithms, n = seq(10, 100, 10))
## SVD run fold/sample [model time/prediction time]
##   1  [0.6sec/0.08sec] 
##   2  [0.03sec/0.08sec] 
##   3  [0.03sec/0.26sec] 
##   4  [0.03sec/0.21sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.08sec/11.45sec] 
##   2  [0sec/11.7sec] 
##   3  [0sec/11.88sec] 
##   4  [0sec/11.55sec]
class(results)
## [1] "evaluationResultList"
## attr(,"package")
## [1] "recommenderlab"
  • True Positive Rate (TPR): This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the number of purchased items (TP + FN).

  • False Positive Rate (FPR): This is the percentage of not purchased items that have been recommended. It’s the number of FP divided by the number of not purchased items (FP + TN)

kable(head(getConfusionMatrix(results$SVD)[1]))
TP FP FN TN precision recall TPR FPR
10 4.776786 5.223214 81.94643 236.0536 0.4776786 0.0619336 0.0619336 0.0211230
20 8.419643 11.580357 78.30357 229.6964 0.4209821 0.1034239 0.1034239 0.0467012
30 11.785714 18.214286 74.93750 223.0625 0.3928571 0.1445970 0.1445970 0.0738392
40 15.071429 24.928571 71.65179 216.3482 0.3767857 0.1840694 0.1840694 0.1015481
50 18.125000 31.875000 68.59821 209.4018 0.3625000 0.2197164 0.2197164 0.1300233
60 20.946429 39.053571 65.77679 202.2232 0.3491071 0.2516999 0.2516999 0.1596566
70 23.776786 46.223214 62.94643 195.0536 0.3396684 0.2838822 0.2838822 0.1891784
80 26.428571 53.571429 60.29464 187.7054 0.3303571 0.3131761 0.3131761 0.2194954
90 29.071429 60.928571 57.65179 180.3482 0.3230159 0.3428524 0.3428524 0.2498391
100 31.785714 68.214286 54.93750 173.0625 0.3178571 0.3728961 0.3728961 0.2796268
kable(head(getConfusionMatrix(results$ALS)[1]))
TP FP FN TN precision recall TPR FPR
10 4.169643 5.830357 82.55357 235.4464 0.4169643 0.0494304 0.0494304 0.0233722
20 8.312500 11.687500 78.41071 229.5893 0.4156250 0.0979654 0.0979654 0.0469716
30 12.017857 17.982143 74.70536 223.2946 0.4005952 0.1408545 0.1408545 0.0722365
40 15.803571 24.196429 70.91964 217.0804 0.3950893 0.1855815 0.1855815 0.0973926
50 19.366071 30.633929 67.35714 210.6429 0.3873214 0.2262206 0.2262206 0.1234147
60 22.750000 37.250000 63.97321 204.0268 0.3791667 0.2647291 0.2647291 0.1502030
70 26.116071 43.883929 60.60714 197.3929 0.3730867 0.3037462 0.3037462 0.1772907
80 29.366071 50.633929 57.35714 190.6429 0.3670759 0.3392218 0.3392218 0.2046213
90 32.642857 57.357143 54.08036 183.9196 0.3626984 0.3763308 0.3763308 0.2319625
100 35.803571 64.196429 50.91964 177.0804 0.3580357 0.4108417 0.4108417 0.2598226
recommenderlab::plot(results, annotate = 1:4, legend="topleft")

columns_to_sum_SVD <- c("TP", "FP", "FN", "TN")
indices_summed_SVD <- Reduce("+", getConfusionMatrix(results$SVD))[, columns_to_sum_SVD]
kable(head(indices_summed_SVD))
TP FP FN TN
10 18.08929 21.91071 312.3482 959.6518
20 32.48214 47.51786 297.9554 934.0446
30 45.57143 74.42857 284.8661 907.1339
40 57.31250 102.68750 273.1250 878.8750
50 68.33929 131.66071 262.0982 849.9018
60 79.48214 160.51786 250.9554 821.0446
columns_to_sum_ALS <- c("TP", "FP", "FN", "TN")
indices_summed_ALS <- Reduce("+", getConfusionMatrix(results$ALS))[, columns_to_sum_ALS]
kable(head(indices_summed_ALS))
TP FP FN TN
10 16.83929 23.16071 313.5982 958.4018
20 32.12500 47.87500 298.3125 933.6875
30 46.19643 73.80357 284.2411 907.7589
40 60.08036 99.91964 270.3571 881.6429
50 73.08036 126.91964 257.3571 854.6429
60 86.10714 153.89286 244.3304 827.6696

Two accuracy metrics are as follows:

  • Precision: This is the percentage of recommended items that have been purchased. It’s the number of FP divided by the total number of positives (TP + FP).

  • Recall: This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the total number of purchases (TP + FN). It’s also equal to the True Positive Rate.

recommenderlab::plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")

eval_recommender <- Recommender(data = getData(eval_sets, "train"),method = 'SVD')
eval_prediction <- predict(object = eval_recommender, newdata =getData(eval_sets, "known"), n = 10, type = "ratings")
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
kable(head(eval_accuracy))
RMSE MSE MAE
2 1.1526658 1.3286384 0.8782033
3 1.4749485 2.1754730 1.3482052
5 1.7868933 3.1929878 1.4359525
6 1.0900108 1.1881236 0.7546279
8 1.3583865 1.8452138 0.9892600
12 0.7778402 0.6050354 0.6092704
ALS_eval_recommender <- Recommender(data = getData(eval_sets, "train"),method = 'ALS',parameter = list( normalize=NULL, lambda=0.1, n_factors=200, n_iterations=10, seed = 1234, verbose = TRUE))
## Used parameters:
## normalize     =  NULL
## lambda    =  0.1
## n_factors     =  200
## n_iterations  =  10
## min_item_nr   =  1
## seed  =  1234
## verbose   =  TRUE
ALS_eval_prediction <- predict(object = ALS_eval_recommender, newdata =getData(eval_sets, "known"), n = 10, type = "ratings")
## [1] "0th iteration: cost function = 107881.841099455"
## [1] "1th iteration, step 1: cost function = 101674.016235416"
## [1] "1th iteration, step 2: cost function = 96810.452973072"
## [1] "2th iteration, step 1: cost function = 92514.1825860634"
## [1] "2th iteration, step 2: cost function = 89276.657175504"
## [1] "3th iteration, step 1: cost function = 86374.3048562102"
## [1] "3th iteration, step 2: cost function = 84097.7093784984"
## [1] "4th iteration, step 1: cost function = 82021.1671793639"
## [1] "4th iteration, step 2: cost function = 80146.2888907426"
## [1] "5th iteration, step 1: cost function = 77759.9060897439"
## [1] "5th iteration, step 2: cost function = 75743.7901169019"
## [1] "6th iteration, step 1: cost function = 74414.1932963314"
## [1] "6th iteration, step 2: cost function = 73092.3428067162"
## [1] "7th iteration, step 1: cost function = 71822.6860187558"
## [1] "7th iteration, step 2: cost function = 70706.9337741614"
## [1] "8th iteration, step 1: cost function = 69684.4965611434"
## [1] "8th iteration, step 2: cost function = 68807.6455159863"
## [1] "9th iteration, step 1: cost function = 68065.9730200085"
## [1] "9th iteration, step 2: cost function = 67445.9681242909"
## [1] "10th iteration, step 1: cost function = 66911.5703490813"
## [1] "10th iteration, step 2: cost function = 66463.722089732"
ALS_eval_accuracy <- calcPredictionAccuracy(x = ALS_eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
kable(head(ALS_eval_accuracy))
RMSE MSE MAE
2 0.8082209 0.6532211 0.5640932
3 1.2972581 1.6828787 1.0810463
5 1.2920681 1.6694399 1.0167497
6 0.8130227 0.6610059 0.6522359
8 0.8648391 0.7479466 0.6862509
12 0.7892665 0.6229416 0.6092135

After analyzing movielens dataset of recommenderlab data package, it appears that both SVD and ALS method RMSE values stay close to each other. It’s very evident from the ALS evaluations that cost function is optimized with each iteration. So, more number of iterations can lead to better optimization of the model. The TPR:FPR ratio is also better for ALS as compared to SVD. Thus for thee given dataset ALS appears to be better model choice when compared to SVD.