The goal of this assignment is give you practice working with Matrix Factorization techniques.

Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

# Loading required libraries
library(recommenderlab)
library(reshape2)
library(RCurl)
library(dplyr)
library(ggplot2)
library(knitr)
library(tidyverse)

1 Data Loading

I have chosen movie lens dataset provided by recommenderlab

data_package <- data(package = "recommenderlab")
data_package$results[, "Item"]

## [1] "Jester5k"                    "JesterJokes (Jester5k)"     
## [3] "MSWeb"                       "MovieLense"                 
## [5] "MovieLenseMeta (MovieLense)"

The dataset is 943 x 1664 movie tating matrix which consists 99292 rating. Each row corresponds to a user and each column is a movie. This is one huge sparse matrix with over 1500K combinations.

data(MovieLense)
MovieLense

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

Finding similarity for fist 4 users using similarity function. Here are cacluating similarity using cosine distance.

similarity_users <- similarity(MovieLense[1:4, ], method ="cosine", which = "users")
as.matrix(similarity_users)

##           1         2         3         4
## 1 0.0000000 0.9605820 0.8339504 0.9192637
## 2 0.9605820 0.0000000 0.9268716 0.9370341
## 3 0.8339504 0.9268716 0.0000000 0.9130323
## 4 0.9192637 0.9370341 0.9130323 0.0000000

2 Recommendation Models

The recommender_models object contains some information about the models. First,let’s see which models we have:

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
kable(names(recommender_models))

## Warning in kable_markdown(x = structure(c("ALS_realRatingMatrix",
## "ALS_implicit_realRatingMatrix", : The table should have a header (column
## names)

ALS_realRatingMatrix

ALS_implicit_realRatingMatrix

IBCF_realRatingMatrix

POPULAR_realRatingMatrix

RANDOM_realRatingMatrix

RERECOMMEND_realRatingMatrix

SVD_realRatingMatrix

SVDF_realRatingMatrix

UBCF_realRatingMatrix

Let’s take a look at their descriptions:

lapply(recommender_models, "[[", "description")

## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

We are creating subset of movie lens dataset where each users have rated at least 50 movies and movies that have been watched more than 100 times.

ratings_movies <- MovieLense[rowCounts(MovieLense) > 50,colCounts(MovieLense) > 100] 
ratings_movies

## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.

Defining train and test datasets, TRUE indicates users in train dataset and FALSE indicates users of test dataset.

which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies),replace = TRUE, prob = c(0.8, 0.2))
head(which_train)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]

Checking SVD and ALS recommendation models and their applicable parameters

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$ALS_realRatingMatrix$parameters

## $normalize
## NULL
## 
## $lambda
## [1] 0.1
## 
## $n_factors
## [1] 10
## 
## $n_iterations
## [1] 10
## 
## $min_item_nr
## [1] 1
## 
## $seed
## NULL

recommender_models$SVD_realRatingMatrix$parameters

## $k
## [1] 10
## 
## $maxiter
## [1] 100
## 
## $normalize
## [1] "center"

3 SVD (Singular Value Decomposition)

Buiilding SVD recommendation model using SVD algorighm method

SVD.recc_model <- Recommender(data = recc_data_train, method = "SVD",parameter = list(k = 30))
SVD.recc_model

## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 449 users.

Predict

SVD.recc_predicted <- predict(object = SVD.recc_model, newdata = recc_data_test, n = 6)
SVD.recc_predicted

## Recommendations as 'topNList' with n = 6 for 111 users.

Recommendation for the first user

SVD.recc_predicted@items[[1]]

## [1] 132 149 107 197 151 137

4 ALS (Alternating Least Squares)

Buiilding ALS recommendation model using ALS algorighm method

ALS.recc_model <- Recommender(data = recc_data_train, method = "ALS")
ALS.recc_model

## Recommender of type 'ALS' for 'realRatingMatrix' 
## learned using 449 users.

Predict

ALS.recc_predicted <- predict(object = ALS.recc_model, newdata = recc_data_test, n = 6)
ALS.recc_predicted

## Recommendations as 'topNList' with n = 6 for 111 users.

Recommendation for the first user

ALS.recc_predicted@items[[1]]

## [1]  23  92 197 260 173  95

5 Evaluate & Compare SVD and ALS

eval_sets <- evaluationScheme(ratings_movies, method = "split", train = .8, k=4, given = 4, goodRating=3)
algorithms <- list("SVD" = list(name="SVD"), "ALS" = list(name="ALS"))

results <- evaluate(eval_sets, algorithms, n = seq(10, 100, 10))

## SVD run fold/sample [model time/prediction time]
##   1  [0.6sec/0.08sec] 
##   2  [0.03sec/0.08sec] 
##   3  [0.03sec/0.26sec] 
##   4  [0.03sec/0.21sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0.08sec/11.45sec] 
##   2  [0sec/11.7sec] 
##   3  [0sec/11.88sec] 
##   4  [0sec/11.55sec]

class(results)

## [1] "evaluationResultList"
## attr(,"package")
## [1] "recommenderlab"

True Positive Rate (TPR): This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the number of purchased items (TP + FN).
False Positive Rate (FPR): This is the percentage of not purchased items that have been recommended. It’s the number of FP divided by the number of not purchased items (FP + TN)

kable(head(getConfusionMatrix(results$SVD)[1]))

	TP	FP	FN	TN	precision	recall	TPR	FPR
10	4.776786	5.223214	81.94643	236.0536	0.4776786	0.0619336	0.0619336	0.0211230
20	8.419643	11.580357	78.30357	229.6964	0.4209821	0.1034239	0.1034239	0.0467012
30	11.785714	18.214286	74.93750	223.0625	0.3928571	0.1445970	0.1445970	0.0738392
40	15.071429	24.928571	71.65179	216.3482	0.3767857	0.1840694	0.1840694	0.1015481
50	18.125000	31.875000	68.59821	209.4018	0.3625000	0.2197164	0.2197164	0.1300233
60	20.946429	39.053571	65.77679	202.2232	0.3491071	0.2516999	0.2516999	0.1596566
70	23.776786	46.223214	62.94643	195.0536	0.3396684	0.2838822	0.2838822	0.1891784
80	26.428571	53.571429	60.29464	187.7054	0.3303571	0.3131761	0.3131761	0.2194954
90	29.071429	60.928571	57.65179	180.3482	0.3230159	0.3428524	0.3428524	0.2498391
100	31.785714	68.214286	54.93750	173.0625	0.3178571	0.3728961	0.3728961	0.2796268

kable(head(getConfusionMatrix(results$ALS)[1]))

	TP	FP	FN	TN	precision	recall	TPR	FPR
10	4.169643	5.830357	82.55357	235.4464	0.4169643	0.0494304	0.0494304	0.0233722
20	8.312500	11.687500	78.41071	229.5893	0.4156250	0.0979654	0.0979654	0.0469716
30	12.017857	17.982143	74.70536	223.2946	0.4005952	0.1408545	0.1408545	0.0722365
40	15.803571	24.196429	70.91964	217.0804	0.3950893	0.1855815	0.1855815	0.0973926
50	19.366071	30.633929	67.35714	210.6429	0.3873214	0.2262206	0.2262206	0.1234147
60	22.750000	37.250000	63.97321	204.0268	0.3791667	0.2647291	0.2647291	0.1502030
70	26.116071	43.883929	60.60714	197.3929	0.3730867	0.3037462	0.3037462	0.1772907
80	29.366071	50.633929	57.35714	190.6429	0.3670759	0.3392218	0.3392218	0.2046213
90	32.642857	57.357143	54.08036	183.9196	0.3626984	0.3763308	0.3763308	0.2319625
100	35.803571	64.196429	50.91964	177.0804	0.3580357	0.4108417	0.4108417	0.2598226

recommenderlab::plot(results, annotate = 1:4, legend="topleft")

columns_to_sum_SVD <- c("TP", "FP", "FN", "TN")
indices_summed_SVD <- Reduce("+", getConfusionMatrix(results$SVD))[, columns_to_sum_SVD]
kable(head(indices_summed_SVD))

	TP	FP	FN	TN
10	18.08929	21.91071	312.3482	959.6518
20	32.48214	47.51786	297.9554	934.0446
30	45.57143	74.42857	284.8661	907.1339
40	57.31250	102.68750	273.1250	878.8750
50	68.33929	131.66071	262.0982	849.9018
60	79.48214	160.51786	250.9554	821.0446

columns_to_sum_ALS <- c("TP", "FP", "FN", "TN")
indices_summed_ALS <- Reduce("+", getConfusionMatrix(results$ALS))[, columns_to_sum_ALS]
kable(head(indices_summed_ALS))

	TP	FP	FN	TN
10	16.83929	23.16071	313.5982	958.4018
20	32.12500	47.87500	298.3125	933.6875
30	46.19643	73.80357	284.2411	907.7589
40	60.08036	99.91964	270.3571	881.6429
50	73.08036	126.91964	257.3571	854.6429
60	86.10714	153.89286	244.3304	827.6696

Two accuracy metrics are as follows:

Precision: This is the percentage of recommended items that have been purchased. It’s the number of FP divided by the total number of positives (TP + FP).
Recall: This is the percentage of purchased items that have been recommended. It’s the number of TP divided by the total number of purchases (TP + FN). It’s also equal to the True Positive Rate.

recommenderlab::plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")

eval_recommender <- Recommender(data = getData(eval_sets, "train"),method = 'SVD')
eval_prediction <- predict(object = eval_recommender, newdata =getData(eval_sets, "known"), n = 10, type = "ratings")
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
kable(head(eval_accuracy))

	RMSE	MSE	MAE
2	1.1526658	1.3286384	0.8782033
3	1.4749485	2.1754730	1.3482052
5	1.7868933	3.1929878	1.4359525
6	1.0900108	1.1881236	0.7546279
8	1.3583865	1.8452138	0.9892600
12	0.7778402	0.6050354	0.6092704

ALS_eval_recommender <- Recommender(data = getData(eval_sets, "train"),method = 'ALS',parameter = list( normalize=NULL, lambda=0.1, n_factors=200, n_iterations=10, seed = 1234, verbose = TRUE))

## Used parameters:
## normalize     =  NULL
## lambda    =  0.1
## n_factors     =  200
## n_iterations  =  10
## min_item_nr   =  1
## seed  =  1234
## verbose   =  TRUE

ALS_eval_prediction <- predict(object = ALS_eval_recommender, newdata =getData(eval_sets, "known"), n = 10, type = "ratings")

## [1] "0th iteration: cost function = 107881.841099455"
## [1] "1th iteration, step 1: cost function = 101674.016235416"
## [1] "1th iteration, step 2: cost function = 96810.452973072"
## [1] "2th iteration, step 1: cost function = 92514.1825860634"
## [1] "2th iteration, step 2: cost function = 89276.657175504"
## [1] "3th iteration, step 1: cost function = 86374.3048562102"
## [1] "3th iteration, step 2: cost function = 84097.7093784984"
## [1] "4th iteration, step 1: cost function = 82021.1671793639"
## [1] "4th iteration, step 2: cost function = 80146.2888907426"
## [1] "5th iteration, step 1: cost function = 77759.9060897439"
## [1] "5th iteration, step 2: cost function = 75743.7901169019"
## [1] "6th iteration, step 1: cost function = 74414.1932963314"
## [1] "6th iteration, step 2: cost function = 73092.3428067162"
## [1] "7th iteration, step 1: cost function = 71822.6860187558"
## [1] "7th iteration, step 2: cost function = 70706.9337741614"
## [1] "8th iteration, step 1: cost function = 69684.4965611434"
## [1] "8th iteration, step 2: cost function = 68807.6455159863"
## [1] "9th iteration, step 1: cost function = 68065.9730200085"
## [1] "9th iteration, step 2: cost function = 67445.9681242909"
## [1] "10th iteration, step 1: cost function = 66911.5703490813"
## [1] "10th iteration, step 2: cost function = 66463.722089732"

ALS_eval_accuracy <- calcPredictionAccuracy(x = ALS_eval_prediction, data = getData(eval_sets, "unknown"), byUser =TRUE)
kable(head(ALS_eval_accuracy))

	RMSE	MSE	MAE
2	0.8082209	0.6532211	0.5640932
3	1.2972581	1.6828787	1.0810463
5	1.2920681	1.6694399	1.0167497
6	0.8130227	0.6610059	0.6522359
8	0.8648391	0.7479466	0.6862509
12	0.7892665	0.6229416	0.6092135

After analyzing movielens dataset of recommenderlab data package, it appears that both SVD and ALS method RMSE values stay close to each other. It’s very evident from the ALS evaluations that cost function is optimized with each iteration. So, more number of iterations can lead to better optimization of the model. The TPR:FPR ratio is also better for ALS as compared to SVD. Thus for thee given dataset ALS appears to be better model choice when compared to SVD.

DATA643_Project3

Niteen Kumar

June 26, 2018

1 Data Loading

2 Recommendation Models

3 SVD (Singular Value Decomposition)

4 ALS (Alternating Least Squares)

5 Evaluate & Compare SVD and ALS

6 Reference