Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found. SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
Loading packages and the MovieLense dataset
Similar to last week’s assignment, and our Building a Recommendation System with R book, I’ll use the MovieLense dataset for this project. In order to access this data, we’ll need to load the recommenderlab package and the MovieLense data stored within this package.
require(recommenderlab)
require(ggplot2)
data("MovieLense")
class(MovieLense)
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
require(knitr)
require(kable)
require(kableExtra)
require(devtools)
require(tidyverse)
require(dplyr)
require(stats)
require(irlba)
require(rsvd)
library(tictoc)
Similar to last week, before we implement a matrix factorization method, we can first do a bit of data exploration and identify the sparsity of our user-item matrix:
dim(MovieLense)
## [1] 943 1664
As we can see, there are 943 users and 1164 movies in this dataset. From this, we can see that there are 1,569,152 possible user-item combinations (943 * 1664).
Next, we can check the sparsity of the dataset by running the following calculation:
movie_matrix_orig <- as.matrix(MovieLense@data)
length(movie_matrix_orig[movie_matrix_orig==0]) / (ncol(movie_matrix_orig)*nrow(movie_matrix_orig))
## [1] 0.9366588
As we can see above, the matrix is quite sparse, with a large majority (about 94% of user-item ratings) with a rating of zero, which in this instance indicates that there is no rating. Before running our matrix factorization method, we’ll need to handle our missing values.
Missing values in MovieLense Data
In the MovieLense dataset, the missing values have been imputed to zero. For our purposes, this is not ideal, since this would greatly alter our singular value calculations later. Therefore, we will create a matrix that converts our missing values from zero to NA, and then impute the missing values with the row mean for each user. This way, we can have a value for each user-movie combination in our matrix, and our matrix will be ready for SVD and dimensionality reduction:
# creating movie_matrix replacing zeros with NAs
movie_matrix <- movie_matrix_orig
is.na(movie_matrix) <- movie_matrix == 0
# imputing all missing values to the row mean
k <- which(is.na(movie_matrix), arr.ind=TRUE)
movie_matrix[k] <- rowMeans(movie_matrix, na.rm=TRUE)[k[,1]]
With our movie_matrix now containing imputed values for our missing data, we can start SVD.
Singular Value Decomposition (SVD)
In order to work through our SVD calculations, we’ll need to keep in mind the following equation:
\[M = U \Sigma V^{T} \] with \(M\) as our original \(m \times n\) matrix, \(U\) as our unitary matrix (\(m \times m\)), \(\Sigma\) as our diagonal matrix (\(m \times n\)), and \(V^{T}\) as our complex unitary matrix (\(n \times n\)).
We can demonstrate that this equation is valid by performing the following calculations to show that when we calculate \(U\), \(\Sigma\), and \(V^{T}\) of our original movie matrix, and multiply these matrices together, we’ll obtain our original movie matrix (\(M\)).
svd_M <- svd(movie_matrix)
S <- diag(svd_M$d)
U <- svd_M$u
V <- svd_M$v
v_tran <- t(V)
# used later to find our optimal rank(k)
d <- svd_M$d
item_profile <- U %*% sqrt(S)
user_profile <- sqrt(S) %*% v_tran
item_user_profile <- item_profile %*% user_profile
rownames(item_user_profile) <- rownames(movie_matrix)
colnames(item_user_profile) <- colnames(movie_matrix)
Check to see if our item_user_profile matrix is equal to our original movie_matrix.
round(max(abs(Matrix(item_user_profile) - movie_matrix)), 0)
## [1] 0
As we can see, this is indeed true given that the difference between the two matrices is zero.
Now, given this method, we can perform dimensionality reduction to create approximate ratings from our original item-user matrix. The reason we would do this is to cut down on the size of our original movie matrix, which at this point with the dimensions of 943 by 1664, is quite large. We can modify the rank of this matrix, \(r = 943\), and reduce it to a rank of \(k\), in which \(k < r\).
rankMatrix(movie_matrix)[1]
## [1] 943
From above, we can check that our movie_matrix is indeed currently rank \(r=943\). We will reduce this to a lower rank matrix, in order to make our recommender more efficient, and to allow us to use smaller matrices to make predictions.
To do this, we can follow the steps outlined in the YouTube resources provided for this week’s materials.
First, we’ll need to find our optimal \(k\) value from our \(d\) singular values, which were computed from our original movie_matrix, and organized from greatest to least. Given there were 943 singular values computed from our original matrix during SVD (which matches our initial rank of 943), in practice, we’ll reduce the dimensions of our matrix by converting very small singular values to zero in our \(d\) vector. To start, we can plot our singular values of our full rank matrix of 943:
hist(d, breaks = 25, xlab="singular value", ylab="frequency in d", main = "Histogram of singular values")
As we can see above, many of our singular values fall close to zero (with one really large value – 4,528). To effectively reduce the dimensions of our matrices, we can convert many of these values that are close to zero, to zero (and thus remove them from our matrices in \(V\) and \(U\) when doing SVD), since these concept values won’t likely hold much weight in our final predictions of ratings. We can actually find the optimal k value by utilizing the following functions:
In order to find the optimal k, we’ll need to first compute the sum of all squares of singular values in our original movies dataset:
ss_all <- sum((d)^2)
Then, we can find the optimal \(k\) by making sure we are keeping at least 90% total energy of our singular values vector for prediction. This is based on our formula:
\[\frac{\sum\limits_{i}^k \sigma_{i}^2}{\sum\limits_{i}^r \sigma_{i}^2} \geq 0.9\]
We can plot this out in a graph and see where we reach our minimum energy threshold of 90%:
energy <- NULL
for (i in 1:length(d)) {
energy[i] <- sum(d[1:i]^2) / ss_all
}
plot(energy, pch=20, cex = 0.5, xlab='# of singular values', ylab='Percent sum of squares of singular values', main = "Dimensionality Reduction - optimal k")
lines(x = c(0,20), y = c(.9961, .9961))
lines(x = c(20,20), y = c(0, .9961))
From our plot of singular values above, we can see that many of our highest singular values (features), and hence most important for our predictions, fall within our optimal range of preserving at least 90% of the energy. Therefore, to make it as computationally efficient as possible, we’ll work with a very low rank matrix of 20 – going lower than 20 may increase our error. We can always go back and adjust our \(k\) value if a rank-20 matrix yields high error, but for now, we can test this size. Therefore, we can re-run our decomposition using the irlba package instead of the svd package, since by default, the irlba package returns the given rank \(k\) matrix instead of the full rank matrix, which the svd function does. After running the functions below, we’ll effectively reduce the dimensions of our full matrix from \(r=943\) to \(k=20\) – a huge improvement.
tic("SVD Processing")
svd_opt <- irlba(movie_matrix, nu=20, nv=20)
d_opt <- svd_opt$d
Below, we can see our 20 singular values with (rank \(k=20\)) of our diagonal matrix.
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4528.8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 65.74 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 50.42 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 38.55 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 37.14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 35.78 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 33.91 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 32.68 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 30.79 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29.47 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29.3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28.17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27.89 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27.83 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27.52 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 27.26 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 26.9 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 26.73 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 26.34 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 26.32 |
Now, with our optimal \(k\) value, we can re-compute our item-user matrix under a much smaller dimension using our updated equation:
\[A=U_kS_kV^T_k \] Where \(A\) is our updated item-user matrix of predictions, using \(U\), \(S\), and \(V^T\) all at rank = \(k\). This equation above is expanded to:
\[A=(U_k \sqrt\Sigma_k)(\sqrt\Sigma_k V^T_k) \] And we can compute these by doing the following on our movie_matrix:
U_opt <- svd_opt$u
V_opt <- svd_opt$v
S_opt <- diag(d_opt)
Vt_opt <- t(V_opt)
Additionally, we can check the rank of our \(U\), \(S\), and \(V^T\) matrices:
paste0('The dimensions of U: ', dim(U_opt)[1], ' by ', dim(U_opt)[2])
## [1] "The dimensions of U: 943 by 20"
paste0('The dimensions of S: ', dim(S_opt)[1], ' by ', dim(S_opt)[2])
## [1] "The dimensions of S: 20 by 20"
paste0('The dimensions of t(V): ', dim(Vt_opt)[1], ' by ', dim(Vt_opt)[2])
## [1] "The dimensions of t(V): 20 by 1664"
As we can see, we’ve created smaller vectors, all with rank = \(k\) = 20. Finally, we can use these smaller vectors to create predictions for our much broader dataset (with dimensions of 943 by 1664):
Uk <- U_opt %*% sqrt(S_opt)
Vk <- sqrt(S_opt) %*% Vt_opt
item_user_profile_svd <- round((Uk %*% Vk), 2)
movie_matrix_svd <- ifelse((item_user_profile_svd) > 5, 5,
ifelse((item_user_profile_svd) < 0, 0,
(item_user_profile_svd)))
colnames(movie_matrix_svd) <- colnames(movie_matrix)
rownames(movie_matrix_svd) <- rownames(movie_matrix)
toc(log = TRUE, quiet = TRUE)
In the end, we’ve created a new movie_matrix_svd matrix, that has our predictions utilizing our singular values.
GoldenEye predictions for User 2 and User 3?
Now with our singular values computed and our SVD matrix ready, we can go back to our original dataset and predict some ratings for users that currently have not seen certain movies. As an example, we can see from our original matrix that User 2 and User 3 have not watched “GoldenEye”. Therefore, using our matrix we can create some predictions:
First, we can grab our initial movie_matrix ratings for Users 2 & 3, and then utilizing our \(U\), \(S\), and \(V^T\) matrices, we can use the dot product to calculate predictions for these previously unknown ratings.
user_2_ratings <- movie_matrix_NA[2, ]
U_opt_user2 <- U_opt[2,]
Vt_opt_user2 <- Vt_opt[,2]
S_opt_user2 <- S_opt
user_3_ratings <- movie_matrix_NA[3, ]
U_opt_user3 <- U_opt[3,]
Vt_opt_user3 <- Vt_opt[,3]
S_opt_user3 <- S_opt
paste0('User 2 predicted rating for Goldeneye: ', round(U_opt_user2 %*% S_opt_user2 %*% Vt_opt_user2, 2))
## [1] "User 2 predicted rating for Goldeneye: 3.73"
paste0('User 3 predicted rating for Goldeneye: ', round(U_opt_user3 %*% S_opt_user3 %*% Vt_opt_user3, 2))
## [1] "User 3 predicted rating for Goldeneye: 2.71"
Obviously these procedures can be done on a much larger scale, and if we needed to reference predicted values for a particular user, it would be very easy to utilize our much lower rank matrices to perform these computations.
Now, with our movie_matrix adjusted with our SVD predictions and saved in a new matrix called movie_matrix_svd, we can calculate the RMSE of our new predictions that were calculated using singular value decomposition and dimensionality reduction from a matrix of rank=943 to rank=20:
movie_matrix1 <- as(as.matrix(movie_matrix), "realRatingMatrix")
calcPredictionAccuracy(x = as(movie_matrix_svd, "realRatingMatrix"), data = movie_matrix1)
## RMSE MSE MAE
## 0.22599232 0.05107253 0.07723558
From our RMSE calculations, we can see that our root-mean standard error is about 0.23. We can compare this to the RMSE values we computed last week when we ran our item-based collaborative filtering and user-based collaborative filtering algorithms to see which method yielded movie predictions that more closely matched our full matrix.
User-based collaborative filtering algorithm
tic("UBCF Algorithm Processing")
movies_ubcf <- as(movie_matrix_orig, "realRatingMatrix")
evaluation_scheme <- evaluationScheme(data = movies_ubcf, method = "split", train = 0.8, given = 10, goodRating = 3, k = 10)
ubcf <- Recommender(getData(evaluation_scheme, "train"), method='ubcf', parameter= list(method = 'pearson', nn = 100, normalize = 'center'))
ubcf_predict <- predict(ubcf, newdata = getData(evaluation_scheme, "known"), n = 6, type = "ratings")
calcPredictionAccuracy(x = ubcf_predict, data = getData(evaluation_scheme, "unknown"), byUser = FALSE)
## RMSE MSE MAE
## 1.0176856 1.0356841 0.6523523
toc(log = TRUE, quiet = TRUE)
Item-based collaborative filtering algorithm
tic("IBCF Algorithm Processing")
movies_ibcf <- as(movie_matrix_orig, "realRatingMatrix")
evaluation_scheme_ibcf <- evaluationScheme(data = movies_ibcf, method = "split", train = 0.8, given = 10, goodRating = 3, k = 10)
ibcf <- Recommender(getData(evaluation_scheme_ibcf, "train"), method='ibcf', parameter= list(method = 'pearson', k = 98, normalize = 'center'))
ibcf_predict <- predict(ibcf, newdata = getData(evaluation_scheme_ibcf, "known"), n = 6, type = "ratings")
calcPredictionAccuracy(x = ibcf_predict, data = getData(evaluation_scheme_ibcf, "unknown"), byUser = FALSE)
## RMSE MSE MAE
## 1.388195 1.927084 0.578933
toc(log = TRUE, quiet = TRUE)
As we can see from our RMSE calculations, our SVD approach yielded a much lower RMSE value than our previous two algorithms – IBCF and UBCF generated much higher RMSE values. Additionally, utilizing SVD instead of IBCF or UBCF may be more beneficial in this instance of building a recommendation system with the MovieLense data. We can test how computationally expensive each algorithm is:
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Run Time |
|---|
| SVD Processing: 0.87 sec elapsed |
| UBCF Algorithm Processing: 7.24 sec elapsed |
| IBCF Algorithm Processing: 11.14 sec elapsed |
As we can see above, our SVD processing was much faster than our UBCF or IBCF algorithms. The processing time for SVD was about three times faster than our UBCF processing and about five times faster than our IBCF processing! In the end, SVD does provide certain advantages for a user-item matrix such as movie ratings – and it’s success has been well documented in the Netflix Prize. However, as mentioned in the weekly materials, SVD can provide singular values that are difficult to interpret, which leads some to look for other alternatives.