For this project we will explore the use of matrix factorization methods in the building of collaborative filtering-based recommender systems for movie ratings.
library(dplyr)
library(ggplot2)
library(recommenderlab)
library(knitr)
The MovieLens dataset of 100,000 movie ratings is provided in User-Item matrix form in the recommenderlab library. Each row represents a user, and NAs are listed for any movie not rated.
data(MovieLense)
ui_mat <- as(MovieLense, "matrix")
ui_mat[1:5, 1:5] %>% kable()
| Toy Story (1995) | GoldenEye (1995) | Four Rooms (1995) | Get Shorty (1995) | Copycat (1995) |
|---|---|---|---|---|
| 5 | 3 | 4 | 3 | 3 |
| 4 | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA |
| 4 | 3 | NA | NA | NA |
Our first step is to normalize the data. To do this we calculate the global mean rating, as well as each user’s mean rating and each item’s mean rating. We then subtract each of these, appropriately, from every rating in the matrix. This leaves us with a matrix of ratings over and above the “baseline.”
For example, let’s say Joe rated “Toy Story” a perfect 5. Now let’s also assume the overall average rating for all movies is 3.5, but Toy Story is rated half a point higher (on average) and Joe tends to rate movies a quarter point higher (on average.) Then Joe’s normalized rating for Toy Story would be 5 - 3.5 - 0.5 - 0.25 = 0.75.
Normalizing the data in this fashion has two benefits. One is that it is a way of accounting for “biases” in both users and items. Two is that it allows us to impute zeros into the matrix for all of the NA values. The latter is necessary for performing Singular Value Decomposition, which would not otherwise work if values were missing.
global_mean <- mean(ui_mat, na.rm = T)
global_mean
## [1] 3.529982
u_bias_vec <- unname(rowMeans(ui_mat, na.rm=T)) - global_mean
u_bias_mat <- matrix(u_bias_vec, nrow=nrow(ui_mat), ncol=ncol(ui_mat))
i_bias_vec <- unname(colMeans(ui_mat, na.rm=T)) - global_mean
i_bias_mat <- matrix(i_bias_vec, nrow=nrow(ui_mat), ncol=ncol(ui_mat), byrow=T)
baseline <- global_mean + u_bias_mat + i_bias_mat
ui_mat_norm <- ui_mat - baseline
ui_mat_norm[is.na(ui_mat_norm)] <- 0
ui_mat_norm[1:5, 1:5] %>% kable()
| Toy Story (1995) | GoldenEye (1995) | Four Rooms (1995) | Get Shorty (1995) | Copycat (1995) |
|---|---|---|---|---|
| 1.0464977 | -0.2812906 | 0.8914829 | -0.625423 | -0.3775093 |
| -0.0532543 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 |
| 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 |
| 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 |
| 0.7773780 | 0.4495897 | 0.0000000 | 0.000000 | 0.0000000 |
We are now ready to factor our normalized ratings matrix via Singular Value Decomposition. This effectively finds latent “factors” (which can be thought of as themes, or concepts or groupings, though they are technically uninterpretable) in the data and tells us how important each factor is, as well as how much each user and each item is related to each factor.
This is algebraically done by finding the dimension that describes the highest amount of variability, then the second most, third most, etc, all the way to the r most, where r is the rank of the matrix. Since the strength of each dimension is returned, we can then discard the dimensions that do not provide adequate strength. One way we can do this is by setting an arbitrary percentage of total strength requirement, and discarding all dimensions that do not meet the requirement.
Dimensionality reduction essentially compresses the data while keeping most of the important information. This leaves us with a much smaller and easier to work with matrix. But it has also been shown to produce better prediction models by helping get rid of some of the “noise” in the data.
svd_list <- svd(ui_mat_norm)
### Keep only the first k dimensions that represent some arbitrary percentage of total strength (variability)
pct_var_threshold <- 0.85
k <- which(cumsum(svd_list$d)/sum(svd_list$d) > pct_var_threshold)[1]
k
## [1] 473
u_comp <- svd_list$u[, 1:k]
d_comp <- svd_list$d[1:k]
v_comp <- svd_list$v[, 1:k]
ui_mat_norm_comp <- u_comp %*% diag(d_comp) %*% t(v_comp)
dimnames(ui_mat_norm_comp) <- dimnames(ui_mat_norm)
ui_mat_norm_comp[1:5, 1:5] %>% kable()
| Toy Story (1995) | GoldenEye (1995) | Four Rooms (1995) | Get Shorty (1995) | Copycat (1995) |
|---|---|---|---|---|
| 1.0437897 | -0.2491957 | 0.8754280 | -0.6593391 | -0.3688876 |
| -0.0023077 | -0.0422761 | -0.0509056 | -0.0226108 | -0.0609236 |
| -0.0208480 | -0.0594591 | -0.0182048 | 0.0228926 | -0.0424904 |
| -0.0318934 | -0.0118165 | 0.0393241 | 0.0002459 | -0.0158619 |
| 0.7892781 | 0.4166696 | 0.0143068 | 0.0346760 | 0.0084791 |
Armed with our new-and-improved matrix of normalized ratings, we can add back in our baseline that we had previously subtracted. Doing so gives us a matrix of “predicted” ratings for every user-item combination. To keep within our standard 1-5 scale, we round any values below 1 or above 5.
ui_mat_pred <- ui_mat_norm_comp + baseline
ui_mat_pred[ui_mat_pred<1] <- 1
ui_mat_pred[ui_mat_pred>5] <- 5
ui_mat_pred[1:5, 1:5] %>% kable()
| Toy Story (1995) | GoldenEye (1995) | Four Rooms (1995) | Get Shorty (1995) | Copycat (1995) |
|---|---|---|---|---|
| 4.997292 | 3.032095 | 3.983945 | 2.966084 | 3.008622 |
| 4.050947 | 3.338767 | 3.157363 | 3.702564 | 3.416338 |
| 3.092194 | 2.381371 | 2.249852 | 2.807855 | 2.494559 |
| 4.620791 | 3.968656 | 3.847023 | 4.324851 | 4.060829 |
| 4.011900 | 2.967080 | 2.391944 | 2.929219 | 2.655108 |
We used some fairly advanced math to come up with our ratings predictions. But a reasonable question is: “Are the predictions any good?” To answer that, we will split our original data into train and test sets, repeat the entire process using only the training data, and calculate the RMSE on our predictions using the test set. We can compare that RMSE to the RMSE of a simpler model built using normalization and baselines but no dimensionality reduction, and also to a very simple model built by just guessing the global mean (~3.5) every time.
### Create test and train indexes
pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
na_ind <- which(is.na(ui_mat))
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]
### Break matrix into test and train
train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA
### Calculate baseline and normalize matrix
global_mean <- mean(train_mat, na.rm = T)
u_bias_vec <- unname(rowMeans(train_mat, na.rm=T)) - global_mean
u_bias_mat <- matrix(u_bias_vec, nrow=nrow(train_mat), ncol=ncol(train_mat))
i_bias_vec <- unname(colMeans(train_mat, na.rm=T)) - global_mean
i_bias_mat <- matrix(i_bias_vec, nrow=nrow(train_mat), ncol=ncol(train_mat), byrow=T)
baseline <- global_mean + u_bias_mat + i_bias_mat
train_mat_norm <- train_mat - baseline
train_mat_norm[is.na(train_mat_norm)] <- 0
### Perform SVD
svd_list_train <- svd(train_mat_norm)
pct_var_threshold <- 0.85
k <- which(cumsum(svd_list_train$d)/sum(svd_list_train$d) > pct_var_threshold)[1]
k
## [1] 471
u_comp <- svd_list_train$u[, 1:k]
d_comp <- svd_list_train$d[1:k]
v_comp <- svd_list_train$v[, 1:k]
train_mat_norm_comp <- u_comp %*% diag(d_comp) %*% t(v_comp)
dimnames(train_mat_norm_comp) <- dimnames(train_mat_norm)
train_mat_pred <- train_mat_norm_comp + baseline
train_mat_pred[train_mat_pred<1] <- 1
train_mat_pred[train_mat_pred>5] <- 5
### Calculate RMSE on Test data
rmse <- function(predicted, observed, rmna=FALSE){
sqrt(mean((predicted - observed)^2, na.rm=rmna))
}
rmse_test <- rmse(train_mat_pred, test_mat, rmna=T)
rmse_test
## [1] 0.9645763
### RMSE of simpler model (with normalization but without compression)
rmse_simpler <- rmse(train_mat_norm + baseline, test_mat, rmna=T)
rmse_simpler
## [1] 0.9665725
### RMSE of very simple model (prediction of global average every time)
prediction_mat_chance <- matrix(mean(train_mat, na.rm=T), nrow = nrow(train_mat), ncol = ncol(train_mat))
rmse_simplest <- rmse(prediction_mat_chance, test_mat, rmna=T)
rmse_simplest
## [1] 1.12636
As expected, the two more advanced models outperformed the very simple model of predicting the global average every time. However, the dimensionality reduction itself did not seem to make a big difference in RMSE. Was the process actually fruitless or did we just choose poorly in our arbitrary number of dimensions to keep?
This is easy enough to determine. We chose to keep 471 dimensions, but since we already decomposed the whole matrix we can loop through various values for k and see how the RMSE changes.
possible_dim <- length(svd_list_train$d)
k_values <- c(2:19, seq(20, 99, by=5), seq(100, possible_dim, by=30))
k_rmse_mat <- matrix(NA, nrow=length(k_values), ncol=2)
for (i in 1:length(k_values)){
k <- k_values[i]
u_comp <- svd_list_train$u[, 1:k]
d_comp <- svd_list_train$d[1:k]
v_comp <- svd_list_train$v[, 1:k]
train_mat_norm_comp <- u_comp %*% diag(d_comp) %*% t(v_comp)
train_mat_pred <- train_mat_norm_comp + baseline
train_mat_pred[train_mat_pred<1] <- 1
train_mat_pred[train_mat_pred>5] <- 5
rmse_test <- rmse(train_mat_pred, test_mat, rmna=T)
k_rmse_mat[i, ] <- c(k, rmse_test)
}
k_rmse_df <- k_rmse_mat %>% as.data.frame()
names(k_rmse_df) <- c("k", "RMSE")
k_rmse_df %>% ggplot(aes(x=k, y=RMSE)) + geom_line()
best_k_rmse <- k_rmse_df %>% filter(RMSE==min(RMSE))
best_k_rmse
## k RMSE
## 1 12 0.9397388
The matrix factorization was, in fact, effective. We just chose a poor value for k. We can see above that keeping only 12 dimensions yields the lowest RMSE (0.9397388). It seems somewhat surprising that more compression leads to a better model, but as mentioned earlier this can be due to a reduction of “noise” in the data.
Another commonly used matrix factorization method is Stochastic Gradient Descent. Known as “Funk SVD” in this context (after Simon Funk from the Netflix project), it is actually an iterative algorithm designed to closely approximate SVD. The touted benefits are that it can handle missing values and be computationally faster if going from a very large number of dimensions to only a handful. (In this case it was significantly slower than regular SVD, perhaps because our original dimensions are not so enormous.)
We will run Funk SVD on our training data and compare its RMSE on the test data to our previous methods.
train_mat_norm <- train_mat - baseline #can go back to keeping NA values
f_svd_list_train <- funkSVD(train_mat_norm, k = best_k_rmse$k)
train_mat_norm_comp <- f_svd_list_train$U %*% t(f_svd_list_train$V)
dimnames(train_mat_norm_comp) <- dimnames(train_mat_norm)
train_mat_pred <- train_mat_norm_comp + baseline
train_mat_pred[train_mat_pred<1] <- 1
train_mat_pred[train_mat_pred>5] <- 5
rmse_test <- rmse(train_mat_pred, test_mat, rmna=T)
rmse_test
## [1] 0.9285994
Our test RMSE has improved again. This method has been the most accurate of the ones we have seen in this example.
Using a dataset of movie ratings from MovieLens, we built several different collaborative filtering recommender models. First we normalized our data and came up with baselines for users and items. Next we explored dimensionality reduction using different forms of matrix factorization. We evaluated our various models by training them on a subset of the original data and then calculating the RMSE of predictions on the unseen test data.
Not unexpectedly, a simple model of guessing the global average every time was not particularly accurate. Normalizing for user and item biases was helpful. Further improvement was acheived using matrix factorization methods. Stochastic Gradient Descent (ie, the “Funk SVD” algorithm) proved to be the most accurate in this example.