In our previous assignment, we studied collaborative filtering and content filtering. Both have their shortcomings. Collaborative filtering focuses on recommending items based on the similarity of items and/or the similarity of users, but suffers from the cold start problem and an inability to consider side features. Content filtering is a challenge because a profile must be built for each item containing multiple characteristics.
In addition, users often associate items in a way which is not easily defined by the characteristics of items. Two movies, for example, might be liked by the same individuals while not having the same genre, actors, box office success, etc. For example, many users might associate two movies because they were both “true to life”. We could call these associations “concepts”. They are also referred to as latent dimensions or factors.
SVD allows us to take a matrix and decompose it in a way which may allow us to identify such concepts, which can then be used to drive recommendations. It is always possible to decompose a real matrix \(A\) into three matrices: \(U\) (left singular vectors), \(\Sigma\) (singular values), and \(V\) (right singular values). In terms of a user-item matrix, we can think of \(U\) as the “user-to-concept” similarity matrix, the diagonal values of \(\Sigma\) as the “strength” of the concepts, and \(V\) as the “item-to-concept” similarity matrix:
\[A = U \Sigma V^T\]
Below we have a matrix of users (viewers),TV shows and ratings: A zero value means the viewer has not rated the show:
# Create "A"" Matrix
A <- matrix(c(5, 4, 1, 3, 2, 3,
2, 3, 4, 0, 3, 3,
3, 0, 5, 0, 4, 5,
0, 0, 3, 0, 4, 3,
4, 4, 0, 4, 1, 2,
0, 4, 3, 3, 3, 3,
5, 3, 4, 4, 0, 0,
2, 5, 3, 5, 0, 0), nrow = 8, byrow = TRUE)
# Add column and row names
colnames(A) <- c("Westworld","GameOfThrones","BigLittleLies","TheJinx","VicePrincipals","Insecure")
rownames(A) <- c("Viewer1","Viewer2","Viewer3","Viewer4","Viewer5","Viewer6","Viewer7","Viewer8")
kable(A) %>% kable_styling()| Westworld | GameOfThrones | BigLittleLies | TheJinx | VicePrincipals | Insecure | |
|---|---|---|---|---|---|---|
| Viewer1 | 5 | 4 | 1 | 3 | 2 | 3 |
| Viewer2 | 2 | 3 | 4 | 0 | 3 | 3 |
| Viewer3 | 3 | 0 | 5 | 0 | 4 | 5 |
| Viewer4 | 0 | 0 | 3 | 0 | 4 | 3 |
| Viewer5 | 4 | 4 | 0 | 4 | 1 | 2 |
| Viewer6 | 0 | 4 | 3 | 3 | 3 | 3 |
| Viewer7 | 5 | 3 | 4 | 4 | 0 | 0 |
| Viewer8 | 2 | 5 | 3 | 5 | 0 | 0 |
Below, the matrix is decomposed into three matrices: \(U\) (left singular vectors), \(\Sigma\) (singular values), and \(V\) (right singular values). In terms of a user-item matrix, we can think of \(U\) as the “user-to-concept” similarity matrix, the diagonal values of \(\Sigma\) as the “strength” of the concepts, and \(V\) as the “item-to-concept” similarity matrix:
The \(U\) matrix is a “viewer-to-concept” similarity matrix:
| -0.4147750 | 0.1425228 | -0.3927796 | -0.4323190 | -0.1424352 | 0.1425331 |
| -0.3395167 | -0.2829300 | 0.0856209 | 0.0667745 | -0.8397469 | 0.1168574 |
| -0.3670956 | -0.5670215 | -0.3197448 | 0.1976414 | 0.2503563 | -0.5656331 |
| -0.2035054 | -0.4668349 | 0.1706882 | -0.0348001 | 0.3756510 | 0.7161290 |
| -0.3501219 | 0.3099925 | -0.2124608 | -0.4716679 | 0.1971777 | -0.0258139 |
| -0.3581909 | -0.0959223 | 0.6273570 | -0.2474209 | 0.1319163 | -0.1989212 |
| -0.3895683 | 0.3041212 | -0.2540980 | 0.6491364 | 0.1122135 | 0.2604425 |
| -0.3658783 | 0.4029932 | 0.4513754 | 0.2515999 | 0.0432731 | -0.1586847 |
The diagonal values of \(\Sigma\) (as a matrix) provide the “strength” of the concepts:
| 17.94313 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 |
| 0.00000 | 9.365004 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 |
| 0.00000 | 0.000000 | 4.702391 | 0.000000 | 0.000000 | 0.0000000 |
| 0.00000 | 0.000000 | 0.000000 | 4.220175 | 0.000000 | 0.0000000 |
| 0.00000 | 0.000000 | 0.000000 | 0.000000 | 1.925994 | 0.0000000 |
| 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.8419046 |
And the \(V\) matrix is the “TV show-to-concept” similarity matrix:
| -0.4421903 | 0.2148693 | -0.8441394 | 0.1012017 | -0.1060627 | 0.1556809 |
| -0.4742203 | 0.3742563 | 0.3912716 | -0.2843243 | -0.6332312 | 0.0114839 |
| -0.4430283 | -0.3296434 | 0.2302764 | 0.7885143 | -0.0769815 | -0.1198035 |
| -0.3960874 | 0.4923887 | 0.2327270 | -0.0169059 | 0.7385139 | -0.0285936 |
| -0.3096002 | -0.4994066 | 0.1158320 | -0.2907176 | 0.1520487 | 0.7305396 |
| -0.3613463 | -0.4617847 | -0.1171715 | -0.4498450 | 0.1154171 | -0.6532786 |
One challenge is interpreting the concepts. What do they represent? In this case, we have six TV shows which arguably can be put into four genres (“Westworld” and “Game of Thromes” into “scifi/fantasy”, “Big Little Lies” into “drama”, “The Jinx” into “documentary”, and “Vice Principals” and “Insecure” into “comedy”). Perhaps some of the shows share directors, cast members, etc. The \(\Sigma\) matrix shows six singular values, yet 90% of the energy (relative strength of the singular value) is encompassed in the first two singular values. We will find that we can drop the other four singular values to create smaller matrices (dimensionality reduction) and use those to make predictions. But what do the two selected concepts represent?
Ultimately, it doesn’t matter. SVD brings out the latent dimensions/factors and uses them to recommend shows. Knowing exactly what the latent dimensions/factors are is secondary to the power of SVD.
We can demonstrate this more clearly by using SVD on a real dataset and using dimensionality reduction to make predictions.
Our data set for this section is the MovieLense dataset from the recommenderlab R package, subsetted in the exact manner it is subsetted on pg. 78 of the textbook (see References; code reproduced below):
# naming it 'rm' instead of 'ratings_movies'
data("MovieLense")
rm <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
rm## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.
Per the output above, rm is a realRatingMatrix consisting of 560 users and 332 movies. The next step is to convert it to a matrix and decompose the matrix via SVD.
# convert rm to matrix
rm_mx <- as(rm, "matrix")
# percentage of data points in matrix which are NA
sum(is.na(rm_mx)) / (560 * 332)## [1] 0.702571
A common challenge for any recommender framework is the fact that most items are not rated. As we can see above, in this case, 70.26% of the user/movie combinations have no rating. SVD requires that there be no missing values in the matrix. We will attempt to address this in two ways:
# matrix with NAs converted to 0
A_0 <- as.matrix(rm@data)
# copy matrix to A_rowavg & replace NA values with row means
A_rowavg <- A_0
aa <- rowMeans(A_rowavg)
for(i in 1:560) {A_rowavg[i,(A_rowavg[i,] == 0)] <- aa[i]}
# using raw average (dataset with 0s)
A_rawavg <- A_0
raw_avg_val <- mean(A_rawavg)
A_rawavg[(A_rawavg == 0)] <- raw_avg_val
# using normalization
A_norm <- normalize(rm)
A_norm <- as.matrix(A_norm@data)Now that we have created four different matrices which handled missing values differently, we are ready to decompose the matrices:
# SVD for dataset that used zeroes for missing values
x_0 <- svd(A_0)
U_0 <- x_0$u
sig_0 <- x_0$d
V_0 <- x_0$v
# SVD for dataset that used row means for missing values
x_rowavg <- svd(A_rowavg)
U_rowavg <- x_rowavg$u
sig_rowavg <- x_rowavg$d
V_rowavg <- x_rowavg$v
# SVD for dataset that used raw mean for missing values
x_rawavg <- svd(A_rawavg)
U_rawavg <- x_rawavg$u
sig_rawavg <- x_rawavg$d
V_rawavg <- x_rawavg$v
# SVD for normalized dataset
x_norm <- svd(A_norm)
U_norm<- x_norm$u
sig_norm <- x_norm$d
V_norm <- x_norm$vNow we can reduce the dimensionality. We do this by keeping some, but not all, of the singular values, and then setting the rest of the singular values to zero in the \(\Sigma\) matrix and using it to create a new \(U\) and \(V\) matrix. The decision on which singular values to include is driven by the amount of “energy” they contribute. Energy is defined as the singular value squared.
Fortunately, SVD creates a \(\Sigma\) matrix composed of diagonal values which are the singular values sorted from highest to lowest. If we decide we want to keep 80% of the energy, for example, then we need to keep the ones which when squared, add up to about 80% of the sum of all singular values squared. The R function svd provides all of the singular values in a vector. We can determine which sum up to 80% of the energy, then create new matrices using the reduced \(\Sigma\) matrices:
# Energy threshold calculation for dataset that used zeroes for missing values
d_sq_0 <- sum(sig_0^2)
nrg_0 <- cumsum(sig_0^2) / d_sq_0
plot(nrg_0, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy')
lines(x = c(0,332), y = c(0.8,0.8)) k_0 <- length(nrg_0[nrg_0 <= 0.8]) + 1
x_0_new <- irlba(A_0, nv = k_0)
U_0_new <- x_0_new$u
sig_0_new <- x_0_new$d
V_0_new <- x_0_new$v
# Energy threshold calculation for dataset that used row means for missing values
d_sq_row <- sum(sig_rowavg^2)
nrg_row <- cumsum(sig_rowavg^2) / d_sq_row
plot(nrg_row, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy')
lines(x = c(0,332), y = c(0.8,0.8)) k_row <- length(nrg_row[nrg_row <= 0.8]) + 1
x_row_new <- irlba(A_rowavg, nv = k_row)
U_row_new <- x_row_new$u
sig_row_new <- x_row_new$d
V_row_new <- x_row_new$v
# Energy threshold calculation for dataset that used raw mean for missing values
d_sq_raw <- sum(sig_rawavg^2)
nrg_raw <- cumsum(sig_rawavg^2) / d_sq_raw
plot(nrg_raw, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy')
lines(x = c(0,332), y = c(0.8,0.8)) k_raw <- length(nrg_raw[nrg_raw <= 0.8]) + 1
x_raw_new <- irlba(A_rawavg, nv=k_raw)
U_raw_new <- x_raw_new$u
sig_raw_new <- x_raw_new$d
V_raw_new <- x_raw_new$v
# Energy threshold calculation for normalized dataset
d_sq_norm <- sum(sig_norm^2)
nrg_norm <- cumsum(sig_norm^2) / d_sq_norm
plot(nrg_norm, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy')
lines(x = c(0,332), y = c(0.8,0.8)) k_norm <- length(nrg_norm[nrg_norm <= 0.8]) + 1
x_norm_new <- irlba(A_norm, nv = k_norm)
U_norm_new <- x_norm_new$u
sig_norm_new <- x_norm_new$d
V_norm_new <- x_norm_new$v
# Build table of # of singular values by missing value methodology
k_Method <- c("Using 0s","Using Raw Average","Using Row Averages","Using Normalization")
k_values <- c(k_0, k_raw, k_row, k_norm)
k_table <- rbind(k_Method, k_values)
kable(k_table) %>% kable_styling()| k_Method | Using 0s | Using Raw Average | Using Row Averages | Using Normalization |
| k_values | 69 | 7 | 6 | 124 |
As can be seen in the table above, the number of singular values retained based on an 80% cut-off varies considerably depending on the way missing values were handled.
Next, we use reduce the dimensionality of our four matrices and use them to create predictions:
# predictions
rm_pred <- as(rm_mx, "realRatingMatrix")
rm_norm <- as(A_norm, "realRatingMatrix")
# dataset that uses zeroes for missing values
pred_0 <- U_0_new %*% diag(sig_0_new) %*% t(V_0_new)
pred_0[,][pred_0[,] > 5] <- 5
pred_0[,][pred_0[,] < 0] <- 0
colnames(pred_0) <- colnames(rm_mx)
rownames(pred_0) <- rownames(rm_mx)
x <- as(pred_0, "realRatingMatrix")
# error calculation
svd_0_er <- calcPredictionAccuracy(x = x, data = rm_pred)
# dataset that uses raw mean for missing values
pred_raw <- U_raw_new %*% diag(sig_raw_new) %*% t(V_raw_new)
pred_raw[,][pred_raw[,] > 5] <- 5
pred_raw[,][pred_raw[,] < 0] <- 0
colnames(pred_raw) <- colnames(rm_mx)
rownames(pred_raw) <- rownames(rm_mx)
y <- as(pred_raw, "realRatingMatrix")
# error calculation
svd_raw_er <- calcPredictionAccuracy(x = y, data = rm_pred)
# dataset that uses row means for missing values
pred_row <- U_row_new %*% diag(sig_row_new) %*% t(V_row_new)
pred_row[,][pred_row[,] > 5] <- 5
pred_row[,][pred_row[,] < 0] <- 0
colnames(pred_row) <- colnames(rm_mx)
rownames(pred_row) <- rownames(rm_mx)
z <- as(pred_row, "realRatingMatrix")
# error calculation
svd_row_er <- calcPredictionAccuracy(x = z, data = rm_pred)
# normalized dataset
pred_norm <- U_norm_new %*% diag(sig_norm_new) %*% t(V_norm_new)
colnames(pred_norm) <- colnames(rm_mx)
rownames(pred_norm) <- rownames(rm_mx)
w <- as(pred_norm, "realRatingMatrix")
# error calculation
svd_norm_er <- calcPredictionAccuracy(x = w, data = rm_norm)
# Build table showing error rates
k_Method <- c("Using 0s","Using Raw Average","Using Row Averages","Using Normalization")
k_values_p <- c(svd_0_er, svd_raw_er, svd_row_er, svd_norm_er)
k_table_p <- data.frame(rbind(svd_0_er, svd_raw_er, svd_row_er, svd_norm_er))
k_table_p <- k_table_p[order(k_table_p$RMSE ),]
kable(k_table_p) %>% kable_styling()| RMSE | MSE | MAE | |
|---|---|---|---|
| svd_norm_er | 0.2386952 | 0.0569754 | 0.1769569 |
| svd_0_er | 1.2155024 | 1.4774462 | 0.9576435 |
| svd_raw_er | 1.4960093 | 2.2380438 | 1.2432131 |
| svd_row_er | 1.5392566 | 2.3693109 | 1.2559457 |
Per the table above, the normalized matrix is generating the most accurate predictions. As a final step, we will compare the SVD error accuracy with the results using UBCF and IBCF.
# split dataset into the training and the test set
rm_1 <- as(rm_mx, "realRatingMatrix")
ev <- evaluationScheme(rm_1, method = "split", train = 0.8, given = 15, goodRating = 4)
# UBCF with cosine distance
ubcf_rec <- Recommender(getData(ev, "train"), "UBCF", param = list(normalize = "center", method = "cosine"))
# IBCF with cosine distance
ibcf_rec <- Recommender(getData(ev, "train"), "IBCF", param = list(normalize = "center", method = "cosine"))
#SVD_realRatingMatrix - column mean imputation
svd_rec <- Recommender(getData(ev, "train"), "SVD")
#prediction
ubcf_pred <- predict(ubcf_rec, getData(ev, "know"), type = "ratings")
ibcf_pred <- predict(ibcf_rec, getData(ev, "know"), type = "ratings")
svd_pred <- predict(svd_rec, getData(ev, "know"), type = "ratings")
#accuracy
ubcf_er <- calcPredictionAccuracy(ubcf_pred, getData(ev, "unknown"))
ibcf_er <- calcPredictionAccuracy(ibcf_pred, getData(ev, "unknown"))
svd_er <- calcPredictionAccuracy(svd_pred, getData(ev, "unknown"))
# Table showing error calcs for SVD vs UBCF and IBCF
k_table_er <- data.frame(rbind(ubcf_er, ibcf_er, svd_er))
k_table_er <- k_table_er[order(k_table_er$RMSE ),]
error_table <- data.frame(rbind(k_table_er, k_table_p))
error_table <- error_table[order(error_table$RMSE ),]
kable(error_table) %>% kable_styling()| RMSE | MSE | MAE | |
|---|---|---|---|
| svd_norm_er | 0.2386952 | 0.0569754 | 0.1769569 |
| svd_er | 1.0113761 | 1.0228816 | 0.8108733 |
| ubcf_er | 1.1414305 | 1.3028636 | 0.8979082 |
| svd_0_er | 1.2155024 | 1.4774462 | 0.9576435 |
| ibcf_er | 1.3910465 | 1.9350105 | 1.0558621 |
| svd_raw_er | 1.4960093 | 2.2380438 | 1.2432131 |
| svd_row_er | 1.5392566 | 2.3693109 | 1.2559457 |
We can see that SVD using a normalized matrix provides better predictions than UBCF or UBCF.