1 Introduction - SVD Explained

Back To Top

In our previous assignment, we studied collaborative filtering and content filtering. Both have their shortcomings. Collaborative filtering focuses on recommending items based on the similarity of items and/or the similarity of users, but suffers from the cold start problem and an inability to consider side features. Content filtering is a challenge because a profile must be built for each item containing multiple characteristics.

In addition, users often associate items in a way which is not easily defined by the characteristics of items. Two movies, for example, might be liked by the same individuals while not having the same genre, actors, box office success, etc. For example, many users might associate two movies because they were both “true to life”. We could call these associations “concepts”. They are also referred to as latent dimensions or factors.

SVD allows us to take a matrix and decompose it in a way which may allow us to identify such concepts, which can then be used to drive recommendations. It is always possible to decompose a real matrix \(A\) into three matrices: \(U\) (left singular vectors), \(\Sigma\) (singular values), and \(V\) (right singular values). In terms of a user-item matrix, we can think of \(U\) as the “user-to-concept” similarity matrix, the diagonal values of \(\Sigma\) as the “strength” of the concepts, and \(V\) as the “item-to-concept” similarity matrix:

\[A = U \Sigma V^T\]

1.1 Illustrative “A” Matrix

Back To Top

Below we have a matrix of users (viewers),TV shows and ratings: A zero value means the viewer has not rated the show:

Westworld GameOfThrones BigLittleLies TheJinx VicePrincipals Insecure
Viewer1 5 4 1 3 2 3
Viewer2 2 3 4 0 3 3
Viewer3 3 0 5 0 4 5
Viewer4 0 0 3 0 4 3
Viewer5 4 4 0 4 1 2
Viewer6 0 4 3 3 3 3
Viewer7 5 3 4 4 0 0
Viewer8 2 5 3 5 0 0


1.2 Decomposition into \(U\), \(\Sigma\), and \(V\)

Back To Top

Below, the matrix is decomposed into three matrices: \(U\) (left singular vectors), \(\Sigma\) (singular values), and \(V\) (right singular values). In terms of a user-item matrix, we can think of \(U\) as the “user-to-concept” similarity matrix, the diagonal values of \(\Sigma\) as the “strength” of the concepts, and \(V\) as the “item-to-concept” similarity matrix:

The \(U\) matrix is a “viewer-to-concept” similarity matrix:

-0.4147750 0.1425228 -0.3927796 -0.4323190 -0.1424352 0.1425331
-0.3395167 -0.2829300 0.0856209 0.0667745 -0.8397469 0.1168574
-0.3670956 -0.5670215 -0.3197448 0.1976414 0.2503563 -0.5656331
-0.2035054 -0.4668349 0.1706882 -0.0348001 0.3756510 0.7161290
-0.3501219 0.3099925 -0.2124608 -0.4716679 0.1971777 -0.0258139
-0.3581909 -0.0959223 0.6273570 -0.2474209 0.1319163 -0.1989212
-0.3895683 0.3041212 -0.2540980 0.6491364 0.1122135 0.2604425
-0.3658783 0.4029932 0.4513754 0.2515999 0.0432731 -0.1586847

The diagonal values of \(\Sigma\) (as a matrix) provide the “strength” of the concepts:

17.94313 0.000000 0.000000 0.000000 0.000000 0.0000000
0.00000 9.365004 0.000000 0.000000 0.000000 0.0000000
0.00000 0.000000 4.702391 0.000000 0.000000 0.0000000
0.00000 0.000000 0.000000 4.220175 0.000000 0.0000000
0.00000 0.000000 0.000000 0.000000 1.925994 0.0000000
0.00000 0.000000 0.000000 0.000000 0.000000 0.8419046

And the \(V\) matrix is the “TV show-to-concept” similarity matrix:

-0.4421903 0.2148693 -0.8441394 0.1012017 -0.1060627 0.1556809
-0.4742203 0.3742563 0.3912716 -0.2843243 -0.6332312 0.0114839
-0.4430283 -0.3296434 0.2302764 0.7885143 -0.0769815 -0.1198035
-0.3960874 0.4923887 0.2327270 -0.0169059 0.7385139 -0.0285936
-0.3096002 -0.4994066 0.1158320 -0.2907176 0.1520487 0.7305396
-0.3613463 -0.4617847 -0.1171715 -0.4498450 0.1154171 -0.6532786

One challenge is interpreting the concepts. What do they represent? In this case, we have six TV shows which arguably can be put into four genres (“Westworld” and “Game of Thromes” into “scifi/fantasy”, “Big Little Lies” into “drama”, “The Jinx” into “documentary”, and “Vice Principals” and “Insecure” into “comedy”). Perhaps some of the shows share directors, cast members, etc. The \(\Sigma\) matrix shows six singular values, yet 90% of the energy (relative strength of the singular value) is encompassed in the first two singular values. We will find that we can drop the other four singular values to create smaller matrices (dimensionality reduction) and use those to make predictions. But what do the two selected concepts represent?

Ultimately, it doesn’t matter. SVD brings out the latent dimensions/factors and uses them to recommend shows. Knowing exactly what the latent dimensions/factors are is secondary to the power of SVD.

We can demonstrate this more clearly by using SVD on a real dataset and using dimensionality reduction to make predictions.


2 Using SVD to a Make Recommendations

Back To Top

Our data set for this section is the MovieLense dataset from the recommenderlab R package, subsetted in the exact manner it is subsetted on pg. 78 of the textbook (see References; code reproduced below):

## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.


2.1 Create “A” Matrix

Per the output above, rm is a realRatingMatrix consisting of 560 users and 332 movies. The next step is to convert it to a matrix and decompose the matrix via SVD.

## [1] 0.702571


2.2 Address Missing Values

A common challenge for any recommender framework is the fact that most items are not rated. As we can see above, in this case, 70.26% of the user/movie combinations have no rating. SVD requires that there be no missing values in the matrix. We will attempt to address this in two ways:

  1. Convert missing values to 0
  2. Convert missing values to the row mean for each row
  3. Convert missing values to the raw average for the matrix
  4. Use a normalized matrix

2.3 Dimensionality Reduction on Both Datasets

Now that we have created four different matrices which handled missing values differently, we are ready to decompose the matrices:

Now we can reduce the dimensionality. We do this by keeping some, but not all, of the singular values, and then setting the rest of the singular values to zero in the \(\Sigma\) matrix and using it to create a new \(U\) and \(V\) matrix. The decision on which singular values to include is driven by the amount of “energy” they contribute. Energy is defined as the singular value squared.

Fortunately, SVD creates a \(\Sigma\) matrix composed of diagonal values which are the singular values sorted from highest to lowest. If we decide we want to keep 80% of the energy, for example, then we need to keep the ones which when squared, add up to about 80% of the sum of all singular values squared. The R function svd provides all of the singular values in a vector. We can determine which sum up to 80% of the energy, then create new matrices using the reduced \(\Sigma\) matrices:

k_Method Using 0s Using Raw Average Using Row Averages Using Normalization
k_values 69 7 6 124

As can be seen in the table above, the number of singular values retained based on an 80% cut-off varies considerably depending on the way missing values were handled.

Next, we use reduce the dimensionality of our four matrices and use them to create predictions:

# predictions
rm_pred <- as(rm_mx, "realRatingMatrix")
rm_norm <- as(A_norm, "realRatingMatrix")
# dataset that uses zeroes for missing values
pred_0 <- U_0_new %*% diag(sig_0_new) %*% t(V_0_new)
pred_0[,][pred_0[,] > 5] <- 5
pred_0[,][pred_0[,] < 0] <- 0
colnames(pred_0) <- colnames(rm_mx)
rownames(pred_0) <- rownames(rm_mx)
x <- as(pred_0, "realRatingMatrix")
# error calculation
svd_0_er <- calcPredictionAccuracy(x = x, data = rm_pred)
# dataset that uses raw mean for missing values
pred_raw <- U_raw_new %*% diag(sig_raw_new) %*% t(V_raw_new)
pred_raw[,][pred_raw[,] > 5] <- 5
pred_raw[,][pred_raw[,] < 0] <- 0
colnames(pred_raw) <- colnames(rm_mx)
rownames(pred_raw) <- rownames(rm_mx)
y <- as(pred_raw, "realRatingMatrix")
# error calculation
svd_raw_er <- calcPredictionAccuracy(x = y, data = rm_pred)
# dataset that uses row means for missing values
pred_row <- U_row_new %*% diag(sig_row_new) %*% t(V_row_new)
pred_row[,][pred_row[,] > 5] <- 5
pred_row[,][pred_row[,] < 0] <- 0
colnames(pred_row) <- colnames(rm_mx)
rownames(pred_row) <- rownames(rm_mx)
z <- as(pred_row, "realRatingMatrix")
# error calculation
svd_row_er <- calcPredictionAccuracy(x = z, data = rm_pred)
# normalized dataset
pred_norm <- U_norm_new %*% diag(sig_norm_new) %*% t(V_norm_new)
colnames(pred_norm) <- colnames(rm_mx)
rownames(pred_norm) <- rownames(rm_mx)
w <- as(pred_norm, "realRatingMatrix")
# error calculation
svd_norm_er <- calcPredictionAccuracy(x = w, data = rm_norm)
# Build table showing error rates
k_Method <- c("Using 0s","Using Raw Average","Using Row Averages","Using Normalization")
k_values_p <- c(svd_0_er, svd_raw_er, svd_row_er, svd_norm_er)
k_table_p <- data.frame(rbind(svd_0_er, svd_raw_er, svd_row_er, svd_norm_er))
k_table_p <- k_table_p[order(k_table_p$RMSE ),]
kable(k_table_p) %>% kable_styling()
RMSE MSE MAE
svd_norm_er 0.2386952 0.0569754 0.1769569
svd_0_er 1.2155024 1.4774462 0.9576435
svd_raw_er 1.4960093 2.2380438 1.2432131
svd_row_er 1.5392566 2.3693109 1.2559457

Per the table above, the normalized matrix is generating the most accurate predictions. As a final step, we will compare the SVD error accuracy with the results using UBCF and IBCF.

2.4 Comparing SVD with UBCF and IBCF methods

RMSE MSE MAE
svd_norm_er 0.2386952 0.0569754 0.1769569
svd_er 1.0113761 1.0228816 0.8108733
ubcf_er 1.1414305 1.3028636 0.8979082
svd_0_er 1.2155024 1.4774462 0.9576435
ibcf_er 1.3910465 1.9350105 1.0558621
svd_raw_er 1.4960093 2.2380438 1.2432131
svd_row_er 1.5392566 2.3693109 1.2559457

We can see that SVD using a normalized matrix provides better predictions than UBCF or UBCF.