1 Introduction - SVD Explained

In our previous assignment, we studied collaborative filtering and content filtering. Both have their shortcomings. Collaborative filtering focuses on recommending items based on the similarity of items and/or the similarity of users, but suffers from the cold start problem and an inability to consider side features. Content filtering is a challenge because a profile must be built for each item containing multiple characteristics.

In addition, users often associate items in a way which is not easily defined by the characteristics of items. Two movies, for example, might be liked by the same individuals while not having the same genre, actors, box office success, etc. For example, many users might associate two movies because they were both “true to life”. We could call these associations “concepts”. They are also referred to as latent dimensions or factors.

SVD allows us to take a matrix and decompose it in a way which may allow us to identify such concepts, which can then be used to drive recommendations. It is always possible to decompose a real matrix \(A\) into three matrices: \(U\) (left singular vectors), \(\Sigma\) (singular values), and \(V\) (right singular values). In terms of a user-item matrix, we can think of \(U\) as the “user-to-concept” similarity matrix, the diagonal values of \(\Sigma\) as the “strength” of the concepts, and \(V\) as the “item-to-concept” similarity matrix:

\[A = U \Sigma V^T\]

1.1 Illustrative “A” Matrix

Below we have a matrix of users (viewers),TV shows and ratings: A zero value means the viewer has not rated the show:

# Create "A"" Matrix
A <- matrix(c(5, 4, 1, 3, 2, 3, 
              2, 3, 4, 0, 3, 3, 
              3, 0, 5, 0, 4, 5, 
              0, 0, 3, 0, 4, 3, 
              4, 4, 0, 4, 1, 2, 
              0, 4, 3, 3, 3, 3, 
              5, 3, 4, 4, 0, 0, 
              2, 5, 3, 5, 0, 0), nrow = 8, byrow = TRUE)
# Add column and row names
colnames(A) <- c("Westworld","GameOfThrones","BigLittleLies","TheJinx","VicePrincipals","Insecure")
rownames(A) <- c("Viewer1","Viewer2","Viewer3","Viewer4","Viewer5","Viewer6","Viewer7","Viewer8")
kable(A) %>% kable_styling()

	Westworld	GameOfThrones	BigLittleLies	TheJinx	VicePrincipals	Insecure
Viewer1	5	4	1	3	2	3
Viewer2	2	3	4	0	3	3
Viewer3	3	0	5	0	4	5
Viewer4	0	0	3	0	4	3
Viewer5	4	4	0	4	1	2
Viewer6	0	4	3	3	3	3
Viewer7	5	3	4	4	0	0
Viewer8	2	5	3	5	0	0

1.2 Decomposition into \(U\), \(\Sigma\), and \(V\)

Below, the matrix is decomposed into three matrices: \(U\) (left singular vectors), \(\Sigma\) (singular values), and \(V\) (right singular values). In terms of a user-item matrix, we can think of \(U\) as the “user-to-concept” similarity matrix, the diagonal values of \(\Sigma\) as the “strength” of the concepts, and \(V\) as the “item-to-concept” similarity matrix:

x <- svd(A)
U <- x$u
sig <- x$d
V <- x$v

The \(U\) matrix is a “viewer-to-concept” similarity matrix:

kable(U) %>% kable_styling()

-0.4147750	0.1425228	-0.3927796	-0.4323190	-0.1424352	0.1425331
-0.3395167	-0.2829300	0.0856209	0.0667745	-0.8397469	0.1168574
-0.3670956	-0.5670215	-0.3197448	0.1976414	0.2503563	-0.5656331
-0.2035054	-0.4668349	0.1706882	-0.0348001	0.3756510	0.7161290
-0.3501219	0.3099925	-0.2124608	-0.4716679	0.1971777	-0.0258139
-0.3581909	-0.0959223	0.6273570	-0.2474209	0.1319163	-0.1989212
-0.3895683	0.3041212	-0.2540980	0.6491364	0.1122135	0.2604425
-0.3658783	0.4029932	0.4513754	0.2515999	0.0432731	-0.1586847

The diagonal values of \(\Sigma\) (as a matrix) provide the “strength” of the concepts:

kable(diag(sig)) %>% kable_styling()

17.94313	0.000000	0.000000	0.000000	0.000000	0.0000000
0.00000	9.365004	0.000000	0.000000	0.000000	0.0000000
0.00000	0.000000	4.702391	0.000000	0.000000	0.0000000
0.00000	0.000000	0.000000	4.220175	0.000000	0.0000000
0.00000	0.000000	0.000000	0.000000	1.925994	0.0000000
0.00000	0.000000	0.000000	0.000000	0.000000	0.8419046

And the \(V\) matrix is the “TV show-to-concept” similarity matrix:

kable(V) %>% kable_styling()

-0.4421903	0.2148693	-0.8441394	0.1012017	-0.1060627	0.1556809
-0.4742203	0.3742563	0.3912716	-0.2843243	-0.6332312	0.0114839
-0.4430283	-0.3296434	0.2302764	0.7885143	-0.0769815	-0.1198035
-0.3960874	0.4923887	0.2327270	-0.0169059	0.7385139	-0.0285936
-0.3096002	-0.4994066	0.1158320	-0.2907176	0.1520487	0.7305396
-0.3613463	-0.4617847	-0.1171715	-0.4498450	0.1154171	-0.6532786

One challenge is interpreting the concepts. What do they represent? In this case, we have six TV shows which arguably can be put into four genres (“Westworld” and “Game of Thromes” into “scifi/fantasy”, “Big Little Lies” into “drama”, “The Jinx” into “documentary”, and “Vice Principals” and “Insecure” into “comedy”). Perhaps some of the shows share directors, cast members, etc. The \(\Sigma\) matrix shows six singular values, yet 90% of the energy (relative strength of the singular value) is encompassed in the first two singular values. We will find that we can drop the other four singular values to create smaller matrices (dimensionality reduction) and use those to make predictions. But what do the two selected concepts represent?

Ultimately, it doesn’t matter. SVD brings out the latent dimensions/factors and uses them to recommend shows. Knowing exactly what the latent dimensions/factors are is secondary to the power of SVD.

We can demonstrate this more clearly by using SVD on a real dataset and using dimensionality reduction to make predictions.

2 Using SVD to a Make Recommendations

Our data set for this section is the MovieLense dataset from the recommenderlab R package, subsetted in the exact manner it is subsetted on pg. 78 of the textbook (see References; code reproduced below):

# naming it 'rm' instead of 'ratings_movies'
data("MovieLense")
rm <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100]
rm

## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.

2.1 Create “A” Matrix

Per the output above, rm is a realRatingMatrix consisting of 560 users and 332 movies. The next step is to convert it to a matrix and decompose the matrix via SVD.

# convert rm to matrix
rm_mx <- as(rm, "matrix")
# percentage of data points in matrix which are NA
sum(is.na(rm_mx)) / (560 * 332)

## [1] 0.702571

2.2 Address Missing Values

A common challenge for any recommender framework is the fact that most items are not rated. As we can see above, in this case, 70.26% of the user/movie combinations have no rating. SVD requires that there be no missing values in the matrix. We will attempt to address this in two ways:

Convert missing values to 0
Convert missing values to the row mean for each row
Convert missing values to the raw average for the matrix
Use a normalized matrix

# matrix with NAs converted to 0
A_0 <- as.matrix(rm@data)
# copy matrix to A_rowavg & replace NA values with row means
A_rowavg <- A_0
aa <- rowMeans(A_rowavg)
for(i in 1:560) {A_rowavg[i,(A_rowavg[i,] == 0)] <- aa[i]}
#  using raw average (dataset with 0s)
A_rawavg <- A_0
raw_avg_val <- mean(A_rawavg)
A_rawavg[(A_rawavg == 0)] <- raw_avg_val
#  using normalization
A_norm <- normalize(rm)
A_norm <- as.matrix(A_norm@data)

2.3 Dimensionality Reduction on Both Datasets

Now that we have created four different matrices which handled missing values differently, we are ready to decompose the matrices:

# SVD for dataset that used zeroes for missing values
x_0 <- svd(A_0)
U_0 <- x_0$u
sig_0 <- x_0$d
V_0 <- x_0$v
# SVD for dataset that used row means for missing values
x_rowavg <- svd(A_rowavg)
U_rowavg <- x_rowavg$u
sig_rowavg <- x_rowavg$d
V_rowavg <- x_rowavg$v
# SVD for dataset that used raw mean for missing values
x_rawavg <- svd(A_rawavg)
U_rawavg <- x_rawavg$u
sig_rawavg <- x_rawavg$d
V_rawavg <- x_rawavg$v
# SVD for normalized dataset
x_norm <- svd(A_norm)
U_norm<- x_norm$u
sig_norm <- x_norm$d
V_norm <- x_norm$v

Now we can reduce the dimensionality. We do this by keeping some, but not all, of the singular values, and then setting the rest of the singular values to zero in the \(\Sigma\) matrix and using it to create a new \(U\) and \(V\) matrix. The decision on which singular values to include is driven by the amount of “energy” they contribute. Energy is defined as the singular value squared.

Fortunately, SVD creates a \(\Sigma\) matrix composed of diagonal values which are the singular values sorted from highest to lowest. If we decide we want to keep 80% of the energy, for example, then we need to keep the ones which when squared, add up to about 80% of the sum of all singular values squared. The R function svd provides all of the singular values in a vector. We can determine which sum up to 80% of the energy, then create new matrices using the reduced \(\Sigma\) matrices:

# Energy threshold calculation for dataset that used zeroes for missing values
d_sq_0 <- sum(sig_0^2) 
nrg_0 <- cumsum(sig_0^2) / d_sq_0
plot(nrg_0, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy') 
lines(x = c(0,332), y = c(0.8,0.8))

k_0 <- length(nrg_0[nrg_0 <= 0.8]) + 1
x_0_new <- irlba(A_0, nv = k_0)
U_0_new <- x_0_new$u
sig_0_new <- x_0_new$d
V_0_new <- x_0_new$v
# Energy threshold calculation for dataset that used row means for missing values
d_sq_row <- sum(sig_rowavg^2) 
nrg_row <- cumsum(sig_rowavg^2) / d_sq_row
plot(nrg_row, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy') 
lines(x = c(0,332), y = c(0.8,0.8))

k_row <- length(nrg_row[nrg_row <= 0.8]) + 1
x_row_new <- irlba(A_rowavg, nv = k_row)
U_row_new <- x_row_new$u
sig_row_new <- x_row_new$d
V_row_new <- x_row_new$v
# Energy threshold calculation for dataset that used raw mean for missing values
d_sq_raw <- sum(sig_rawavg^2) 
nrg_raw <- cumsum(sig_rawavg^2) / d_sq_raw
plot(nrg_raw, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy') 
lines(x = c(0,332), y = c(0.8,0.8))

k_raw <- length(nrg_raw[nrg_raw <= 0.8]) + 1
x_raw_new <- irlba(A_rawavg, nv=k_raw)
U_raw_new <- x_raw_new$u
sig_raw_new <- x_raw_new$d
V_raw_new <- x_raw_new$v
# Energy threshold calculation for normalized dataset
d_sq_norm <- sum(sig_norm^2) 
nrg_norm <- cumsum(sig_norm^2) / d_sq_norm
plot(nrg_norm, pch = 21, col = "red", cex = 0.5, xlab = 'Singular Value', ylab = 'Singular Values Energy') 
lines(x = c(0,332), y = c(0.8,0.8))

k_norm <- length(nrg_norm[nrg_norm <= 0.8]) + 1
x_norm_new <- irlba(A_norm, nv = k_norm)
U_norm_new <- x_norm_new$u
sig_norm_new <- x_norm_new$d
V_norm_new <- x_norm_new$v
# Build table of # of singular values by missing value methodology
k_Method <- c("Using 0s","Using Raw Average","Using Row Averages","Using Normalization")
k_values <- c(k_0, k_raw, k_row, k_norm)
k_table <- rbind(k_Method, k_values)
kable(k_table) %>% kable_styling()

k_Method	Using 0s	Using Raw Average	Using Row Averages	Using Normalization
k_values	69	7	6	124

As can be seen in the table above, the number of singular values retained based on an 80% cut-off varies considerably depending on the way missing values were handled.

Next, we use reduce the dimensionality of our four matrices and use them to create predictions:

# predictions
rm_pred <- as(rm_mx, "realRatingMatrix")
rm_norm <- as(A_norm, "realRatingMatrix")
# dataset that uses zeroes for missing values
pred_0 <- U_0_new %*% diag(sig_0_new) %*% t(V_0_new)
pred_0[,][pred_0[,] > 5] <- 5
pred_0[,][pred_0[,] < 0] <- 0
colnames(pred_0) <- colnames(rm_mx)
rownames(pred_0) <- rownames(rm_mx)
x <- as(pred_0, "realRatingMatrix")
# error calculation
svd_0_er <- calcPredictionAccuracy(x = x, data = rm_pred)
# dataset that uses raw mean for missing values
pred_raw <- U_raw_new %*% diag(sig_raw_new) %*% t(V_raw_new)
pred_raw[,][pred_raw[,] > 5] <- 5
pred_raw[,][pred_raw[,] < 0] <- 0
colnames(pred_raw) <- colnames(rm_mx)
rownames(pred_raw) <- rownames(rm_mx)
y <- as(pred_raw, "realRatingMatrix")
# error calculation
svd_raw_er <- calcPredictionAccuracy(x = y, data = rm_pred)
# dataset that uses row means for missing values
pred_row <- U_row_new %*% diag(sig_row_new) %*% t(V_row_new)
pred_row[,][pred_row[,] > 5] <- 5
pred_row[,][pred_row[,] < 0] <- 0
colnames(pred_row) <- colnames(rm_mx)
rownames(pred_row) <- rownames(rm_mx)
z <- as(pred_row, "realRatingMatrix")
# error calculation
svd_row_er <- calcPredictionAccuracy(x = z, data = rm_pred)
# normalized dataset
pred_norm <- U_norm_new %*% diag(sig_norm_new) %*% t(V_norm_new)
colnames(pred_norm) <- colnames(rm_mx)
rownames(pred_norm) <- rownames(rm_mx)
w <- as(pred_norm, "realRatingMatrix")
# error calculation
svd_norm_er <- calcPredictionAccuracy(x = w, data = rm_norm)
# Build table showing error rates
k_Method <- c("Using 0s","Using Raw Average","Using Row Averages","Using Normalization")
k_values_p <- c(svd_0_er, svd_raw_er, svd_row_er, svd_norm_er)
k_table_p <- data.frame(rbind(svd_0_er, svd_raw_er, svd_row_er, svd_norm_er))
k_table_p <- k_table_p[order(k_table_p$RMSE ),]
kable(k_table_p) %>% kable_styling()

	RMSE	MSE	MAE
svd_norm_er	0.2386952	0.0569754	0.1769569
svd_0_er	1.2155024	1.4774462	0.9576435
svd_raw_er	1.4960093	2.2380438	1.2432131
svd_row_er	1.5392566	2.3693109	1.2559457

Per the table above, the normalized matrix is generating the most accurate predictions. As a final step, we will compare the SVD error accuracy with the results using UBCF and IBCF.

2.4 Comparing SVD with UBCF and IBCF methods

# split dataset into the training and the test set
rm_1 <- as(rm_mx, "realRatingMatrix")
ev <- evaluationScheme(rm_1, method = "split", train = 0.8, given = 15, goodRating = 4)
# UBCF with cosine distance
ubcf_rec <- Recommender(getData(ev, "train"), "UBCF", param = list(normalize = "center", method = "cosine"))
# IBCF with cosine distance
ibcf_rec <- Recommender(getData(ev, "train"), "IBCF", param = list(normalize = "center", method = "cosine"))
#SVD_realRatingMatrix - column mean imputation
svd_rec <- Recommender(getData(ev, "train"), "SVD")
#prediction
ubcf_pred <- predict(ubcf_rec, getData(ev, "know"), type = "ratings")
ibcf_pred <- predict(ibcf_rec, getData(ev, "know"), type = "ratings")
svd_pred <- predict(svd_rec, getData(ev, "know"), type = "ratings")
#accuracy
ubcf_er <- calcPredictionAccuracy(ubcf_pred, getData(ev, "unknown"))
ibcf_er <- calcPredictionAccuracy(ibcf_pred, getData(ev, "unknown"))
svd_er <- calcPredictionAccuracy(svd_pred, getData(ev, "unknown"))
# Table showing error calcs for SVD vs UBCF and IBCF
k_table_er <- data.frame(rbind(ubcf_er, ibcf_er, svd_er))
k_table_er <- k_table_er[order(k_table_er$RMSE ),]
error_table <- data.frame(rbind(k_table_er, k_table_p))
error_table <- error_table[order(error_table$RMSE ),]
kable(error_table) %>% kable_styling()

	RMSE	MSE	MAE
svd_norm_er	0.2386952	0.0569754	0.1769569
svd_er	1.0113761	1.0228816	0.8108733
ubcf_er	1.1414305	1.3028636	0.8979082
svd_0_er	1.2155024	1.4774462	0.9576435
ibcf_er	1.3910465	1.9350105	1.0558621
svd_raw_er	1.4960093	2.2380438	1.2432131
svd_row_er	1.5392566	2.3693109	1.2559457

We can see that SVD using a normalized matrix provides better predictions than UBCF or UBCF.

3 References

Building a Recommendation System with R by Suresh K. Gorakala, Michele Usuelli

DATA 612 - Summer 2020 - Project 3 | Matrix Factorization Methods

Bruno de Melo and Leland Randles

June 25, 2020