Introduction
In this project, we are going to review the Matrix Factorization methods with the same data set we have used in the earlier project (MovieLense). In the previous project, while we were creating the Movie Ratings Matrix, we used “realRatingMatrix” class“. More details on”realRatingMatrix" can be found here: https://www.rdocumentation.org/packages/recommenderlab/versions/0.2-5/topics/realRatingMatrix .With this project, we are going to implement SVD with using realRatingMatrix class and recommenderlab package and keeping the matrix as sparse matrix of class dgCMatrix and replacing the NA(or 0) values with calculating the baseline predictor.
# Load Libraries
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
## Warning: package 'recommenderlab' was built under R version 3.6.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loading required package: arules
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.6.3
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Registered S3 methods overwritten by 'registry':
## method from
## print.registry_field proxy
## print.registry_entry proxy
library(ggplot2)
library(caTools)
Model Development Approach 1
Matrix Factorization
Users and items gets modeled to a joint latent factor space of dimensionality f so that the user-item iteractions are modeled as the inner products within the space f.
\(i --> item\)
\(q_{i} --> item vector\)
\(p_{u} --> user vector\)
For a given item \(i\), the elements of \(q_{i}\) measure the extent to which the item possesses those factors positive or negative. For a given user u, the elements of \(p_{u}\) measure the extent of interest the user has in items that are high on the corresponding factors, again, positive or negative. The resulting dot product, \(q_{i}^{T}\)\(p_{u}\), captures the interaction between user \(u\) and item \(i\) —the user’s overall interest in the item’s characteristics. This approximates user \(u\)’s rating of item i, which is denoted by \(r_{ui}\), leading to the estimate \(r_{ui}=q_{i}^{T}p_{u}\) . This model would be singular value decomposition(SVD) an approach to idenfitfy latent sementic factors.
In a nutshell, SVD is
\(R=PAQ^{T}\) –> We have three matrices, P, A, Q where we multiply them we get the matrix of R. R is m * n ratings matrix, P is m * k user feature affinity matrix, Q is n * k item feature relevance matrix and A is k * k diagonal feature weight matrix. The R , original matrix can be estimated by the product of all these matrices.
** SVD describes preference in terms of latent features
** These features are learned from the rating data
** As explained in the begining, defines a shared vector space for users and items.
We created the movie_lense_matrix, we know from previous project that we need to subset.
# subset the dataset
movies_1 <- movie_lense_matrix[rowCounts(MovieLense) > 100, colCounts(MovieLense) > 100]
#replace 0 with NA
movies_1[,][movies_1[,] == 0] <- NA
summary(as.vector(as.matrix(movies_1)))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 3.00 4.00 3.69 4.00 5.00 73782
Data Preperation
SVD requires that there are no missing values. In our subset matrix, we have 73,782 missing values that we can replace them with the mean value. We can also use the baseline predictor approach. We calculate the user and item biases in the matrix, then replace the missing values with the sum of the raw mean, user and item biases.
# get mean value of the matrix
raw_mean <- mean(as.vector(as.matrix(movies_1)), na.rm = TRUE )
raw_mean
## [1] 3.69468
# count number of non-NA's in each row of training set
row_valid <- rowSums(!is.na(movies_1[,]))
# count number of non-NA's in each column of training set
col_valid <- colSums(!is.na(movies_1[,]))
# calculate user biases
user_biases <- rowMeans(movies_1[,] - raw_mean, na.rm = TRUE) / row_valid
# calculate item biases
item_biases <- colMeans(movies_1[,] - raw_mean, na.rm = TRUE) / col_valid
# memory cleanup
rm(row_valid, col_valid)
for (i in 1:nrow(movies_1)) {
for (j in 1:ncol(movies_1)) {
# if the matrix element has an NA, fill in with baseline predictor
if(is.na(movies_1[i,j])) {
movies_1[i,j] <- raw_mean + user_biases[i] + item_biases[j]
# ensure new values are within valid ratings bounds
if (movies_1[i,j] > 5) movies_1[i,j] <- 5
if (movies_1[i,j] < 0) movies_1[i,j] <- 0
} # end if
} # end for j
} # end for i
summary(as.vector(as.matrix(movies_1)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.688 3.695 3.694 3.704 5.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.457e-02 -1.888e-03 3.498e-04 -5.813e-05 2.328e-03 1.136e-02
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0213567 -0.0028758 -0.0002285 -0.0010286 0.0015623 0.0106482
We handled the missing values and now we can start calculating the SVD.
rank <- qr(as.matrix(movies_1))$rank
rank
## [1] 332
The matrix has 332 columns, our subset matrix is 358x332. Referencing the earlier approach of \(R=PAQ^{T}\) .
Matrix P will be 358 * 332
Matrix A will be 332 * 332
Matrix Q will be 332 * 332
Calculating the SVD.
# calculate svd
movies_1_svd <- svd(as.matrix(movies_1))

The singular values are low throughout the 0 to 300.
Dimensionality Reduction
Singular value decompoisition allows an exact representation of any matrix, and also allows us to eliminate the less importatn features that representation to produce an approximate representation with any desired number of dimensions. The fewer the dimensions we choose, the less accurate will be the approximation. Let’s say we have a huge matrix R with its components P, A and Q. They are all large. The best way to reduce the dimensionality of the three matrices would be to set the smallest of the singular values to zero. If we sent the smallest singular values to 0, then we can also eliminate the corresponding columns of P and Q. We can sum the squares of each singular value and then identify the first \(k\) singular values within matrix A. Based on the singular values plotted, we can see that it will around the start of the singular values.
# sum of squares of all singular values
sum_squares <- sum(movies_1_svd$d^2)
sum_squares
## [1] 1671774
#checksum of squares for singular values
perc_vec <- NULL
for (i in 1:length(movies_1_svd$d)) {
perc_vec[i] <- sum(movies_1_svd$d[1:i]^2) / sum_squares
}
plot(perc_vec)

k <- length(perc_vec[perc_vec <= .99])
k
## [1] 64
We can see that first 64 singular values whose squares sum to at least 99% of the total of the sum of squares of all the singular values. Let’s calculate our \(PRQ_{T}\) matrices.
# calculate size of A matrix
A_k <- Diagonal(x = movies_1_svd$d[1:k])
#calculate P matrix
P_k <- movies_1_svd$u[, 1:k]
#calculate V matrix (transpose of V matrix)
Q_k <- t(movies_1_svd$v)[1:k,]
#product of all these matrices will give us the estimated matrix
predicted <- P_k %*% A_k %*% Q_k
## [1] 5.571451 3.882264 3.446047 3.431647 3.845288
We see the values are higher than 5. Let’s set all the ratings within 0 and 5.
# set all vals > 0 to 5
predicted[,][predicted[,] > 5] <- 5
# set all vals < -10 to -10
predicted[,][predicted[,] < 0] <- 0
## [1] 5.000000 3.882264 3.446047 3.431647 3.845288
predicted_matrix <- as.matrix(predicted)
