DATA 612 Project 3 | Matrix Factorization

The goal of this assignment is give you practice working with Matrix Factorization techniques. Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system. You may approach this assignment in a number of ways. You are welcome to start with an
existing recommender system written by yourself or someone else. Remember as always to
cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering. You might easily start
with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

Notes/Limitations: • SVD builds features that may or may not map neatly to items (such as movie genres or news topics). As in many areas of machine learning, the lack of explainability can be an issue).

• SVD requires that there are no missing values. There are various ways to handle this, including (1) imputation of missing values, (2) mean-centering values around 0, or (3) using a more advance technique, such as stochastic gradient descent to simulate SVD in populating the factored matrices.

• Calculating the SVD matrices can be computationally expensive, although calculating ratings once the factorization is completed is very fast. You may need to create a subset of your data for SVD calculations to be successfully performed, especially on a machine with a small RAM footprint.

Introduction

Recommendations using SVD

The goal of CF-based recommendation algorithms is to suggest new products or to predict the utility of a product for a customer, based on the customer’s previous behavior and other similar customers ’opinions. However, these systems have some problems like sparsity, scalability, and synonymy. The weakness of CF algorithms for large, sparse databases led the researchers to alternative ways. In order to remove noise data from a large and sparse database, some dimensionality reduction techniques are proposed. Latent Semantic Indexing (LSI), which is a dimensionality reduction technique that used in information retrieval (IR), is a widely used technique to reduce the dimensionality of user-item ratings matrix. LSI, which uses singular value decomposition (SVD) as its underlying dimension reduction algorithm, maps nicely into the collaborative filtering recommender algorithm challenge

The data is my own sample rating of a movie of 5 users on 5 movies

set.seed(1)
sample_m <- matrix(sample(1:20,20),5,5)
sample_m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    6   14    3    7    6
## [2,]    8   15    2   12    8
## [3,]   11    9   20   17   11
## [4,]   16   19   10   18   16
## [5,]    4    1    5   13    4

We can now decompose the matrix. We see that the values of the diagonal matrix sigma (here d) are ranked and vary widely. This magnitude represents the strength of each latent feature and will serve as a guide to determine which could be removed.

sample_m.svd <- svd(sample_m)
sample_m.svd
## $d
## [1] 5.528631e+01 1.550274e+01 7.263082e+00 3.651949e+00 2.042991e-15
## 
## $u
##            [,1]       [,2]        [,3]       [,4]       [,5]
## [1,] -0.3014749  0.4152895 -0.22127999  0.5587723  0.6127448
## [2,] -0.3826434  0.4447365  0.29259878  0.3258567 -0.6811731
## [3,] -0.5405446 -0.6767722 -0.38805160  0.2571403 -0.1818950
## [4,] -0.6464569  0.2310308 -0.04505732 -0.7088737  0.1555190
## [5,] -0.2293510 -0.3440178  0.84427893  0.1138766  0.3213644
## 
## $v
##            [,1]        [,2]        [,3]       [,4]          [,5]
## [1,] -0.3993159  0.05970282 -0.08250749 -0.5746085  7.071068e-01
## [2,] -0.4944666  0.67341141 -0.30472087  0.4573456 -2.448129e-16
## [3,] -0.3634160 -0.69728875 -0.56021023  0.2605412 -1.502772e-16
## [4,] -0.5518381 -0.23060055  0.76137686  0.2502067 -2.813505e-17
## [5,] -0.3993159  0.05970282 -0.08250749 -0.5746085 -7.071068e-01

If we were to obtain the dot product of u,d and v’ we could reconstruct the original matrix precisely but this wouldn’t be very helpful. Instead, we can limit or reconstruction the largest n values in sigma (d). In this first example, we’ll use only the first two terms.

ds <- diag(sample_m.svd$d[1:2])
us <- as.matrix(sample_m.svd$u[, 1:2])
vs <- as.matrix(sample_m.svd$v[, 1:2])
sample_m.1 <- us %*% ds %*% t(vs)
sample_m.1
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]  7.039947 12.576997  1.567982  7.713092  7.039947
## [2,]  8.859133 15.103336  2.880494 10.084196  8.859133
## [3,] 11.307051  7.711683 18.176410 18.910943 11.307051
## [4,] 14.485469 20.084242 10.491150 18.896892 14.485469
## [5,]  4.744905  2.678372  8.326895  8.227132  4.744905

Eventhough we are using only two features to reconstruct the data, we can see that the values are very similar similar. If we instead use the 3 largest singular values, the values appriximate the original matrix even closer.

ds <- diag(sample_m.svd$d[1:3])
us <- as.matrix(sample_m.svd$u[, 1:3])
vs <- as.matrix(sample_m.svd$v[, 1:3])
sample_m.2 <- us %*% ds %*% t(vs)
sample_m.2
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]  7.172551 13.066737  2.468337  6.489426  7.172551
## [2,]  8.683791 14.455753  1.689953 11.702251  8.683791
## [3,] 11.539594  8.570523 19.755335 16.765040 11.539594
## [4,] 14.512470 20.183963 10.674482 18.647728 14.512470
## [5,]  4.238963  0.809803  4.891648 12.895946  4.238963

As we expect, the RMSE decreases as we add more features and would decrease to 0 if we were to include all of them, effectively reconstrucitng the origiinal matrix.

RMSE(sample_m, sample_m.1)
## [1] 1.625904
RMSE(sample_m, sample_m.2)
## [1] 0.7303898

More importantly, we see that SVD allows us to uncover latent features from the data. We can measure their relative strength and reconstruct the ratings only using those with the most predictive power.

Deciding how many features to keep will depend on the particular application but a rule of thumb is to conserve about 90% of the energy in sigma. That is, the sum of the squares in the retained sigma values should be about 90% of the total sum of squares (Leskovec, Rajaraman, & Ullman, 2014).

References

Gorakala, S. K., & Usuelli, M. (2015). Building a recommendation system with R. Retrieved from https://learning.oreilly.com/library/view/building-a-recommendation/9781783554492/