The goal of this assignment is give you practice working with Matrix Factorization techniques.
Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
library(recommenderlab)
library(dplyr)
library(tidyr)
library(kableExtra)
library(data.table)
data("MovieLense")
In this example to see how SVD decomposition recommender systems work, we will take a small subset of popular movies; Toy Story, Aladdin, Lion King, The Rock, Twister (1996), The Abyss, Alien, Psycho, and The Shining.
Note that these 9 movies are in a specific genre, respectively Animation, Action, and Horror.
To highlight some of the features of SVD decomposition, the users have been selected to have a preference for one of the 3 genres:
# Get subset of movie Lense dataset of most rated movies and most actively rating users
movie_ratings <- getData.frame(MovieLense[rowCounts(MovieLense) > 60, colCounts(MovieLense) > 150])
# Determine 3 movies for each genre of Animation, Action, and Horror
animation <- MovieLenseMeta %>% filter(Animation == 1 & `Children's`== 1 & title %in% movie_ratings$item)
action <- MovieLenseMeta %>% filter(Thriller == 1 & Adventure == 1 & Musical == 0 & title %in% movie_ratings$item)
horror <- MovieLenseMeta %>% filter(Horror == 1 & title %in% movie_ratings$item)
# Create dataframe of the selected 9 movies
df2 <- spread(movie_ratings %>% filter(item %in% animation$title[1:3] | item %in% action$title[1:3] | item %in% horror$title[1:3]), key = item, value = rating)
col.order <- c("user","Toy Story (1995)","Aladdin (1992)", "Lion King, The (1994)", "Rock, The (1996)", "Twister (1996)", "Abyss, The (1989)", "Alien (1979)", "Psycho (1960)", "Shining, The (1980)")
df2 <- df2[,col.order]
df3 <- df2[2:length(df2)]
row.names(df3) <- as.character(df2$user)
df3[is.na(df3)] <- 0
# Determine users that have a visible preference of a specific genre based on ratings
animation_fans <- df3 %>% filter(`Toy Story (1995)` > 3 & `Lion King, The (1994)` > 3 & `Aladdin (1992)` > 3 &
`Rock, The (1996)` <= 3 & `Twister (1996)` <= 3 & `Abyss, The (1989)` <= 3 &
`Alien (1979)` <= 3 & `Psycho (1960)` <= 3 & `Shining, The (1980)` <= 3)
action_fans <- df3 %>% filter(`Toy Story (1995)` <= 3 & `Lion King, The (1994)` <= 2 & `Aladdin (1992)` <= 2 &
`Rock, The (1996)` >= 3 & `Twister (1996)` >= 0 & `Abyss, The (1989)` >= 3 &
`Alien (1979)` <= 3 & `Psycho (1960)` <= 3 & `Shining, The (1980)` <= 3)
horror_fans <- df3 %>% filter(`Toy Story (1995)` %in% c(0) & `Lion King, The (1994)` < 3 & `Aladdin (1992)` < 3 &
`Rock, The (1996)` <= 3 & `Twister (1996)` <= 3 & `Abyss, The (1989)` != 0 &
`Alien (1979)` >= 3 & `Psycho (1960)` >= 3 & `Shining, The (1980)` >= 3)
# Combine users of specific genre preference
m2 <- rbind(animation_fans, action_fans)
m2 <- rbind(m2, horror_fans)
m <- as.matrix(m2)
m %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F) %>% add_header_above(c(" ", "Animation" = 3, "Action" = 3, "Horror" = 3))%>% pack_rows("Animation Fans", 1, 4) %>% pack_rows("Action Fans", 5, 6) %>% pack_rows("Horror Fans", 7, 10)
| Toy Story (1995) | Aladdin (1992) | Lion King, The (1994) | Rock, The (1996) | Twister (1996) | Abyss, The (1989) | Alien (1979) | Psycho (1960) | Shining, The (1980) | |
|---|---|---|---|---|---|---|---|---|---|
| Animation Fans | |||||||||
| 18 | 5 | 4 | 4 | 0 | 0 | 0 | 0 | 3 | 3 |
| 664 | 4 | 4 | 4 | 0 | 3 | 0 | 3 | 0 | 0 |
| 771 | 5 | 4 | 5 | 0 | 0 | 2 | 0 | 0 | 0 |
| 821 | 5 | 5 | 5 | 3 | 3 | 0 | 0 | 0 | 0 |
| Action Fans | |||||||||
| 430 | 0 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 |
| 453 | 0 | 0 | 0 | 4 | 0 | 3 | 0 | 0 | 0 |
| Horror Fans | |||||||||
| 270 | 0 | 0 | 0 | 0 | 3 | 5 | 5 | 5 | 5 |
| 372 | 0 | 0 | 0 | 0 | 0 | 4 | 5 | 5 | 5 |
| 617 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 5 | 5 |
| 833 | 0 | 0 | 0 | 0 | 2 | 2 | 5 | 5 | 4 |
we will decompose out matrix M into 3 matrices where \(M=U\sum V^T\) such that:
#Determine rank of current matrix
rankMatrix(m)[1]
## [1] 9
svd_m<-svd(m)
#U
#round(s$u, 2) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
svd_m$u
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.30980303 0.346022876 0.27923899 0.5587444 0.198941909 0.26999978
## [2,] -0.24798789 0.380454840 0.11496702 -0.5913655 -0.134210073 -0.39014688
## [3,] -0.20260356 0.455247906 -0.04693749 0.2673050 -0.628730047 -0.10806889
## [4,] -0.21925320 0.546067913 -0.19711604 -0.1941723 0.483582175 0.16055517
## [5,] -0.06003199 -0.005796787 -0.57338346 0.1276974 -0.003855702 -0.06663823
## [6,] -0.06301899 0.001149358 -0.67461043 0.1437329 0.180275120 -0.19852322
## [7,] -0.49354250 -0.252315727 -0.16698190 -0.2424484 -0.247073513 0.66657110
## [8,] -0.44612275 -0.263793774 -0.04262905 0.1861726 -0.236944063 -0.35506315
## [9,] -0.37049344 -0.222558541 0.21487165 0.2087524 0.357377159 -0.34491078
## [10,] -0.40737402 -0.217402513 0.08951243 -0.2344180 0.189304919 -0.06500960
## [,7] [,8] [,9]
## [1,] -0.008713349 0.52472726 0.07891848
## [2,] 0.209845438 0.39003902 0.24138032
## [3,] -0.286197537 -0.41145047 0.14094383
## [4,] 0.107914317 -0.38275183 -0.39428187
## [5,] -0.037888031 0.18748759 -0.09860938
## [6,] -0.054112429 0.14476842 0.37314312
## [7,] 0.226203779 -0.05585837 0.21492089
## [8,] 0.280207160 0.10428242 -0.61232733
## [9,] 0.237390292 -0.42695789 0.42732698
## [10,] -0.819714846 0.06590076 -0.08648909
#S
#round(s$d, 2) %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
svd_m$d
## [1] 19.1021537 15.2971192 7.0350655 4.8391352 3.4252460 1.9709804 0.6266352
## [8] 0.3965897 0.1716448
#V*
#t(round(s$v, 2))%>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
t(svd_m$v)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.24344116 -0.216616618 -0.22722294 -0.05705804 -0.19354356 -0.325188725
## [2,] 0.53987373 0.487493231 0.51725360 0.10625601 0.10379837 -0.135814054
## [3,] 0.09037519 0.057354675 0.05068274 -0.71213839 -0.08079070 -0.632458333
## [4,] 0.16406311 -0.006638758 0.04859942 0.07759799 -0.73417965 0.128383737
## [5,] -0.07820463 0.047271942 -0.13628569 0.63069336 0.20013325 -0.635095603
## [6,] 0.02630303 -0.055854498 -0.11068452 -0.25994275 0.59915308 0.216161866
## [7,] -0.15256259 0.318063495 -0.13865761 -0.01016678 -0.01203107 0.002252788
## [8,] 0.53652646 0.250899295 -0.78657218 -0.01694198 -0.03507913 0.041727195
## [9,] 0.54428354 -0.736630677 0.08450579 0.08097388 0.07620290 -0.086615810
## [,7] [,8] [,9]
## [1,] -0.46911641 -0.49821987 -0.4768938
## [2,] -0.22333811 -0.24463980 -0.2304278
## [3,] 0.08584072 0.18643503 0.1737113
## [4,] -0.49441795 0.26172588 0.3101680
## [5,] -0.13040957 0.26571763 0.2104501
## [6,] -0.66849572 0.16130917 0.1942926
## [7,] 0.02007220 -0.64743906 0.6606822
## [8,] 0.08550283 0.02777775 -0.1383909
## [9,] 0.08139614 -0.26850405 0.2353800
The SUV decomposition helps us see some patterns in the underlying data or latent factors.
The \(U\) matrix maps users to genres and can be interpreted as a user to latent factor similarity matrix
The \(V^T\) matrix maps movies to latent factors per user.
The \(\sum\) matrix measure the strength of each of these latent factors within our dataset where the larger the numeber on the diagonal, the stronger the factor. Factors which are very low can be considered noise within the data
#U
u <- round(svd_m$u, 2)
colnames(u) <- c("Animation Latent Factor", "Horror Latent Factor", "Action Latent Factor", " ", " ", " ", " ", " ", " ")
rownames(u) <- rownames(m)
u %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| Animation Latent Factor | Horror Latent Factor | Action Latent Factor | |||||||
|---|---|---|---|---|---|---|---|---|---|
| 18 | -0.31 | 0.35 | 0.28 | 0.56 | 0.20 | 0.27 | -0.01 | 0.52 | 0.08 |
| 664 | -0.25 | 0.38 | 0.11 | -0.59 | -0.13 | -0.39 | 0.21 | 0.39 | 0.24 |
| 771 | -0.20 | 0.46 | -0.05 | 0.27 | -0.63 | -0.11 | -0.29 | -0.41 | 0.14 |
| 821 | -0.22 | 0.55 | -0.20 | -0.19 | 0.48 | 0.16 | 0.11 | -0.38 | -0.39 |
| 430 | -0.06 | -0.01 | -0.57 | 0.13 | 0.00 | -0.07 | -0.04 | 0.19 | -0.10 |
| 453 | -0.06 | 0.00 | -0.67 | 0.14 | 0.18 | -0.20 | -0.05 | 0.14 | 0.37 |
| 270 | -0.49 | -0.25 | -0.17 | -0.24 | -0.25 | 0.67 | 0.23 | -0.06 | 0.21 |
| 372 | -0.45 | -0.26 | -0.04 | 0.19 | -0.24 | -0.36 | 0.28 | 0.10 | -0.61 |
| 617 | -0.37 | -0.22 | 0.21 | 0.21 | 0.36 | -0.34 | 0.24 | -0.43 | 0.43 |
| 833 | -0.41 | -0.22 | 0.09 | -0.23 | 0.19 | -0.07 | -0.82 | 0.07 | -0.09 |
#S
s <- diag(round(svd_m$d, 2), length(round(svd_m$d, 2)))
colnames(s) <- c("Animation Latent Factor Strength", "Horror Latent Factor Strength", "Action Latent Factor Strength", " ", " ", " ", " ", " ", " ")
s %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| Animation Latent Factor Strength | Horror Latent Factor Strength | Action Latent Factor Strength | ||||||
|---|---|---|---|---|---|---|---|---|
| 19.1 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 |
| 0.0 | 15.3 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 |
| 0.0 | 0.0 | 7.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 |
| 0.0 | 0.0 | 0.00 | 4.84 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 |
| 0.0 | 0.0 | 0.00 | 0.00 | 3.43 | 0.00 | 0.00 | 0.0 | 0.00 |
| 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 1.97 | 0.00 | 0.0 | 0.00 |
| 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.63 | 0.0 | 0.00 |
| 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.4 | 0.00 |
| 0.0 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.17 |
#V*
v <- t(round(svd_m$v, 2))
v %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| -0.24 | -0.22 | -0.23 | -0.06 | -0.19 | -0.33 | -0.47 | -0.50 | -0.48 |
| 0.54 | 0.49 | 0.52 | 0.11 | 0.10 | -0.14 | -0.22 | -0.24 | -0.23 |
| 0.09 | 0.06 | 0.05 | -0.71 | -0.08 | -0.63 | 0.09 | 0.19 | 0.17 |
| 0.16 | -0.01 | 0.05 | 0.08 | -0.73 | 0.13 | -0.49 | 0.26 | 0.31 |
| -0.08 | 0.05 | -0.14 | 0.63 | 0.20 | -0.64 | -0.13 | 0.27 | 0.21 |
| 0.03 | -0.06 | -0.11 | -0.26 | 0.60 | 0.22 | -0.67 | 0.16 | 0.19 |
| -0.15 | 0.32 | -0.14 | -0.01 | -0.01 | 0.00 | 0.02 | -0.65 | 0.66 |
| 0.54 | 0.25 | -0.79 | -0.02 | -0.04 | 0.04 | 0.09 | 0.03 | -0.14 |
| 0.54 | -0.74 | 0.08 | 0.08 | 0.08 | -0.09 | 0.08 | -0.27 | 0.24 |
Since we know in our dataset we have 3 genres, thus we anticipate 3 latent factors, the other factors can be thought of as noise. We can determine which level of noise to remove and reduce the dimensionality of our dataset.
We start by looking at the \(\sum\) matrix and computing the sum of the squares along the diagonal, the total energy. We can drop 10% of the energy as a general rule of thumb. All 3 matrices can be reduced by the same factor, thus reducing the rank of our utility matrix
sum(svd_m$d^2)
## [1] 688
energy_pct <- round(((svd_m$d^2)/sum(svd_m$d^2))*100,2)
reduce_by <- 0
for(i in length(energy_pct):1)
{
if(sum(energy_pct[i:length(energy_pct)]) < 10)
reduce_by = i
}
#U
u_reduced <- round(svd_m$u[,1:1:reduce_by-1],2)
u_reduced %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| -0.31 | 0.35 | 0.28 |
| -0.25 | 0.38 | 0.11 |
| -0.20 | 0.46 | -0.05 |
| -0.22 | 0.55 | -0.20 |
| -0.06 | -0.01 | -0.57 |
| -0.06 | 0.00 | -0.67 |
| -0.49 | -0.25 | -0.17 |
| -0.45 | -0.26 | -0.04 |
| -0.37 | -0.22 | 0.21 |
| -0.41 | -0.22 | 0.09 |
#S
s_reduced <- round(diag(c(svd_m$d[1:reduce_by-1]), length(svd_m$d[1:reduce_by-1])),4)
s_reduced %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| 19.1022 | 0.0000 | 0.0000 |
| 0.0000 | 15.2971 | 0.0000 |
| 0.0000 | 0.0000 | 7.0351 |
#V*
v_reduced <- round(t(svd_m$v)[1:reduce_by-1,],2)
v_reduced %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| -0.24 | -0.22 | -0.23 | -0.06 | -0.19 | -0.33 | -0.47 | -0.50 | -0.48 |
| 0.54 | 0.49 | 0.52 | 0.11 | 0.10 | -0.14 | -0.22 | -0.24 | -0.23 |
| 0.09 | 0.06 | 0.05 | -0.71 | -0.08 | -0.63 | 0.09 | 0.19 | 0.17 |
The most simple recommendation that can be made is to use the dot product of the SVD matrices to reconstruct as close as possible the original matrix while reducing the Root Mean Squared Error
r <- u_reduced %*% s_reduced %*% v_reduced
r[r < 0] <- 0
r[r > 5] <- 5
r <- round(r)
# Original matrix
m %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| Toy Story (1995) | Aladdin (1992) | Lion King, The (1994) | Rock, The (1996) | Twister (1996) | Abyss, The (1989) | Alien (1979) | Psycho (1960) | Shining, The (1980) | |
|---|---|---|---|---|---|---|---|---|---|
| 18 | 5 | 4 | 4 | 0 | 0 | 0 | 0 | 3 | 3 |
| 664 | 4 | 4 | 4 | 0 | 3 | 0 | 3 | 0 | 0 |
| 771 | 5 | 4 | 5 | 0 | 0 | 2 | 0 | 0 | 0 |
| 821 | 5 | 5 | 5 | 3 | 3 | 0 | 0 | 0 | 0 |
| 430 | 0 | 0 | 0 | 3 | 0 | 3 | 0 | 0 | 0 |
| 453 | 0 | 0 | 0 | 4 | 0 | 3 | 0 | 0 | 0 |
| 270 | 0 | 0 | 0 | 0 | 3 | 5 | 5 | 5 | 5 |
| 372 | 0 | 0 | 0 | 0 | 0 | 4 | 5 | 5 | 5 |
| 617 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 5 | 5 |
| 833 | 0 | 0 | 0 | 0 | 2 | 2 | 5 | 5 | 4 |
# Reconstructed matrix
r %>% kable() %>% kable_styling(bootstrap_options = "striped", full_width = F)
| 4 | 4 | 4 | 0 | 2 | 0 | 2 | 2 | 2 |
| 4 | 4 | 4 | 0 | 1 | 0 | 1 | 1 | 1 |
| 5 | 4 | 5 | 1 | 1 | 0 | 0 | 0 | 0 |
| 5 | 5 | 5 | 2 | 2 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 3 | 1 | 3 | 0 | 0 | 0 |
| 0 | 0 | 0 | 3 | 1 | 3 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 1 | 4 | 5 | 5 | 5 |
| 0 | 0 | 0 | 0 | 1 | 4 | 5 | 5 | 5 |
| 0 | 0 | 0 | 0 | 1 | 2 | 4 | 5 | 4 |
| 0 | 0 | 0 | 0 | 1 | 3 | 4 | 5 | 5 |
#RMSE
RMSE(m,r)
## [1] 0.7226494
While SVD factorization methods alleviate issues with content based filtering algorithms and help see some underlying patterns within the dataset. However, interpreting latent factors can be difficult.