• Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
• You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
• This project is based on the work done in Project 2
• In this project we will add SVD to further explore the recommender system. I have used the recommenderlab package.
The data set is from MovieLens project and it was downloaded from [Movie Lens] (https://grouplens.org/datasets/movielens/)
Movie_Matrix <- ratings %>%
select(-timestamp) %>%
spread(movieId, rating)
row.names(Movie_Matrix) <- Movie_Matrix[,1]
Movie_Matrix <- Movie_Matrix[-c(1)]
Movie_Matrix <- as(as.matrix(Movie_Matrix), "realRatingMatrix")
Movie_Matrix## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.
Our movie matrix contains 610 users and 9,724 items/movies.
Now we will split our data into train and test sets
Firstly, we will build a user-based collaborative filtering model.
tic("UBCF Model - Training")
modelUBCF <- Recommender(train, method = "UBCF")
toc(log = TRUE, quiet = TRUE)
tic("UBCF Model - Predicting")
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)
( accUBCF <- calcPredictionAccuracy(predUBCF, unknown) )## RMSE MSE MAE
## 0.9320803 0.8687737 0.7174840
Now we will build a SVD Model in order to compare this model with UBCF Model. For building SVD Model, we will generate a model with 50 concepts/categories. It will have all the required information and also has a lower value of RMSE and gives a reasonable processing time.
tic("SVD Model - Training")
modelSVD <- Recommender(train, method = "SVD", parameter = list(k = 50))
toc(log = TRUE, quiet = TRUE)
tic("SVD Model - Predicting")
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)
( accSVD <- calcPredictionAccuracy(predSVD, unknown) )## RMSE MSE MAE
## 0.9361910 0.8764536 0.7210165
As we can see RMSE is very similar to the UBCF model. On the surface these models appear to be similar.
One major difference between SVD and UBCF Model is their run-times.
Let’s explore their log displays to individually analyze their run-time.
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Run Time |
|---|
| UBCF Model - Training: 0.015 sec elapsed |
| UBCF Model - Predicting: 2.556 sec elapsed |
| SVD Model - Training: 2.172 sec elapsed |
| SVD Model - Predicting: 0.542 sec elapsed |
As we can see from the log display of both the models:
UBCF takes less time to build a model, but takes more resources making predictions while SVD model is the opposite - resource intensive to build a model, but quick to make predictions.
Now let us evaluate our predictions by seeing the prediction matrix of a particular user.
Here, let’s see for user 400th.
mov_rated <- as.data.frame(Movie_Matrix@data[c("400"), ])
colnames(mov_rated) <- c("Rating")
mov_rated$movieId <- as.integer(rownames(mov_rated))
mov_rated <- mov_rated %>% filter(Rating != 0) %>%
inner_join (movies, by="movieId") %>%
arrange(Rating) %>%
select(Movie = "title", Rating)
knitr::kable(mov_rated, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Movie | Rating |
|---|---|
| Spider-Man 3 (2007) | 2.5 |
| Indiana Jones and the Kingdom of the Crystal Skull (2008) | 2.5 |
| Back to the Future (1985) | 4.0 |
| Gladiator (2000) | 4.0 |
| Lord of the Rings: The Fellowship of the Ring, The (2001) | 4.0 |
| Lord of the Rings: The Return of the King, The (2003) | 4.0 |
| Lucky Number Slevin (2006) | 4.0 |
| Pursuit of Happyness, The (2006) | 4.0 |
| Departed, The (2006) | 4.0 |
| The Martian (2015) | 4.0 |
| Logan (2017) | 4.0 |
| Forrest Gump (1994) | 4.5 |
| Blade Runner (1982) | 4.5 |
| Die Hard (1988) | 4.5 |
| One Flew Over the Cuckoo’s Nest (1975) | 4.5 |
| Princess Bride, The (1987) | 4.5 |
| Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) | 4.5 |
| Goodfellas (1990) | 4.5 |
| Godfather: Part II, The (1974) | 4.5 |
| Shining, The (1980) | 4.5 |
| Donnie Darko (2001) | 4.5 |
| Dark Knight, The (2008) | 4.5 |
| How to Train Your Dragon (2010) | 4.5 |
| Star Wars: Episode VII - The Force Awakens (2015) | 4.5 |
| Arrival (2016) | 4.5 |
| Heat (1995) | 5.0 |
| Seven (a.k.a. Se7en) (1995) | 5.0 |
| Usual Suspects, The (1995) | 5.0 |
| Star Wars: Episode IV - A New Hope (1977) | 5.0 |
| Léon: The Professional (a.k.a. The Professional) (Léon) (1994) | 5.0 |
| Pulp Fiction (1994) | 5.0 |
| Shawshank Redemption, The (1994) | 5.0 |
| Silence of the Lambs, The (1991) | 5.0 |
| Fargo (1996) | 5.0 |
| Trainspotting (1996) | 5.0 |
| Godfather, The (1972) | 5.0 |
| Star Wars: Episode V - The Empire Strikes Back (1980) | 5.0 |
| Star Wars: Episode VI - Return of the Jedi (1983) | 5.0 |
| Matrix, The (1999) | 5.0 |
| Fight Club (1999) | 5.0 |
| Requiem for a Dream (2000) | 5.0 |
| Inside Man (2006) | 5.0 |
| Inception (2010) | 5.0 |
• As we see that user 400th movie likes comes under action , low on romantic , dramatic movie genre categories.
• Now we can see the movies suggested by SVD to user 400th.
mov_recommend <- as.data.frame(predSVD@data[c("400"), ])
colnames(mov_recommend) <- c("Rating")
mov_recommend$movieId <- as.integer(rownames(mov_recommend))
mov_recommend <- mov_recommend %>% arrange(desc(Rating)) %>% head(6) %>%
inner_join (movies, by="movieId") %>%
select(Movie = "title")
knitr::kable(mov_recommend, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Movie |
|---|
| American Beauty (1999) |
| Pulp Fiction (1994) |
| Schindler’s List (1993) |
| Sixth Sense, The (1999) |
| Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) |
| Saving Private Ryan (1998) |
Therefore by analyzing top 6 movies being recommended to user 400th, we see that they also are action and drama genre movie categories.
Let us normalize the ratings matrix
# Normalize matrix
movieMatrix <- as.matrix(normalize(Movie_Matrix)@data)
# Perform SVD
movieSVD <- svd(movieMatrix)
rownames(movieSVD$u) <- rownames(movieMatrix)
rownames(movieSVD$v) <- colnames(movieMatrix)As we have seen earlier, our data has 610 users. In order to be usable we need to reduce number of dimensions/concepts by setting some singular values in the diagonal matrix Σ to 0.
# Reduce dimensions
n <- length(movieSVD$d)
total_energy <- sum(movieSVD$d^2)
for (i in (n-1):1) {
energy <- sum(movieSVD$d[1:i]^2)
if (energy/total_energy<0.9) {
n_dims <- i+1
break
}
}trim_mov_D <- movieSVD$d[1:n_dims]
trim_mov_U <- movieSVD$u[, 1:n_dims]
trim_mov_V <- movieSVD$v[, 1:n_dims]As we had 610 users in our ratings matrix. and after reducing the dimensionality of the diagonal matrix Σ , we have 251 dimensions/concepts.
## [1] 76.20047 43.62240 41.77917 39.37051 37.95619 36.54896
Consider two first concepts with singular values 76.2 and 43.6. Let us pick 5 movies with highest and lowest values in each concept and plot them.
mov_count <- 5
movies_df <- as.data.frame(trim_mov_V) %>% select(V1, V2)
movies_df$movieId <- as.integer(rownames(movies_df))
mov_sample <- movies_df %>% arrange(V1) %>% head(mov_count)
mov_sample <- rbind(mov_sample, movies_df %>% arrange(desc(V1)) %>% head(mov_count))
mov_sample <- rbind(mov_sample, movies_df %>% arrange(V2) %>% head(mov_count))
mov_sample <- rbind(mov_sample, movies_df %>% arrange(desc(V2)) %>% head(mov_count))
mov_sample <- mov_sample %>% inner_join(movies, by = "movieId") %>%
select(Movie = "title", Concept1 = "V1", Concept2 = "V2")
mov_sample$Concept1 <- round(mov_sample$Concept1, 4)
mov_sample$Concept2 <- round(mov_sample$Concept2, 4)
knitr::kable(mov_sample, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))| Movie | Concept1 | Concept2 |
|---|---|---|
| Pulp Fiction (1994) | -0.1353 | -0.0097 |
| Star Wars: Episode IV - A New Hope (1977) | -0.1182 | 0.0268 |
| Star Wars: Episode V - The Empire Strikes Back (1980) | -0.1156 | 0.0233 |
| Godfather, The (1972) | -0.1093 | -0.0118 |
| Fight Club (1999) | -0.1068 | -0.0065 |
| Batman & Robin (1997) | 0.0599 | -0.0060 |
| Batman Forever (1995) | 0.0584 | 0.0255 |
| Wild Wild West (1999) | 0.0580 | 0.0242 |
| Hollow Man (2000) | 0.0564 | 0.0003 |
| Nutty Professor, The (1996) | 0.0542 | 0.0157 |
| Charlie’s Angels: Full Throttle (2003) | 0.0366 | -0.0589 |
| Transformers: Dark of the Moon (2011) | 0.0175 | -0.0537 |
| Battlefield Earth (2000) | 0.0337 | -0.0524 |
| Schindler’s List (1993) | -0.0757 | -0.0522 |
| Shawshank Redemption, The (1994) | -0.1057 | -0.0500 |
| Cannonball Run, The (1981) | 0.0061 | 0.0630 |
| Naked Gun: From the Files of Police Squad!, The (1988) | -0.0104 | 0.0588 |
| Blazing Saddles (1974) | -0.0309 | 0.0585 |
| Ace Ventura: Pet Detective (1994) | 0.0255 | 0.0585 |
| Beverly Hills Cop (1984) | -0.0090 | 0.0563 |
Collaborative Filtering:
• It successfully avoids the problem posed by dynamic user preference as item-based CF is more static.
• However, several problems remain for this method. First, the main issue is scalability. The computation grows with both the customer and the product. The worst case complexity is O(mn) with m users and n items.
Singular Value Decomposition:
• SVD decreases the dimension of the utility matrix by extracting its latent factors.
• SVD handles the problem of scalability and sparsity posed by CF successfully. However, SVD is not without flaw. The main drawback of SVD is that there is no to little explanation to the reason that we recommend an item to a user. This can be a huge problem if users are eager to know why a specific item is recommended to them.