library(tidyverse)
library(kableExtra)
library(knitr)
library(recommenderlab)
library(dplyr)
library(ggplot2)
library(ggrepel)
library(tictoc)
The goal of this assignment is give you practice working with Matrix Factorization techniques.
Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else.
Remember as always to cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering.
You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
The data set is from MovieLens project and it was downloaded from Movie Lens
ratings <- read.csv(paste0("https://raw.githubusercontent.com/josephsimone/Data-612/master/project_2/Movie_Lens/ratings.csv"))
movies <- read.csv(paste0("https://raw.githubusercontent.com/josephsimone/Data-612/master/project_2/Movie_Lens/movies.csv"))
m_m <- ratings %>%
select(-timestamp) %>%
spread(movieId, rating)
row.names(m_m) <- m_m[,1]
m_m <- m_m[-c(1)]
m_m <- as(as.matrix(m_m), "realRatingMatrix")
m_m
## 610 x 9724 rating matrix of class 'realRatingMatrix' with 100836 ratings.
norm_films <- normalize(m_m)
avg_rating <- round(rowMeans(norm_films),5)
table(avg_rating)
## avg_rating
## 0
## 610
Our movie matrix contains 610 users and 9,724 items/movies.
Now we will split our data into train and test sets
set.seed(123)
eval <- evaluationScheme(norm_films, method = "split",
train = 0.8, given= 20, goodRating=3)
movie_train <- getData(eval, "train")
movie_known <- getData(eval, "known")
movie_unknown <- getData(eval, "unknown")
First, let’s compare the complexity between a User-Based Collaborative Filtering and a Singular Value Decomposition (SVD) Model.
tic("UBCF Model - Training")
UBCF_model <- Recommender(movie_train, method = "UBCF")
## Warning in .local(x, ...): x was already normalized by row!
toc(log = TRUE, quiet = TRUE)
tic("UBCF Model - Predicting")
UBCF_predict <- predict(UBCF_model, newdata = movie_known, type = "ratings")
toc(log = TRUE, quiet = TRUE)
(UBCF_accuracy <- calcPredictionAccuracy(UBCF_predict, movie_unknown) )
## RMSE MSE MAE
## 0.9041400 0.8174691 0.6956109
When building this SVD Model, it will consists of 50 concepts or categories.
tic("SVD Model - Training")
modelSVD <- Recommender(movie_train, method = "SVD", parameter = list(k = 50))
## Warning in .local(x, ...): x was already normalized by row!
toc(log = TRUE, quiet = TRUE)
tic("SVD Model - Predicting")
predSVD <- predict(modelSVD, newdata = movie_known, type = "ratings")
toc(log = TRUE, quiet = TRUE)
( accSVD <- calcPredictionAccuracy(predSVD, movie_unknown) )
## RMSE MSE MAE
## 0.9069787 0.8226103 0.6985789
At first glance, the difference between the SVD and UBCF Models are very similar.
Now comparing the run-time complexities.
Let’s explore the models’ log displays to to better understand their complexities..
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Run Time |
|---|
| UBCF Model - Training: 0.07 sec elapsed |
| UBCF Model - Predicting: 8.05 sec elapsed |
| SVD Model - Training: 3.39 sec elapsed |
| SVD Model - Predicting: 1.52 sec elapsed |
One major difference between SVD and UBCF Model is their run-times.
While the UBCF takes less time to build a model, it is more resource intensive in making predictions.
Let’s evaluate our predictions by seeing the prediction matrix of a specific user.
In this particular case, the \(3^{rd}\) User from this DataSet.
movie_rating <- as.data.frame(m_m@data[c("3"), ])
colnames(movie_rating) <- c("Rating")
movie_rating$movieId <- as.integer(rownames(movie_rating))
movie_rating <- movie_rating %>% filter(Rating != 0) %>%
inner_join (movies, by="movieId") %>%
arrange(Rating) %>%
select(Movie = "title", Rating)
knitr::kable(movie_rating, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie | Rating |
|---|---|
| Dangerous Minds (1995) | 0.5 |
| Schindler’s List (1993) | 0.5 |
| Courage Under Fire (1996) | 0.5 |
| Operation Dumbo Drop (1995) | 0.5 |
| Wallace & Gromit: The Best of Aardman Animation (1996) | 0.5 |
| My Fair Lady (1964) | 0.5 |
| Doors, The (1991) | 0.5 |
| On Golden Pond (1981) | 0.5 |
| Deer Hunter, The (1978) | 0.5 |
| Patton (1970) | 0.5 |
| Field of Dreams (1989) | 0.5 |
| Bambi (1942) | 0.5 |
| Lady and the Tramp (1955) | 0.5 |
| Rescuers, The (1977) | 0.5 |
| You’ve Got Mail (1998) | 0.5 |
| Fast Times at Ridgemont High (1982) | 0.5 |
| Requiem for a Dream (2000) | 0.5 |
| Snow Dogs (2002) | 0.5 |
| Green Card (1990) | 0.5 |
| 2012 (2009) | 0.5 |
| Tron (1982) | 2.0 |
| Star Trek: The Motion Picture (1979) | 3.0 |
| Highlander (1986) | 3.5 |
| Thing, The (1982) | 4.0 |
| Conan the Barbarian (1982) | 4.5 |
| Piranha (1978) | 4.5 |
| Looker (1981) | 4.5 |
| Master of the Flying Guillotine (Du bi quan wang da po xue di zi) (1975) | 4.5 |
| Clonus Horror, The (1979) | 4.5 |
| Escape from L.A. (1996) | 5.0 |
| Saturn 3 (1980) | 5.0 |
| Road Warrior, The (Mad Max 2) (1981) | 5.0 |
| The Lair of the White Worm (1988) | 5.0 |
| Hangar 18 (1980) | 5.0 |
| Galaxy of Terror (Quest) (1981) | 5.0 |
| Android (1982) | 5.0 |
| Alien Contamination (1980) | 5.0 |
| Death Race 2000 (1975) | 5.0 |
| Troll 2 (1990) | 5.0 |
As we see that \(3^{rd}\) user movie likes comes under action , horror & some animation.
On the other hand, the genres rated romantic & dramatic film genres very low.
Exploring the movies suggested by SVD for the \(3^{rd}\) user.
recommend_movie <- as.data.frame(predSVD@data[c("3"), ])
colnames(recommend_movie) <- c("Rating")
recommend_movie$movieId <- as.integer(rownames(recommend_movie))
recommend_movie <- recommend_movie %>% arrange(desc(Rating)) %>% head(6) %>%
inner_join (movies, by="movieId") %>%
select(Movie = "title")
knitr::kable(recommend_movie, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie |
|---|
| Dangerous Minds (1995) |
| Courage Under Fire (1996) |
| Operation Dumbo Drop (1995) |
| Wallace & Gromit: The Best of Aardman Animation (1996) |
| Escape from L.A. (1996) |
| My Fair Lady (1964) |
When analyzing top 6 movies being recommended to the 3rd user , we see that they also are action, horror and animation genre movie categories.
User-Based Collaborative Filtering:
There are several problems that can occure during a USCF. First, is in regards for scalability. The computations increasingly grows with the amount customers and the products.
Singular Value Decomposition:
When running a SVD Model, this decreases the dimension of the matrix by extracting latent factors. Thereofre, this model can handle the problems of scalability & sparsity.
However, SVD is not still not a perfect model. One of the drawbacks being there is are no clear reasoning as to why the recommendation was made to a user. This can become problematic if the user wants to know why this recommendation has occured.