Project 3: Matrix Factorization methods

Introduction

This project is based on the work done in Project 2 (https://rpubs.com/ilyakats/data643proj2). It adds SVD to explore recommender systems further. Code is based on the recommenderlab package.

# Required libraries
library(recommenderlab)  # Matrix/recommender functions
library(dplyr)           # Data manipulation
library(tidyr)           # Data manipulation
library(ggplot2)         # Plotting
library(ggrepel)         # Plotting labels without overlapping
library(tictoc)          # Timing

Data Set

The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/datasets/movielens/.

# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
                           "master/Project%202/ml-latest-small/ratings.csv"))
titles <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
                          "master/Project%202/ml-latest-small/movies.csv"))

# Convert to matrix
movieMatrix <- ratings %>% 
  select(-timestamp) %>% 
  spread(movieId, rating)
row.names(movieMatrix) <- movieMatrix[,1]
movieMatrix <- as.matrix(movieMatrix[-c(1)])
movieRealMatrix <- as(movieMatrix, "realRatingMatrix")

Our movie matrix contains 671 users and 9,066 items/movies.

In order to test any models, we need to split our data into training and testing sets.

# Train/test split
set.seed(88)
eval <- evaluationScheme(movieRealMatrix, method = "split", 
                         train = 0.8, given = 20, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")

User-Based Collaborative Model

So we can compare SVD model against other models, we will build a user-based collaborative filtering model.

# UBCF model
tic("UBCF Model - Training")
modelUBCF <- Recommender(train, method = "UBCF")
toc(log = TRUE, quiet = TRUE)

tic("UBCF Model - Predicting")
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)

( accUBCF <- calcPredictionAccuracy(predUBCF, unknown) )

##      RMSE       MSE       MAE 
## 0.9720656 0.9449115 0.7532431

Singular Value Decomposition (SVD) Model

We are using similar code to generate the SVD model. After testing several parameters, we are generating a model with 50 categories/concepts. This covers enough information and lowers RMSE, but at the same time provides reasonable processing time.

# SVD model
tic("SVD Model - Training")
modelSVD <- Recommender(train, method = "SVD", parameter = list(k = 50))
toc(log = TRUE, quiet = TRUE)

tic("SVD Model - Predicting")
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)

( accSVD <- calcPredictionAccuracy(predSVD, unknown) )

##      RMSE       MSE       MAE 
## 0.9770621 0.9546503 0.7579945

As we can see RMSE is very similar to the UBCF model - just under 1. On the surface these models appear to be similar.

Run Time

One major difference is run time. This is especially important for scaling the project. UBCF takes less time to build a model, but takes more resources making predictions while SVD model is the opposite - resource intensive to build a model, but quick to make predictions.

# Display log
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Run Time
UBCF Model - Training: 0.02 sec elapsed
UBCF Model - Predicting: 6.27 sec elapsed
SVD Model - Training: 2.07 sec elapsed
SVD Model - Predicting: 0.47 sec elapsed

Predictions Example

Let us pick a user to see how predictions would work. We can pick a user from the predictions matrix we already created. For example, user 44. The following list shows all movies user 44 rated - low on romantic, serious or light-hearted movies, high on lowbrow humor and action movies.

mov_rated <- as.data.frame(movieRealMatrix@data[c("44"), ]) 
colnames(mov_rated) <- c("Rating")
mov_rated$movieId <- as.integer(rownames(mov_rated))
mov_rated <- mov_rated %>% filter(Rating != 0) %>% 
  inner_join (titles, by="movieId") %>%
  arrange(Rating) %>%
  select(Movie = "title", Rating)
knitr::kable(mov_rated, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie	Rating
Sense and Sensibility (1995)	2
Birdcage, The (1996)	2
Sgt. Bilko (1996)	2
Father of the Bride Part II (1995)	3
Heat (1995)	3
Leaving Las Vegas (1995)	3
Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	3
Broken Arrow (1996)	3
Star Wars: Episode IV - A New Hope (1977)	3
Executive Decision (1996)	3
Primal Fear (1996)	3
Mission: Impossible (1996)	3
Eraser (1996)	3
Nutty Professor, The (1996)	3
Phenomenon (1996)	3
Long Kiss Goodnight, The (1996)	3
Toy Story (1995)	4
Happy Gilmore (1996)	4
Down Periscope (1996)	4
Rock, The (1996)	4
Twister (1996)	4
Time to Kill, A (1996)	4
Grumpier Old Men (1995)	5
Mr. Holland’s Opus (1995)	5
Independence Day (a.k.a. ID4) (1996)	5

Now we can take top 6 movies as suggested by the SVD model. I think these suggestions - Demolition Man, Armaggedon, Jurassic Park - make sense for user 44. The only questionable suggestion in my opinion is Contact.

mov_recommend <- as.data.frame(predSVD@data[c("44"), ]) 
colnames(mov_recommend) <- c("Rating")
mov_recommend$movieId <- as.integer(rownames(mov_recommend))
mov_recommend <- mov_recommend %>% arrange(desc(Rating)) %>% head(6) %>% 
  inner_join (titles, by="movieId") %>%
  select(Movie = "title")
knitr::kable(mov_recommend, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie
Home Alone (1990)
Demolition Man (1993)
Contact (1997)
Armageddon (1998)
Jurassic Park (1993)
Lost World: Jurassic Park, The (1997)

Roadblock

If we look at the diaganol matrix \(\Sigma\) from the SVD model, we see that the first concept/dimension seemingly accounts for 99% of the energy. This provided some problems in further evaluation of the SVD method.

modelSVD@model$svd$d

##  [1] 2909.17096   44.21351   40.05544   34.32164   31.64448   29.48242
##  [7]   28.22409   27.40770   26.25308   25.84147   25.27600   25.08950
## [13]   24.08783   23.69596   23.12094   22.77796   22.38026   21.81070
## [19]   21.34381   21.13460   20.82351   20.55284   20.14412   20.03202
## [25]   19.78793   19.72819   19.66162   19.33117   19.10150   18.83120
## [31]   18.68897   18.64556   18.42660   18.16176   18.12811   18.07196
## [37]   17.71729   17.59136   17.36400   17.33684   17.20880   17.04195
## [43]   16.99371   16.84449   16.73244   16.58959   16.47004   16.20081
## [49]   16.15551   16.00560

Manual Singular Value Decomposition

In order to do more testing on the SVD method, I have decided to perform decomposition manually without any additional recommender functionality. I am using base R svd function.

First the ratings matrix is normalized. NA values are replaced with 0 and there are negative and positive ratings. Now we can decompose original matrix.

# Normalize matrix
movieMatrix <- as.matrix(normalize(movieRealMatrix)@data)

# Perform SVD
movieSVD <- svd(movieMatrix)
rownames(movieSVD$u) <- rownames(movieMatrix)
rownames(movieSVD$v) <- colnames(movieMatrix)

This process generates 671 concepts. Clearly to be usable we need to reduce number of dimensions/concepts by setting some singular values in the diagonal matrix \(\Sigma\) to 0. Per Leskovec (Mining of Massive Datasets, 2014, p. 424), we will retain enough singular values to make up 90% of the energy of \(\Sigma\).

# Reduce dimentions
n <- length(movieSVD$d)
total_energy <- sum(movieSVD$d^2)
for (i in (n-1):1) {
  energy <- sum(movieSVD$d[1:i]^2)
  if (energy/total_energy<0.9) {
    n_dims <- i+1
    break
  }
}

This process leaves us with 283 dimensions/concepts. This is still a high number, but much more manageable than 671 (almost 60% reduction).

trim_mov_D <- movieSVD$d[1:n_dims]
trim_mov_U <- movieSVD$u[, 1:n_dims]
trim_mov_V <- movieSVD$v[, 1:n_dims]

Consider two first concepts with singular values 73.6 and 51.9. Let us pick 5 movies with highest and lowest values in each concept and plot them.

mov_count <- 5

movies <- as.data.frame(trim_mov_V) %>% select(V1, V2)
movies$movieId <- as.integer(rownames(movies))

mov_sample <- movies %>% arrange(V1) %>% head(mov_count)
mov_sample <- rbind(mov_sample, movies %>% arrange(desc(V1)) %>% head(mov_count))
mov_sample <- rbind(mov_sample, movies %>% arrange(V2) %>% head(mov_count))
mov_sample <- rbind(mov_sample, movies %>% arrange(desc(V2)) %>% head(mov_count))
mov_sample <- mov_sample %>% inner_join(titles, by = "movieId") %>% 
  select(Movie = "title", Concept1 = "V1", Concept2 = "V2")
mov_sample$Concept1 <- round(mov_sample$Concept1, 4)
mov_sample$Concept2 <- round(mov_sample$Concept2, 4)

knitr::kable(mov_sample, format = "html") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Movie	Concept1	Concept2
Pulp Fiction (1994)	-0.1455	0.0206
Godfather, The (1972)	-0.1222	0.0207
Silence of the Lambs, The (1991)	-0.1147	0.0153
Shawshank Redemption, The (1994)	-0.1115	-0.0027
Star Wars: Episode IV - A New Hope (1977)	-0.1073	0.0540
Batman Forever (1995)	0.0850	-0.0095
Ace Ventura: Pet Detective (1994)	0.0801	0.0432
Batman & Robin (1997)	0.0757	-0.0331
Ace Ventura: When Nature Calls (1995)	0.0712	-0.0245
Waterworld (1995)	0.0691	-0.0290
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	-0.0198	-0.0374
Lives of Others, The (Das leben der Anderen) (2006)	-0.0055	-0.0363
Some Like It Hot (1959)	-0.0263	-0.0354
Unbearable Lightness of Being, The (1988)	0.0017	-0.0352
Modern Times (1936)	-0.0182	-0.0346
Star Wars: Episode V - The Empire Strikes Back (1980)	-0.1054	0.0770
Clerks (1994)	-0.0373	0.0767
Chasing Amy (1997)	-0.0111	0.0723
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	-0.0922	0.0720
Star Wars: Episode VI - Return of the Jedi (1983)	-0.0776	0.0708

ggplot(mov_sample, aes(Concept1, Concept2, label=Movie)) + geom_point() +
  geom_text_repel(aes(label=Movie), hjust=-0.1, vjust=-0.1, size = 3) +
  scale_x_continuous(limits = c(-0.2, 0.2)) +
  scale_y_continuous(limits = c(-0.1, 0.1))

This plot nicely demonstrates one of the biggest disadvantages of the SVD method - inability to connect characteristics/concepts to real-world categories. We can certainly agree that it is meaningful that all original Star Wars movies are close together or that Pulp Fiction and Ace Ventura are on opposites sides, but there is no clear way to categorize these movies.

Summary

This project explored some SVD features. It showed that SVD is faster than collaborative filtering in making predictions. SVD provides good recommendations, but at the same time SVD is difficult to interpret.