This project is based on the work done in Project 2 (https://rpubs.com/ilyakats/data643proj2). It adds SVD to explore recommender systems further. Code is based on the recommenderlab package.
# Required libraries
library(recommenderlab) # Matrix/recommender functions
library(dplyr) # Data manipulation
library(tidyr) # Data manipulation
library(ggplot2) # Plotting
library(ggrepel) # Plotting labels without overlapping
library(tictoc) # Timing
The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/datasets/movielens/.
# Data import
ratings <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
"master/Project%202/ml-latest-small/ratings.csv"))
titles <- read.csv(paste0("https://raw.githubusercontent.com/ilyakats/CUNY-DATA643/",
"master/Project%202/ml-latest-small/movies.csv"))
# Convert to matrix
movieMatrix <- ratings %>%
select(-timestamp) %>%
spread(movieId, rating)
row.names(movieMatrix) <- movieMatrix[,1]
movieMatrix <- as.matrix(movieMatrix[-c(1)])
movieRealMatrix <- as(movieMatrix, "realRatingMatrix")
Our movie matrix contains 671 users and 9,066 items/movies.
In order to test any models, we need to split our data into training and testing sets.
# Train/test split
set.seed(88)
eval <- evaluationScheme(movieRealMatrix, method = "split",
train = 0.8, given = 20, goodRating = 3)
train <- getData(eval, "train")
known <- getData(eval, "known")
unknown <- getData(eval, "unknown")
So we can compare SVD model against other models, we will build a user-based collaborative filtering model.
# UBCF model
tic("UBCF Model - Training")
modelUBCF <- Recommender(train, method = "UBCF")
toc(log = TRUE, quiet = TRUE)
tic("UBCF Model - Predicting")
predUBCF <- predict(modelUBCF, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)
( accUBCF <- calcPredictionAccuracy(predUBCF, unknown) )
## RMSE MSE MAE
## 0.9720656 0.9449115 0.7532431
We are using similar code to generate the SVD model. After testing several parameters, we are generating a model with 50 categories/concepts. This covers enough information and lowers RMSE, but at the same time provides reasonable processing time.
# SVD model
tic("SVD Model - Training")
modelSVD <- Recommender(train, method = "SVD", parameter = list(k = 50))
toc(log = TRUE, quiet = TRUE)
tic("SVD Model - Predicting")
predSVD <- predict(modelSVD, newdata = known, type = "ratings")
toc(log = TRUE, quiet = TRUE)
( accSVD <- calcPredictionAccuracy(predSVD, unknown) )
## RMSE MSE MAE
## 0.9770621 0.9546503 0.7579945
As we can see RMSE is very similar to the UBCF model - just under 1. On the surface these models appear to be similar.
One major difference is run time. This is especially important for scaling the project. UBCF takes less time to build a model, but takes more resources making predictions while SVD model is the opposite - resource intensive to build a model, but quick to make predictions.
# Display log
log <- as.data.frame(unlist(tic.log(format = TRUE)))
colnames(log) <- c("Run Time")
knitr::kable(log, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Run Time |
|---|
| UBCF Model - Training: 0.02 sec elapsed |
| UBCF Model - Predicting: 6.27 sec elapsed |
| SVD Model - Training: 2.07 sec elapsed |
| SVD Model - Predicting: 0.47 sec elapsed |
Let us pick a user to see how predictions would work. We can pick a user from the predictions matrix we already created. For example, user 44. The following list shows all movies user 44 rated - low on romantic, serious or light-hearted movies, high on lowbrow humor and action movies.
mov_rated <- as.data.frame(movieRealMatrix@data[c("44"), ])
colnames(mov_rated) <- c("Rating")
mov_rated$movieId <- as.integer(rownames(mov_rated))
mov_rated <- mov_rated %>% filter(Rating != 0) %>%
inner_join (titles, by="movieId") %>%
arrange(Rating) %>%
select(Movie = "title", Rating)
knitr::kable(mov_rated, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie | Rating |
|---|---|
| Sense and Sensibility (1995) | 2 |
| Birdcage, The (1996) | 2 |
| Sgt. Bilko (1996) | 2 |
| Father of the Bride Part II (1995) | 3 |
| Heat (1995) | 3 |
| Leaving Las Vegas (1995) | 3 |
| Twelve Monkeys (a.k.a. 12 Monkeys) (1995) | 3 |
| Broken Arrow (1996) | 3 |
| Star Wars: Episode IV - A New Hope (1977) | 3 |
| Executive Decision (1996) | 3 |
| Primal Fear (1996) | 3 |
| Mission: Impossible (1996) | 3 |
| Eraser (1996) | 3 |
| Nutty Professor, The (1996) | 3 |
| Phenomenon (1996) | 3 |
| Long Kiss Goodnight, The (1996) | 3 |
| Toy Story (1995) | 4 |
| Happy Gilmore (1996) | 4 |
| Down Periscope (1996) | 4 |
| Rock, The (1996) | 4 |
| Twister (1996) | 4 |
| Time to Kill, A (1996) | 4 |
| Grumpier Old Men (1995) | 5 |
| Mr. Holland’s Opus (1995) | 5 |
| Independence Day (a.k.a. ID4) (1996) | 5 |
Now we can take top 6 movies as suggested by the SVD model. I think these suggestions - Demolition Man, Armaggedon, Jurassic Park - make sense for user 44. The only questionable suggestion in my opinion is Contact.
mov_recommend <- as.data.frame(predSVD@data[c("44"), ])
colnames(mov_recommend) <- c("Rating")
mov_recommend$movieId <- as.integer(rownames(mov_recommend))
mov_recommend <- mov_recommend %>% arrange(desc(Rating)) %>% head(6) %>%
inner_join (titles, by="movieId") %>%
select(Movie = "title")
knitr::kable(mov_recommend, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie |
|---|
| Home Alone (1990) |
| Demolition Man (1993) |
| Contact (1997) |
| Armageddon (1998) |
| Jurassic Park (1993) |
| Lost World: Jurassic Park, The (1997) |
If we look at the diaganol matrix \(\Sigma\) from the SVD model, we see that the first concept/dimension seemingly accounts for 99% of the energy. This provided some problems in further evaluation of the SVD method.
modelSVD@model$svd$d
## [1] 2909.17096 44.21351 40.05544 34.32164 31.64448 29.48242
## [7] 28.22409 27.40770 26.25308 25.84147 25.27600 25.08950
## [13] 24.08783 23.69596 23.12094 22.77796 22.38026 21.81070
## [19] 21.34381 21.13460 20.82351 20.55284 20.14412 20.03202
## [25] 19.78793 19.72819 19.66162 19.33117 19.10150 18.83120
## [31] 18.68897 18.64556 18.42660 18.16176 18.12811 18.07196
## [37] 17.71729 17.59136 17.36400 17.33684 17.20880 17.04195
## [43] 16.99371 16.84449 16.73244 16.58959 16.47004 16.20081
## [49] 16.15551 16.00560
In order to do more testing on the SVD method, I have decided to perform decomposition manually without any additional recommender functionality. I am using base R svd function.
First the ratings matrix is normalized. NA values are replaced with 0 and there are negative and positive ratings. Now we can decompose original matrix.
# Normalize matrix
movieMatrix <- as.matrix(normalize(movieRealMatrix)@data)
# Perform SVD
movieSVD <- svd(movieMatrix)
rownames(movieSVD$u) <- rownames(movieMatrix)
rownames(movieSVD$v) <- colnames(movieMatrix)
This process generates 671 concepts. Clearly to be usable we need to reduce number of dimensions/concepts by setting some singular values in the diagonal matrix \(\Sigma\) to 0. Per Leskovec (Mining of Massive Datasets, 2014, p. 424), we will retain enough singular values to make up 90% of the energy of \(\Sigma\).
# Reduce dimentions
n <- length(movieSVD$d)
total_energy <- sum(movieSVD$d^2)
for (i in (n-1):1) {
energy <- sum(movieSVD$d[1:i]^2)
if (energy/total_energy<0.9) {
n_dims <- i+1
break
}
}
This process leaves us with 283 dimensions/concepts. This is still a high number, but much more manageable than 671 (almost 60% reduction).
trim_mov_D <- movieSVD$d[1:n_dims]
trim_mov_U <- movieSVD$u[, 1:n_dims]
trim_mov_V <- movieSVD$v[, 1:n_dims]
Consider two first concepts with singular values 73.6 and 51.9. Let us pick 5 movies with highest and lowest values in each concept and plot them.
mov_count <- 5
movies <- as.data.frame(trim_mov_V) %>% select(V1, V2)
movies$movieId <- as.integer(rownames(movies))
mov_sample <- movies %>% arrange(V1) %>% head(mov_count)
mov_sample <- rbind(mov_sample, movies %>% arrange(desc(V1)) %>% head(mov_count))
mov_sample <- rbind(mov_sample, movies %>% arrange(V2) %>% head(mov_count))
mov_sample <- rbind(mov_sample, movies %>% arrange(desc(V2)) %>% head(mov_count))
mov_sample <- mov_sample %>% inner_join(titles, by = "movieId") %>%
select(Movie = "title", Concept1 = "V1", Concept2 = "V2")
mov_sample$Concept1 <- round(mov_sample$Concept1, 4)
mov_sample$Concept2 <- round(mov_sample$Concept2, 4)
knitr::kable(mov_sample, format = "html") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))
| Movie | Concept1 | Concept2 |
|---|---|---|
| Pulp Fiction (1994) | -0.1455 | 0.0206 |
| Godfather, The (1972) | -0.1222 | 0.0207 |
| Silence of the Lambs, The (1991) | -0.1147 | 0.0153 |
| Shawshank Redemption, The (1994) | -0.1115 | -0.0027 |
| Star Wars: Episode IV - A New Hope (1977) | -0.1073 | 0.0540 |
| Batman Forever (1995) | 0.0850 | -0.0095 |
| Ace Ventura: Pet Detective (1994) | 0.0801 | 0.0432 |
| Batman & Robin (1997) | 0.0757 | -0.0331 |
| Ace Ventura: When Nature Calls (1995) | 0.0712 | -0.0245 |
| Waterworld (1995) | 0.0691 | -0.0290 |
| Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | -0.0198 | -0.0374 |
| Lives of Others, The (Das leben der Anderen) (2006) | -0.0055 | -0.0363 |
| Some Like It Hot (1959) | -0.0263 | -0.0354 |
| Unbearable Lightness of Being, The (1988) | 0.0017 | -0.0352 |
| Modern Times (1936) | -0.0182 | -0.0346 |
| Star Wars: Episode V - The Empire Strikes Back (1980) | -0.1054 | 0.0770 |
| Clerks (1994) | -0.0373 | 0.0767 |
| Chasing Amy (1997) | -0.0111 | 0.0723 |
| Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981) | -0.0922 | 0.0720 |
| Star Wars: Episode VI - Return of the Jedi (1983) | -0.0776 | 0.0708 |
ggplot(mov_sample, aes(Concept1, Concept2, label=Movie)) + geom_point() +
geom_text_repel(aes(label=Movie), hjust=-0.1, vjust=-0.1, size = 3) +
scale_x_continuous(limits = c(-0.2, 0.2)) +
scale_y_continuous(limits = c(-0.1, 0.1))
This plot nicely demonstrates one of the biggest disadvantages of the SVD method - inability to connect characteristics/concepts to real-world categories. We can certainly agree that it is meaningful that all original Star Wars movies are close together or that Pulp Fiction and Ace Ventura are on opposites sides, but there is no clear way to categorize these movies.
This project explored some SVD features. It showed that SVD is faster than collaborative filtering in making predictions. SVD provides good recommendations, but at the same time SVD is difficult to interpret.