# Required libraries
library(recommenderlab)
library(tidyverse)
library(ggthemes)
library(kableExtra)
library(skimr)
library(ggrepel)
library(tictoc) The data set is courtesy of MovieLens project and it was downloaded from https://grouplens.org/datasets/movielens/. Please note - I reduced the size of the movie matrix so that my circa 1990 Mac Mini could handle the load. The data set is comprised of two files - rates and titles. We utilized the skimr package to explore the data. SVD models require no missing data. Skimr will let us know where we stand in that regard.
# Data import
setwd("C:/Users/mutue/OneDrive/Documents/Data612")
ratings <- read.csv('ratings150.csv')
titles <- read.csv('movies.csv')| Name | ratings |
| Number of rows | 18262 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| userId | 0 | 1 | 6.866000e+01 | 43.23 | 1.0 | 31 | 72 | 1.060000e+02 | 150 | ▇▆▇▆▅ |
| movieId | 0 | 1 | 1.796845e+04 | 34504.35 | 1.0 | 1047 | 2540 | 6.952750e+03 | 203519 | ▇▁▁▁▁ |
| rating | 0 | 1 | 3.600000e+00 | 1.02 | 0.5 | 3 | 4 | 4.000000e+00 | 5 | ▁▂▅▇▅ |
| timestamp | 0 | 1 | 1.193247e+09 | 232839079.08 | 828708507.0 | 980644978 | 1169595541 | 1.439474e+09 | 1574195101 | ▆▆▆▂▇ |
| Name | titles |
| Number of rows | 62423 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| title | 0 | 1 | FALSE | 62325 | 9 (: 2, Abs: 2, Ala: 2, Alo: 2 |
| genres | 0 | 1 | FALSE | 1639 | Dra: 9056, Com: 5674, (no: 5062, Doc: 4731 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| movieId | 0 | 1 | 122220.4 | 63264.74 | 1 | 82146.5 | 138022 | 173222 | 209171 | ▅▂▅▇▇ |
realRatingMatrix. The end result is a 150 by 4332 rating matrix with more than 18,000 ratings.movieMatrix <- ratings %>%
select(-timestamp) %>%
spread(movieId, rating)
row.names(movieMatrix) <- movieMatrix[,1]
movieMatrix <- as.matrix(movieMatrix[-c(1)])
movieRealMatrix <- as(movieMatrix, "realRatingMatrix")
movieRealMatrix## 150 x 4332 rating matrix of class 'realRatingMatrix' with 18262 ratings.
We will build a User-Based Collaborative model and an SVD model. We will compare each model’s performance, based upon RMSE, as well as the time required to build and predict under each methodology.
See table 1 below for the performance results of the UBCF model.
| x | |
|---|---|
| RMSE | 0.8682038 |
| MSE | 0.7537779 |
| MAE | 0.6706194 |
Next we create the sVD model. After some tuning, k = 20, was utilized in the final SVD model. Table 2 below set forth the performance of the SVD model.
| x | |
|---|---|
| RMSE | 0.8731257 |
| MSE | 0.7623486 |
| MAE | 0.6761709 |
####The result of the speed comparison are interesting. The UCBF model was trained 10x faster than the SVD(0.02 seconds vs 0.20 seconds). However, it also took almost 10x longer for the UCBF model predictions (0.82 second vs 0.09 seconds).
| Run Time |
|---|
| UBCF Model - Training: 0.03 sec elapsed |
| UBCF Model - Predicting: 0.69 sec elapsed |
| SVD Model - Training: 0.21 sec elapsed |
| SVD Model - Predicting: 0.08 sec elapsed |
Now we will make some movie predictions to see if the models produce similar results. Since it’s the 22nd of June, we’ll pick the 22nd user and see how she rated her movies.
Our movie rater appears to be a somewhat generous movie rater or someone who simply likes movies. Of the 22 movies rated 18 were either rated 4 or 5. Dumb & Dumber got a rating of 1, Pulp Fiction and Ace Ventura each earned a 3. This could indicate that our movie rater does like Violent or Comedy movies. There does appear to be a preference for action/suspence, drama and feel good movies.
| Movie | Rating |
|---|---|
| Dumb & Dumber (Dumb and Dumber) (1994) | 1 |
| Pulp Fiction (1994) | 3 |
| Ace Ventura: Pet Detective (1994) | 3 |
| Crimson Tide (1995) | 4 |
| Waterworld (1995) | 4 |
| Interview with the Vampire: The Vampire Chronicles (1994) | 4 |
| Shawshank Redemption, The (1994) | 4 |
| True Lies (1994) | 4 |
| Cliffhanger (1993) | 4 |
| Beauty and the Beast (1991) | 4 |
| Apollo 13 (1995) | 5 |
| Batman Forever (1995) | 5 |
| Die Hard: With a Vengeance (1995) | 5 |
| Net, The (1995) | 5 |
| Outbreak (1995) | 5 |
| Stargate (1994) | 5 |
| Star Trek: Generations (1994) | 5 |
| While You Were Sleeping (1995) | 5 |
| Clear and Present Danger (1994) | 5 |
| Aladdin (1992) | 5 |
| Dances with Wolves (1990) | 5 |
| Batman (1989) | 5 |
mov_recommend1 <- as.data.frame(predUBCF@data[22, ])
colnames(mov_recommend1) <- c("Rating")
mov_recommend1$movieId <- as.integer(rownames(mov_recommend1))
mov_recommend1 <- mov_recommend1 %>% arrange(desc(Rating)) %>% head(5) %>%
inner_join (titles, by="movieId") %>%
select(Movie = "title")
kable(mov_recommend1) %>%
kable_styling()| Movie |
|---|
| Pulp Fiction (1994) |
| Godfather, The (1972) |
| Taxi Driver (1976) |
| Silence of the Lambs, The (1991) |
| Star Wars: Episode IV - A New Hope (1977) |
mov_recommend2 <- as.data.frame(predSVD@data[22, ])
colnames(mov_recommend2) <- c("Rating")
mov_recommend2$movieId <- as.integer(rownames(mov_recommend2))
mov_recommend2 <- mov_recommend2 %>% arrange(desc(Rating)) %>% head(5) %>%
inner_join (titles, by="movieId") %>%
select(Movie = "title")
kable(mov_recommend2) %>%
kable_styling()| Movie |
|---|
| Butch Cassidy and the Sundance Kid (1969) |
| Crying Game, The (1992) |
| Star Wars: Episode IV - A New Hope (1977) |
| Raising Arizona (1987) |
| Godfather: Part II, The (1974) |
The two approaches yield similar results. Each algorithm recommended a Star Wars movie and a God Father movie. I can also see similarities between Taxi and Raising Arizona (light hearted and funny). From here the UCBF recommended two great, albeit violent movies and the SVD went with Butch and Sundance and the Crying Game. These pairs don’t seem to be too closely related.
NA values are replaced with 0 and there are negative and positive ratings. Next we use the svd function to decompose movieMatrix.Below we display the five movies with highest values for Concept 1 (1-5) and Concept 2 (6-10). Shawshank Redemption through Usual Suspects comprise Concept 1. This grouping includes two Star War moviee and, in my opion, three classic movies in Shawshank, Pulp Fiction and Usual Suspect. One knock against sVD is that the concepts are “anonomous” or a black box. We take comfort in the fact they seem to go together fairly well.
Concept 2 is comprised of two Titanic movies, Twister, Mr. Holland’s Opus and Arachnophobia. Aside from the Titanic movies, this grouping is a bit more difficult for me to understand.
mov_count <- 5
movies <- as.data.frame(trim_mov_V) %>% select(V1, V2)
movies$movieId <- as.integer(rownames(movies))
mov_sample <- movies %>% arrange(V1) %>% head(mov_count)
mov_sample <- rbind(mov_sample, movies %>% arrange(V2) %>% head(mov_count))
mov_sample <- mov_sample %>% inner_join(titles, by = "movieId") %>%
select(Movie = "title", Concept1 = "V1", Concept2 = "V2")
mov_sample$Concept1 <- round(mov_sample$Concept1, 4)
mov_sample$Concept2 <- round(mov_sample$Concept2, 4)
knitr::kable(mov_sample) %>%
kableExtra::kable_styling()| Movie | Concept1 | Concept2 |
|---|---|---|
| Shawshank Redemption, The (1994) | -0.1278 | 0.0841 |
| Star Wars: Episode V - The Empire Strikes Back (1980) | -0.1258 | 0.0864 |
| Star Wars: Episode IV - A New Hope (1977) | -0.1234 | 0.0835 |
| Pulp Fiction (1994) | -0.1193 | 0.0912 |
| Usual Suspects, The (1995) | -0.1116 | 0.0493 |
| Twister (1996) | -0.0113 | -0.0835 |
| Titanic (1997) | -0.0253 | -0.0780 |
| Titanic (1953) | -0.0228 | -0.0722 |
| Mr. Holland’s Opus (1995) | 0.0096 | -0.0687 |
| Arachnophobia (1990) | 0.0084 | -0.0608 |
The SVD algorithm seems to perform as well as other popular algorithms (or at least UBCF). Training a SVD model appears to be more computationally complex and required more time than the UBCF approach. However, when it comes to prediction SVD offer a key advantage for a deployment - it’s fast. The SVD prediction was almost 10 times faster than UBCF. Finally, SVD is a bit of a blackbox. Concepts are produced but there is no guide book that explains what they mean. As long as one is aware of this and SVD produces good results, 10x faster than other algorithms it seem like a viable alternative for a recommender system.