For this project we will build two simple recommender systems using the MovieLens dataset.
library(dplyr)
library(tidytext)
library(recommenderlab)
library(psych)
library(knitr)
The dataset was procured from kaggle and hosted on Github.
item_df <- read.delim('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.item', header=F, sep="|")
item_names_vec <- "movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western"
names(item_df) <- strsplit(item_names_vec, split = "\\|") %>%
sapply(., trimws) %>% as.vector() %>% gsub(" ", "_", .) %>% gsub("\\'", "", .)
item_df %>% head() %>% kable()
| movie_id | movie_title | release_date | video_release_date | IMDb_URL | unknown | Action | Adventure | Animation | Childrens | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Toy Story (1995) | 01-Jan-1995 | NA | http://us.imdb.com/M/title-exact?Toy%20Story%20(1995) | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | GoldenEye (1995) | 01-Jan-1995 | NA | http://us.imdb.com/M/title-exact?GoldenEye%20(1995) | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | Four Rooms (1995) | 01-Jan-1995 | NA | http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | Get Shorty (1995) | 01-Jan-1995 | NA | http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995) | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | Copycat (1995) | 01-Jan-1995 | NA | http://us.imdb.com/M/title-exact?Copycat%20(1995) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6 | Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) | 01-Jan-1995 | NA | http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ratings_df <- read.delim('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.data', header=F)
names(ratings_df) <- c("user_id", "item_id", "rating", "timestamp")
ratings_df %>% head() %>% kable()
| user_id | item_id | rating | timestamp |
|---|---|---|---|
| 196 | 242 | 3 | 881250949 |
| 186 | 302 | 3 | 891717742 |
| 22 | 377 | 1 | 878887116 |
| 244 | 51 | 2 | 880606923 |
| 166 | 346 | 1 | 886397596 |
| 298 | 474 | 4 | 884182806 |
We use the tidy dataframes to create an item content matrix and a user-item matrix.
item_mat <- as.matrix(item_df[ , 7:24])
ui_mat <- ratings_df %>%
arrange(user_id, item_id) %>%
cast_sparse(user_id, item_id, rating) %>%
as.matrix()
item_mat[1:5, 1:5] %>% kable()
| Action | Adventure | Animation | Childrens | Comedy |
|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 |
ui_mat[1:5, 1:5] %>% kable()
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 5 | 3 | 4 | 3 | 3 |
| 4 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 |
| 4 | 3 | 0 | 0 | 0 |
We create a simple Content-Based Recommender by first finding the Jaccard similarities between all rows of the item matrix.
jac_mat <- as(similarity(as(item_mat, "realRatingMatrix"), method = "jaccard"), "matrix")
jac_mat[1:5, 1:5] %>% kable()
| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 0.0 | 0.0000000 | 0.0000000 | 0.2 | 0.0000000 |
| 0.0 | 0.0000000 | 0.3333333 | 0.2 | 0.2000000 |
| 0.0 | 0.3333333 | 0.0000000 | 0.0 | 0.3333333 |
| 0.2 | 0.2000000 | 0.0000000 | 0.0 | 0.2000000 |
| 0.0 | 0.2000000 | 0.3333333 | 0.2 | 0.0000000 |
Then we recommend three movies to an example user by finding a movie they rated “5” and looking up the top three Jaccard similarities for that movie which the user has not yet seen.
user_ratings <- ui_mat[1, ]
movie_id <- which(user_ratings==5)[1] %>% unname()
movie_id
## [1] 1
similar_ids <- order(jac_mat[movie_id,], decreasing=T)
sim_ord_user_ratings <- user_ratings[similar_ids]
reco_ids <- sim_ord_user_ratings[sim_ord_user_ratings==0] %>% head(3) %>% names() %>% as.integer()
reco_ids
## [1] 422 1219 377
Let’s informally evaluate our recommendations for the example user by seeing if they make sense. The first movie_id that the user rated “5” is movie_id 1, which above we can see is “Toy Story.” According to our Jaccard similarities cross referenced with the user ratings, the top three most similar movies which the user has not yet seen are:
item_df %>% slice(reco_ids) %>% select(movie_id, movie_title) %>% kable()
| movie_id | movie_title |
|---|---|
| 422 | Aladdin and the King of Thieves (1996) |
| 1219 | Goofy Movie, A (1995) |
| 377 | Heavyweights (1994) |
For a person who rated “Toy Story” very highly, these seem like pretty good recommendations!
Next we create an Item-Based Collaborative Filtering Recommender using the same MovieLens data. To do this we will compare the items’ similarities according to the user ratings rather than any pre-determined features such as genre. Using the user-item matrix from above, we first replace all 0’s with NA’s (to signify movies that were not rated by particular users) and then create a Pearson correlation matrix for all movies (setting a minimum threshold for sample size.) This is just like the Jaccard similarities matrix above except now we are comparing movies by user ratings rather than genre, and we are using a different distance metric since our data is no longer binary.
ui_mat[ui_mat==0] <- NA
min_n <- 20
corr_obj <- suppressWarnings(corr.test(ui_mat, ci=F))
cor_mat <- corr_obj$r
cor_mat[which(corr_obj$n < min_n)] <- NA
We can now, for example, find a user who rated “Toy Story” very highly, just as we did above. Then using the item correlation matrix we can recommend the top three most similar movies that the user has not yet seen.
user_ratings <- ui_mat[1, ]
movie_id <- which(user_ratings==5)[1] %>% unname()
movie_id
## [1] 1
similar_ids <- order(cor_mat[, movie_id], decreasing=T)
sim_ord_user_ratings <- user_ratings[similar_ids]
reco_ids <- sim_ord_user_ratings[is.na(sim_ord_user_ratings)] %>% head(3) %>% names() %>% as.integer()
reco_ids
## [1] 426 539 921
item_df %>% slice(reco_ids) %>% select(movie_id, movie_title) %>% kable()
| movie_id | movie_title |
|---|---|
| 426 | Transformers: The Movie, The (1986) |
| 539 | Mouse Hunt (1997) |
| 921 | Farewell My Concubine (1993) |
The first two seem like reasonable recommendations, but the third is a bit out of place.
We can more formally evaluate our model by calculating numerical ratings predictions for the movies a user has not seen. We do this by splitting the data into train and test sets, re-running the correlation matrix on the training data, using the correlation matrix to calculate predicted ratings, and evaluating the predictions on the test data.
### Create test and train indexes
pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
na_ind <- which(is.na(ui_mat))
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]
### Break matrix into test and train
train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA
### Create correlation matrix from training data
corr_obj_train <- suppressWarnings(corr.test(train_mat, ci=F))
cor_mat_train <- corr_obj_train$r
cor_mat_train[which(corr_obj_train$n < min_n)] <- NA
### For each user...
### Hadamard multiply empirical ratings by correlation matrix (re-centered and squared).
### Then remove empirical ratings and take column averages to get predicted ratings for each item.
### Finally, transform the predicted ratings distribution to more closely resemble empirical shape.
prediction_mat <- matrix(NA, nrow = nrow(train_mat), ncol = ncol(train_mat))
for (i in 1:nrow(train_mat)){
user_pred_mat <- train_mat[i, ] * (cor_mat_train - min(cor_mat_train, na.rm=T))^2
diag(user_pred_mat) <- NA
user_pred_vec <- colMeans(user_pred_mat, na.rm=T)
user_pred_vec_trans <- user_pred_vec + (median(train_mat[i, ], na.rm=T) - median(user_pred_vec, na.rm=T))
user_pred_vec_trans <- pmin(user_pred_vec_trans, 5)
prediction_mat[i, ] <- user_pred_vec_trans
}
### Calculate RMSE on Test data and compare to RMSE of simply predicting the global average every time
prediction_mat_chance <- matrix(mean(train_mat, na.rm=T), nrow = nrow(train_mat), ncol = ncol(train_mat))
prediction_mat_chance[which(is.na(prediction_mat))] <- NA
rmse <- function(predicted, observed, rmna=FALSE){
sqrt(mean((predicted - observed)^2, na.rm=rmna))
}
rmse(prediction_mat, test_mat, rmna=T)
## [1] 1.154043
rmse(prediction_mat_chance, test_mat, rmna=T)
## [1] 1.098416
Our prediction model does not fare much differently than just guessing the average (~3.5) every time, so clearly it needs refinement.
We created simple versions of a Content Based Recommender and a Collaborative Filtering Recommender. The benefits of the former are that one does not need other users’ information in order to make a recommendation. The downsides are that the features (“movie genres”, in this example) need to be well determined and tagged in advance. The benefits of the latter are that our model is not determined by how well we feature engineer. The downsides are that we need all other users’ data, so we suffer from a “cold start” problem where we cannot start using the model until we have a reasonably large amount of data.
Our ratings prediction model built with the User-Item Matrix and Item Correlation Matrix did not fare much differently than a rudimentary model of guessing the global average every time.