Intro

For this project we will build two simple recommender systems using the MovieLens dataset.


Load libraries

library(dplyr)
library(tidytext)
library(recommenderlab)
library(psych)
library(knitr)


Get data

The dataset was procured from kaggle and hosted on Github.

item_df <- read.delim('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.item', header=F, sep="|")
item_names_vec <- "movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western"
names(item_df) <- strsplit(item_names_vec, split = "\\|") %>% 
  sapply(., trimws) %>% as.vector() %>% gsub(" ", "_", .) %>% gsub("\\'", "", .)
item_df %>% head() %>% kable()
movie_id movie_title release_date video_release_date IMDb_URL unknown Action Adventure Animation Childrens Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
1 Toy Story (1995) 01-Jan-1995 NA http://us.imdb.com/M/title-exact?Toy%20Story%20(1995) 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
2 GoldenEye (1995) 01-Jan-1995 NA http://us.imdb.com/M/title-exact?GoldenEye%20(1995) 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 Four Rooms (1995) 01-Jan-1995 NA http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
4 Get Shorty (1995) 01-Jan-1995 NA http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995) 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
5 Copycat (1995) 01-Jan-1995 NA http://us.imdb.com/M/title-exact?Copycat%20(1995) 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0
6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 01-Jan-1995 NA http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
ratings_df <- read.delim('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.data', header=F)
names(ratings_df) <- c("user_id", "item_id", "rating", "timestamp")
ratings_df %>% head() %>% kable()
user_id item_id rating timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806


Create matrices

We use the tidy dataframes to create an item content matrix and a user-item matrix.

item_mat <- as.matrix(item_df[ , 7:24])

ui_mat <- ratings_df %>% 
  arrange(user_id, item_id) %>% 
  cast_sparse(user_id, item_id, rating) %>% 
  as.matrix()

item_mat[1:5, 1:5] %>% kable()
Action Adventure Animation Childrens Comedy
0 0 1 1 1
1 1 0 0 0
0 0 0 0 0
1 0 0 0 1
0 0 0 0 0
ui_mat[1:5, 1:5] %>% kable()
1 2 3 4 5
5 3 4 3 3
4 0 0 0 0
0 0 0 0 0
0 0 0 0 0
4 3 0 0 0


Content based recommender

We create a simple Content-Based Recommender by first finding the Jaccard similarities between all rows of the item matrix.

jac_mat <- as(similarity(as(item_mat, "realRatingMatrix"), method = "jaccard"), "matrix")
jac_mat[1:5, 1:5] %>% kable()
1 2 3 4 5
0.0 0.0000000 0.0000000 0.2 0.0000000
0.0 0.0000000 0.3333333 0.2 0.2000000
0.0 0.3333333 0.0000000 0.0 0.3333333
0.2 0.2000000 0.0000000 0.0 0.2000000
0.0 0.2000000 0.3333333 0.2 0.0000000

Then we recommend three movies to an example user by finding a movie they rated “5” and looking up the top three Jaccard similarities for that movie which the user has not yet seen.

user_ratings <- ui_mat[1, ]
movie_id <- which(user_ratings==5)[1] %>% unname()
movie_id
## [1] 1
similar_ids <- order(jac_mat[movie_id,], decreasing=T)
sim_ord_user_ratings <- user_ratings[similar_ids]
reco_ids <- sim_ord_user_ratings[sim_ord_user_ratings==0] %>% head(3) %>% names() %>% as.integer()
reco_ids
## [1]  422 1219  377


Evaluate content

Let’s informally evaluate our recommendations for the example user by seeing if they make sense. The first movie_id that the user rated “5” is movie_id 1, which above we can see is “Toy Story.” According to our Jaccard similarities cross referenced with the user ratings, the top three most similar movies which the user has not yet seen are:

item_df %>% slice(reco_ids) %>% select(movie_id, movie_title) %>% kable()
movie_id movie_title
422 Aladdin and the King of Thieves (1996)
1219 Goofy Movie, A (1995)
377 Heavyweights (1994)

For a person who rated “Toy Story” very highly, these seem like pretty good recommendations!


Collaborative filtering

Next we create an Item-Based Collaborative Filtering Recommender using the same MovieLens data. To do this we will compare the items’ similarities according to the user ratings rather than any pre-determined features such as genre. Using the user-item matrix from above, we first replace all 0’s with NA’s (to signify movies that were not rated by particular users) and then create a Pearson correlation matrix for all movies (setting a minimum threshold for sample size.) This is just like the Jaccard similarities matrix above except now we are comparing movies by user ratings rather than genre, and we are using a different distance metric since our data is no longer binary.

ui_mat[ui_mat==0] <- NA
min_n <- 20
corr_obj <- suppressWarnings(corr.test(ui_mat, ci=F))
cor_mat <- corr_obj$r
cor_mat[which(corr_obj$n < min_n)] <- NA

We can now, for example, find a user who rated “Toy Story” very highly, just as we did above. Then using the item correlation matrix we can recommend the top three most similar movies that the user has not yet seen.

user_ratings <- ui_mat[1, ]
movie_id <- which(user_ratings==5)[1] %>% unname()
movie_id
## [1] 1
similar_ids <- order(cor_mat[, movie_id], decreasing=T)
sim_ord_user_ratings <- user_ratings[similar_ids]
reco_ids <- sim_ord_user_ratings[is.na(sim_ord_user_ratings)] %>% head(3) %>% names() %>% as.integer()
reco_ids
## [1] 426 539 921
item_df %>% slice(reco_ids) %>% select(movie_id, movie_title) %>% kable()
movie_id movie_title
426 Transformers: The Movie, The (1986)
539 Mouse Hunt (1997)
921 Farewell My Concubine (1993)

The first two seem like reasonable recommendations, but the third is a bit out of place.


Ratings predictions

We can more formally evaluate our model by calculating numerical ratings predictions for the movies a user has not seen. We do this by splitting the data into train and test sets, re-running the correlation matrix on the training data, using the correlation matrix to calculate predicted ratings, and evaluating the predictions on the test data.

### Create test and train indexes
pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
na_ind <- which(is.na(ui_mat))
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]

### Break matrix into test and train
train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA

### Create correlation matrix from training data
corr_obj_train <- suppressWarnings(corr.test(train_mat, ci=F))
cor_mat_train <- corr_obj_train$r
cor_mat_train[which(corr_obj_train$n < min_n)] <- NA

### For each user...
### Hadamard multiply empirical ratings by correlation matrix (re-centered and squared).
### Then remove empirical ratings and take column averages to get predicted ratings for each item.
### Finally, transform the predicted ratings distribution to more closely resemble empirical shape.
prediction_mat <- matrix(NA, nrow = nrow(train_mat), ncol = ncol(train_mat))
for (i in 1:nrow(train_mat)){
  user_pred_mat <- train_mat[i, ] * (cor_mat_train - min(cor_mat_train, na.rm=T))^2
  diag(user_pred_mat) <- NA
  user_pred_vec <- colMeans(user_pred_mat, na.rm=T)
  user_pred_vec_trans <- user_pred_vec  + (median(train_mat[i, ], na.rm=T) - median(user_pred_vec, na.rm=T))
  user_pred_vec_trans <- pmin(user_pred_vec_trans, 5)
  prediction_mat[i, ] <- user_pred_vec_trans
}

### Calculate RMSE on Test data and compare to RMSE of simply predicting the global average every time
prediction_mat_chance <- matrix(mean(train_mat, na.rm=T), nrow = nrow(train_mat), ncol = ncol(train_mat))
prediction_mat_chance[which(is.na(prediction_mat))] <- NA

rmse <- function(predicted, observed, rmna=FALSE){
  sqrt(mean((predicted - observed)^2, na.rm=rmna))
}

rmse(prediction_mat, test_mat, rmna=T)
## [1] 1.154043
rmse(prediction_mat_chance, test_mat, rmna=T)
## [1] 1.098416

Our prediction model does not fare much differently than just guessing the average (~3.5) every time, so clearly it needs refinement.


Summary

We created simple versions of a Content Based Recommender and a Collaborative Filtering Recommender. The benefits of the former are that one does not need other users’ information in order to make a recommendation. The downsides are that the features (“movie genres”, in this example) need to be well determined and tagged in advance. The benefits of the latter are that our model is not determined by how well we feature engineer. The downsides are that we need all other users’ data, so we suffer from a “cold start” problem where we cannot start using the model until we have a reasonably large amount of data.

Our ratings prediction model built with the User-Item Matrix and Item Correlation Matrix did not fare much differently than a rudimentary model of guessing the global average every time.