CUNY 612

Intro

For this project we will build two simple recommender systems using the MovieLens dataset.

Load libraries

library(dplyr)
library(tidytext)
library(recommenderlab)
library(psych)
library(knitr)

Get data

The dataset was procured from kaggle and hosted on Github.

item_df <- read.delim('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.item', header=F, sep="|")
item_names_vec <- "movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western"
names(item_df) <- strsplit(item_names_vec, split = "\\|") %>% 
  sapply(., trimws) %>% as.vector() %>% gsub(" ", "_", .) %>% gsub("\\'", "", .)
item_df %>% head() %>% kable()

movie_id	movie_title	release_date	video_release_date	IMDb_URL	Action	Adventure	Animation	Childrens	Comedy	Crime	Drama	Thriller
1	Toy Story (1995)	01-Jan-1995	NA	http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)	0	0	1	1	1	0	0	0
2	GoldenEye (1995)	01-Jan-1995	NA	http://us.imdb.com/M/title-exact?GoldenEye%20(1995)	1	1	0	0	0	0	0	1
3	Four Rooms (1995)	01-Jan-1995	NA	http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)	0	0	0	0	0	0	0	1
4	Get Shorty (1995)	01-Jan-1995	NA	http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)	1	0	0	0	1	0	1	0
5	Copycat (1995)	01-Jan-1995	NA	http://us.imdb.com/M/title-exact?Copycat%20(1995)	0	0	0	0	0	1	1	1
6	Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)	01-Jan-1995	NA	http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)	0	0	0	0	0	0	1	0

ratings_df <- read.delim('https://raw.githubusercontent.com/mehtablocker/cuny_612/master/data_files/MovieLens/u.data', header=F)
names(ratings_df) <- c("user_id", "item_id", "rating", "timestamp")
ratings_df %>% head() %>% kable()

user_id	item_id	rating	timestamp
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806

Create matrices

We use the tidy dataframes to create an item content matrix and a user-item matrix.

item_mat <- as.matrix(item_df[ , 7:24])

ui_mat <- ratings_df %>% 
  arrange(user_id, item_id) %>% 
  cast_sparse(user_id, item_id, rating) %>% 
  as.matrix()

item_mat[1:5, 1:5] %>% kable()

Action	Adventure	Animation	Childrens	Comedy
0	0	1	1	1
1	1	0	0	0
0	0	0	0	0
1	0	0	0	1
0	0	0	0	0

ui_mat[1:5, 1:5] %>% kable()

1	2	3	4	5
5	3	4	3	3
4	0	0	0	0
0	0	0	0	0
0	0	0	0	0
4	3	0	0	0

Content based recommender

We create a simple Content-Based Recommender by first finding the Jaccard similarities between all rows of the item matrix.

jac_mat <- as(similarity(as(item_mat, "realRatingMatrix"), method = "jaccard"), "matrix")
jac_mat[1:5, 1:5] %>% kable()

1	2	3	4	5
0.0	0.0000000	0.0000000	0.2	0.0000000
0.0	0.0000000	0.3333333	0.2	0.2000000
0.0	0.3333333	0.0000000	0.0	0.3333333
0.2	0.2000000	0.0000000	0.0	0.2000000
0.0	0.2000000	0.3333333	0.2	0.0000000

Then we recommend three movies to an example user by finding a movie they rated “5” and looking up the top three Jaccard similarities for that movie which the user has not yet seen.

user_ratings <- ui_mat[1, ]
movie_id <- which(user_ratings==5)[1] %>% unname()
movie_id

## [1] 1

similar_ids <- order(jac_mat[movie_id,], decreasing=T)
sim_ord_user_ratings <- user_ratings[similar_ids]
reco_ids <- sim_ord_user_ratings[sim_ord_user_ratings==0] %>% head(3) %>% names() %>% as.integer()
reco_ids

## [1]  422 1219  377

Evaluate content

Let’s informally evaluate our recommendations for the example user by seeing if they make sense. The first movie_id that the user rated “5” is movie_id 1, which above we can see is “Toy Story.” According to our Jaccard similarities cross referenced with the user ratings, the top three most similar movies which the user has not yet seen are:

item_df %>% slice(reco_ids) %>% select(movie_id, movie_title) %>% kable()

movie_id	movie_title
422	Aladdin and the King of Thieves (1996)
1219	Goofy Movie, A (1995)
377	Heavyweights (1994)

For a person who rated “Toy Story” very highly, these seem like pretty good recommendations!

Collaborative filtering

Next we create an Item-Based Collaborative Filtering Recommender using the same MovieLens data. To do this we will compare the items’ similarities according to the user ratings rather than any pre-determined features such as genre. Using the user-item matrix from above, we first replace all 0’s with NA’s (to signify movies that were not rated by particular users) and then create a Pearson correlation matrix for all movies (setting a minimum threshold for sample size.) This is just like the Jaccard similarities matrix above except now we are comparing movies by user ratings rather than genre, and we are using a different distance metric since our data is no longer binary.

ui_mat[ui_mat==0] <- NA
min_n <- 20
corr_obj <- suppressWarnings(corr.test(ui_mat, ci=F))
cor_mat <- corr_obj$r
cor_mat[which(corr_obj$n < min_n)] <- NA

We can now, for example, find a user who rated “Toy Story” very highly, just as we did above. Then using the item correlation matrix we can recommend the top three most similar movies that the user has not yet seen.

user_ratings <- ui_mat[1, ]
movie_id <- which(user_ratings==5)[1] %>% unname()
movie_id

## [1] 1

similar_ids <- order(cor_mat[, movie_id], decreasing=T)
sim_ord_user_ratings <- user_ratings[similar_ids]
reco_ids <- sim_ord_user_ratings[is.na(sim_ord_user_ratings)] %>% head(3) %>% names() %>% as.integer()
reco_ids

## [1] 426 539 921

item_df %>% slice(reco_ids) %>% select(movie_id, movie_title) %>% kable()

movie_id	movie_title
426	Transformers: The Movie, The (1986)
539	Mouse Hunt (1997)
921	Farewell My Concubine (1993)

The first two seem like reasonable recommendations, but the third is a bit out of place.

Ratings predictions

We can more formally evaluate our model by calculating numerical ratings predictions for the movies a user has not seen. We do this by splitting the data into train and test sets, re-running the correlation matrix on the training data, using the correlation matrix to calculate predicted ratings, and evaluating the predictions on the test data.

### Create test and train indexes
pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
na_ind <- which(is.na(ui_mat))
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]

### Break matrix into test and train
train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA

### Create correlation matrix from training data
corr_obj_train <- suppressWarnings(corr.test(train_mat, ci=F))
cor_mat_train <- corr_obj_train$r
cor_mat_train[which(corr_obj_train$n < min_n)] <- NA

### For each user...
### Hadamard multiply empirical ratings by correlation matrix (re-centered and squared).
### Then remove empirical ratings and take column averages to get predicted ratings for each item.
### Finally, transform the predicted ratings distribution to more closely resemble empirical shape.
prediction_mat <- matrix(NA, nrow = nrow(train_mat), ncol = ncol(train_mat))
for (i in 1:nrow(train_mat)){
  user_pred_mat <- train_mat[i, ] * (cor_mat_train - min(cor_mat_train, na.rm=T))^2
  diag(user_pred_mat) <- NA
  user_pred_vec <- colMeans(user_pred_mat, na.rm=T)
  user_pred_vec_trans <- user_pred_vec  + (median(train_mat[i, ], na.rm=T) - median(user_pred_vec, na.rm=T))
  user_pred_vec_trans <- pmin(user_pred_vec_trans, 5)
  prediction_mat[i, ] <- user_pred_vec_trans
}

### Calculate RMSE on Test data and compare to RMSE of simply predicting the global average every time
prediction_mat_chance <- matrix(mean(train_mat, na.rm=T), nrow = nrow(train_mat), ncol = ncol(train_mat))
prediction_mat_chance[which(is.na(prediction_mat))] <- NA

rmse <- function(predicted, observed, rmna=FALSE){
  sqrt(mean((predicted - observed)^2, na.rm=rmna))
}

rmse(prediction_mat, test_mat, rmna=T)

## [1] 1.154043

rmse(prediction_mat_chance, test_mat, rmna=T)

## [1] 1.098416

Our prediction model does not fare much differently than just guessing the average (~3.5) every time, so clearly it needs refinement.

Summary

We created simple versions of a Content Based Recommender and a Collaborative Filtering Recommender. The benefits of the former are that one does not need other users’ information in order to make a recommendation. The downsides are that the features (“movie genres”, in this example) need to be well determined and tagged in advance. The benefits of the latter are that our model is not determined by how well we feature engineer. The downsides are that we need all other users’ data, so we suffer from a “cold start” problem where we cannot start using the model until we have a reasonably large amount of data.

Our ratings prediction model built with the User-Item Matrix and Item Correlation Matrix did not fare much differently than a rudimentary model of guessing the global average every time.