Intro

For this project we will build a simple recommender system using a toy dataset of movie reviews.


SQL Preparation

Make sure you have MySQL installed. Then please run the following sql script in order to create the movie_review database: Github_Link

Load libraries

library(dplyr)
library(RMySQL)
library(dbplyr)
library(tidytext)
library(knitr)

Loading the SQL tables into R

Be sure you use the correct username and password for your particular MySQL setup!

mr_db <- src_mysql(dbname = 'movie_reviews', username = 'user', password = 'password')

movie_df <- tbl(mr_db, "movie") %>% collect(n=Inf)
critic_df <- tbl(mr_db, "critic") %>% collect(n=Inf)
review_df <- tbl(mr_db, "review") %>% collect(n=Inf)

review_enhanced_df <- review_df %>% 
  left_join(movie_df, by="movie_id") %>% 
  left_join(critic_df, by="critic_id") %>% 
  arrange(title, desc(movie_rating))

review_enhanced_df %>% head() %>% kable()
movie_id critic_id movie_rating title critic_name
1 2 3.9 Acquaman Beverly
1 3 3.9 Acquaman Charlie
1 5 3.5 Acquaman Edward
1 1 3.1 Acquaman Aaron
1 4 3.1 Acquaman Dolores
6 1 4.8 Annihilation Aaron


Create matrix

We start by using the tidy dataframe to create a user-item matrix.

ui_mat <- review_enhanced_df %>% 
  arrange(critic_name, title) %>% 
  cast_sparse(critic_name, title, movie_rating) %>% 
  as.matrix()
ui_mat %>% kable()
Acquaman Annihilation Black Panther Mary Poppins Returns Ready Player One The Greatest Showman
Aaron 3.1 4.8 4.8 3.7 2.5 1.6
Beverly 3.9 3.1 4.6 3.5 3.1 2.3
Charlie 3.9 4.3 4.1 3.5 3.8 2.0
Dolores 3.1 4.2 5.0 2.6 3.7 2.3
Edward 3.5 4.1 5.0 3.1 3.8 1.8

Make sparse

We randomly create missing values in the matrix.

pct_na <- 0.2
n_na <- round( pct_na * length(ui_mat), 0)
na_ind <- sample(1:length(ui_mat), n_na, replace = F)
ui_mat[na_ind] <- NA
ui_mat %>% kable()
Acquaman Annihilation Black Panther Mary Poppins Returns Ready Player One The Greatest Showman
Aaron 3.1 4.8 4.8 3.7 2.5 1.6
Beverly 3.9 3.1 NA 3.5 3.1 NA
Charlie 3.9 4.3 4.1 3.5 3.8 2.0
Dolores 3.1 NA 5.0 NA NA 2.3
Edward 3.5 NA 5.0 3.1 3.8 1.8


Split the data

Next we break the matrix into training and test values.

pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]

train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA

train_mat %>% kable()
Acquaman Annihilation Black Panther Mary Poppins Returns Ready Player One The Greatest Showman
Aaron 3.1 4.8 4.8 3.7 2.5 1.6
Beverly NA 3.1 NA 3.5 NA NA
Charlie 3.9 4.3 4.1 3.5 3.8 2.0
Dolores 3.1 NA 5.0 NA NA NA
Edward 3.5 NA 5.0 NA 3.8 NA
test_mat %>% kable()
Acquaman Annihilation Black Panther Mary Poppins Returns Ready Player One The Greatest Showman
Aaron NA NA NA NA NA NA
Beverly 3.9 NA NA NA 3.1 NA
Charlie NA NA NA NA NA NA
Dolores NA NA NA NA NA 2.3
Edward NA NA NA 3.1 NA 1.8


Raw average model

Using the training data, we compute the overall average and make predictions for user-item combinations with that one value. We calculate the RMSE of our predictions on both the training and test data.

predictions_model_1 <- mean(train_mat, na.rm=T)

rmse <- function(predicted, observed, rmna=FALSE){
  sqrt(mean((predicted - observed)^2, na.rm=rmna))
}

### Train RMSE
train_rmse_model_1 <- rmse(predictions_model_1, train_mat[!is.na(train_mat)])
train_rmse_model_1
## [1] 0.9246396
### Test RMSE
test_rmse_model_1 <- rmse(predictions_model_1, test_mat[!is.na(test_mat)])
test_rmse_model_1
## [1] 1.077663


Baseline predictor model

We try to improve upon the raw average model by calculating a “bias” for each user and each item (movie, in this case) using the training data. We will do this by taking averages for each row and column, and subtracting the overall mean. Effectively we are estimating how much each user (ie, row) and each item (ie, movie, or column) is above or below the overall average.

Then to make predictions for each user-item combination we will take the raw average and add the according user and item bias. We calculate the RMSE of our predictions on both the training and test data.

row_bias <- rowMeans(train_mat, na.rm=T) - mean(train_mat, na.rm=T)
col_bias <- colMeans(train_mat, na.rm=T) - mean(train_mat, na.rm=T)

predictions_df_model_2 <- expand.grid(row_bias=row_bias, col_bias=col_bias) %>% 
  cbind(expand.grid(user=names(row_bias), item=names(col_bias))) %>% 
  mutate(raw_avg = mean(train_mat, na.rm=T), prediction = raw_avg + row_bias + col_bias)

predictions_df_model_2 %>% head() %>% kable()
row_bias col_bias user item raw_avg prediction
-0.2201754 -0.2368421 Aaron Acquaman 3.636842 3.179825
-0.3368421 -0.2368421 Beverly Acquaman 3.636842 3.063158
-0.0368421 -0.2368421 Charlie Acquaman 3.636842 3.363158
0.4131579 -0.2368421 Dolores Acquaman 3.636842 3.813158
0.4631579 -0.2368421 Edward Acquaman 3.636842 3.863158
-0.2201754 0.4298246 Aaron Annihilation 3.636842 3.846491
predictions_mat_model_2 <- predictions_df_model_2 %>% 
  arrange(user, item) %>% 
  cast_sparse(user, item, prediction) %>% 
  as.matrix()

predictions_mat_model_2 %>% kable()
Acquaman Annihilation Black Panther Mary Poppins Returns Ready Player One The Greatest Showman
Aaron 3.179825 3.846491 4.504825 3.346491 3.146491 1.579825
Beverly 3.063158 3.729825 4.388158 3.229825 3.029825 1.463158
Charlie 3.363158 4.029825 4.688158 3.529825 3.329825 1.763158
Dolores 3.813158 4.479825 5.138158 3.979825 3.779825 2.213158
Edward 3.863158 4.529825 5.188158 4.029825 3.829825 2.263158
### Train RMSE
train_rmse_model_2 <- rmse(predictions_mat_model_2, train_mat, rmna=T)
train_rmse_model_2
## [1] 0.4410938
### Test RMSE
test_rmse_model_2 <- rmse(predictions_mat_model_2, test_mat, rmna=T)
test_rmse_model_2
## [1] 0.5986418


Model comparison summary

We see that for the slightly more complicated second model we get a Test RMSE of 0.6 compared to 1.08 for the simple first model, which is a better result.

We would expect the second model to be more accurate, given it contains more pertinent information. But also keep in mind natural variance can play a larger role in a toy dataset with small sample sizes. The next step would be to repeat this analysis on a larger dataset.