CUNY 612

Intro

For this project we will build a simple recommender system using a toy dataset of movie reviews.

SQL Preparation

Make sure you have MySQL installed. Then please run the following sql script in order to create the movie_review database: Github_Link

Load libraries

library(dplyr)
library(RMySQL)
library(dbplyr)
library(tidytext)
library(knitr)

Loading the SQL tables into R

Be sure you use the correct username and password for your particular MySQL setup!

mr_db <- src_mysql(dbname = 'movie_reviews', username = 'user', password = 'password')

movie_df <- tbl(mr_db, "movie") %>% collect(n=Inf)
critic_df <- tbl(mr_db, "critic") %>% collect(n=Inf)
review_df <- tbl(mr_db, "review") %>% collect(n=Inf)

review_enhanced_df <- review_df %>% 
  left_join(movie_df, by="movie_id") %>% 
  left_join(critic_df, by="critic_id") %>% 
  arrange(title, desc(movie_rating))

review_enhanced_df %>% head() %>% kable()

movie_id	critic_id	movie_rating	title	critic_name
1	2	3.9	Acquaman	Beverly
1	3	3.9	Acquaman	Charlie
1	5	3.5	Acquaman	Edward
1	1	3.1	Acquaman	Aaron
1	4	3.1	Acquaman	Dolores
6	1	4.8	Annihilation	Aaron

Create matrix

We start by using the tidy dataframe to create a user-item matrix.

ui_mat <- review_enhanced_df %>% 
  arrange(critic_name, title) %>% 
  cast_sparse(critic_name, title, movie_rating) %>% 
  as.matrix()
ui_mat %>% kable()

	Acquaman	Annihilation	Black Panther	Mary Poppins Returns	Ready Player One	The Greatest Showman
Aaron	3.1	4.8	4.8	3.7	2.5	1.6
Beverly	3.9	3.1	4.6	3.5	3.1	2.3
Charlie	3.9	4.3	4.1	3.5	3.8	2.0
Dolores	3.1	4.2	5.0	2.6	3.7	2.3
Edward	3.5	4.1	5.0	3.1	3.8	1.8

Make sparse

We randomly create missing values in the matrix.

pct_na <- 0.2
n_na <- round( pct_na * length(ui_mat), 0)
na_ind <- sample(1:length(ui_mat), n_na, replace = F)
ui_mat[na_ind] <- NA
ui_mat %>% kable()

	Acquaman	Annihilation	Black Panther	Mary Poppins Returns	Ready Player One	The Greatest Showman
Aaron	3.1	4.8	4.8	3.7	2.5	1.6
Beverly	3.9	3.1	NA	3.5	3.1	NA
Charlie	3.9	4.3	4.1	3.5	3.8	2.0
Dolores	3.1	NA	5.0	NA	NA	2.3
Edward	3.5	NA	5.0	3.1	3.8	1.8

Split the data

Next we break the matrix into training and test values.

pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]

train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA

train_mat %>% kable()

	Acquaman	Annihilation	Black Panther	Mary Poppins Returns	Ready Player One	The Greatest Showman
Aaron	3.1	4.8	4.8	3.7	2.5	1.6
Beverly	NA	3.1	NA	3.5	NA	NA
Charlie	3.9	4.3	4.1	3.5	3.8	2.0
Dolores	3.1	NA	5.0	NA	NA	NA
Edward	3.5	NA	5.0	NA	3.8	NA

test_mat %>% kable()

	Acquaman	Annihilation	Black Panther	Mary Poppins Returns	Ready Player One	The Greatest Showman
Aaron	NA	NA	NA	NA	NA	NA
Beverly	3.9	NA	NA	NA	3.1	NA
Charlie	NA	NA	NA	NA	NA	NA
Dolores	NA	NA	NA	NA	NA	2.3
Edward	NA	NA	NA	3.1	NA	1.8

Raw average model

Using the training data, we compute the overall average and make predictions for user-item combinations with that one value. We calculate the RMSE of our predictions on both the training and test data.

predictions_model_1 <- mean(train_mat, na.rm=T)

rmse <- function(predicted, observed, rmna=FALSE){
  sqrt(mean((predicted - observed)^2, na.rm=rmna))
}

### Train RMSE
train_rmse_model_1 <- rmse(predictions_model_1, train_mat[!is.na(train_mat)])
train_rmse_model_1

## [1] 0.9246396

### Test RMSE
test_rmse_model_1 <- rmse(predictions_model_1, test_mat[!is.na(test_mat)])
test_rmse_model_1

## [1] 1.077663

Baseline predictor model

We try to improve upon the raw average model by calculating a “bias” for each user and each item (movie, in this case) using the training data. We will do this by taking averages for each row and column, and subtracting the overall mean. Effectively we are estimating how much each user (ie, row) and each item (ie, movie, or column) is above or below the overall average.

Then to make predictions for each user-item combination we will take the raw average and add the according user and item bias. We calculate the RMSE of our predictions on both the training and test data.

row_bias <- rowMeans(train_mat, na.rm=T) - mean(train_mat, na.rm=T)
col_bias <- colMeans(train_mat, na.rm=T) - mean(train_mat, na.rm=T)

predictions_df_model_2 <- expand.grid(row_bias=row_bias, col_bias=col_bias) %>% 
  cbind(expand.grid(user=names(row_bias), item=names(col_bias))) %>% 
  mutate(raw_avg = mean(train_mat, na.rm=T), prediction = raw_avg + row_bias + col_bias)

predictions_df_model_2 %>% head() %>% kable()

row_bias	col_bias	user	item	raw_avg	prediction
-0.2201754	-0.2368421	Aaron	Acquaman	3.636842	3.179825
-0.3368421	-0.2368421	Beverly	Acquaman	3.636842	3.063158
-0.0368421	-0.2368421	Charlie	Acquaman	3.636842	3.363158
0.4131579	-0.2368421	Dolores	Acquaman	3.636842	3.813158
0.4631579	-0.2368421	Edward	Acquaman	3.636842	3.863158
-0.2201754	0.4298246	Aaron	Annihilation	3.636842	3.846491

predictions_mat_model_2 <- predictions_df_model_2 %>% 
  arrange(user, item) %>% 
  cast_sparse(user, item, prediction) %>% 
  as.matrix()

predictions_mat_model_2 %>% kable()

	Acquaman	Annihilation	Black Panther	Mary Poppins Returns	Ready Player One	The Greatest Showman
Aaron	3.179825	3.846491	4.504825	3.346491	3.146491	1.579825
Beverly	3.063158	3.729825	4.388158	3.229825	3.029825	1.463158
Charlie	3.363158	4.029825	4.688158	3.529825	3.329825	1.763158
Dolores	3.813158	4.479825	5.138158	3.979825	3.779825	2.213158
Edward	3.863158	4.529825	5.188158	4.029825	3.829825	2.263158

### Train RMSE
train_rmse_model_2 <- rmse(predictions_mat_model_2, train_mat, rmna=T)
train_rmse_model_2

## [1] 0.4410938

### Test RMSE
test_rmse_model_2 <- rmse(predictions_mat_model_2, test_mat, rmna=T)
test_rmse_model_2

## [1] 0.5986418

Model comparison summary

We see that for the slightly more complicated second model we get a Test RMSE of 0.6 compared to 1.08 for the simple first model, which is a better result.

We would expect the second model to be more accurate, given it contains more pertinent information. But also keep in mind natural variance can play a larger role in a toy dataset with small sample sizes. The next step would be to repeat this analysis on a larger dataset.