For this project we will build a simple recommender system using a toy dataset of movie reviews.
Make sure you have MySQL installed. Then please run the following sql script in order to create the movie_review database: Github_Link
library(dplyr)
library(RMySQL)
library(dbplyr)
library(tidytext)
library(knitr)
Be sure you use the correct username and password for your particular MySQL setup!
mr_db <- src_mysql(dbname = 'movie_reviews', username = 'user', password = 'password')
movie_df <- tbl(mr_db, "movie") %>% collect(n=Inf)
critic_df <- tbl(mr_db, "critic") %>% collect(n=Inf)
review_df <- tbl(mr_db, "review") %>% collect(n=Inf)
review_enhanced_df <- review_df %>%
left_join(movie_df, by="movie_id") %>%
left_join(critic_df, by="critic_id") %>%
arrange(title, desc(movie_rating))
review_enhanced_df %>% head() %>% kable()
| movie_id | critic_id | movie_rating | title | critic_name |
|---|---|---|---|---|
| 1 | 2 | 3.9 | Acquaman | Beverly |
| 1 | 3 | 3.9 | Acquaman | Charlie |
| 1 | 5 | 3.5 | Acquaman | Edward |
| 1 | 1 | 3.1 | Acquaman | Aaron |
| 1 | 4 | 3.1 | Acquaman | Dolores |
| 6 | 1 | 4.8 | Annihilation | Aaron |
We start by using the tidy dataframe to create a user-item matrix.
ui_mat <- review_enhanced_df %>%
arrange(critic_name, title) %>%
cast_sparse(critic_name, title, movie_rating) %>%
as.matrix()
ui_mat %>% kable()
| Acquaman | Annihilation | Black Panther | Mary Poppins Returns | Ready Player One | The Greatest Showman | |
|---|---|---|---|---|---|---|
| Aaron | 3.1 | 4.8 | 4.8 | 3.7 | 2.5 | 1.6 |
| Beverly | 3.9 | 3.1 | 4.6 | 3.5 | 3.1 | 2.3 |
| Charlie | 3.9 | 4.3 | 4.1 | 3.5 | 3.8 | 2.0 |
| Dolores | 3.1 | 4.2 | 5.0 | 2.6 | 3.7 | 2.3 |
| Edward | 3.5 | 4.1 | 5.0 | 3.1 | 3.8 | 1.8 |
We randomly create missing values in the matrix.
pct_na <- 0.2
n_na <- round( pct_na * length(ui_mat), 0)
na_ind <- sample(1:length(ui_mat), n_na, replace = F)
ui_mat[na_ind] <- NA
ui_mat %>% kable()
| Acquaman | Annihilation | Black Panther | Mary Poppins Returns | Ready Player One | The Greatest Showman | |
|---|---|---|---|---|---|---|
| Aaron | 3.1 | 4.8 | 4.8 | 3.7 | 2.5 | 1.6 |
| Beverly | 3.9 | 3.1 | NA | 3.5 | 3.1 | NA |
| Charlie | 3.9 | 4.3 | 4.1 | 3.5 | 3.8 | 2.0 |
| Dolores | 3.1 | NA | 5.0 | NA | NA | 2.3 |
| Edward | 3.5 | NA | 5.0 | 3.1 | 3.8 | 1.8 |
Next we break the matrix into training and test values.
pct_test <- 0.2
n_test <- round( pct_test * sum(!is.na(ui_mat)), 0)
test_ind <- sample((1:length(ui_mat))[-na_ind], n_test, replace = F)
train_ind <- (1:length(ui_mat))[-c(na_ind, test_ind)]
train_mat <- ui_mat
train_mat[test_ind] <- NA
test_mat <- ui_mat
test_mat[train_ind] <- NA
train_mat %>% kable()
| Acquaman | Annihilation | Black Panther | Mary Poppins Returns | Ready Player One | The Greatest Showman | |
|---|---|---|---|---|---|---|
| Aaron | 3.1 | 4.8 | 4.8 | 3.7 | 2.5 | 1.6 |
| Beverly | NA | 3.1 | NA | 3.5 | NA | NA |
| Charlie | 3.9 | 4.3 | 4.1 | 3.5 | 3.8 | 2.0 |
| Dolores | 3.1 | NA | 5.0 | NA | NA | NA |
| Edward | 3.5 | NA | 5.0 | NA | 3.8 | NA |
test_mat %>% kable()
| Acquaman | Annihilation | Black Panther | Mary Poppins Returns | Ready Player One | The Greatest Showman | |
|---|---|---|---|---|---|---|
| Aaron | NA | NA | NA | NA | NA | NA |
| Beverly | 3.9 | NA | NA | NA | 3.1 | NA |
| Charlie | NA | NA | NA | NA | NA | NA |
| Dolores | NA | NA | NA | NA | NA | 2.3 |
| Edward | NA | NA | NA | 3.1 | NA | 1.8 |
Using the training data, we compute the overall average and make predictions for user-item combinations with that one value. We calculate the RMSE of our predictions on both the training and test data.
predictions_model_1 <- mean(train_mat, na.rm=T)
rmse <- function(predicted, observed, rmna=FALSE){
sqrt(mean((predicted - observed)^2, na.rm=rmna))
}
### Train RMSE
train_rmse_model_1 <- rmse(predictions_model_1, train_mat[!is.na(train_mat)])
train_rmse_model_1
## [1] 0.9246396
### Test RMSE
test_rmse_model_1 <- rmse(predictions_model_1, test_mat[!is.na(test_mat)])
test_rmse_model_1
## [1] 1.077663
We try to improve upon the raw average model by calculating a “bias” for each user and each item (movie, in this case) using the training data. We will do this by taking averages for each row and column, and subtracting the overall mean. Effectively we are estimating how much each user (ie, row) and each item (ie, movie, or column) is above or below the overall average.
Then to make predictions for each user-item combination we will take the raw average and add the according user and item bias. We calculate the RMSE of our predictions on both the training and test data.
row_bias <- rowMeans(train_mat, na.rm=T) - mean(train_mat, na.rm=T)
col_bias <- colMeans(train_mat, na.rm=T) - mean(train_mat, na.rm=T)
predictions_df_model_2 <- expand.grid(row_bias=row_bias, col_bias=col_bias) %>%
cbind(expand.grid(user=names(row_bias), item=names(col_bias))) %>%
mutate(raw_avg = mean(train_mat, na.rm=T), prediction = raw_avg + row_bias + col_bias)
predictions_df_model_2 %>% head() %>% kable()
| row_bias | col_bias | user | item | raw_avg | prediction |
|---|---|---|---|---|---|
| -0.2201754 | -0.2368421 | Aaron | Acquaman | 3.636842 | 3.179825 |
| -0.3368421 | -0.2368421 | Beverly | Acquaman | 3.636842 | 3.063158 |
| -0.0368421 | -0.2368421 | Charlie | Acquaman | 3.636842 | 3.363158 |
| 0.4131579 | -0.2368421 | Dolores | Acquaman | 3.636842 | 3.813158 |
| 0.4631579 | -0.2368421 | Edward | Acquaman | 3.636842 | 3.863158 |
| -0.2201754 | 0.4298246 | Aaron | Annihilation | 3.636842 | 3.846491 |
predictions_mat_model_2 <- predictions_df_model_2 %>%
arrange(user, item) %>%
cast_sparse(user, item, prediction) %>%
as.matrix()
predictions_mat_model_2 %>% kable()
| Acquaman | Annihilation | Black Panther | Mary Poppins Returns | Ready Player One | The Greatest Showman | |
|---|---|---|---|---|---|---|
| Aaron | 3.179825 | 3.846491 | 4.504825 | 3.346491 | 3.146491 | 1.579825 |
| Beverly | 3.063158 | 3.729825 | 4.388158 | 3.229825 | 3.029825 | 1.463158 |
| Charlie | 3.363158 | 4.029825 | 4.688158 | 3.529825 | 3.329825 | 1.763158 |
| Dolores | 3.813158 | 4.479825 | 5.138158 | 3.979825 | 3.779825 | 2.213158 |
| Edward | 3.863158 | 4.529825 | 5.188158 | 4.029825 | 3.829825 | 2.263158 |
### Train RMSE
train_rmse_model_2 <- rmse(predictions_mat_model_2, train_mat, rmna=T)
train_rmse_model_2
## [1] 0.4410938
### Test RMSE
test_rmse_model_2 <- rmse(predictions_mat_model_2, test_mat, rmna=T)
test_rmse_model_2
## [1] 0.5986418
We see that for the slightly more complicated second model we get a Test RMSE of 0.6 compared to 1.08 for the simple first model, which is a better result.
We would expect the second model to be more accurate, given it contains more pertinent information. But also keep in mind natural variance can play a larger role in a toy dataset with small sample sizes. The next step would be to repeat this analysis on a larger dataset.