Introduction

The purpose of this assignment is to get an introduction to recommender systems and modeling. The global baseline estimate recommender is a basic recommender system that uses the overall average of all users ratings, the average of a specific users ratings for all the movies they watched, and the average rating across all users for a specific movie. The formula is:

Global Baseline Estimate = Mean movie rating overall + Specific movie rating relative to average + users rating relative to average

As an extension, tidymodels can be used to implement other types of recommender systems to compare how they work in relation to each other.

Load packages

library(tidyverse)
library(recommenderlab)
library(kableExtra)

Loading the data

This dataset was previously collected via a google survey of my friends and family.

path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_wk11/main/movie_ratings.csv"
df <- read.csv(path)
kable(head(df))
Timestamp Name Avatar Inception Rodents.of.Unusual.Size Harry.Potter.and.the.Sorcerer.s.Stone Top.Gun..Maverick Causeway Record.ID Result Timestamp.1
1/29/2023 17:42:04 Helena 5 - AMAZING 4 - GOOD I did not watch this movie 5 - AMAZING I did not watch this movie I did not watch this movie Helena OK 2023-02-02 15:19
1/29/2023 17:42:45 Alvaro 4 - GOOD 4 - GOOD I did not watch this movie 5 - AMAZING 4 - GOOD 2 - BAD Alvaro OK 2023-02-02 15:19
1/29/2023 17:43:21 Nolan 3 - OKAY 4 - GOOD I did not watch this movie 3 - OKAY 4 - GOOD I did not watch this movie Nolan OK 2023-02-02 15:19
1/29/2023 18:03:40 Emily 4 - GOOD 4 - GOOD 5 - AMAZING 3 - OKAY 2 - BAD I did not watch this movie Emily OK 2023-02-02 15:19
1/29/2023 18:28:07 Avocado Toast 3 - OKAY 1 - DISSAPOINTING I did not watch this movie 4 - GOOD I did not watch this movie I did not watch this movie Avocado Toast OK 2023-02-02 15:19
1/30/2023 10:21:05 Emma Griffen 5 - AMAZING 4 - GOOD I did not watch this movie 4 - GOOD 4 - GOOD I did not watch this movie Emma Griffen OK 2023-02-02 15:19

Clean data

Select only the necessary columns and update the row values to numerics.

df <- df[c(2:8)]

df[df == "1 - DISSAPOINTING"] <- 1
df[df == "2 - BAD"] <- 2
df[df == "3 - OKAY"] <- 3
df[df == "4 - GOOD"] <- 4
df[df == "5 - AMAZING"] <- 5
df[df == "I did not watch this movie"] <- NA

df[, 2:ncol(df)] <- lapply(2:ncol(df), function(x) as.numeric(df[[x]]))

kable(head(df))
Name Avatar Inception Rodents.of.Unusual.Size Harry.Potter.and.the.Sorcerer.s.Stone Top.Gun..Maverick Causeway
Helena 5 4 NA 5 NA NA
Alvaro 4 4 NA 5 4 2
Nolan 3 4 NA 3 4 NA
Emily 4 4 5 3 2 NA
Avocado Toast 3 1 NA 4 NA NA
Emma Griffen 5 4 NA 4 4 NA

Global Baseline Estimate Recommender

The next step is to implement the global baseline estimate recommender. In order to recommend a movie with this method, three values need to be known:

  1. The overall mean movie rating across all users and all movies (avg overall)
  2. The average of a specific users ratings for all the movies they watched (avg user)
  3. The average rating across all users for a specific movie (avg movie)

First, I will make a copy of the dataframe, called df_predict to store my predictions. I will turn the movies that were rated to “rated” so that they will not receive a prediction. The remaining movies will be NA until they are predicted.

# Make a copy of the df
df_predict <- df
# Make the already rated movies to rated
df_predict[df_predict > 0 & df_predict <= 5] <- "rated"

Now I can find the three key values that need to be known to do the calculation of the global baseline recommender.

# Find the user average
df$user_mean <- rowMeans(df[,2:ncol(df)], na.rm = TRUE)

# Find the movie average
cols <- colMeans(df[1:nrow(df),2:ncol(df)], na.rm = TRUE)

# Find the average overall
avg_overall <- cols[["user_mean"]]

Now I can implement the recommender.

# All rows
for (i in 1:dim(df_predict)[1]){
  # Columns for ratings
  for (ii in 2:(dim(df_predict)[2]))
    # Check if need to calculate rating
    if (is.na(df_predict[i,ii])){
      user_rel <- df$user_mean[i] - avg_overall
      movie_rel <- cols[[ii - 1]] - avg_overall
      df_predict[i,ii] <- round(avg_overall + user_rel + movie_rel,1)
    }
}

kable(head(df_predict))
Name Avatar Inception Rodents.of.Unusual.Size Harry.Potter.and.the.Sorcerer.s.Stone Top.Gun..Maverick Causeway
Helena rated rated 5.8 rated 4.4 2.8
Alvaro rated rated 5 rated rated rated
Nolan rated rated 4.7 rated rated 1.7
Emily rated rated rated rated rated 1.8
Avocado Toast rated rated 3.8 rated 2.4 0.8
Emma Griffen rated rated 5.4 rated rated 2.4

Looking at the data - sometimes the rating goes above 5 and sometimes it goes below 1 - which were my limits. For this reason I will filter those.

df_predict[df_predict > 5 & df_predict != "rated"] <- 5
df_predict[df_predict < 1] <- 1

Now the global baseline estimator has been implemented.

Alternative Recommender

The global baseline estimator can be compared to other methods of recommender engines. recommenderlab is a library in R that ” aims at providing a comprehensive research infrastructure for recommender systems” essentially meaning that it allows researchers a simple mechanism for implementing different types of recommender engines and deciding what is best for their purposes. Below I will implement a User Based Collaborative Filtering algorithm based on the examples section in “recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms” (1). User Based Collaborative Filtering recommends items (in this case movies) based on similarities between users (2).

The first steps are to create the matrix and turn it into a realRatingMatrix.

# convert dataframe to matrix
m_df <- as.matrix(df[,2:(ncol(df)-1)])

# convert to realRatingMatrix (sparse format)
r <- as(m_df, "realRatingMatrix")

# check out the matrix
head(getRatingMatrix(r))
## 6 x 6 sparse Matrix of class "dgCMatrix"
##      Avatar Inception Rodents.of.Unusual.Size
## [1,]      5         4                       .
## [2,]      4         4                       .
## [3,]      3         4                       .
## [4,]      4         4                       5
## [5,]      3         1                       .
## [6,]      5         4                       .
##      Harry.Potter.and.the.Sorcerer.s.Stone Top.Gun..Maverick Causeway
## [1,]                                     5                 .        .
## [2,]                                     5                 4        2
## [3,]                                     3                 4        .
## [4,]                                     3                 2        .
## [5,]                                     4                 .        .
## [6,]                                     4                 4        .
# check distribution
hist(getRatings(r), breaks=5)

Now the recommender can be implemented, evaluated and the error plot generated.

# set seed for consistency
set.seed(09041996)

# create an evaluation scheme
e <- evaluationScheme(r, method="split", train=0.8, given=-2, goodRating=5)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
# comparing multiple
algorithms <- list(
  "random items" = list(name="RANDOM", param=NULL),
  "popular items" = list(name="POPULAR", param=NULL),
  "user-based CF" = list(name="UBCF", param=NULL),
  "item-based CF" = list(name="IBCF", param=NULL),
  "SVD approximation" = list(name="SVD", param=list(k = 3))
  )

results <- evaluate(e, algorithms, type = "ratings")
## RANDOM run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.02sec/0.08sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/0sec] 
## SVD run fold/sample [model time/prediction time]
##   1
## Warning in irlba::irlba(m, nv = p$k, maxit = p$maxiter): You're computing too
## large a percentage of total singular values, use a standard svd instead.
## [0sec/0sec]
plot(results, ylim = c(0,5))

Based on this plot, the item-based collaborative filter had the least error.

Conclusion

In conclusion a global baseline estimate recommender was implemented. Additionally basic functionality of recommenderlab was explored to see what type of recommender in it’s fleet had the least error.

Citations

  1. recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms (https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf)
  2. Recommendation Systems (http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)