Introduction to Modeling

Introduction

The purpose of this assignment is to get an introduction to recommender systems and modeling. The global baseline estimate recommender is a basic recommender system that uses the overall average of all users ratings, the average of a specific users ratings for all the movies they watched, and the average rating across all users for a specific movie. The formula is:

Global Baseline Estimate = Mean movie rating overall + Specific movie rating relative to average + users rating relative to average

As an extension, tidymodels can be used to implement other types of recommender systems to compare how they work in relation to each other.

Load packages

library(tidyverse)
library(recommenderlab)
library(kableExtra)

Loading the data

This dataset was previously collected via a google survey of my friends and family.

path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_wk11/main/movie_ratings.csv"
df <- read.csv(path)
kable(head(df))

Timestamp	Name	Avatar	Inception	Rodents.of.Unusual.Size	Harry.Potter.and.the.Sorcerer.s.Stone	Top.Gun..Maverick	Causeway	Record.ID	Result	Timestamp.1
1/29/2023 17:42:04	Helena	5 - AMAZING	4 - GOOD	I did not watch this movie	5 - AMAZING	I did not watch this movie	I did not watch this movie	Helena	OK	2023-02-02 15:19
1/29/2023 17:42:45	Alvaro	4 - GOOD	4 - GOOD	I did not watch this movie	5 - AMAZING	4 - GOOD	2 - BAD	Alvaro	OK	2023-02-02 15:19
1/29/2023 17:43:21	Nolan	3 - OKAY	4 - GOOD	I did not watch this movie	3 - OKAY	4 - GOOD	I did not watch this movie	Nolan	OK	2023-02-02 15:19
1/29/2023 18:03:40	Emily	4 - GOOD	4 - GOOD	5 - AMAZING	3 - OKAY	2 - BAD	I did not watch this movie	Emily	OK	2023-02-02 15:19
1/29/2023 18:28:07	Avocado Toast	3 - OKAY	1 - DISSAPOINTING	I did not watch this movie	4 - GOOD	I did not watch this movie	I did not watch this movie	Avocado Toast	OK	2023-02-02 15:19
1/30/2023 10:21:05	Emma Griffen	5 - AMAZING	4 - GOOD	I did not watch this movie	4 - GOOD	4 - GOOD	I did not watch this movie	Emma Griffen	OK	2023-02-02 15:19

Clean data

Select only the necessary columns and update the row values to numerics.

df <- df[c(2:8)]

df[df == "1 - DISSAPOINTING"] <- 1
df[df == "2 - BAD"] <- 2
df[df == "3 - OKAY"] <- 3
df[df == "4 - GOOD"] <- 4
df[df == "5 - AMAZING"] <- 5
df[df == "I did not watch this movie"] <- NA

df[, 2:ncol(df)] <- lapply(2:ncol(df), function(x) as.numeric(df[[x]]))

kable(head(df))

Name	Avatar	Inception	Rodents.of.Unusual.Size	Harry.Potter.and.the.Sorcerer.s.Stone	Top.Gun..Maverick	Causeway
Helena	5	4	NA	5	NA	NA
Alvaro	4	4	NA	5	4	2
Nolan	3	4	NA	3	4	NA
Emily	4	4	5	3	2	NA
Avocado Toast	3	1	NA	4	NA	NA
Emma Griffen	5	4	NA	4	4	NA

Global Baseline Estimate Recommender

The next step is to implement the global baseline estimate recommender. In order to recommend a movie with this method, three values need to be known:

The overall mean movie rating across all users and all movies (avg overall)
The average of a specific users ratings for all the movies they watched (avg user)
The average rating across all users for a specific movie (avg movie)

First, I will make a copy of the dataframe, called df_predict to store my predictions. I will turn the movies that were rated to “rated” so that they will not receive a prediction. The remaining movies will be NA until they are predicted.

# Make a copy of the df
df_predict <- df
# Make the already rated movies to rated
df_predict[df_predict > 0 & df_predict <= 5] <- "rated"

Now I can find the three key values that need to be known to do the calculation of the global baseline recommender.

# Find the user average
df$user_mean <- rowMeans(df[,2:ncol(df)], na.rm = TRUE)

# Find the movie average
cols <- colMeans(df[1:nrow(df),2:ncol(df)], na.rm = TRUE)

# Find the average overall
avg_overall <- cols[["user_mean"]]

Now I can implement the recommender.

# All rows
for (i in 1:dim(df_predict)[1]){
  # Columns for ratings
  for (ii in 2:(dim(df_predict)[2]))
    # Check if need to calculate rating
    if (is.na(df_predict[i,ii])){
      user_rel <- df$user_mean[i] - avg_overall
      movie_rel <- cols[[ii - 1]] - avg_overall
      df_predict[i,ii] <- round(avg_overall + user_rel + movie_rel,1)
    }
}

kable(head(df_predict))

Name	Avatar	Inception	Rodents.of.Unusual.Size	Harry.Potter.and.the.Sorcerer.s.Stone	Top.Gun..Maverick	Causeway
Helena	rated	rated	5.8	rated	4.4	2.8
Alvaro	rated	rated	5	rated	rated	rated
Nolan	rated	rated	4.7	rated	rated	1.7
Emily	rated	rated	rated	rated	rated	1.8
Avocado Toast	rated	rated	3.8	rated	2.4	0.8
Emma Griffen	rated	rated	5.4	rated	rated	2.4

Looking at the data - sometimes the rating goes above 5 and sometimes it goes below 1 - which were my limits. For this reason I will filter those.

df_predict[df_predict > 5 & df_predict != "rated"] <- 5
df_predict[df_predict < 1] <- 1

Now the global baseline estimator has been implemented.

Alternative Recommender

The global baseline estimator can be compared to other methods of recommender engines. recommenderlab is a library in R that ” aims at providing a comprehensive research infrastructure for recommender systems” essentially meaning that it allows researchers a simple mechanism for implementing different types of recommender engines and deciding what is best for their purposes. Below I will implement a User Based Collaborative Filtering algorithm based on the examples section in “recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms” (1). User Based Collaborative Filtering recommends items (in this case movies) based on similarities between users (2).

The first steps are to create the matrix and turn it into a realRatingMatrix.

# convert dataframe to matrix
m_df <- as.matrix(df[,2:(ncol(df)-1)])

# convert to realRatingMatrix (sparse format)
r <- as(m_df, "realRatingMatrix")

# check out the matrix
head(getRatingMatrix(r))

## 6 x 6 sparse Matrix of class "dgCMatrix"
##      Avatar Inception Rodents.of.Unusual.Size
## [1,]      5         4                       .
## [2,]      4         4                       .
## [3,]      3         4                       .
## [4,]      4         4                       5
## [5,]      3         1                       .
## [6,]      5         4                       .
##      Harry.Potter.and.the.Sorcerer.s.Stone Top.Gun..Maverick Causeway
## [1,]                                     5                 .        .
## [2,]                                     5                 4        2
## [3,]                                     3                 4        .
## [4,]                                     3                 2        .
## [5,]                                     4                 .        .
## [6,]                                     4                 4        .

# check distribution
hist(getRatings(r), breaks=5)

Now the recommender can be implemented, evaluated and the error plot generated.

# set seed for consistency
set.seed(09041996)

# create an evaluation scheme
e <- evaluationScheme(r, method="split", train=0.8, given=-2, goodRating=5)

## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead

# comparing multiple
algorithms <- list(
  "random items" = list(name="RANDOM", param=NULL),
  "popular items" = list(name="POPULAR", param=NULL),
  "user-based CF" = list(name="UBCF", param=NULL),
  "item-based CF" = list(name="IBCF", param=NULL),
  "SVD approximation" = list(name="SVD", param=list(k = 3))
  )

results <- evaluate(e, algorithms, type = "ratings")

## RANDOM run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
## POPULAR run fold/sample [model time/prediction time]
##   1  [0.02sec/0.08sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/0sec] 
## SVD run fold/sample [model time/prediction time]
##   1

## Warning in irlba::irlba(m, nv = p$k, maxit = p$maxiter): You're computing too
## large a percentage of total singular values, use a standard svd instead.

## [0sec/0sec]

plot(results, ylim = c(0,5))

Based on this plot, the item-based collaborative filter had the least error.

Conclusion

In conclusion a global baseline estimate recommender was implemented. Additionally basic functionality of recommenderlab was explored to see what type of recommender in it’s fleet had the least error.

Citations

recommenderlab: An R Framework for Developing and Testing Recommendation Algorithms (https://cran.r-project.org/web/packages/recommenderlab/vignettes/recommenderlab.pdf)
Recommendation Systems (http://infolab.stanford.edu/~ullman/mmds/ch9.pdf)