The purpose of this assignment is to get an introduction to recommender systems and modeling. The global baseline estimate recommender is a basic recommender system that uses the overall average of all users ratings, the average of a specific users ratings for all the movies they watched, and the average rating across all users for a specific movie. The formula is:
Global Baseline Estimate = Mean movie rating overall + Specific movie rating relative to average + users rating relative to average
As an extension, tidymodels
can be used to implement
other types of recommender systems to compare how they work in relation
to each other.
library(tidyverse)
library(recommenderlab)
library(kableExtra)
This dataset was previously collected via a google survey of my friends and family.
path <- "https://raw.githubusercontent.com/klgriffen96/spring23_data607_wk11/main/movie_ratings.csv"
df <- read.csv(path)
kable(head(df))
Timestamp | Name | Avatar | Inception | Rodents.of.Unusual.Size | Harry.Potter.and.the.Sorcerer.s.Stone | Top.Gun..Maverick | Causeway | Record.ID | Result | Timestamp.1 |
---|---|---|---|---|---|---|---|---|---|---|
1/29/2023 17:42:04 | Helena | 5 - AMAZING | 4 - GOOD | I did not watch this movie | 5 - AMAZING | I did not watch this movie | I did not watch this movie | Helena | OK | 2023-02-02 15:19 |
1/29/2023 17:42:45 | Alvaro | 4 - GOOD | 4 - GOOD | I did not watch this movie | 5 - AMAZING | 4 - GOOD | 2 - BAD | Alvaro | OK | 2023-02-02 15:19 |
1/29/2023 17:43:21 | Nolan | 3 - OKAY | 4 - GOOD | I did not watch this movie | 3 - OKAY | 4 - GOOD | I did not watch this movie | Nolan | OK | 2023-02-02 15:19 |
1/29/2023 18:03:40 | Emily | 4 - GOOD | 4 - GOOD | 5 - AMAZING | 3 - OKAY | 2 - BAD | I did not watch this movie | Emily | OK | 2023-02-02 15:19 |
1/29/2023 18:28:07 | Avocado Toast | 3 - OKAY | 1 - DISSAPOINTING | I did not watch this movie | 4 - GOOD | I did not watch this movie | I did not watch this movie | Avocado Toast | OK | 2023-02-02 15:19 |
1/30/2023 10:21:05 | Emma Griffen | 5 - AMAZING | 4 - GOOD | I did not watch this movie | 4 - GOOD | 4 - GOOD | I did not watch this movie | Emma Griffen | OK | 2023-02-02 15:19 |
Select only the necessary columns and update the row values to numerics.
df <- df[c(2:8)]
df[df == "1 - DISSAPOINTING"] <- 1
df[df == "2 - BAD"] <- 2
df[df == "3 - OKAY"] <- 3
df[df == "4 - GOOD"] <- 4
df[df == "5 - AMAZING"] <- 5
df[df == "I did not watch this movie"] <- NA
df[, 2:ncol(df)] <- lapply(2:ncol(df), function(x) as.numeric(df[[x]]))
kable(head(df))
Name | Avatar | Inception | Rodents.of.Unusual.Size | Harry.Potter.and.the.Sorcerer.s.Stone | Top.Gun..Maverick | Causeway |
---|---|---|---|---|---|---|
Helena | 5 | 4 | NA | 5 | NA | NA |
Alvaro | 4 | 4 | NA | 5 | 4 | 2 |
Nolan | 3 | 4 | NA | 3 | 4 | NA |
Emily | 4 | 4 | 5 | 3 | 2 | NA |
Avocado Toast | 3 | 1 | NA | 4 | NA | NA |
Emma Griffen | 5 | 4 | NA | 4 | 4 | NA |
The next step is to implement the global baseline estimate recommender. In order to recommend a movie with this method, three values need to be known:
First, I will make a copy of the dataframe, called
df_predict
to store my predictions. I will turn the movies
that were rated to “rated” so that they will not receive a prediction.
The remaining movies will be NA until they are predicted.
# Make a copy of the df
df_predict <- df
# Make the already rated movies to rated
df_predict[df_predict > 0 & df_predict <= 5] <- "rated"
Now I can find the three key values that need to be known to do the calculation of the global baseline recommender.
# Find the user average
df$user_mean <- rowMeans(df[,2:ncol(df)], na.rm = TRUE)
# Find the movie average
cols <- colMeans(df[1:nrow(df),2:ncol(df)], na.rm = TRUE)
# Find the average overall
avg_overall <- cols[["user_mean"]]
Now I can implement the recommender.
# All rows
for (i in 1:dim(df_predict)[1]){
# Columns for ratings
for (ii in 2:(dim(df_predict)[2]))
# Check if need to calculate rating
if (is.na(df_predict[i,ii])){
user_rel <- df$user_mean[i] - avg_overall
movie_rel <- cols[[ii - 1]] - avg_overall
df_predict[i,ii] <- round(avg_overall + user_rel + movie_rel,1)
}
}
kable(head(df_predict))
Name | Avatar | Inception | Rodents.of.Unusual.Size | Harry.Potter.and.the.Sorcerer.s.Stone | Top.Gun..Maverick | Causeway |
---|---|---|---|---|---|---|
Helena | rated | rated | 5.8 | rated | 4.4 | 2.8 |
Alvaro | rated | rated | 5 | rated | rated | rated |
Nolan | rated | rated | 4.7 | rated | rated | 1.7 |
Emily | rated | rated | rated | rated | rated | 1.8 |
Avocado Toast | rated | rated | 3.8 | rated | 2.4 | 0.8 |
Emma Griffen | rated | rated | 5.4 | rated | rated | 2.4 |
Looking at the data - sometimes the rating goes above 5 and sometimes it goes below 1 - which were my limits. For this reason I will filter those.
df_predict[df_predict > 5 & df_predict != "rated"] <- 5
df_predict[df_predict < 1] <- 1
Now the global baseline estimator has been implemented.
The global baseline estimator can be compared to other methods of
recommender engines. recommenderlab
is a library in R that
” aims at providing a comprehensive research infrastructure for
recommender systems” essentially meaning that it allows researchers a
simple mechanism for implementing different types of recommender engines
and deciding what is best for their purposes. Below I will implement a
User Based Collaborative Filtering algorithm based on the examples
section in “recommenderlab: An R Framework for Developing and Testing
Recommendation Algorithms” (1). User Based Collaborative Filtering
recommends items (in this case movies) based on similarities between
users (2).
The first steps are to create the matrix and turn it into a
realRatingMatrix
.
# convert dataframe to matrix
m_df <- as.matrix(df[,2:(ncol(df)-1)])
# convert to realRatingMatrix (sparse format)
r <- as(m_df, "realRatingMatrix")
# check out the matrix
head(getRatingMatrix(r))
## 6 x 6 sparse Matrix of class "dgCMatrix"
## Avatar Inception Rodents.of.Unusual.Size
## [1,] 5 4 .
## [2,] 4 4 .
## [3,] 3 4 .
## [4,] 4 4 5
## [5,] 3 1 .
## [6,] 5 4 .
## Harry.Potter.and.the.Sorcerer.s.Stone Top.Gun..Maverick Causeway
## [1,] 5 . .
## [2,] 5 4 2
## [3,] 3 4 .
## [4,] 3 2 .
## [5,] 4 . .
## [6,] 4 4 .
# check distribution
hist(getRatings(r), breaks=5)
Now the recommender can be implemented, evaluated and the error plot generated.
# set seed for consistency
set.seed(09041996)
# create an evaluation scheme
e <- evaluationScheme(r, method="split", train=0.8, given=-2, goodRating=5)
## as(<dgCMatrix>, "dgTMatrix") is deprecated since Matrix 1.5-0; do as(., "TsparseMatrix") instead
# comparing multiple
algorithms <- list(
"random items" = list(name="RANDOM", param=NULL),
"popular items" = list(name="POPULAR", param=NULL),
"user-based CF" = list(name="UBCF", param=NULL),
"item-based CF" = list(name="IBCF", param=NULL),
"SVD approximation" = list(name="SVD", param=list(k = 3))
)
results <- evaluate(e, algorithms, type = "ratings")
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.02sec/0.01sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.02sec/0.08sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.01sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.06sec/0sec]
## SVD run fold/sample [model time/prediction time]
## 1
## Warning in irlba::irlba(m, nv = p$k, maxit = p$maxiter): You're computing too
## large a percentage of total singular values, use a standard svd instead.
## [0sec/0sec]
plot(results, ylim = c(0,5))
Based on this plot, the item-based collaborative filter had the least error.
In conclusion a global baseline estimate recommender was implemented.
Additionally basic functionality of recommenderlab
was
explored to see what type of recommender in it’s fleet had the least
error.