EC 5 - Recommender Systems

The goal of this project is to implement a Global Baseline Estimate recommendation system in R based on movie ratings. This is an extension of assignment 2, where movie ratings were collected from different individuals.

For this, we will use the ratings collected in assignment 2 by connecting to the SQL database in which they are stored.

Connecting to PostgreSQL Database

con <- dbConnect(
  Postgres(), 
  host = "localhost", 
  port = 5432,
  user = "postgres",
  password = Sys.getenv("SQL_DB_PASS"), 
  dbname = "cuny-sps"
)

Loading the Databases

movie_ratings <- dbGetQuery(con, "SELECT * FROM movie_ratings")
raters <- dbGetQuery(con, "SELECT * FROM raters")
movies <- dbGetQuery(con, "SELECT * FROM movies")

Movie Ratings Table

rmarkdown::paged_table(movie_ratings)

Raters Table

rmarkdown::paged_table(raters)

Movies Table

rmarkdown::paged_table(movies)

Join Tables

To implement the recommender system, I will first join the tables together based on movieID and raterID.

movie_ratings_full <- movie_ratings |>
  left_join(movies, by = "movieid") |>
  left_join(raters, by = "raterid") |>
  select("rater" = name, movie_title, rating) 

rmarkdown::paged_table(movie_ratings_full)

Building the System

We will use the following formula to calculate the predicted ratings for each rater: Global Baseline Estimate = Mean Movie Rating + Movie’s rating relative to average + Rater’s rating relative to average (where mean movie rating is the average rating across all movies and users, movie’s rating relative to average is defined as movie’s average rating - mean movie rating, and rater’s rating relative to average is defined as rater’s average rating - mean movie rating).

First, let’s take calculate the average rating for each rater:

avg_rater_ratings <- movie_ratings_full |>
  group_by(rater) |>
  summarize(avg_rater_rating = round(mean(rating, na.rm=T), 2))

Now, let’s find the average rating for each movie:

avg_movie_ratings <- movie_ratings_full |>
  group_by(movie_title) |>
  summarize(avg_movie_rating = round(mean(rating, na.rm=T), 2))

We can now join both tables to the original ratings table and perform the calculations for the average movie rating across movies and the relative ratings for raters and movies.

predicted_ratings_calc <- movie_ratings_full |>
  left_join(avg_rater_ratings, by="rater") |>
  left_join(avg_movie_ratings, by="movie_title") |>
  mutate(mean_movie_rating = round(mean(rating, na.rm=T), 2),
         rater_relative_rating = avg_rater_rating - mean_movie_rating,
         movie_relative_rating = avg_movie_rating - mean_movie_rating)

predicted_ratings <- predicted_ratings_calc |>
  mutate(predicted_rating = ifelse(is.na(rating), mean_movie_rating-rater_relative_rating-movie_relative_rating, NA)) |>
  select(rater, movie_title, predicted_rating)

rmarkdown::paged_table(predicted_ratings)

We can see from here that some ratings go above the 1-5 rating range. This is probably due to the fact that some raters were more “nice” about their ratings than others, so their predicted movie ratings are likely to be higher. As such, their predicted rating score is out of the range, but still provides a good insight as to which movies should be recommended based on the highest predicted rating.

Let’s standardize these ratings to a 1-5 scale by first dividing all the predicted ratings by the maximum rating and then multiplying by 5 to get only ratings between a 1-5 range.

stand_predicted_ratings <- predicted_ratings |>
  mutate(predicted_rating = round(predicted_rating/max(predicted_rating, na.rm=T)*5 ,2))

Since raters can only rate movies on a discrete scale of 1-5, the rater would not be able to rate movies with a decimal rating and we can round their predicted rating to the nearest whole number to see what they might actually rate the movie. However, I also think the predictions which include the decimal show insight into which movies the user would like more. For example, a rating of 3.8 and 4.3 would both translate to a rating of 4 when rounded to the nearest whole number, but based on the Global Baseline Estimator, the rater may enjoy the movie with the 4.3 rating better. Therefore, we can also keep the decimal rating to determine which is the top movie which should be recommended for the user or the order in which to recommend top-rated movies for a user.

whole_predicted_ratings <- stand_predicted_ratings |>
  mutate(pred_rating_whole = round(predicted_rating, 0))

We can now recommend a movie to each person based on their predicted top-rated movie:

whole_predicted_ratings |>
  group_by(rater) |>
  mutate(max_rating = max(predicted_rating, na.rm=T)) |>
  filter(predicted_rating == max_rating) |>
  select(-max_rating, -predicted_rating) |>
  knitr::kable(col.names = c("Rater", "Movie Title", "Predicted Rating"))

Rater	Movie Title	Predicted Rating
Sarah	Knives Out Glass Onion	5
Shani	Ticket to Paradise	2
Leah	Black Panther Wakanda Forever	4
Shimon	Bullet Train	4
Dinah	Black Panther Wakanda Forever	4
Abe	Bullet Train	3
Leeor	Black Panther Wakanda Forever	3
Leeor	Knives Out Glass Onion	3
Talia	Black Panther Wakanda Forever	4