Assignment 3A Global Baseline Estimate

Author

Long Lin

Overview

For this assignment, I’ll use the movie ratings data provided in the spreadsheet and implement a Global Baseline Estimate recommendation system in R. In order to do this, I will use the Global Baseline Estimate implementation algorithm provided. The Global Baseline Estimate is one of the best non-personalized recommender system algorithms.

source: https://raw.githubusercontent.com/longflin/DATA-607/refs/heads/main/Assignment%203A/Movie_ratings.csv

Data Input

First I’ll populate the data, from the spreadsheet provided, into a data frame. In order to do this, I manually added the data.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ratings_data <- tibble(
  user_id = c("Burton", "Burton",
              "Charley", "Charley", "Charley", "Charley", "Charley", "Charley",
              "Dan", "Dan",
              "Dieudonne", "Dieudonne", "Dieudonne",
              "Matt", "Matt", "Matt", "Matt",
              "Mauricio", "Mauricio", "Mauricio", "Mauricio",
              "Max", "Max", "Max", "Max", "Max", "Max",
              "Nathan",
              "Param", "Param", "Param", "Param",
              "Parshu", "Parshu", "Parshu", "Parshu", "Parshu", "Parshu",
              "Prashanth", "Prashanth", "Prashanth", "Prashanth", "Prashanth",
              "Shipra", "Shipra", "Shipra",
              "Sreejaya", "Sreejaya", "Sreejaya", "Sreejaya", "Sreejaya", "Sreejaya",
              "Steve", "Steve",
              "Vuthy", "Vuthy", "Vuthy", "Vuthy", "Vuthy",
              "Xingjia", "Xingjia"
              ),
  movie_id = c("JungleBook", "StarWarsForce",
              "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect", "StarWarsForce",
              "Deadpool", "StarWarsForce",
              "CaptainAmerica", "Deadpool", "StarWarsForce",
              "CaptainAmerica", "Frozen", "PitchPerfect", "StarWarsForce",
              "CaptainAmerica", "Frozen", "JungleBook", "PitchPerfect",
              "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect", "StarWarsForce",
              "StarWarsForce",
              "CaptainAmerica", "Deadpool", "Frozen", "StarWarsForce",
              "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect", "StarWarsForce",
              "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "StarWarsForce",
              "Frozen", "JungleBook", "StarWarsForce",
              "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect", "StarWarsForce",
              "CaptainAmerica", "StarWarsForce",
              "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPerfect",
              "Frozen", "JungleBook"
              ),
  rating  = c(4, 4,
              4, 5, 4, 3, 2, 3,
              5, 5, 
              5, 4, 5,
              4, 2, 2, 5,
              4, 3, 3, 4,
              4, 4, 4, 2, 2, 4,
              4,
              4, 4, 1, 5,
              4, 3, 5, 5, 2, 3,
              5, 5, 5, 5, 4,
              4, 5, 3,
              5, 5, 5, 4, 4, 5,
              4, 4,
              4, 5, 3, 3, 3,
              5, 5
              )
)

print(ratings_data)
# A tibble: 61 × 3
   user_id movie_id       rating
   <chr>   <chr>           <dbl>
 1 Burton  JungleBook          4
 2 Burton  StarWarsForce       4
 3 Charley CaptainAmerica      4
 4 Charley Deadpool            5
 5 Charley Frozen              4
 6 Charley JungleBook          3
 7 Charley PitchPerfect        2
 8 Charley StarWarsForce       3
 9 Dan     Deadpool            5
10 Dan     StarWarsForce       5
# ℹ 51 more rows

Calculations

I’ll calculate the mean movie rating by taking a mean of all the ratings.

mean_movie_rating <- round(mean(ratings_data$rating), 2)

print(paste("mean_movie_rating:", round(mean_movie_rating, 2)))
[1] "mean_movie_rating: 3.93"

Next, I’ll calculate the movie biases by subtracting the mean_movie_rating from the mean of the movie’s ratings.

movie_biases <- ratings_data |>
  group_by(movie_id) |>
  summarise(movie_bias = round(mean(rating) - mean_movie_rating, 2))

print(movie_biases)
# A tibble: 6 × 2
  movie_id       movie_bias
  <chr>               <dbl>
1 CaptainAmerica       0.34
2 Deadpool             0.51
3 Frozen              -0.2 
4 JungleBook          -0.03
5 PitchPerfect        -1.22
6 StarWarsForce        0.22

Next, I’ll calculate the user biases by subtracting the mean_movie_rating from the mean of the user’s ratings.

user_biases <- ratings_data |>
  left_join(movie_biases, by = "movie_id") |>
  group_by(user_id) |>
  summarise(user_bias = mean(rating) - mean_movie_rating)

print(user_biases)
# A tibble: 16 × 2
   user_id   user_bias
   <chr>         <dbl>
 1 Burton       0.0700
 2 Charley     -0.43  
 3 Dan          1.07  
 4 Dieudonne    0.737 
 5 Matt        -0.68  
 6 Mauricio    -0.43  
 7 Max         -0.597 
 8 Nathan       0.0700
 9 Param       -0.43  
10 Parshu      -0.263 
11 Prashanth    0.87  
12 Shipra       0.0700
13 Sreejaya     0.737 
14 Steve        0.0700
15 Vuthy       -0.33  
16 Xingjia      1.07  

Making a prediction

I’ll use the recommendation system to predict what Param will rate PitchPerfect.

target_user <- "Param"
target_movie <- "PitchPerfect"

target_user_bias <- user_biases |>
  filter(user_id == target_user) |>
    pull(user_bias)

target_movie_bias <- movie_biases |>
  filter(movie_id == target_movie) |>
    pull(movie_bias)

# Global Baseline Estimate = Mean Movie Rating + Pitch Perfect 2's rating relative to average + Param's rating relative to average
predicted_rating <- mean_movie_rating + target_movie_bias + target_user_bias

# Clamp the prediction (ensure it stays between 1 and 5)
predicted_rating <- pmin(pmax(predicted_rating, 1), 5)

print(paste("Predicted rating for", toString(target_user),"on", toString(target_movie),"is", round(predicted_rating, 2)))
[1] "Predicted rating for Param on PitchPerfect is 2.28"

Next, I’ll use the recommendation system to predict what Burton will rate CaptainAmerica.

target_user <- "Burton"
target_movie <- "CaptainAmerica"

target_user_bias <- user_biases |>
  filter(user_id == target_user) |>
    pull(user_bias)

target_movie_bias <- movie_biases |>
  filter(movie_id == target_movie) |>
    pull(movie_bias)

predicted_rating <- mean_movie_rating + target_movie_bias + target_user_bias

# Keep predictions within the typical 1 to 5 rating scale
predicted_rating <- pmin(pmax(predicted_rating, 1), 5)

print(paste("Predicted rating for", toString(target_user),"on", toString(target_movie),"is", round(predicted_rating, 2)))
[1] "Predicted rating for Burton on CaptainAmerica is 4.34"

Generating Recommendations

For generating recommendations, I’ll calculate all the user-movie combinations that have not been rated.

library(tidyr)
all_movie_user_combinations <- crossing(user_id = ratings_data$user_id, movie_id = ratings_data$movie_id)

not_rated <- all_movie_user_combinations |>
  anti_join(ratings_data, by = c("user_id", "movie_id"))

predicted_ratings <- not_rated |>
  left_join(user_biases, by = "user_id") |>
  left_join(movie_biases, by = "movie_id") |>
  mutate(
    predicted_rating = mean_movie_rating + user_bias + movie_bias
  ) |>
  # Keep predictions within the typical 1 to 5 rating scale
  mutate(
      predicted_rating = pmin(pmax(predicted_rating, 1), 5)
  )

# View predicted ratings for missing entries
head(predicted_ratings)
# A tibble: 6 × 5
  user_id movie_id       user_bias movie_bias predicted_rating
  <chr>   <chr>              <dbl>      <dbl>            <dbl>
1 Burton  CaptainAmerica    0.0700       0.34             4.34
2 Burton  Deadpool          0.0700       0.51             4.51
3 Burton  Frozen            0.0700      -0.2              3.8 
4 Burton  PitchPerfect      0.0700      -1.22             2.78
5 Dan     CaptainAmerica    1.07         0.34             5   
6 Dan     Frozen            1.07        -0.2              4.8 

Next I’ll generate the top three recommendations for Dan based on the predictions.

user_recommendations <- predicted_ratings |>
  filter(user_id == "Dan") |>
  arrange(desc(predicted_rating)) |>
  select("Movie" = movie_id, "Predicted rating" = predicted_rating)

dan_top_three_recommendations <- head(user_recommendations, 3)
head(dan_top_three_recommendations)
# A tibble: 3 × 2
  Movie          `Predicted rating`
  <chr>                       <dbl>
1 CaptainAmerica               5   
2 JungleBook                   4.97
3 Frozen                       4.8 

Conclusion

The Global Baseline Estimate recommendation system is an effective way to generate predictions taking into account a user’s bias and a movie’s bias with the mean movie rating.