week_3A_global_baseline_estimate

Author

Brandon Chanderban

Introduction/Approach

The goal of this week 3A assignment is to implement a Global Baseline Estimate recommender system in RStudio. The data to be used for this task will come from the professor-provided Movie Ratings Excel file. I selected this option because it contains a larger number of respondents (more than the five in my own dataset), which should provide more observations for testing the recommender system’s output.

In examining the types of recommender systems, and compared to more personalized recommender systems (such as collaborative filtering), the Global Baseline approach facilitates predictions via three components:

  • an overall average rating (the global mean),

  • a movie specific bias term, and

  • a user-specific bias term.

Subsequently, the prediction formula for a specific user’s rating of a particular movie will be that of:

predicted rating = μ + bᵢ + bᵤ

Where the predicted rating is equal to the global mean (μ), plus the movie bias (bᵢ) and the user bias (bᵤ).

Data Preparation

As mentioned prior, the dataset will be taken from the Movie Ratings Excel file provided by Professor Catlin. This data is currently within a wide format, where rows represent individual users and columns represent movies. The cells contain the ratings, and some blanks represent movies that were not viewed (or at least not rated).

Therefore, in my data preparation workflow, after importing the Excel file into R (using the readxl functionality), the dataset will be reshaped into a long format (likely using pivot_longer()), and missing entries will be handled as NA values. This is important to ensure that only observed ratings are used when computing the averages and bias terms.

The resulting dataframe will include the variables user, movie, and rating, similar to how my own dataset was structured in the previous Week 2A assignment.

Computing the Global Baseline Components

The next stage will involve calculating:

  1. The global mean (μ): the average across all observed movie ratings.

  2. The movie bias (bᵢ): for each movie, determine its average rating and subtract the global mean from this figure to discern whether it tends to be rated above or below average.

  3. The user bias (bᵤ): For each user, determine their average rating and subtract the global mean from this figure to discern whether the user tends to rate more stringently or generously than average.

Once these values are computed for all users and movies, it becomes possible to estimate ratings for movies that a reviewer has not rated. Recommendations can then be made by ranking unseen movies by their predicted ratings and selecting the highest.

Anticipated Challenges/Data Issues

  • Ensuring missing ratings are properly handled and excluded from mean/bias calculations.

  • Confirming that the RStudio implementations produces results that align with the spreadsheet logic and algorithm.

  • Verifying the garnered predictions against one or more examples executed within the confines of the spreadsheet.

Optional Endeavor

Once the Global Baseline Estimate recommender system in established, a small function may be developed that takes arguments for a specific user and a specific movie. The function would then use the ratings dataframe and the precomputed baseline components to output the predicted rating for that user-movie pair.

Code Base (Body)

The first step, as with most analytical tasks in RStudio, will call for the loading of the required libraries.

library(readxl)
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Subsequently, we must import the data into our working environment. In this case, the dataset to be examined originates from the critic movie ratings provided by Professor Catlin.

url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/week_3A_assignment/MovieRatings.csv"

raw_ratings <- read.csv(url,stringsAsFactors = FALSE)

glimpse(raw_ratings)
Rows: 16
Columns: 7
$ Critic         <chr> "Burton", "Charley", "Dan", "Dieudonne", "Matt", "Mauri…
$ CaptainAmerica <int> NA, 4, NA, 5, 4, 4, 4, NA, 4, 4, 5, NA, 5, 4, 4, NA
$ Deadpool       <int> NA, 5, 5, 4, NA, NA, 4, NA, 4, 3, 5, NA, 5, NA, 5, NA
$ Frozen         <int> NA, 4, NA, NA, 2, 3, 4, NA, 1, 5, 5, 4, 5, NA, 3, 5
$ JungleBook     <int> 4, 3, NA, NA, NA, 3, 2, NA, NA, 5, 5, 5, 4, NA, 3, 5
$ PitchPerfect2  <int> NA, 2, NA, NA, 2, 4, 2, NA, NA, 2, NA, NA, 4, NA, 3, NA
$ StarWarsForce  <int> 4, 3, 5, 5, 5, NA, 4, 4, 5, 3, 4, 3, 5, 4, NA, NA

Note that the data remains in its initial wide formatting. While this type of orientation allows for more intuitive perception from a human standpoint, restructuring it into a long format allows for more complex analytical undertakings.

Specifically, at present, the data is structured such that:

  • The rows contain the critics/users

  • The columns harbor the individual movie titles

  • The cells at the intersection of the two contain the actual ratings (with blanks/NAs representing unrated instances)

For purposes of modeling and calculation, the data must instead be arranged in the following structure: users | movie | rating

long_ratings <- raw_ratings %>%
  pivot_longer(
    cols = -Critic,        #Keeping the Critic column fixed
    names_to = "movie",    #New column for the movie titles 
    values_to = "rating"   #New column for the ratings
  ) %>%
  rename(user = Critic)    #Renaming the Critic column to user

head(long_ratings, 7)
# A tibble: 7 × 3
  user    movie          rating
  <chr>   <chr>           <int>
1 Burton  CaptainAmerica     NA
2 Burton  Deadpool           NA
3 Burton  Frozen             NA
4 Burton  JungleBook          4
5 Burton  PitchPerfect2      NA
6 Burton  StarWarsForce       4
7 Charley CaptainAmerica      4

Now that we have reoriented the data table in a long formatting, the next step is to ensure that the extracted ratings are indeed numeric in nature.

long_ratings <- long_ratings %>%
  mutate(rating = as.numeric(rating))

Computing Global Baseline Components

Next, we will begin the calculations required for the establishment of the Global Baseline Estimate recommender system. The first of these is the previously mentioned Global Mean (μ). This mean will be computed using only the observed ratings (i.e., non-NA-values).

mu <- long_ratings %>%
  summarise (global_mean = mean(rating, na.rm = TRUE)) %>%
  pull(global_mean)

mu
[1] 3.934426

Now that we have discerned the value of the Global Mean, the next stage in the analytical process is to compute the movie bias (bᵢ).

movie_bias <- long_ratings %>%
  group_by(movie) %>%
  summarise(
    movie_avg = mean(rating, na.rm = TRUE),
    b_i = movie_avg - mu
  )

movie_bias
# A tibble: 6 × 3
  movie          movie_avg     b_i
  <chr>              <dbl>   <dbl>
1 CaptainAmerica      4.27  0.338 
2 Deadpool            4.44  0.510 
3 Frozen              3.73 -0.207 
4 JungleBook          3.9  -0.0344
5 PitchPerfect2       2.71 -1.22  
6 StarWarsForce       4.15  0.219 

The movie bias reflects whether a particular film tends to be rated above or below the overall average rating across all movies.

Now, possessing the movie bias values, the final component to compute is that of the user bias (bᵤ).

user_bias <- long_ratings %>%
  group_by(user) %>%
  summarise(
    user_avg = mean(rating, na.rm = TRUE),
    b_u = user_avg - mu
  )

user_bias
# A tibble: 16 × 3
   user      user_avg     b_u
   <chr>        <dbl>   <dbl>
 1 Burton        4     0.0656
 2 Charley       3.5  -0.434 
 3 Dan           5     1.07  
 4 Dieudonne     4.67  0.732 
 5 Matt          3.25 -0.684 
 6 Mauricio      3.5  -0.434 
 7 Max           3.33 -0.601 
 8 Nathan        4     0.0656
 9 Param         3.5  -0.434 
10 Parshu        3.67 -0.268 
11 Prashanth     4.8   0.866 
12 Shipra        4     0.0656
13 Sreejaya      4.67  0.732 
14 Steve         4     0.0656
15 Vuthy         3.6  -0.334 
16 Xingjia       5     1.07  

The user bias captures whether a given critic tends to rate more generously or more stringently relative to the global mean.

Making A Prediction

Now that we have attained all the necessary components for a Global Baseline Estimate recommender system, let us proceed with generating a prediction. Once the prediction is computed within R, we can compare the estimated rating to the corresponding output produced using the spreadsheet algorithm.

Specifically, we will consider the following question:

Which of Matt’s unseen movies (either Deadpool or JungleBook) would he be more inclined to view?

We begin by computing Matt’s estimated rating for Deadpool.

prediction <- tibble(
  user = "Matt",
  movie = "Deadpool"
) %>%
  left_join(user_bias,by = "user") %>%
  left_join(movie_bias,by = "movie") %>%
  mutate(predicted_rating = round(mu + b_u + b_i,2))

cat("Matt's predicted rating for Deadpool is:", prediction$predicted_rating)
Matt's predicted rating for Deadpool is: 3.76

Next, we compute his estimated rating for JungleBook.

prediction <- tibble(
  user = "Matt",
  movie = "JungleBook"
) %>%
  left_join(user_bias, by = "user") %>%
  left_join(movie_bias, by = "movie") %>%
  mutate(predicted_rating = round(mu + b_u +b_i,2))


cat("Matt's predicted rating for JungleBook is:", prediction$predicted_rating)
Matt's predicted rating for JungleBook is: 3.22

Therefore, since Matt’s estimated rating for Deadpool (3.76) exceeds his predicted rating for JungleBook (3.22), the recommendation would be that, of the two, Matt should watch Deadpool.

Moreover, the predictions obtained from the R calculations align precisely with the spreadsheet-based implementation of the Global Baseline algorithm, confirming the correctness of the approach.

Predicting All of the Missing Ratings

Now that we have verified that the Global Baseline calculations are behaving as expected, we can generate predictions for all missing user-movie ratings.

#Calculate the predicted ratings for all user-movie pairings
all_predictions <- long_ratings %>%
  left_join(user_bias, by = "user") %>%
  left_join(movie_bias, by = "movie") %>%
  mutate(predicted_rating = round(mu + b_u +b_i,2))

#Filter out to present predictions for only those instances of unrated movies
recommendations <- all_predictions %>%
  filter(is.na(rating)) %>%
  arrange(user, desc(predicted_rating))

recommendations
# A tibble: 35 × 8
   user      movie     rating user_avg    b_u movie_avg     b_i predicted_rating
   <chr>     <chr>      <dbl>    <dbl>  <dbl>     <dbl>   <dbl>            <dbl>
 1 Burton    Deadpool      NA     4    0.0656      4.44  0.510              4.51
 2 Burton    CaptainA…     NA     4    0.0656      4.27  0.338              4.34
 3 Burton    Frozen        NA     4    0.0656      3.73 -0.207              3.79
 4 Burton    PitchPer…     NA     4    0.0656      2.71 -1.22               2.78
 5 Dan       CaptainA…     NA     5    1.07        4.27  0.338              5.34
 6 Dan       JungleBo…     NA     5    1.07        3.9  -0.0344             4.97
 7 Dan       Frozen        NA     5    1.07        3.73 -0.207              4.79
 8 Dan       PitchPer…     NA     5    1.07        2.71 -1.22               3.78
 9 Dieudonne JungleBo…     NA     4.67 0.732       3.9  -0.0344             4.63
10 Dieudonne Frozen        NA     4.67 0.732       3.73 -0.207              4.46
# ℹ 25 more rows

The resulting table above sets out the predicted ratings for all missing user-movie pairs. Additionally, it is also ordered by predicted rating (descending) for each user, so it becomes easy to see which unseen movies are most recommended for each individual.

Optional Endeavor: Creating a Predictor Function

Instead of perusing a fairly decent-sized table, a user of the recommender system may prefer the ability to query a specific user-movie pairing and receive an estimated rating directly.

To address that, a reusable function may be created.

predict_rating_1 <- function(desired_user, desired_movie){
  
  user_b <- user_bias %>% 
    filter(user == desired_user) %>%
    pull(b_u)
  
  
  movie_b <- movie_bias %>% 
    filter(movie == desired_movie) %>%
    pull(b_i)
   
  if(length(user_b) == 0 | length(movie_b) == 0){
    return("User or movie not found.")
  }
  

  round(mu + user_b + movie_b, 2)
  
}

Testing the created function, using the same example of Matt and and his two unrated movies.

cat("Matt's predicted rating for Deadpool is:", predict_rating_1("Matt","Deadpool"))
Matt's predicted rating for Deadpool is: 3.76
cat("\nMatt's predicted rating for JungleBook is:", predict_rating_1("Matt","JungleBook"))

Matt's predicted rating for JungleBook is: 3.22

Now, we can also simulate a case where the user or movie name is incorrect.

predict_rating_1("Mat","Joonglebook")
[1] "User or movie not found."

At this point, the function operates as desired. However, one thing still remains. If a movie has already been rated by a user, the function should return the actual observed rating, rather than the baseline prediction. So, the function will be adjusted accordingly.

predict_rating_2 <- function(desired_user, desired_movie){

  # Step 1: Check if rating already exists
 existing_rating <- long_ratings %>%
  filter(user == desired_user,
         movie == desired_movie,
         !is.na(rating)) %>%
  pull(rating)

  # If user already rated movie
  if(length(existing_rating) > 0){
    return(paste0(
      desired_user, 
      " has viewed this movie before. A rating of ",
      existing_rating,
      " was given."
    ))
  }

  # Step 2: Otherwise compute baseline prediction
  user_b <- user_bias %>% 
    filter(user == desired_user) %>% 
    pull(b_u)

  movie_b <- movie_bias %>% 
    filter(movie == desired_movie) %>% 
    pull(b_i)
  
  if(length(user_b) == 0 | length(movie_b) == 0){
    return("User or movie not found.")
  }

  predicted <- round(mu + user_b + movie_b, 2)

  return(paste0(
    "Predicted rating for ",
    desired_user,
    " on ",
    desired_movie,
    " is ",
    predicted
  ))
}

Now we can test the three use cases:

1. Predicting an unrated movie:

predict_rating_2("Vuthy", "StarWarsForce")
[1] "Predicted rating for Vuthy on StarWarsForce is 3.82"

2. Querying a movie the user already rated:

predict_rating_2("Vuthy", "CaptainAmerica")
[1] "Vuthy has viewed this movie before. A rating of 4 was given."

3. Incorrect user/movie input:

predict_rating_2("Vooothy", "CaptainAmerica")
[1] "User or movie not found."

Conclusion

In completing this assignment, it became clear that the Global Baseline Estimate provides a structured yet straightforward way to generate rating predictions. By combining the global mean with both the user and movie bias terms, the model accounts for overall rating tendencies whilst also adjusting for systematic differences across individual users and films. Although this approach does not incorporate similarity-based personalization like collaborative filtering, it still offers a transparent and computationally efficient foundation for recommendation work. Overall, I can say that this implementation highlights how even relatively simple algorithms can produce meaningful and logically grounded predictions, when the utilized components are carefully calculated and applied.

LLM Used

  • OpenAI. (2026). ChatGPT (Version 4o) [Large language model]. https://chat.openai.com . Accessed February 14, 2026.