Global Baseline Estimate

Author

Madina Kudanova

Introduction

The goal of this assignment is to implement a Global Baseline Estimate recommendation system in R using the movie rating data collected earlier in the course. The Global Baseline Estimate is a non-personalized recommender approach that predicts ratings based on the overall average rating, along with systematic deviations associated with individual users and individual movies.

Using the provided survey dataset (Movie Ratings XLSX), I will construct a reproducible workflow that calculates these components and generates predicted ratings. This implementation serves as a foundational benchmark model that can be compared to more advanced personalized recommendation techniques in future work.

Approach

To implement the Global Baseline Estimate recommender system, I will begin by preparing the movie ratings dataset in a tidy, structured format suitable for analysis in R. If necessary, I will reshape the data so that each row represents a single user–movie rating pair. This ensures the dataset follows a consistent rectangular structure and supports grouped calculations.

Next, I will compute the overall mean rating across all observations. This value serves as the global baseline estimate. From there, I will calculate user-specific bias terms (how each user tends to rate relative to the global average) and movie-specific bias terms (how each movie tends to deviate from the global average). These components will be combined to generate predicted ratings for each user–movie pair.

After generating predicted values, I will review the results to confirm that predictions are logically consistent and reflect the structure of the algorithm described in the provided implementation spreadsheet. I will also validate that missing ratings are handled appropriately and that calculations are reproducible.

Anticipated Challenges

The dataset may contain missing ratings that must be handled carefully.

I might need to perform data reformatting.

Users with very few ratings may produce unstable bias estimates.

Ensuring that bias calculations are performed correctly without data leakage or double counting.

I will address these challenges through careful data inspection, validation checks, and step-by-step calculation of each model component.

Code Base

# Reading and loading the Excel dataset
library(readxl)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

movie_ratings <- read_excel("MovieRatings.xlsx")
glimpse(movie_ratings)

Rows: 16
Columns: 7
$ Critic         <chr> "Burton", "Charley", "Dan", "Dieudonne", "Matt", "Mauri…
$ CaptainAmerica <dbl> NA, 4, NA, 5, 4, 4, 4, NA, 4, 4, 5, NA, 5, 4, 4, NA
$ Deadpool       <dbl> NA, 5, 5, 4, NA, NA, 4, NA, 4, 3, 5, NA, 5, NA, 5, NA
$ Frozen         <dbl> NA, 4, NA, NA, 2, 3, 4, NA, 1, 5, 5, 4, 5, NA, 3, 5
$ JungleBook     <dbl> 4, 3, NA, NA, NA, 3, 2, NA, NA, 5, 5, 5, 4, NA, 3, 5
$ PitchPerfect2  <dbl> NA, 2, NA, NA, 2, 4, 2, NA, NA, 2, NA, NA, 4, NA, 3, NA
$ StarWarsForce  <dbl> 4, 3, 5, 5, 5, NA, 4, 4, 5, 3, 4, 3, 5, 4, NA, NA

 # Reshaping dataset to long format so each row represents one user–movie rating pair
ratings_long <- movie_ratings %>%
  pivot_longer(
    cols = -Critic,
    names_to = "movie",
    values_to = "rating"  )
glimpse(ratings_long)

Rows: 96
Columns: 3
$ Critic <chr> "Burton", "Burton", "Burton", "Burton", "Burton", "Burton", "Ch…
$ movie  <chr> "CaptainAmerica", "Deadpool", "Frozen", "JungleBook", "PitchPer…
$ rating <dbl> NA, NA, NA, 4, NA, 4, 4, 5, 4, 3, 2, 3, NA, 5, NA, NA, NA, 5, 5…

# Calculating global average rating
global_avg <- mean(ratings_long$rating, na.rm = TRUE)

global_avg

[1] 3.934426

# Calculating average rating for each movie
movie_avg <- ratings_long %>%
  group_by(movie) %>%
  summarise(movie_avg = mean(rating, na.rm = TRUE), .groups = "drop")

movie_avg

# A tibble: 6 × 2
  movie          movie_avg
  <chr>              <dbl>
1 CaptainAmerica      4.27
2 Deadpool            4.44
3 Frozen              3.73
4 JungleBook          3.9 
5 PitchPerfect2       2.71
6 StarWarsForce       4.15

# Calculating average rating for each user
user_avg <- ratings_long %>%
  group_by(Critic) %>%
  summarise(user_avg = mean(rating, na.rm = TRUE), .groups = "drop")

user_avg

# A tibble: 16 × 2
   Critic    user_avg
   <chr>        <dbl>
 1 Burton        4   
 2 Charley       3.5 
 3 Dan           5   
 4 Dieudonne     4.67
 5 Matt          3.25
 6 Mauricio      3.5 
 7 Max           3.33
 8 Nathan        4   
 9 Param         3.5 
10 Parshu        3.67
11 Prashanth     4.8 
12 Shipra        4   
13 Sreejaya      4.67
14 Steve         4   
15 Vuthy         3.6 
16 Xingjia       5

# Calculating movie bias as deviation from global average
# (movie_avg - global_avg) — matches spreadsheet row "movie avg - mean movie"
movie_bias <- movie_avg %>%
  mutate(movie_bias = movie_avg - global_avg)
movie_bias

# A tibble: 6 × 3
  movie          movie_avg movie_bias
  <chr>              <dbl>      <dbl>
1 CaptainAmerica      4.27     0.338 
2 Deadpool            4.44     0.510 
3 Frozen              3.73    -0.207 
4 JungleBook          3.9     -0.0344
5 PitchPerfect2       2.71    -1.22  
6 StarWarsForce       4.15     0.219

# Calculating user bias as deviation from global average
# (user_avg - global_avg) — matches spreadsheet column "user avg - mean movie"
user_bias <- user_avg %>%
  mutate(user_bias = user_avg - global_avg)
user_bias

# A tibble: 16 × 3
   Critic    user_avg user_bias
   <chr>        <dbl>     <dbl>
 1 Burton        4       0.0656
 2 Charley       3.5    -0.434 
 3 Dan           5       1.07  
 4 Dieudonne     4.67    0.732 
 5 Matt          3.25   -0.684 
 6 Mauricio      3.5    -0.434 
 7 Max           3.33   -0.601 
 8 Nathan        4       0.0656
 9 Param         3.5    -0.434 
10 Parshu        3.67   -0.268 
11 Prashanth     4.8     0.866 
12 Shipra        4       0.0656
13 Sreejaya      4.67    0.732 
14 Steve         4       0.0656
15 Vuthy         3.6    -0.334 
16 Xingjia       5       1.07

# Calculating predicted rating using spreadsheet formula:
# prediction = global_avg + movie_bias + user_bias
# Join biases back to full ratings table
predictions <- ratings_long %>%
  left_join(movie_bias %>% select(movie, movie_bias), by = "movie") %>%
  left_join(user_bias %>% select(Critic, user_bias), by = "Critic") %>%
  mutate(
    prediction = global_avg + movie_bias + user_bias
  )

predictions

# A tibble: 96 × 6
   Critic  movie          rating movie_bias user_bias prediction
   <chr>   <chr>           <dbl>      <dbl>     <dbl>      <dbl>
 1 Burton  CaptainAmerica     NA     0.338     0.0656       4.34
 2 Burton  Deadpool           NA     0.510     0.0656       4.51
 3 Burton  Frozen             NA    -0.207     0.0656       3.79
 4 Burton  JungleBook          4    -0.0344    0.0656       3.97
 5 Burton  PitchPerfect2      NA    -1.22      0.0656       2.78
 6 Burton  StarWarsForce       4     0.219     0.0656       4.22
 7 Charley CaptainAmerica      4     0.338    -0.434        3.84
 8 Charley Deadpool            5     0.510    -0.434        4.01
 9 Charley Frozen              4    -0.207    -0.434        3.29
10 Charley JungleBook          3    -0.0344   -0.434        3.47
# ℹ 86 more rows

# Recommending the single highest-predicted unrated movie for each user
recommendations_top1 <- predictions %>%
  filter(is.na(rating)) %>%
  group_by(Critic) %>%
  slice_max(prediction, n = 1, with_ties = FALSE) %>%
  arrange(Critic)

recommendations_top1

# A tibble: 12 × 6
# Groups:   Critic [12]
   Critic    movie          rating movie_bias user_bias prediction
   <chr>     <chr>           <dbl>      <dbl>     <dbl>      <dbl>
 1 Burton    Deadpool           NA     0.510     0.0656       4.51
 2 Dan       CaptainAmerica     NA     0.338     1.07         5.34
 3 Dieudonne JungleBook         NA    -0.0344    0.732        4.63
 4 Matt      Deadpool           NA     0.510    -0.684        3.76
 5 Mauricio  Deadpool           NA     0.510    -0.434        4.01
 6 Nathan    Deadpool           NA     0.510     0.0656       4.51
 7 Param     JungleBook         NA    -0.0344   -0.434        3.47
 8 Prashanth PitchPerfect2      NA    -1.22      0.866        3.58
 9 Shipra    Deadpool           NA     0.510     0.0656       4.51
10 Steve     Deadpool           NA     0.510     0.0656       4.51
11 Vuthy     StarWarsForce      NA     0.219    -0.334        3.82
12 Xingjia   Deadpool           NA     0.510     1.07         5.51

Conclusion

In this analysis, I implemented a Global Baseline Estimate recommender system using the collected movie ratings dataset. The model predicts missing ratings by combining the overall average rating with user-specific and movie-specific deviations, following the implementation structure provided in the spreadsheet. Based on these predicted values, one movie recommendation was generated for each user. Although this approach does not use advanced personalized algorithms, it provides a clear and interpretable baseline model that can serve as a benchmark for more sophisticated recommendation systems.