Week 3A - Global Baseline Estimate

Author

Sinem K Moschos

Week 3A - Approach

Global Baseline Estimate Recommendation System

Introduction

For this assignment, I needed to build a very basic recommendation system using movie ratings. Instead of making personalized recommendations, I used something called the Global Baseline Estimate. This method does not try to guess what a specific user likes. It just looks at overall rating patterns and uses simple averages to recommend movies.

What data I used

I used the movie ratings data we collected earlier. The dataset includes: • User IDs • Movie names • Ratings given by users

Each row shows how one user rated one movie.

Steps to Approach Problem

I followed the algorithm provided in the spreadsheet and broke the problem into small steps so it would be easier to understand.

First, I calculated the global average rating. This is just the average of all ratings in the dataset. It represents the starting point for every prediction.

Next, I looked at user behavior. Some users usually give high ratings and some are more strict. So I calculated a user bias, which shows how much each user’s average rating is above or below the global average.

Then, I looked at the movies themselves. Some movies are generally liked more than others. So I calculated a movie bias, which shows how much each movie’s average rating is above or below the global average.

How the Global Baseline Estimate works

After calculating the global mean, user bias, and movie bias, I combined them using this idea:

The predicted rating equals: • the global average rating • plus the user’s bias • plus the movie’s bias

This gives a baseline prediction for how a movie should be rated, without using any personalized recommendation logic.

Movie Recommendations

Once I had baseline predictions, I averaged the predicted scores for each movie. Then I sorted the movies from highest to lowest predicted rating.

The movies with the highest predicted ratings are the recommended movies according to the Global Baseline Estimate method.

Why Makes Sense

This approach is useful because: • It is simple and easy to understand • It works even with limited data • It provides a strong baseline to compare with more advanced recommendation systems later

It helped me understand how recommendation systems can work even without personalization.

Step 1: Load the libraries

First, I loaded the libraries I needed. I used tidyverse for data manipulation and readxl to read the Excel file.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)

Step 2: Load the movie ratings data

Next, I loaded the movie ratings spreadsheet into R. This file contains the user IDs, movie names, and ratings.

ratings_raw <- read_excel("MovieRatings.xlsx")
glimpse(ratings_raw)

Rows: 16
Columns: 7
$ Critic         <chr> "Burton", "Charley", "Dan", "Dieudonne", "Matt", "Mauri…
$ CaptainAmerica <dbl> NA, 4, NA, 5, 4, 4, 4, NA, 4, 4, 5, NA, 5, 4, 4, NA
$ Deadpool       <dbl> NA, 5, 5, 4, NA, NA, 4, NA, 4, 3, 5, NA, 5, NA, 5, NA
$ Frozen         <dbl> NA, 4, NA, NA, 2, 3, 4, NA, 1, 5, 5, 4, 5, NA, 3, 5
$ JungleBook     <dbl> 4, 3, NA, NA, NA, 3, 2, NA, NA, 5, 5, 5, 4, NA, 3, 5
$ PitchPerfect2  <dbl> NA, 2, NA, NA, 2, 4, 2, NA, NA, 2, NA, NA, 4, NA, 3, NA
$ StarWarsForce  <dbl> 4, 3, 5, 5, 5, NA, 4, 4, 5, 3, 4, 3, 5, 4, NA, NA

After loading the data, I standardize column names.

ratings <- ratings_raw %>%
  pivot_longer(
    cols = -Critic,
    names_to = "movie",
    values_to = "rating"
  ) %>%
  filter(!is.na(rating)) %>%
  rename(user_id = Critic)

Step 3: Calculate the global average rating

Now I calculated the global mean rating. This is the average of all ratings in the dataset. This value is important because it is the starting point for the Global Baseline Estimate.

global_mean <- mean(ratings$rating)
global_mean

[1] 3.934426

Step 4: Calculate user bias

In this step, I looked at how each user rates movies on average. Some users usually give higher ratings, and some give lower ratings. To capture this, I calculated the user bias. First, I grouped the data by user ID and calculated each user’s average rating.Then I subtracted the global mean from the user’s average.

user_bias <- ratings %>%
  group_by(user_id) %>%
  summarise(
    user_avg = mean(rating),
    .groups = "drop"
  ) %>%
  mutate(user_bias = user_avg - global_mean)

user_bias

# A tibble: 16 × 3
   user_id   user_avg user_bias
   <chr>        <dbl>     <dbl>
 1 Burton        4       0.0656
 2 Charley       3.5    -0.434 
 3 Dan           5       1.07  
 4 Dieudonne     4.67    0.732 
 5 Matt          3.25   -0.684 
 6 Mauricio      3.5    -0.434 
 7 Max           3.33   -0.601 
 8 Nathan        4       0.0656
 9 Param         3.5    -0.434 
10 Parshu        3.67   -0.268 
11 Prashanth     4.8     0.866 
12 Shipra        4       0.0656
13 Sreejaya      4.67    0.732 
14 Steve         4       0.0656
15 Vuthy         3.6    -0.334 
16 Xingjia       5       1.07

Step 5: Calculate movie bias

Next, I did the same thing for movies. Some movies are generally liked more than others. So I calculated the movie bias by finding each movie’s average rating and comparing it to the global mean.

movie_bias <- ratings %>%
  group_by(movie) %>%
  summarise(movie_avg = mean(rating)) %>%
  mutate(movie_bias = movie_avg - global_mean)

movie_bias

# A tibble: 6 × 3
  movie          movie_avg movie_bias
  <chr>              <dbl>      <dbl>
1 CaptainAmerica      4.27     0.338 
2 Deadpool            4.44     0.510 
3 Frozen              3.73    -0.207 
4 JungleBook          3.9     -0.0344
5 PitchPerfect2       2.71    -1.22  
6 StarWarsForce       4.15     0.219

Step 6: Calculate the Global Baseline Estimate

Now I combined everything. I joined the user bias and movie bias back into the original ratings data. Then I applied the Global Baseline Estimate formula.

baseline_estimates <- ratings %>%
  left_join(user_bias, by = "user_id") %>%
  left_join(movie_bias, by = "movie") %>%
  mutate(
    baseline_prediction = global_mean + user_bias + movie_bias
  )

baseline_estimates

# A tibble: 61 × 8
   user_id movie          rating user_avg user_bias movie_avg movie_bias
   <chr>   <chr>           <dbl>    <dbl>     <dbl>     <dbl>      <dbl>
 1 Burton  JungleBook          4      4      0.0656      3.9     -0.0344
 2 Burton  StarWarsForce       4      4      0.0656      4.15     0.219 
 3 Charley CaptainAmerica      4      3.5   -0.434       4.27     0.338 
 4 Charley Deadpool            5      3.5   -0.434       4.44     0.510 
 5 Charley Frozen              4      3.5   -0.434       3.73    -0.207 
 6 Charley JungleBook          3      3.5   -0.434       3.9     -0.0344
 7 Charley PitchPerfect2       2      3.5   -0.434       2.71    -1.22  
 8 Charley StarWarsForce       3      3.5   -0.434       4.15     0.219 
 9 Dan     Deadpool            5      5      1.07        4.44     0.510 
10 Dan     StarWarsForce       5      5      1.07        4.15     0.219 
# ℹ 51 more rows
# ℹ 1 more variable: baseline_prediction <dbl>

Step 7: Create movie recommendations

To recommend movies, I averaged the baseline predictions for each movie. Then I sorted the movies from highest to lowest predicted rating.

movie_recommendations <- baseline_estimates %>%
  group_by(movie) %>%
  summarise(
    predicted_rating = mean(baseline_prediction)
  ) %>%
  arrange(desc(predicted_rating))

movie_recommendations

# A tibble: 6 × 2
  movie          predicted_rating
  <chr>                     <dbl>
1 Deadpool                   4.59
2 StarWarsForce              4.25
3 CaptainAmerica             4.20
4 JungleBook                 3.97
5 Frozen                     3.69
6 PitchPerfect2              2.43

Step 8: Top recommended movies

Finally, I selected the top movies based on the highest predicted ratings.

top_movies <- movie_recommendations %>%
  slice_head(n = 5)

top_movies

# A tibble: 5 × 2
  movie          predicted_rating
  <chr>                     <dbl>
1 Deadpool                   4.59
2 StarWarsForce              4.25
3 CaptainAmerica             4.20
4 JungleBook                 3.97
5 Frozen                     3.69