Movie Recommender System using Baseline Predictors

1 Introduction
2 Dataset Description
3 Data Preparation
4 Global Average Rating
5 Baseline Predictor
6 RMSE for Baseline Predictor
7 Reflection: Using RMSE and Residuals to Improve Recommendations
8 Summarize Results
9 Updated Conclusion

1 Introduction

This system recommends movies to users of a streaming platform based on collaborative filtering. We leverage the MovieLens dataset, which includes user ratings on movies, to build a simple recommender system using average and bias-adjusted baseline predictors.

2 Dataset Description

We use the MovieLens 100k dataset, which includes userId, movieId, and rating. Ratings range from 1 to 5 and are sparse across users and items.

3 Data Preparation

The dataset was loaded into R and split into training and test sets. A small subset of the data was used to verify calculations by hand. All analyses were performed in tidyverse.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 4185688 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl  (3): userId, movieId, rating
## dttm (1): tstamp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 4
##   userId movieId rating tstamp             
##    <dbl>   <dbl>  <dbl> <dttm>             
## 1 393217       1    3.5 2023-01-25 19:45:46
## 2 393217       6    4   2023-02-07 21:17:19
## 3 393217      16    3.5 2023-01-25 19:50:40
## 4 393217      17    4.5 2023-01-25 19:49:45
## 5 393217      32    3   2023-01-25 19:46:01
## 6 393217      47    4   2023-01-25 15:44:44

## # A tibble: 6 × 5
##   userId movieId rating tstamp              split
##    <dbl>   <dbl>  <dbl> <dttm>              <chr>
## 1 393217       1    3.5 2023-01-25 19:45:46 train
## 2 393217       6    4   2023-02-07 21:17:19 test 
## 3 393217      16    3.5 2023-01-25 19:50:40 train
## 4 393217      17    4.5 2023-01-25 19:49:45 train
## 5 393217      32    3   2023-01-25 19:46:01 test 
## 6 393217      47    4   2023-01-25 15:44:44 train

## # A tibble: 6 × 5
##   userId movieId rating tstamp              split
##    <dbl>   <dbl>  <dbl> <dttm>              <chr>
## 1 393217       1    3.5 2023-01-25 19:45:46 train
## 2 393217      16    3.5 2023-01-25 19:50:40 train
## 3 393217      17    4.5 2023-01-25 19:49:45 train
## 4 393217      47    4   2023-01-25 15:44:44 train
## 5 393217     111    4   2023-01-25 17:14:10 train
## 6 393217     260    3   2023-01-25 17:10:01 train

## # A tibble: 6 × 5
##   userId movieId rating tstamp              split
##    <dbl>   <dbl>  <dbl> <dttm>              <chr>
## 1 393217       6    4   2023-02-07 21:17:19 test 
## 2 393217      32    3   2023-01-25 19:46:01 test 
## 3 393217      50    4.5 2023-01-25 15:38:58 test 
## 4 393217     377    3.5 2023-01-25 19:46:12 test 
## 5 393217     541    4   2023-01-25 15:42:52 test 
## 6 393217     778    2.5 2023-01-25 19:47:24 test

4 Global Average Rating

The global average rating from the training data is:

## [1] 2.906781

Using this as a predictor for all unknown ratings, we calculated the RMSE on the test set:

## [1] "Global Average RMSE: 1.7601"

5 Baseline Predictor

We calculated user and item biases based on deviations from the global average. These were merged with the test set, and the baseline predictor was calculated as:

\[ Global Avg + User Bias + Item Bias \]

## # A tibble: 6 × 2
##   userId user_bias
##    <dbl>     <dbl>
## 1   1892     0.152
## 2   3114     0.387
## 3  12559     0.330
## 4  15893     0.260
## 5  22005     0.608
## 6  41965     0.755

## # A tibble: 6 × 2
##   movieId item_bias
##     <dbl>     <dbl>
## 1       1     0.759
## 2       2     0.409
## 3       3    -0.355
## 4       4    -1.09 
## 5       5    -0.455
## 6       6     0.947

## # A tibble: 6 × 9
##   userId movieId rating tstamp              split pred_global user_bias
##    <dbl>   <dbl>  <dbl> <dttm>              <chr>       <dbl>     <dbl>
## 1 393217       6    4   2023-02-07 21:17:19 test         2.91     0.917
## 2 393217      32    3   2023-01-25 19:46:01 test         2.91     0.917
## 3 393217      50    4.5 2023-01-25 15:38:58 test         2.91     0.917
## 4 393217     377    3.5 2023-01-25 19:46:12 test         2.91     0.917
## 5 393217     541    4   2023-01-25 15:42:52 test         2.91     0.917
## 6 393217     778    2.5 2023-01-25 19:47:24 test         2.91     0.917
## # ℹ 2 more variables: item_bias <dbl>, pred_baseline <dbl>

6 RMSE for Baseline Predictor

After applying this model:

## [1] "Baseline Predictor RMSE: 1.3131"

This shows a significant improvement over the global average model.

# Calculate residuals
test$residual <- test$rating - test$pred_baseline

# Show top residuals by absolute error
head(test %>% arrange(desc(abs(residual))), 10)

## # A tibble: 10 × 10
##    userId movieId rating tstamp              split pred_global user_bias
##     <dbl>   <dbl>  <dbl> <dttm>              <chr>       <dbl>     <dbl>
##  1 327881  109374     -1 2019-07-30 18:15:22 test         2.91      1.48
##  2 393447    5618     -1 2023-01-31 09:48:55 test         2.91      1.34
##  3 328049    3535     -1 2018-10-08 18:45:35 test         2.91      1.53
##  4 394210   48780     -1 2023-03-03 22:26:53 test         2.91      1.13
##  5 394403  195159     -1 2023-03-03 14:51:34 test         2.91      1.41
##  6 394404  281096     -1 2023-03-25 22:53:06 test         2.91      1.88
##  7 395334     527     -1 2023-03-25 20:40:54 test         2.91      1.16
##  8 395470    3949     -1 2023-04-02 11:12:10 test         2.91      1.46
##  9 396300   68954     -1 2023-04-22 23:57:26 test         2.91      1.31
## 10 396377    5618     -1 2023-04-24 15:42:17 test         2.91      1.33
## # ℹ 3 more variables: item_bias <dbl>, pred_baseline <dbl>, residual <dbl>

# Plot residuals to check for bias
library(ggplot2)

ggplot(test, aes(x = pred_baseline, y = residual)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot for Baseline Predictor",
       x = "Predicted Rating",
       y = "Residual (Actual - Predicted)")

7 Reflection: Using RMSE and Residuals to Improve Recommendations

Examining the residuals helps interpret what the RMSE means in practice — it shows which specific predictions are most inaccurate and highlights whether the model systematically over- or under-predicts for certain users or movies. This reveals exactly which recommendations are poor and where the baseline predictor struggles. By spotting large residuals, we can identify opportunities to improve the recommender, for example by adjusting the bias calculations, tuning with regularization, or exploring more advanced models that capture interactions between users and movies more effectively. In this way, RMSE is not just a number but a diagnostic tool that guides us to refine our system for better recommendations.

8 Summarize Results

## # A tibble: 2 × 2
##   Model               RMSE
##   <chr>              <dbl>
## 1 Global Average      1.76
## 2 Baseline Predictor  1.31

The baseline predictor significantly reduces error by accounting for individual user and item biases. This result supports the effectiveness of incorporating basic personalization into recommender systems.

9 Updated Conclusion

This project built a baseline recommender system that predicts ratings using a simple model: the global average rating plus user bias (how much a user’s ratings typically differ from the global average) and item bias (how much a movie’s average rating differs from the global average). This core idea captures systematic tendencies — for example, some users rate higher on average and some movies are generally liked more.

The RMSE (Root Mean Squared Error) shows how far, on average, the predicted ratings deviate from the true ratings — so a lower RMSE means better predictions. By also plotting the residuals, we see which predictions have larger errors. The residuals in this project reveal a pattern where extreme ratings are sometimes poorly predicted, showing exactly which recommendations might be inaccurate. These insights tell us how to improve the model: we could add regularization to prevent overfitting the biases, or move beyond simple additive effects by using more advanced methods like matrix factorization or neighborhood-based collaborative filtering. This baseline is a solid start that demonstrates how personalization immediately improves performance over the global average, but the residuals and RMSE highlight clear next steps to make the recommendations even more reliable.

Note: All code used for data processing and analysis is available in this GitHub repository: Project1