This system recommends movies to users of a streaming platform based on collaborative filtering. We leverage the MovieLens dataset, which includes user ratings on movies, to build a simple recommender system using average and bias-adjusted baseline predictors.
We use the MovieLens 100k dataset, which includes userId, movieId, and rating. Ratings range from 1 to 5 and are sparse across users and items.
The dataset was loaded into R and split into training and test sets. A small subset of the data was used to verify calculations by hand. All analyses were performed in tidyverse.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 4185688 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): userId, movieId, rating
## dttm (1): tstamp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 4
## userId movieId rating tstamp
## <dbl> <dbl> <dbl> <dttm>
## 1 393217 1 3.5 2023-01-25 19:45:46
## 2 393217 6 4 2023-02-07 21:17:19
## 3 393217 16 3.5 2023-01-25 19:50:40
## 4 393217 17 4.5 2023-01-25 19:49:45
## 5 393217 32 3 2023-01-25 19:46:01
## 6 393217 47 4 2023-01-25 15:44:44
## # A tibble: 6 × 5
## userId movieId rating tstamp split
## <dbl> <dbl> <dbl> <dttm> <chr>
## 1 393217 1 3.5 2023-01-25 19:45:46 train
## 2 393217 6 4 2023-02-07 21:17:19 test
## 3 393217 16 3.5 2023-01-25 19:50:40 train
## 4 393217 17 4.5 2023-01-25 19:49:45 train
## 5 393217 32 3 2023-01-25 19:46:01 test
## 6 393217 47 4 2023-01-25 15:44:44 train
## # A tibble: 6 × 5
## userId movieId rating tstamp split
## <dbl> <dbl> <dbl> <dttm> <chr>
## 1 393217 1 3.5 2023-01-25 19:45:46 train
## 2 393217 16 3.5 2023-01-25 19:50:40 train
## 3 393217 17 4.5 2023-01-25 19:49:45 train
## 4 393217 47 4 2023-01-25 15:44:44 train
## 5 393217 111 4 2023-01-25 17:14:10 train
## 6 393217 260 3 2023-01-25 17:10:01 train
## # A tibble: 6 × 5
## userId movieId rating tstamp split
## <dbl> <dbl> <dbl> <dttm> <chr>
## 1 393217 6 4 2023-02-07 21:17:19 test
## 2 393217 32 3 2023-01-25 19:46:01 test
## 3 393217 50 4.5 2023-01-25 15:38:58 test
## 4 393217 377 3.5 2023-01-25 19:46:12 test
## 5 393217 541 4 2023-01-25 15:42:52 test
## 6 393217 778 2.5 2023-01-25 19:47:24 test
The global average rating from the training data is:
## [1] 2.906781
Using this as a predictor for all unknown ratings, we calculated the RMSE on the test set:
## [1] "Global Average RMSE: 1.7601"
We calculated user and item biases based on deviations from the global average. These were merged with the test set, and the baseline predictor was calculated as:
\[ Global Avg + User Bias + Item Bias \]
## # A tibble: 6 × 2
## userId user_bias
## <dbl> <dbl>
## 1 1892 0.152
## 2 3114 0.387
## 3 12559 0.330
## 4 15893 0.260
## 5 22005 0.608
## 6 41965 0.755
## # A tibble: 6 × 2
## movieId item_bias
## <dbl> <dbl>
## 1 1 0.759
## 2 2 0.409
## 3 3 -0.355
## 4 4 -1.09
## 5 5 -0.455
## 6 6 0.947
## # A tibble: 6 × 9
## userId movieId rating tstamp split pred_global user_bias
## <dbl> <dbl> <dbl> <dttm> <chr> <dbl> <dbl>
## 1 393217 6 4 2023-02-07 21:17:19 test 2.91 0.917
## 2 393217 32 3 2023-01-25 19:46:01 test 2.91 0.917
## 3 393217 50 4.5 2023-01-25 15:38:58 test 2.91 0.917
## 4 393217 377 3.5 2023-01-25 19:46:12 test 2.91 0.917
## 5 393217 541 4 2023-01-25 15:42:52 test 2.91 0.917
## 6 393217 778 2.5 2023-01-25 19:47:24 test 2.91 0.917
## # ℹ 2 more variables: item_bias <dbl>, pred_baseline <dbl>
After applying this model:
## [1] "Baseline Predictor RMSE: 1.3131"
This shows a significant improvement over the global average model.
# Calculate residuals
test$residual <- test$rating - test$pred_baseline
# Show top residuals by absolute error
head(test %>% arrange(desc(abs(residual))), 10)
## # A tibble: 10 × 10
## userId movieId rating tstamp split pred_global user_bias
## <dbl> <dbl> <dbl> <dttm> <chr> <dbl> <dbl>
## 1 327881 109374 -1 2019-07-30 18:15:22 test 2.91 1.48
## 2 393447 5618 -1 2023-01-31 09:48:55 test 2.91 1.34
## 3 328049 3535 -1 2018-10-08 18:45:35 test 2.91 1.53
## 4 394210 48780 -1 2023-03-03 22:26:53 test 2.91 1.13
## 5 394403 195159 -1 2023-03-03 14:51:34 test 2.91 1.41
## 6 394404 281096 -1 2023-03-25 22:53:06 test 2.91 1.88
## 7 395334 527 -1 2023-03-25 20:40:54 test 2.91 1.16
## 8 395470 3949 -1 2023-04-02 11:12:10 test 2.91 1.46
## 9 396300 68954 -1 2023-04-22 23:57:26 test 2.91 1.31
## 10 396377 5618 -1 2023-04-24 15:42:17 test 2.91 1.33
## # ℹ 3 more variables: item_bias <dbl>, pred_baseline <dbl>, residual <dbl>
# Plot residuals to check for bias
library(ggplot2)
ggplot(test, aes(x = pred_baseline, y = residual)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residual Plot for Baseline Predictor",
x = "Predicted Rating",
y = "Residual (Actual - Predicted)")
Examining the residuals helps interpret what the RMSE means in practice — it shows which specific predictions are most inaccurate and highlights whether the model systematically over- or under-predicts for certain users or movies. This reveals exactly which recommendations are poor and where the baseline predictor struggles. By spotting large residuals, we can identify opportunities to improve the recommender, for example by adjusting the bias calculations, tuning with regularization, or exploring more advanced models that capture interactions between users and movies more effectively. In this way, RMSE is not just a number but a diagnostic tool that guides us to refine our system for better recommendations.
## # A tibble: 2 × 2
## Model RMSE
## <chr> <dbl>
## 1 Global Average 1.76
## 2 Baseline Predictor 1.31
The baseline predictor significantly reduces error by accounting for individual user and item biases. This result supports the effectiveness of incorporating basic personalization into recommender systems.
This project built a baseline recommender system that predicts ratings using a simple model: the global average rating plus user bias (how much a user’s ratings typically differ from the global average) and item bias (how much a movie’s average rating differs from the global average). This core idea captures systematic tendencies — for example, some users rate higher on average and some movies are generally liked more.
The RMSE (Root Mean Squared Error) shows how far, on average, the predicted ratings deviate from the true ratings — so a lower RMSE means better predictions. By also plotting the residuals, we see which predictions have larger errors. The residuals in this project reveal a pattern where extreme ratings are sometimes poorly predicted, showing exactly which recommendations might be inaccurate. These insights tell us how to improve the model: we could add regularization to prevent overfitting the biases, or move beyond simple additive effects by using more advanced methods like matrix factorization or neighborhood-based collaborative filtering. This baseline is a solid start that demonstrates how personalization immediately improves performance over the global average, but the residuals and RMSE highlight clear next steps to make the recommendations even more reliable.
Note: All code used for data processing and analysis is available in this GitHub repository: Project1