Personalized Movie Recommender System

User-to-User Collaborative Filtering

Author

Mark Hamer

Published

April 24, 2026

1 Introduction

In the previous assignment, I used a Global Baseline Estimate (GBE) to predict movie ratings. The GBE is non-personalized, it uses the overall mean, a user bias, and a movie bias to make predictions, but it does not consider which specific users share similar taste.

In this project, I build a personalized recommender system using User-to-User Collaborative Filtering. This algorithm identifies users who rate movies similarly to a target user (also known as neighbors), and uses their ratings to predict what the target user would rate an unseen movie.

What the recommender outputs:

  • Predicted ratings for every unrated user movie pair
  • A top-N ranked recommendation list per user
  • RMSE evaluation via leave-one-out cross-validation, compared against the GBE baseline

2 Step 1 — Load and Prepare the Data

library(readxl)

ratings_data <- read_excel("MovieRatings.xlsx", sheet = "MovieRatings")

critic_names <- ratings_data$Critic

ratings_matrix <- as.matrix(ratings_data[, -1])
rownames(ratings_matrix) <- critic_names

ratings_matrix[ratings_matrix == "?"] <- NA
ratings_matrix <- apply(ratings_matrix, 2, as.numeric)
rownames(ratings_matrix) <- critic_names

overall_mean <- mean(ratings_matrix, na.rm = TRUE)

user_avg <- rowMeans(ratings_matrix, na.rm = TRUE)
user_bias <- user_avg - overall_mean

movie_avg <- colMeans(ratings_matrix, na.rm = TRUE)
movie_bias <- movie_avg - overall_mean

cat("Ratings Matrix:\n\n")
Ratings Matrix:
print(ratings_matrix)
          CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
Burton                NA       NA     NA          4            NA             4
Charley                4        5      4          3             2             3
Dan                   NA        5     NA         NA            NA             5
Dieudonne              5        4     NA         NA            NA             5
Matt                   4       NA      2         NA             2             5
Mauricio               4       NA      3          3             4            NA
Max                    4        4      4          2             2             4
Nathan                NA       NA     NA         NA            NA             4
Param                  4        4      1         NA            NA             5
Parshu                 4        3      5          5             2             3
Prashanth              5        5      5          5            NA             4
Shipra                NA       NA      4          5            NA             3
Sreejaya               5        5      5          4             4             5
Steve                  4       NA     NA         NA            NA             4
Vuthy                  4        5      3          3             3            NA
Xingjia               NA       NA      5          5            NA            NA
cat("\nOverall mean:", round(overall_mean, 2), "\n")

Overall mean: 3.93 

3 Step 2 — Center the Ratings

centered_matrix <- ratings_matrix - user_avg

cat("Centered Ratings Matrix:\n\n")
Centered Ratings Matrix:
print(round(centered_matrix, 2))
          CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
Burton                NA       NA     NA       0.00            NA          0.00
Charley             0.50     1.50   0.50      -0.50         -1.50         -0.50
Dan                   NA     0.00     NA         NA            NA          0.00
Dieudonne           0.33    -0.67     NA         NA            NA          0.33
Matt                0.75       NA  -1.25         NA         -1.25          1.75
Mauricio            0.50       NA  -0.50      -0.50          0.50            NA
Max                 0.67     0.67   0.67      -1.33         -1.33          0.67
Nathan                NA       NA     NA         NA            NA          0.00
Param               0.50     0.50  -2.50         NA            NA          1.50
Parshu              0.33    -0.67   1.33       1.33         -1.67         -0.67
Prashanth           0.20     0.20   0.20       0.20            NA         -0.80
Shipra                NA       NA   0.00       1.00            NA         -1.00
Sreejaya            0.33     0.33   0.33      -0.67         -0.67          0.33
Steve               0.00       NA     NA         NA            NA          0.00
Vuthy               0.40     1.40  -0.60      -0.60         -0.60            NA
Xingjia               NA       NA   0.00       0.00            NA            NA

4 Step 3 — Compute User Similarity

n_users <- nrow(ratings_matrix)

sim_matrix <- matrix(NA, nrow = n_users, ncol = n_users)
rownames(sim_matrix) <- critic_names
colnames(sim_matrix) <- critic_names

for (i in 1:n_users) {
  for (j in 1:n_users) {
    if (i == j) {
      sim_matrix[i, j] <- 1  
      next
    }
    
    both_rated <- !is.na(centered_matrix[i, ]) & !is.na(centered_matrix[j, ])
    n_common <- sum(both_rated)
    
    if (n_common < 2) {
      sim_matrix[i, j] <- NA
      next
    }
    
    u_ratings <- centered_matrix[i, both_rated]
    v_ratings <- centered_matrix[j, both_rated]
    
    numerator <- sum(u_ratings * v_ratings)
    denominator <- sqrt(sum(u_ratings^2)) * sqrt(sum(v_ratings^2))
    
    if (denominator == 0) {
      sim_matrix[i, j] <- 0
    } else {
      sim_matrix[i, j] <- numerator / denominator
    }
  }
}

cat("User Similarity Matrix (Pearson Correlation):\n\n")
User Similarity Matrix (Pearson Correlation):
print(round(sim_matrix, 2))
          Burton Charley Dan Dieudonne  Matt Mauricio   Max Nathan Param Parshu
Burton         1    0.00  NA        NA    NA       NA  0.00     NA    NA   0.00
Charley        0    1.00   0     -0.74  0.17    -0.29  0.74     NA -0.19   0.31
Dan           NA    0.00   1      0.00    NA       NA  0.00     NA  0.00   0.00
Dieudonne     NA   -0.74   0      1.00  0.93       NA  0.00     NA  0.25   0.41
Matt          NA    0.17  NA      0.93  1.00     0.23  0.55     NA  0.91  -0.09
Mauricio      NA   -0.29  NA        NA  0.23     1.00  0.00     NA  0.83  -0.79
Max            0    0.74   0      0.00  0.55     0.00  1.00     NA  0.00   0.11
Nathan        NA      NA  NA        NA    NA       NA    NA      1    NA     NA
Param         NA   -0.19   0      0.25  0.91     0.83  0.00     NA  1.00  -0.90
Parshu         0    0.31   0      0.41 -0.09    -0.79  0.11     NA -0.90   1.00
Prashanth      0    0.50   0     -0.48 -0.78    -0.33 -0.24     NA -0.57   0.52
Shipra         0    0.00  NA        NA -0.81    -0.71 -0.87     NA -0.51   0.71
Sreejaya       0    0.74   0      0.00  0.55     0.00  1.00     NA  0.00   0.11
Steve         NA    0.00  NA      0.00  0.00       NA  0.00     NA  0.00   0.00
Vuthy         NA    0.78  NA     -0.74  1.00     0.45  0.61     NA  0.59  -0.30
Xingjia       NA    0.00  NA        NA    NA     0.00  0.00     NA    NA   0.00
          Prashanth Shipra Sreejaya Steve Vuthy Xingjia
Burton         0.00   0.00     0.00    NA    NA      NA
Charley        0.50   0.00     0.74     0  0.78       0
Dan            0.00     NA     0.00    NA    NA      NA
Dieudonne     -0.48     NA     0.00     0 -0.74      NA
Matt          -0.78  -0.81     0.55     0  1.00      NA
Mauricio      -0.33  -0.71     0.00    NA  0.45       0
Max           -0.24  -0.87     1.00     0  0.61       0
Nathan           NA     NA       NA    NA    NA      NA
Param         -0.57  -0.51     0.00     0  0.59      NA
Parshu         0.52   0.71     0.11     0 -0.30       0
Prashanth      1.00   0.83    -0.24     0  0.18       0
Shipra         0.83   1.00    -0.87    NA -0.71       0
Sreejaya      -0.24  -0.87     1.00     0  0.61       0
Steve          0.00     NA     0.00     1    NA      NA
Vuthy          0.18  -0.71     0.61    NA  1.00       0
Xingjia        0.00   0.00     0.00    NA  0.00       1

5 Step 4 — Generate Predictions

predict_cf <- function(user, movie, k = 5) {
  
  if (!is.na(ratings_matrix[user, movie])) {
    return(ratings_matrix[user, movie])
  }
  
  sims <- sim_matrix[user, ]
  
  candidates <- which(!is.na(ratings_matrix[, movie]) & 
                       !is.na(sims) & 
                       sims > 0 & 
                       names(sims) != user)
  
  if (length(candidates) == 0) {
    prediction <- overall_mean + user_bias[user] + movie_bias[movie]
    return(prediction)
  }
  
  candidates <- candidates[order(sims[candidates], decreasing = TRUE)]
  if (length(candidates) > k) {
    candidates <- candidates[1:k]
  }
  neighbor_sims <- sims[candidates]
  neighbor_centered <- centered_matrix[candidates, movie]

  prediction <- user_avg[user] + 
    sum(neighbor_sims * neighbor_centered) / sum(abs(neighbor_sims))
  
  return(prediction)
}

prediction_matrix <- ratings_matrix

for (i in 1:nrow(ratings_matrix)) {
  for (j in 1:ncol(ratings_matrix)) {
    if (is.na(ratings_matrix[i, j])) {
      prediction_matrix[i, j] <- predict_cf(rownames(ratings_matrix)[i],
                                             colnames(ratings_matrix)[j])
    }
  }
}

cat("Predicted Ratings for All Missing Values:\n")
Predicted Ratings for All Missing Values:
cat("(Original ratings kept as-is, predictions fill the gaps)\n\n")
(Original ratings kept as-is, predictions fill the gaps)
print(round(prediction_matrix, 2))
          CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
Burton              4.34     4.51   3.79       4.00          2.78          4.00
Charley             4.00     5.00   4.00       3.00          2.00          3.00
Dan                 5.34     5.00   4.79       4.97          3.78          5.00
Dieudonne           5.00     4.00   3.89       6.00          3.29          5.00
Matt                4.00     3.70   2.00       2.49          2.00          5.00
Mauricio            4.00     4.32   3.00       3.00          4.00          5.05
Max                 4.00     4.00   4.00       2.00          2.00          4.00
Nathan              4.34     4.51   3.79       3.97          2.78          4.00
Param               4.00     4.00   1.00       2.96          3.04          5.00
Parshu              4.00     3.00   5.00       5.00          2.00          3.00
Prashanth           5.00     5.00   5.00       5.00          3.36          4.00
Shipra              4.26     3.80   4.00       5.00          2.33          3.00
Sreejaya            5.00     5.00   5.00       4.00          4.00          5.00
Steve               4.00     4.51   3.79       3.97          2.78          4.00
Vuthy               4.00     5.00   3.00       3.00          3.00          4.39
Xingjia             5.34     5.51   5.00       5.00          3.78          5.22

6 Step 5 — Top-N Recommendations Per User

cat("Top Movie Recommendations Per User:\n")
Top Movie Recommendations Per User:
cat("(Only showing movies the user has NOT rated, ranked by predicted rating)\n\n")
(Only showing movies the user has NOT rated, ranked by predicted rating)
for (i in 1:nrow(ratings_matrix)) {
  user <- rownames(ratings_matrix)[i]
  
  unrated <- which(is.na(ratings_matrix[i, ]))
  
  if (length(unrated) == 0) {
    cat(user, ": Rated all movies — no recommendations needed\n\n")
    next
  }
  
preds <- setNames(prediction_matrix[i, unrated], colnames(ratings_matrix)[unrated])
  preds_sorted <- sort(preds, decreasing = TRUE)
  
  cat(user, ":\n")
  for (j in 1:length(preds_sorted)) {
    cat("  ", j, ". ", names(preds_sorted)[j], 
        " (predicted: ", round(preds_sorted[j], 2), ")\n", sep = "")
  }
  cat("\n")
}
Burton :
  1. Deadpool (predicted: 4.51)
  2. CaptainAmerica (predicted: 4.34)
  3. Frozen (predicted: 3.79)
  4. PitchPerfect2 (predicted: 2.78)

Charley : Rated all movies — no recommendations needed

Dan :
  1. CaptainAmerica (predicted: 5.34)
  2. JungleBook (predicted: 4.97)
  3. Frozen (predicted: 4.79)
  4. PitchPerfect2 (predicted: 3.78)

Dieudonne :
  1. JungleBook (predicted: 6)
  2. Frozen (predicted: 3.89)
  3. PitchPerfect2 (predicted: 3.29)

Matt :
  1. Deadpool (predicted: 3.7)
  2. JungleBook (predicted: 2.49)

Mauricio :
  1. StarWarsForce (predicted: 5.05)
  2. Deadpool (predicted: 4.32)

Max : Rated all movies — no recommendations needed

Nathan :
  1. Deadpool (predicted: 4.51)
  2. CaptainAmerica (predicted: 4.34)
  3. JungleBook (predicted: 3.97)
  4. Frozen (predicted: 3.79)
  5. PitchPerfect2 (predicted: 2.78)

Param :
  1. PitchPerfect2 (predicted: 3.04)
  2. JungleBook (predicted: 2.96)

Parshu : Rated all movies — no recommendations needed

Prashanth :
  1. PitchPerfect2 (predicted: 3.36)

Shipra :
  1. CaptainAmerica (predicted: 4.26)
  2. Deadpool (predicted: 3.8)
  3. PitchPerfect2 (predicted: 2.33)

Sreejaya : Rated all movies — no recommendations needed

Steve :
  1. Deadpool (predicted: 4.51)
  2. JungleBook (predicted: 3.97)
  3. Frozen (predicted: 3.79)
  4. PitchPerfect2 (predicted: 2.78)

Vuthy :
  1. StarWarsForce (predicted: 4.39)

Xingjia :
  1. Deadpool (predicted: 5.51)
  2. CaptainAmerica (predicted: 5.34)
  3. StarWarsForce (predicted: 5.22)
  4. PitchPerfect2 (predicted: 3.78)

7 Step 6 — Evaluate with Leave-One-Out Cross-Validation

cf_errors <- c()
gbe_errors <- c()

for (i in 1:nrow(ratings_matrix)) {
  for (j in 1:ncol(ratings_matrix)) {
    
    if (is.na(ratings_matrix[i, j])) next
    
    actual <- ratings_matrix[i, j]
    user <- rownames(ratings_matrix)[i]
    movie <- colnames(ratings_matrix)[j]
    
    temp_matrix <- ratings_matrix
    temp_matrix[i, j] <- NA
    
    if (sum(!is.na(temp_matrix[i, ])) < 2) next
    
    temp_user_avg <- rowMeans(temp_matrix, na.rm = TRUE)
    temp_centered <- temp_matrix - temp_user_avg
    
    temp_sim <- matrix(NA, nrow = nrow(temp_matrix), ncol = nrow(temp_matrix))
    rownames(temp_sim) <- rownames(temp_matrix)
    colnames(temp_sim) <- rownames(temp_matrix)
    
    for (a in 1:nrow(temp_matrix)) {
      for (b in 1:nrow(temp_matrix)) {
        if (a == b) { temp_sim[a, b] <- 1; next }
        both <- !is.na(temp_centered[a, ]) & !is.na(temp_centered[b, ])
        if (sum(both) < 2) next
        u_r <- temp_centered[a, both]
        v_r <- temp_centered[b, both]
        denom <- sqrt(sum(u_r^2)) * sqrt(sum(v_r^2))
        temp_sim[a, b] <- if (denom == 0) 0 else sum(u_r * v_r) / denom
      }
    }
    
    temp_overall <- mean(temp_matrix, na.rm = TRUE)
    gbe_pred <- temp_overall + 
      (temp_user_avg[user] - temp_overall) + 
      (mean(temp_matrix[, movie], na.rm = TRUE) - temp_overall)
    
    sims <- temp_sim[user, ]
    candidates <- which(!is.na(temp_matrix[, movie]) & 
                         !is.na(sims) & 
                         sims > 0 & 
                         names(sims) != user)
    
    if (length(candidates) == 0) {
      cf_pred <- gbe_pred  # Fall back to GBE
    } else {
      candidates <- candidates[order(sims[candidates], decreasing = TRUE)]
      if (length(candidates) > 5) candidates <- candidates[1:5]
      cf_pred <- temp_user_avg[user] + 
        sum(sims[candidates] * temp_centered[candidates, movie]) / 
        sum(abs(sims[candidates]))
    }
    
    if (!is.na(cf_pred) && !is.nan(cf_pred) && 
        !is.na(gbe_pred) && !is.nan(gbe_pred)) {
      cf_errors <- c(cf_errors, (actual - cf_pred)^2)
      gbe_errors <- c(gbe_errors, (actual - gbe_pred)^2)
    }
  }
}

cf_rmse <- sqrt(mean(cf_errors))
gbe_rmse <- sqrt(mean(gbe_errors))

cat("=== Model Evaluation (Leave-One-Out Cross-Validation) ===\n\n")
=== Model Evaluation (Leave-One-Out Cross-Validation) ===
cat("Ratings evaluated:", length(cf_errors), "\n\n")
Ratings evaluated: 52 
cat("Collaborative Filtering RMSE:", round(cf_rmse, 4), "\n")
Collaborative Filtering RMSE: 1.016 
cat("Global Baseline Estimate RMSE:", round(gbe_rmse, 4), "\n\n")
Global Baseline Estimate RMSE: 1.1008 
if (cf_rmse < gbe_rmse) {
  improvement <- round((1 - cf_rmse / gbe_rmse) * 100, 1)
  cat("Collaborative Filtering improves over GBE by", improvement, "%\n")
} else {
  cat("Global Baseline Estimate performed better on this dataset.\n")
  cat("This is not unusual with very sparse, small datasets where\n")
  cat("similarity estimates are unreliable.\n")
}
Collaborative Filtering improves over GBE by 7.7 %

8 Conclusion

8.1 What Was Built

So for this project I implemented a User-to-User Collaborative Filtering recommender system using the same 16-critic, 6-movie survey dataset from the previous Global Baseline Estimate assignment. The system:

  1. Centers each user’s ratings by subtracting their personal average to remove scale differences
  2. Computes Pearson correlation between all pairs of users (requiring at least 2 co-rated movies)
  3. Predicts missing ratings using a weighted average of the most similar neighbors’ centered ratings (up to k = 5 neighbors with positive similarity)
  4. It’ll fall back to the Global Baseline Estimate when no valid neighbors are available

8.2 What It Outputs

  • Predicted ratings for all 40 unrated user–movie pairs
  • Ranked recommendation lists for each user, showing unseen movies ordered by predicted rating
  • RMSE scores for both models via leave-one-out cross-validation

8.3 Evaluation Results

Global Baseline Estimate | 1.10 |
Collaborative Filtering | 1.02 |

By using the collaborative filter we achieved a 7.7% improvement over the non-personalized baseline. So, that demonstrates that by using user similarity it does improve prediction accuracy even on a small dataset.

8.4 Limitations

  • Small and sparse data: With only 16 users and 6 movies, similarity estimates are based on very few co-rated items, making them inherently noisy.
  • Out-of-range predictions: The model can predict ratings outside the 1–5 scale (e.g., Dieudonne’s predicted 6.0 for JungleBook). Clamping predictions to the valid range would address this.
  • Cold-start users: Critics like Burton and Nathan with very few ratings cannot be effectively scored by collaborative filtering and we have to lean on the baseline fallback.