Assignment 11 CodeBase: Personalized Recommender System

Author

Muhammad Suffyan Khan

Published

April 26, 2026

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

Approach Summary

This assignment builds a personalized recommender system using the same survey-based movie rating dataset from the previous assignment. Unlike the Global Baseline Estimate model, which produced non-personalized recommendations, this model uses item-to-item collaborative filtering to generate user-specific recommendations.

The recommender calculates similarity between movies based on user rating patterns and predicts ratings for unseen movies using weighted averages of similar items.

The model outputs a ranked Top-3 list of recommended movies for each user. Performance is evaluated using a hold-out validation approach with RMSE and MAE metrics.

Additional visualizations are included to support data understanding and model evaluation.

Objective

The objective of this assignment is to build a personalized movie recommendation system using the same survey rating data from the previous Global Baseline Estimate assignment. Unlike the previous baseline model, this assignment uses item-to-item collaborative filtering to generate personalized recommendations based on each user’s rating history.

The recommender will output ranked Top-3 movie recommendations for each user and evaluate prediction performance using a hold-out test set with RMSE and MAE.

Loading Libraries

library(DBI)
library(RPostgres)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(ggplot2)
library(knitr)

Connect to PostgreSQL

The PostgreSQL database used in this assignment was created in the previous Global Estimate assignment. The schema file is included in the GitHub repository for reproducibility, but the database is not recreated in this codebase.

con <- dbConnect(
  RPostgres::Postgres(),
  host = "localhost",
  port = 5432,
  dbname = "Movie _Ratings",
  user = "postgres",
  password = Sys.getenv("PGPASSWORD")
)

dbListTables(con)

[1] "movies"  "ratings" "users"

Load Tables into R

users_df <- dbGetQuery(con, "SELECT * FROM users;")
movies_df <- dbGetQuery(con, "SELECT * FROM movies;")
ratings_df <- dbGetQuery(con, "SELECT * FROM ratings;")

Basic Sanity Checks

glimpse(users_df)

Rows: 5
Columns: 2
$ user_id <int> 1, 2, 3, 4, 5
$ name    <chr> "Muhammad", "Ali", "Abdul", "Samir", "Daniyal"

glimpse(movies_df)

Rows: 6
Columns: 3
$ movie_id     <int> 1, 2, 3, 4, 5, 6
$ title        <chr> "Dune: Part Two", "Oppenheimer", "Barbie", "The Batman", …
$ release_year <int> 2024, 2023, 2023, 2022, 2023, 2022

glimpse(ratings_df)

Rows: 17
Columns: 4
$ rating_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
$ user_id   <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5
$ movie_id  <int> 1, 2, 5, 2, 3, 6, 1, 4, 5, 6, 2, 3, 5, 1, 4, 2, 6
$ rating    <int> 5, 5, 4, 4, 3, 5, 4, 3, 4, 4, 5, 4, 5, 3, 4, 5, 5

Prepare Clean Ratings Data

ratings_clean <- ratings_df %>%
  select(user_id, movie_id, rating) %>%
  mutate(rating = as.numeric(rating)) %>%
  filter(!is.na(rating))

ratings_joined <- ratings_clean %>%
  left_join(users_df, by = "user_id") %>%
  left_join(movies_df, by = "movie_id") %>%
  select(user_id, name, movie_id, title, release_year, rating) %>%
  arrange(name, title)

ratings_joined

   user_id     name movie_id                               title release_year
1        3    Abdul        1                      Dune: Part Two         2024
2        3    Abdul        5 Spider-Man: Across the Spider-Verse         2023
3        3    Abdul        4                          The Batman         2022
4        3    Abdul        6                   Top Gun: Maverick         2022
5        2      Ali        3                              Barbie         2023
6        2      Ali        2                         Oppenheimer         2023
7        2      Ali        6                   Top Gun: Maverick         2022
8        5  Daniyal        1                      Dune: Part Two         2024
9        5  Daniyal        2                         Oppenheimer         2023
10       5  Daniyal        4                          The Batman         2022
11       5  Daniyal        6                   Top Gun: Maverick         2022
12       1 Muhammad        1                      Dune: Part Two         2024
13       1 Muhammad        2                         Oppenheimer         2023
14       1 Muhammad        5 Spider-Man: Across the Spider-Verse         2023
15       4    Samir        3                              Barbie         2023
16       4    Samir        2                         Oppenheimer         2023
17       4    Samir        5 Spider-Man: Across the Spider-Verse         2023
   rating
1       4
2       4
3       3
4       4
5       3
6       4
7       5
8       3
9       5
10      4
11      5
12      5
13      5
14      4
15      4
16      5
17      5

This step prepares a clean ratings dataset and joins the relational tables so that user names and movie titles are visible in the analysis.

Validate Rating Scale

ratings_clean %>%
  summarise(
    total_ratings = n(),
    min_rating = min(rating),
    max_rating = max(rating),
    average_rating = mean(rating),
    missing_ratings = sum(is.na(rating))
  )

  total_ratings min_rating max_rating average_rating missing_ratings
1            17          3          5       4.235294               0

The validation confirms that the rating values are numeric and fall within the expected 1 to 5 survey scale.

Rating Distribution Visualization

ggplot(ratings_clean, aes(x = factor(rating))) +
  geom_bar() +
  labs(
    title = "Distribution of Movie Ratings",
    x = "Rating",
    y = "Count"
  )

This plot shows how frequently each rating value appears in the dataset. It provides useful context before building the recommender.

Create Training and Testing Sets

set.seed(607)

ratings_split <- ratings_clean %>%
  mutate(row_id = row_number())

test_ids <- sample(
  ratings_split$row_id,
  size = ceiling(0.20 * nrow(ratings_split))
)

train_ratings <- ratings_split %>%
  filter(!row_id %in% test_ids) %>%
  select(-row_id)

test_ratings <- ratings_split %>%
  filter(row_id %in% test_ids) %>%
  select(-row_id)

nrow(train_ratings)

[1] 13

nrow(test_ratings)

[1] 4

A hold-out validation method is used. The model is trained on 80% of the observed ratings and evaluated on the remaining 20%.

Create User-Movie Rating Matrix

rating_matrix <- train_ratings %>%
  left_join(movies_df, by = "movie_id") %>%
  select(user_id, title, rating) %>%
  pivot_wider(
    names_from = title,
    values_from = rating
  ) %>%
  column_to_rownames("user_id") %>%
  as.matrix()

rating_matrix

  Dune: Part Two Oppenheimer Spider-Man: Across the Spider-Verse Barbie
1              5           5                                   4     NA
2             NA          NA                                  NA      3
3              4          NA                                   4     NA
4             NA           5                                   5     NA
5              3           5                                  NA     NA
  Top Gun: Maverick The Batman
1                NA         NA
2                 5         NA
3                NA          3
4                NA         NA
5                 5         NA

The rating matrix stores users as rows and movies as columns. Missing values represent movies that a user did not rate in the training data.

Item-to-Item Similarity Function

cosine_similarity <- function(x, y) {
  common <- !is.na(x) & !is.na(y)
  
  if (sum(common) < 2) {
    return(NA_real_)
  }
  
  numerator <- sum(x[common] * y[common])
  denominator <- sqrt(sum(x[common]^2)) * sqrt(sum(y[common]^2))
  
  if (denominator == 0) {
    return(NA_real_)
  }
  
  numerator / denominator
}

Cosine similarity is used to measure how similar two movies are based on rating patterns from users who rated both movies.

Calculate Item-to-Item Similarity Matrix

movie_names <- colnames(rating_matrix)

item_similarity <- matrix(
  NA_real_,
  nrow = length(movie_names),
  ncol = length(movie_names),
  dimnames = list(movie_names, movie_names)
)

for (i in seq_along(movie_names)) {
  for (j in seq_along(movie_names)) {
    if (i == j) {
      item_similarity[i, j] <- 1
    } else {
      item_similarity[i, j] <- cosine_similarity(
        rating_matrix[, i],
        rating_matrix[, j]
      )
    }
  }
}

item_similarity

                                    Dune: Part Two Oppenheimer
Dune: Part Two                           1.0000000   0.9701425
Oppenheimer                              0.9701425   1.0000000
Spider-Man: Across the Spider-Verse      0.9938837   0.9938837
Barbie                                          NA          NA
Top Gun: Maverick                               NA          NA
The Batman                                      NA          NA
                                    Spider-Man: Across the Spider-Verse Barbie
Dune: Part Two                                                0.9938837     NA
Oppenheimer                                                   0.9938837     NA
Spider-Man: Across the Spider-Verse                           1.0000000     NA
Barbie                                                               NA      1
Top Gun: Maverick                                                    NA     NA
The Batman                                                           NA     NA
                                    Top Gun: Maverick The Batman
Dune: Part Two                                     NA         NA
Oppenheimer                                        NA         NA
Spider-Man: Across the Spider-Verse                NA         NA
Barbie                                             NA         NA
Top Gun: Maverick                                   1         NA
The Batman                                         NA          1

This matrix shows the similarity between each pair of movies. Higher values indicate stronger similarity in user rating behavior.

Prediction Function

global_mean <- mean(train_ratings$rating, na.rm = TRUE)

predict_item_cf <- function(user_id_value, movie_id_value, rating_matrix, item_similarity, movies_df, global_mean) {
  
  target_movie <- movies_df %>%
    filter(movie_id == movie_id_value) %>%
    pull(title)
  
  if (length(target_movie) == 0 || !(target_movie %in% colnames(rating_matrix))) {
    return(global_mean)
  }
  
  user_key <- as.character(user_id_value)
  
  if (!(user_key %in% rownames(rating_matrix))) {
    return(global_mean)
  }
  
  user_ratings <- rating_matrix[user_key, ]
  rated_movies <- names(user_ratings)[!is.na(user_ratings)]
  
  if (length(rated_movies) == 0) {
    return(global_mean)
  }
  
  similarities <- item_similarity[target_movie, rated_movies]
  valid <- !is.na(similarities) & similarities > 0
  
  if (sum(valid) == 0) {
    return(global_mean)
  }
  
  predicted <- sum(similarities[valid] * user_ratings[rated_movies][valid]) /
    sum(abs(similarities[valid]))
  
  predicted <- max(min(predicted, 5), 1)
  
  predicted
}

The function predicts how a user may rate a target movie by using a weighted average of the user’s ratings for similar movies.

Predict Ratings for Test Set

test_predictions <- test_ratings %>%
  rowwise() %>%
  mutate(
    predicted_rating = predict_item_cf(
      user_id,
      movie_id,
      rating_matrix,
      item_similarity,
      movies_df,
      global_mean
    )
  ) %>%
  ungroup() %>%
  left_join(users_df, by = "user_id") %>%
  left_join(movies_df, by = "movie_id") %>%
  select(name, title, actual_rating = rating, predicted_rating)

test_predictions

# A tibble: 4 × 4
  name    title             actual_rating predicted_rating
  <chr>   <chr>                     <dbl>            <dbl>
1 Ali     Oppenheimer                   4             4.31
2 Abdul   Top Gun: Maverick             4             4.31
3 Samir   Barbie                        4             4.31
4 Daniyal The Batman                    4             4.31

This table compares the held-out actual ratings with the recommender’s predicted ratings.

Model Evaluation

evaluation_metrics <- test_predictions %>%
  summarise(
    RMSE = sqrt(mean((actual_rating - predicted_rating)^2, na.rm = TRUE)),
    MAE = mean(abs(actual_rating - predicted_rating), na.rm = TRUE)
  )

evaluation_metrics

# A tibble: 1 × 2
   RMSE   MAE
  <dbl> <dbl>
1 0.308 0.308

RMSE and MAE measure how close the predicted ratings are to the actual ratings in the test set.

Predicted vs Actual Ratings Visualization

ggplot(test_predictions, aes(x = actual_rating, y = predicted_rating)) +
  geom_point(size = 3) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(
    title = "Predicted vs Actual Ratings on Test Set",
    x = "Actual Rating",
    y = "Predicted Rating"
  )

Points closer to the dashed line represent more accurate predictions. This plot visually supports the RMSE and MAE evaluation results.

Generate All User-Movie Candidate Pairs

all_pairs <- users_df %>%
  select(user_id, name) %>%
  crossing(movies_df %>% select(movie_id, title, release_year)) %>%
  left_join(ratings_clean, by = c("user_id", "movie_id"))

recommendation_candidates <- all_pairs %>%
  filter(is.na(rating))

Only movies that a user has not already rated are considered as recommendation candidates.

Generate Top-3 Personalized Recommendations

top_n <- 3

recommendations_all <- recommendation_candidates %>%
  rowwise() %>%
  mutate(
    predicted_rating = predict_item_cf(
      user_id,
      movie_id,
      rating_matrix,
      item_similarity,
      movies_df,
      global_mean
    )
  ) %>%
  ungroup() %>%
  group_by(user_id, name) %>%
  arrange(desc(predicted_rating), title, .by_group = TRUE) %>%
  mutate(recommendation_rank = row_number()) %>%
  slice_head(n = top_n) %>%
  ungroup() %>%
  select(
    name,
    recommendation_rank,
    title,
    release_year,
    predicted_rating
  )

recommendations_all

# A tibble: 13 × 5
   name     recommendation_rank title              release_year predicted_rating
   <chr>                  <int> <chr>                     <int>            <dbl>
 1 Muhammad                   1 Barbie                     2023             4.31
 2 Muhammad                   2 The Batman                 2022             4.31
 3 Muhammad                   3 Top Gun: Maverick          2022             4.31
 4 Ali                        1 Dune: Part Two             2024             4.31
 5 Ali                        2 Spider-Man: Acros…         2023             4.31
 6 Ali                        3 The Batman                 2022             4.31
 7 Abdul                      1 Barbie                     2023             4.31
 8 Abdul                      2 Oppenheimer                2023             4   
 9 Samir                      1 Dune: Part Two             2024             5   
10 Samir                      2 The Batman                 2022             4.31
11 Samir                      3 Top Gun: Maverick          2022             4.31
12 Daniyal                    1 Barbie                     2023             4.31
13 Daniyal                    2 Spider-Man: Acros…         2023             4

The table above contains the Top-3 personalized movie recommendations for each user. These recommendations are personalized because the predicted ratings depend on each user’s previously observed ratings and movie-to-movie similarity patterns.

Top Recommendations Visualization

recommendations_all %>%
  ggplot(aes(x = reorder(title, predicted_rating), y = predicted_rating)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ name) +
  labs(
    title = "Top-3 Personalized Movie Recommendations by User",
    x = "Recommended Movie",
    y = "Predicted Rating"
  )

This chart visually summarizes the personalized recommendation output for each user.

Discussion

The item-to-item collaborative filtering recommender generates personalized movie recommendations by comparing movies based on rating patterns. Unlike the previous Global Baseline Estimate model, which mainly depended on overall averages and bias terms, this approach uses a user’s own rating history to predict how they may respond to unseen movies.

The hold-out evaluation provides a direct way to compare predicted ratings with actual ratings. Because the dataset is small and sparse, the evaluation metrics should be interpreted carefully. A small number of test cases can make RMSE and MAE sensitive to individual ratings. However, the workflow still demonstrates the main logic of a personalized recommender system: training on observed ratings, predicting unseen or held-out ratings, and ranking recommendations for each user.

Disconnect Database

dbDisconnect(con)

Conclusion

In this assignment, I implemented a personalized movie recommender system using item-to-item collaborative filtering. The recommender calculated movie similarity from user rating patterns, predicted ratings for held-out user-movie pairs, evaluated performance using RMSE and MAE, and generated ranked Top-3 movie recommendations for each user.

Overall, this assignment extends the earlier non-personalized Global Baseline Estimate by producing recommendations that depend on individual user preferences. The final results show both prediction performance and personalized recommendation output, while the visualizations provide additional support for interpreting the dataset, the model evaluation, and the recommendation rankings.