knitr::opts_chunk$set(message = FALSE, warning = FALSE)Assignment 11 CodeBase: Personalized Recommender System
Approach Summary
This assignment builds a personalized recommender system using the same survey-based movie rating dataset from the previous assignment. Unlike the Global Baseline Estimate model, which produced non-personalized recommendations, this model uses item-to-item collaborative filtering to generate user-specific recommendations.
The recommender calculates similarity between movies based on user rating patterns and predicts ratings for unseen movies using weighted averages of similar items.
The model outputs a ranked Top-3 list of recommended movies for each user. Performance is evaluated using a hold-out validation approach with RMSE and MAE metrics.
Additional visualizations are included to support data understanding and model evaluation.
Objective
The objective of this assignment is to build a personalized movie recommendation system using the same survey rating data from the previous Global Baseline Estimate assignment. Unlike the previous baseline model, this assignment uses item-to-item collaborative filtering to generate personalized recommendations based on each user’s rating history.
The recommender will output ranked Top-3 movie recommendations for each user and evaluate prediction performance using a hold-out test set with RMSE and MAE.
Loading Libraries
library(DBI)
library(RPostgres)
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
library(ggplot2)
library(knitr)Connect to PostgreSQL
The PostgreSQL database used in this assignment was created in the previous Global Estimate assignment. The schema file is included in the GitHub repository for reproducibility, but the database is not recreated in this codebase.
con <- dbConnect(
RPostgres::Postgres(),
host = "localhost",
port = 5432,
dbname = "Movie _Ratings",
user = "postgres",
password = Sys.getenv("PGPASSWORD")
)
dbListTables(con)[1] "movies" "ratings" "users"
Load Tables into R
users_df <- dbGetQuery(con, "SELECT * FROM users;")
movies_df <- dbGetQuery(con, "SELECT * FROM movies;")
ratings_df <- dbGetQuery(con, "SELECT * FROM ratings;")Basic Sanity Checks
glimpse(users_df)Rows: 5
Columns: 2
$ user_id <int> 1, 2, 3, 4, 5
$ name <chr> "Muhammad", "Ali", "Abdul", "Samir", "Daniyal"
glimpse(movies_df)Rows: 6
Columns: 3
$ movie_id <int> 1, 2, 3, 4, 5, 6
$ title <chr> "Dune: Part Two", "Oppenheimer", "Barbie", "The Batman", …
$ release_year <int> 2024, 2023, 2023, 2022, 2023, 2022
glimpse(ratings_df)Rows: 17
Columns: 4
$ rating_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
$ user_id <int> 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5
$ movie_id <int> 1, 2, 5, 2, 3, 6, 1, 4, 5, 6, 2, 3, 5, 1, 4, 2, 6
$ rating <int> 5, 5, 4, 4, 3, 5, 4, 3, 4, 4, 5, 4, 5, 3, 4, 5, 5
Prepare Clean Ratings Data
ratings_clean <- ratings_df %>%
select(user_id, movie_id, rating) %>%
mutate(rating = as.numeric(rating)) %>%
filter(!is.na(rating))
ratings_joined <- ratings_clean %>%
left_join(users_df, by = "user_id") %>%
left_join(movies_df, by = "movie_id") %>%
select(user_id, name, movie_id, title, release_year, rating) %>%
arrange(name, title)
ratings_joined user_id name movie_id title release_year
1 3 Abdul 1 Dune: Part Two 2024
2 3 Abdul 5 Spider-Man: Across the Spider-Verse 2023
3 3 Abdul 4 The Batman 2022
4 3 Abdul 6 Top Gun: Maverick 2022
5 2 Ali 3 Barbie 2023
6 2 Ali 2 Oppenheimer 2023
7 2 Ali 6 Top Gun: Maverick 2022
8 5 Daniyal 1 Dune: Part Two 2024
9 5 Daniyal 2 Oppenheimer 2023
10 5 Daniyal 4 The Batman 2022
11 5 Daniyal 6 Top Gun: Maverick 2022
12 1 Muhammad 1 Dune: Part Two 2024
13 1 Muhammad 2 Oppenheimer 2023
14 1 Muhammad 5 Spider-Man: Across the Spider-Verse 2023
15 4 Samir 3 Barbie 2023
16 4 Samir 2 Oppenheimer 2023
17 4 Samir 5 Spider-Man: Across the Spider-Verse 2023
rating
1 4
2 4
3 3
4 4
5 3
6 4
7 5
8 3
9 5
10 4
11 5
12 5
13 5
14 4
15 4
16 5
17 5
This step prepares a clean ratings dataset and joins the relational tables so that user names and movie titles are visible in the analysis.
Validate Rating Scale
ratings_clean %>%
summarise(
total_ratings = n(),
min_rating = min(rating),
max_rating = max(rating),
average_rating = mean(rating),
missing_ratings = sum(is.na(rating))
) total_ratings min_rating max_rating average_rating missing_ratings
1 17 3 5 4.235294 0
The validation confirms that the rating values are numeric and fall within the expected 1 to 5 survey scale.
Rating Distribution Visualization
ggplot(ratings_clean, aes(x = factor(rating))) +
geom_bar() +
labs(
title = "Distribution of Movie Ratings",
x = "Rating",
y = "Count"
)This plot shows how frequently each rating value appears in the dataset. It provides useful context before building the recommender.
Create Training and Testing Sets
set.seed(607)
ratings_split <- ratings_clean %>%
mutate(row_id = row_number())
test_ids <- sample(
ratings_split$row_id,
size = ceiling(0.20 * nrow(ratings_split))
)
train_ratings <- ratings_split %>%
filter(!row_id %in% test_ids) %>%
select(-row_id)
test_ratings <- ratings_split %>%
filter(row_id %in% test_ids) %>%
select(-row_id)
nrow(train_ratings)[1] 13
nrow(test_ratings)[1] 4
A hold-out validation method is used. The model is trained on 80% of the observed ratings and evaluated on the remaining 20%.
Create User-Movie Rating Matrix
rating_matrix <- train_ratings %>%
left_join(movies_df, by = "movie_id") %>%
select(user_id, title, rating) %>%
pivot_wider(
names_from = title,
values_from = rating
) %>%
column_to_rownames("user_id") %>%
as.matrix()
rating_matrix Dune: Part Two Oppenheimer Spider-Man: Across the Spider-Verse Barbie
1 5 5 4 NA
2 NA NA NA 3
3 4 NA 4 NA
4 NA 5 5 NA
5 3 5 NA NA
Top Gun: Maverick The Batman
1 NA NA
2 5 NA
3 NA 3
4 NA NA
5 5 NA
The rating matrix stores users as rows and movies as columns. Missing values represent movies that a user did not rate in the training data.
Item-to-Item Similarity Function
cosine_similarity <- function(x, y) {
common <- !is.na(x) & !is.na(y)
if (sum(common) < 2) {
return(NA_real_)
}
numerator <- sum(x[common] * y[common])
denominator <- sqrt(sum(x[common]^2)) * sqrt(sum(y[common]^2))
if (denominator == 0) {
return(NA_real_)
}
numerator / denominator
}Cosine similarity is used to measure how similar two movies are based on rating patterns from users who rated both movies.
Calculate Item-to-Item Similarity Matrix
movie_names <- colnames(rating_matrix)
item_similarity <- matrix(
NA_real_,
nrow = length(movie_names),
ncol = length(movie_names),
dimnames = list(movie_names, movie_names)
)
for (i in seq_along(movie_names)) {
for (j in seq_along(movie_names)) {
if (i == j) {
item_similarity[i, j] <- 1
} else {
item_similarity[i, j] <- cosine_similarity(
rating_matrix[, i],
rating_matrix[, j]
)
}
}
}
item_similarity Dune: Part Two Oppenheimer
Dune: Part Two 1.0000000 0.9701425
Oppenheimer 0.9701425 1.0000000
Spider-Man: Across the Spider-Verse 0.9938837 0.9938837
Barbie NA NA
Top Gun: Maverick NA NA
The Batman NA NA
Spider-Man: Across the Spider-Verse Barbie
Dune: Part Two 0.9938837 NA
Oppenheimer 0.9938837 NA
Spider-Man: Across the Spider-Verse 1.0000000 NA
Barbie NA 1
Top Gun: Maverick NA NA
The Batman NA NA
Top Gun: Maverick The Batman
Dune: Part Two NA NA
Oppenheimer NA NA
Spider-Man: Across the Spider-Verse NA NA
Barbie NA NA
Top Gun: Maverick 1 NA
The Batman NA 1
This matrix shows the similarity between each pair of movies. Higher values indicate stronger similarity in user rating behavior.
Prediction Function
global_mean <- mean(train_ratings$rating, na.rm = TRUE)
predict_item_cf <- function(user_id_value, movie_id_value, rating_matrix, item_similarity, movies_df, global_mean) {
target_movie <- movies_df %>%
filter(movie_id == movie_id_value) %>%
pull(title)
if (length(target_movie) == 0 || !(target_movie %in% colnames(rating_matrix))) {
return(global_mean)
}
user_key <- as.character(user_id_value)
if (!(user_key %in% rownames(rating_matrix))) {
return(global_mean)
}
user_ratings <- rating_matrix[user_key, ]
rated_movies <- names(user_ratings)[!is.na(user_ratings)]
if (length(rated_movies) == 0) {
return(global_mean)
}
similarities <- item_similarity[target_movie, rated_movies]
valid <- !is.na(similarities) & similarities > 0
if (sum(valid) == 0) {
return(global_mean)
}
predicted <- sum(similarities[valid] * user_ratings[rated_movies][valid]) /
sum(abs(similarities[valid]))
predicted <- max(min(predicted, 5), 1)
predicted
}The function predicts how a user may rate a target movie by using a weighted average of the user’s ratings for similar movies.
Predict Ratings for Test Set
test_predictions <- test_ratings %>%
rowwise() %>%
mutate(
predicted_rating = predict_item_cf(
user_id,
movie_id,
rating_matrix,
item_similarity,
movies_df,
global_mean
)
) %>%
ungroup() %>%
left_join(users_df, by = "user_id") %>%
left_join(movies_df, by = "movie_id") %>%
select(name, title, actual_rating = rating, predicted_rating)
test_predictions# A tibble: 4 × 4
name title actual_rating predicted_rating
<chr> <chr> <dbl> <dbl>
1 Ali Oppenheimer 4 4.31
2 Abdul Top Gun: Maverick 4 4.31
3 Samir Barbie 4 4.31
4 Daniyal The Batman 4 4.31
This table compares the held-out actual ratings with the recommender’s predicted ratings.
Model Evaluation
evaluation_metrics <- test_predictions %>%
summarise(
RMSE = sqrt(mean((actual_rating - predicted_rating)^2, na.rm = TRUE)),
MAE = mean(abs(actual_rating - predicted_rating), na.rm = TRUE)
)
evaluation_metrics# A tibble: 1 × 2
RMSE MAE
<dbl> <dbl>
1 0.308 0.308
RMSE and MAE measure how close the predicted ratings are to the actual ratings in the test set.
Predicted vs Actual Ratings Visualization
ggplot(test_predictions, aes(x = actual_rating, y = predicted_rating)) +
geom_point(size = 3) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(
title = "Predicted vs Actual Ratings on Test Set",
x = "Actual Rating",
y = "Predicted Rating"
)Points closer to the dashed line represent more accurate predictions. This plot visually supports the RMSE and MAE evaluation results.
Generate All User-Movie Candidate Pairs
all_pairs <- users_df %>%
select(user_id, name) %>%
crossing(movies_df %>% select(movie_id, title, release_year)) %>%
left_join(ratings_clean, by = c("user_id", "movie_id"))
recommendation_candidates <- all_pairs %>%
filter(is.na(rating))Only movies that a user has not already rated are considered as recommendation candidates.
Generate Top-3 Personalized Recommendations
top_n <- 3
recommendations_all <- recommendation_candidates %>%
rowwise() %>%
mutate(
predicted_rating = predict_item_cf(
user_id,
movie_id,
rating_matrix,
item_similarity,
movies_df,
global_mean
)
) %>%
ungroup() %>%
group_by(user_id, name) %>%
arrange(desc(predicted_rating), title, .by_group = TRUE) %>%
mutate(recommendation_rank = row_number()) %>%
slice_head(n = top_n) %>%
ungroup() %>%
select(
name,
recommendation_rank,
title,
release_year,
predicted_rating
)
recommendations_all# A tibble: 13 × 5
name recommendation_rank title release_year predicted_rating
<chr> <int> <chr> <int> <dbl>
1 Muhammad 1 Barbie 2023 4.31
2 Muhammad 2 The Batman 2022 4.31
3 Muhammad 3 Top Gun: Maverick 2022 4.31
4 Ali 1 Dune: Part Two 2024 4.31
5 Ali 2 Spider-Man: Acros… 2023 4.31
6 Ali 3 The Batman 2022 4.31
7 Abdul 1 Barbie 2023 4.31
8 Abdul 2 Oppenheimer 2023 4
9 Samir 1 Dune: Part Two 2024 5
10 Samir 2 The Batman 2022 4.31
11 Samir 3 Top Gun: Maverick 2022 4.31
12 Daniyal 1 Barbie 2023 4.31
13 Daniyal 2 Spider-Man: Acros… 2023 4
The table above contains the Top-3 personalized movie recommendations for each user. These recommendations are personalized because the predicted ratings depend on each user’s previously observed ratings and movie-to-movie similarity patterns.
Top Recommendations Visualization
recommendations_all %>%
ggplot(aes(x = reorder(title, predicted_rating), y = predicted_rating)) +
geom_col() +
coord_flip() +
facet_wrap(~ name) +
labs(
title = "Top-3 Personalized Movie Recommendations by User",
x = "Recommended Movie",
y = "Predicted Rating"
)This chart visually summarizes the personalized recommendation output for each user.
Discussion
The item-to-item collaborative filtering recommender generates personalized movie recommendations by comparing movies based on rating patterns. Unlike the previous Global Baseline Estimate model, which mainly depended on overall averages and bias terms, this approach uses a user’s own rating history to predict how they may respond to unseen movies.
The hold-out evaluation provides a direct way to compare predicted ratings with actual ratings. Because the dataset is small and sparse, the evaluation metrics should be interpreted carefully. A small number of test cases can make RMSE and MAE sensitive to individual ratings. However, the workflow still demonstrates the main logic of a personalized recommender system: training on observed ratings, predicting unseen or held-out ratings, and ranking recommendations for each user.
Disconnect Database
dbDisconnect(con)Conclusion
In this assignment, I implemented a personalized movie recommender system using item-to-item collaborative filtering. The recommender calculated movie similarity from user rating patterns, predicted ratings for held-out user-movie pairs, evaluated performance using RMSE and MAE, and generated ranked Top-3 movie recommendations for each user.
Overall, this assignment extends the earlier non-personalized Global Baseline Estimate by producing recommendations that depend on individual user preferences. The final results show both prediction performance and personalized recommendation output, while the visualizations provide additional support for interpreting the dataset, the model evaluation, and the recommendation rankings.