Personalized Movie Recommender System

Author

Madina Kudanova

Introduction

This assignment builds a personalized recommendation system using collaborative filtering. Unlike global baseline models that provide the same recommendations to everyone, this system learns individual user preferences to generate customized movie suggestions.

Key Objectives:

  1. Implement item-based collaborative filtering (IBCF)
  2. Generate personalized top-N recommendations
  3. Evaluate model performance using hold-out validation
  4. Compare predicted ratings against actual user ratings

Setup and Data Loading

# Load required libraries
library(recommenderlab)
library(dplyr)
library(ggplot2)
library(tidyr)

# Set seed for reproducibility
set.seed(123)
# Load the movie ratings survey data
ratings_raw <- read.csv("https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/w11/Movie%20Ratings%20Survey.csv")

# Display data structure
cat("Dataset dimensions:", dim(ratings_raw), "\n")
Dataset dimensions: 26 7 
cat("Number of users:", nrow(ratings_raw), "\n")
Number of users: 26 
cat("Number of movies:", ncol(ratings_raw) - 1, "\n\n")
Number of movies: 6 
head(ratings_raw[, 1:5])
                  Timestamp
1 2026/02/08 3:42:58 PM EST
2 2026/02/08 3:43:03 PM EST
3 2026/02/08 3:43:10 PM EST
4 2026/02/08 3:43:14 PM EST
5 2026/02/08 3:43:25 PM EST
6 2026/02/08 3:43:31 PM EST
  Please.rate.only.the.movies.you.ve.seen..Leave.the.rest.blank..1...Poor.2...Average.3..Good.4...Very.good.5..Masterpiece...The.Godfather.
1                                                                                                                                         3
2                                                                                                                                         4
3                                                                                                                                        NA
4                                                                                                                                        NA
5                                                                                                                                         4
6                                                                                                                                         3
  Please.rate.only.the.movies.you.ve.seen..Leave.the.rest.blank..1...Poor.2...Average.3..Good.4...Very.good.5..Masterpiece...Casablanca.
1                                                                                                                                     NA
2                                                                                                                                      3
3                                                                                                                                     NA
4                                                                                                                                      3
5                                                                                                                                      3
6                                                                                                                                     NA
  Please.rate.only.the.movies.you.ve.seen..Leave.the.rest.blank..1...Poor.2...Average.3..Good.4...Very.good.5..Masterpiece...Titanic.
1                                                                                                                                   4
2                                                                                                                                   4
3                                                                                                                                   4
4                                                                                                                                   3
5                                                                                                                                  NA
6                                                                                                                                   5
  Please.rate.only.the.movies.you.ve.seen..Leave.the.rest.blank..1...Poor.2...Average.3..Good.4...Very.good.5..Masterpiece...Forrest.Gump.
1                                                                                                                                       NA
2                                                                                                                                        5
3                                                                                                                                        5
4                                                                                                                                        3
5                                                                                                                                       NA
6                                                                                                                                        4

Data Preprocessing

# Remove timestamp column
ratings_df <- ratings_raw[, -1]

# Extract movie names - they come after "...Masterpiece..."
colnames(ratings_df) <- gsub(".*Masterpiece\\.{3}(.*)", "\\1", colnames(ratings_df))

# Check result
cat("Cleaned column names:\n")
Cleaned column names:
print(head(colnames(ratings_df)))
[1] "The.Godfather."      "Casablanca."         "Titanic."           
[4] "Forrest.Gump."       "The.Sound.of.Music." "Gone.with.the.Wind."
cat("\n")
# Convert to numeric matrix
ratings_matrix <- as.matrix(ratings_df)
mode(ratings_matrix) <- "numeric"

# Add user identifiers
rownames(ratings_matrix) <- paste0("User", 1:nrow(ratings_matrix))

# Check for data quality
cat("Missing ratings:", sum(is.na(ratings_matrix)), "\n")
Missing ratings: 52 
cat("Total possible ratings:", nrow(ratings_matrix) * ncol(ratings_matrix), "\n")
Total possible ratings: 156 
cat("Sparsity:", round(sum(is.na(ratings_matrix)) / (nrow(ratings_matrix) * ncol(ratings_matrix)) * 100, 2), "%\n\n")
Sparsity: 33.33 %
# Convert to recommenderlab format
ratings_rrm <- as(ratings_matrix, "realRatingMatrix")
ratings_rrm
26 x 6 rating matrix of class 'realRatingMatrix' with 104 ratings.

Data Exploration

# Rating distribution
rating_values <- as.vector(ratings_matrix)
rating_values <- rating_values[!is.na(rating_values)]

cat("Rating statistics:\n")
Rating statistics:
cat("  Mean:", round(mean(rating_values), 2), "\n")
  Mean: 3.47 
cat("  Median:", median(rating_values), "\n")
  Median: 3.5 
cat("  SD:", round(sd(rating_values), 2), "\n\n")
  SD: 1.21 
# Visualize rating distribution
ggplot(data.frame(rating = rating_values), aes(x = rating)) +
  geom_histogram(binwidth = 0.5, fill = "steelblue", color = "white") +
  labs(title = "Distribution of Movie Ratings",
       x = "Rating", y = "Count") +
  theme_minimal()

Model Training

Train-Test Split

# Create evaluation scheme
# - 80% training, 20% testing
# - given = -1 means all but one item per test user is given
# - goodRating = 3 defines what counts as a "good" rating
scheme <- evaluationScheme(
  data = ratings_rrm,
  method = "split",
  train = 0.8,
  given = -1,
  goodRating = 3
)

# Extract datasets
train_set <- getData(scheme, "train")
known_set <- getData(scheme, "known")
unknown_set <- getData(scheme, "unknown")

cat("Training set:", dim(train_set), "\n")
Training set: 20 6 
cat("Test set (known):", dim(known_set), "\n")
Test set (known): 6 6 
cat("Test set (unknown):", dim(unknown_set), "\n")
Test set (unknown): 6 6 

Build IBCF Model

# Train Item-Based Collaborative Filtering model
# k = 5: use 5 most similar items for prediction
# method = "cosine": use cosine similarity metric
ibcf_model <- Recommender(
  data = train_set,
  method = "IBCF",
  parameter = list(k = 5, method = "cosine")
)

cat("Model trained successfully!\n")
Model trained successfully!
print(ibcf_model)
Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 20 users.

Model Evaluation

Generate Predictions

# Predict ratings for test users
predicted_ratings <- predict(
  object = ibcf_model,
  newdata = known_set,
  type = "ratings"
)

# Convert to matrices for comparison
pred_matrix <- as(predicted_ratings, "matrix")
true_matrix <- as(unknown_set, "matrix")

# Verify alignment
cat("Prediction matrix dimensions:", dim(pred_matrix), "\n")
Prediction matrix dimensions: 6 6 
cat("True matrix dimensions:", dim(true_matrix), "\n")
True matrix dimensions: 6 6 
cat("Dimensions match:", identical(dim(pred_matrix), dim(true_matrix)), "\n")
Dimensions match: TRUE 
cat("Row names match:", identical(rownames(pred_matrix), rownames(true_matrix)), "\n")
Row names match: TRUE 
cat("Column names match:", identical(colnames(pred_matrix), colnames(true_matrix)), "\n\n")
Column names match: TRUE 

Calculate Error Metrics

# Extract matching predictions and actuals
# Only compare where we have both predicted and actual ratings
comparison_df <- data.frame(
  actual = as.vector(true_matrix),
  predicted = as.vector(pred_matrix)
) %>%
  filter(!is.na(actual), !is.na(predicted))

cat("Number of comparable predictions:", nrow(comparison_df), "\n\n")
Number of comparable predictions: 7 
# Calculate error metrics
rmse <- sqrt(mean((comparison_df$actual - comparison_df$predicted)^2))
mae <- mean(abs(comparison_df$actual - comparison_df$predicted))
mse <- mean((comparison_df$actual - comparison_df$predicted)^2)

# Display results
cat("MODEL PERFORMANCE\n")
MODEL PERFORMANCE
cat("RMSE (Root Mean Squared Error):", round(rmse, 4), "\n")
RMSE (Root Mean Squared Error): 2.011 
cat("MAE  (Mean Absolute Error):    ", round(mae, 4), "\n")
MAE  (Mean Absolute Error):     1.5994 
cat("MSE  (Mean Squared Error):     ", round(mse, 4), "\n\n")
MSE  (Mean Squared Error):      4.0441 
# Interpretation
cat("Interpretation:\n")
Interpretation:
cat("- On average, predictions are off by", round(mae, 2), "rating points\n")
- On average, predictions are off by 1.6 rating points
cat("- Lower values indicate better performance\n")
- Lower values indicate better performance

Visualize Prediction Accuracy

# Plot actual vs predicted ratings
ggplot(comparison_df, aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "Actual vs Predicted Ratings",
       subtitle = paste("RMSE =", round(rmse, 3), "| MAE =", round(mae, 3)),
       x = "Actual Rating",
       y = "Predicted Rating") +
  theme_minimal() +
  coord_fixed(ratio = 1)
`geom_smooth()` using formula = 'y ~ x'

# Residual plot
comparison_df$residual <- comparison_df$actual - comparison_df$predicted

ggplot(comparison_df, aes(x = predicted, y = residual)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot",
       x = "Predicted Rating",
       y = "Residual (Actual - Predicted)") +
  theme_minimal()

Generate Recommendations

Top-N Recommendations

# Generate top-3 movie recommendations for each user
top_n <- 3

top_recommendations <- predict(
  object = ibcf_model,
  newdata = ratings_rrm,
  type = "topNList",
  n = top_n
)

# Convert to readable format
recommendation_list <- as(top_recommendations, "list")

# Display recommendations for first 10 users
cat("TOP", " RECOMMENDATIONS\n\n", sep = "")
TOP RECOMMENDATIONS
for (i in 1:min(10, length(recommendation_list))) {
  user_name <- names(recommendation_list)[i]
  movies <- recommendation_list[[i]]
  
  cat("User:", user_name, "\n")
  if (length(movies) > 0) {
    cat("  Recommended:", paste(movies, collapse = ", "), "\n\n")
  } else {
    cat("  No recommendations available\n\n")
  }
}
User: 0 
  Recommended: Casablanca., Gone.with.the.Wind., Forrest.Gump. 

User: 1 
  Recommended: The.Sound.of.Music., Gone.with.the.Wind. 

User: 2 
  Recommended: The.Sound.of.Music., The.Godfather., Casablanca. 

User: 3 
  Recommended: The.Godfather., The.Sound.of.Music., Gone.with.the.Wind. 

User: 4 
  Recommended: Forrest.Gump., The.Sound.of.Music., Titanic. 

User: 5 
  Recommended: The.Sound.of.Music., Casablanca. 

User: 6 
  Recommended: Titanic., The.Sound.of.Music. 

User: 7 
  Recommended: The.Godfather., Gone.with.the.Wind., Titanic. 

User: 8 
  Recommended: Casablanca., Forrest.Gump., Gone.with.the.Wind. 

User: 9 
  Recommended: Gone.with.the.Wind. 
# Summary statistics
rec_lengths <- sapply(recommendation_list, length)
cat("Recommendation summary:\n")
Recommendation summary:
cat("  Users with", top_n, "recommendations:", sum(rec_lengths == top_n), "\n")
  Users with 3 recommendations: 11 
cat("  Users with < ", top_n, "recommendations:", sum(rec_lengths < top_n), "\n")
  Users with <  3 recommendations: 15 
cat("  Users with 0 recommendations:", sum(rec_lengths == 0), "\n")
  Users with 0 recommendations: 5 

Create Recommendation Table

# Create a tidy table of all recommendations
recommendations_tidy <- data.frame(
  user = rep(names(recommendation_list), sapply(recommendation_list, length)),
  movie = unlist(recommendation_list),
  rank = unlist(lapply(recommendation_list, seq_along))
) %>%
  pivot_wider(names_from = rank, values_from = movie, names_prefix = "Rec_")

head(recommendations_tidy, 10)
# A tibble: 10 × 4
   user  Rec_1               Rec_2               Rec_3              
   <chr> <chr>               <chr>               <chr>              
 1 0     Casablanca.         Gone.with.the.Wind. Forrest.Gump.      
 2 1     The.Sound.of.Music. Gone.with.the.Wind. <NA>               
 3 2     The.Sound.of.Music. The.Godfather.      Casablanca.        
 4 3     The.Godfather.      The.Sound.of.Music. Gone.with.the.Wind.
 5 4     Forrest.Gump.       The.Sound.of.Music. Titanic.           
 6 5     The.Sound.of.Music. Casablanca.         <NA>               
 7 6     Titanic.            The.Sound.of.Music. <NA>               
 8 7     The.Godfather.      Gone.with.the.Wind. Titanic.           
 9 8     Casablanca.         Forrest.Gump.       Gone.with.the.Wind.
10 9     Gone.with.the.Wind. <NA>                <NA>               

Model Insights

Item Similarity Matrix

# Extract the learned item-item similarity matrix
similarity_matrix <- getModel(ibcf_model)$sim

# Convert to regular matrix for inspection
sim_matrix_view <- as.matrix(similarity_matrix)

# Display sample of similarities
cat("Sample of item-item similarities:\n\n")
Sample of item-item similarities:
print(round(sim_matrix_view[1:min(6, nrow(sim_matrix_view)),
                             1:min(6, ncol(sim_matrix_view))], 3))
                    The.Godfather. Casablanca. Titanic. Forrest.Gump.
The.Godfather.               0.000       0.605    0.423         0.303
Casablanca.                  0.605       0.000    0.356         0.170
Titanic.                     0.423       0.356    0.000         0.286
Forrest.Gump.                0.303       0.170    0.286         0.000
The.Sound.of.Music.          0.359       0.091    0.311         0.899
Gone.with.the.Wind.          0.250       0.361    0.274         0.429
                    The.Sound.of.Music. Gone.with.the.Wind.
The.Godfather.                    0.359               0.250
Casablanca.                       0.091               0.361
Titanic.                          0.311               0.274
Forrest.Gump.                     0.899               0.429
The.Sound.of.Music.               0.000               0.304
Gone.with.the.Wind.               0.304               0.000
# Find most similar movie pairs
if (ncol(sim_matrix_view) >= 2) {
  # Get upper triangle (avoid duplicates)
  sim_matrix_upper <- sim_matrix_view
  sim_matrix_upper[lower.tri(sim_matrix_upper, diag = TRUE)] <- NA
  
  # Find top 5 most similar pairs
  sim_df <- as.data.frame(as.table(sim_matrix_upper)) %>%
    filter(!is.na(Freq)) %>%
    arrange(desc(Freq)) %>%
    head(5)
  
  cat("\n\nMost similar movie pairs:\n")
  for (i in 1:nrow(sim_df)) {
    cat(i, ". ", as.character(sim_df$Var1[i]), " <-> ", 
        as.character(sim_df$Var2[i]), " (similarity: ", 
        round(sim_df$Freq[i], 3), ")\n", sep = "")
  }
}


Most similar movie pairs:
1. Forrest.Gump. <-> The.Sound.of.Music. (similarity: 0.899)
2. The.Godfather. <-> Casablanca. (similarity: 0.605)
3. Forrest.Gump. <-> Gone.with.the.Wind. (similarity: 0.429)
4. The.Godfather. <-> Titanic. (similarity: 0.423)
5. Casablanca. <-> Gone.with.the.Wind. (similarity: 0.361)

Summary

Key Findings

cat("MODEL SUMMARY\n\n")
MODEL SUMMARY
cat("Algorithm: Item-Based Collaborative Filtering (IBCF)\n")
Algorithm: Item-Based Collaborative Filtering (IBCF)
cat("Training approach: 80/20 hold-out validation\n")
Training approach: 80/20 hold-out validation
cat("Similarity metric: Cosine similarity\n")
Similarity metric: Cosine similarity
cat("Neighborhood size (k): 5\n\n")
Neighborhood size (k): 5
cat("Performance Metrics:\n")
Performance Metrics:
cat("  RMSE:", round(rmse, 4), "\n")
  RMSE: 2.011 
cat("  MAE: ", round(mae, 4), "\n\n")
  MAE:  1.5994 
cat("Output:\n")
Output:
cat("  Top-N recommendations per user: ", top_n, "\n")
  Top-N recommendations per user:  3 
cat("  Total users:", nrow(ratings_matrix), "\n")
  Total users: 26 
cat("  Total movies:", ncol(ratings_matrix), "\n")
  Total movies: 6 

Conclusion

This project successfully implemented a personalized movie recommendation system using Item-Based Collaborative Filtering (IBCF), which generates customized suggestions for each user based on their individual rating patterns—a significant improvement over non-personalized baseline models.

I chose IBCF because it calculates similarities between movies based on user ratings, then recommends new movies similar to ones a user already liked. With only 26 users but consistent movie ratings, an item-based approach is more stable than user-based filtering. The model uses cosine similarity and considers the 5 most similar items when making predictions

Using 80/20 hold-out validation, the model achieved an RMSE of 2.011 and MAE of 1.60, meaning predictions are off by approximately 1.6 rating points on average. The actual vs. predicted ratings plot shows positive correlation, indicating the model successfully learned user preferences.

The system generates top-3 personalized movie recommendations per user, though coverage varied based on how many movies each user had already rated. The diversity in recommendations—different users receiving different suggestions—confirms genuine personalization. The similarity matrix revealed interesting patterns, such as Forrest Gump and The Sound of Music being most alike based on user ratings, which directly drives the recommendation logic.

Main limitations include the cold start problem, small dataset size (6 movies, 26 users), and rating sparsity. Future improvements could include testing different parameters, implementing cross-validation, trying matrix factorization algorithms, and adding diversity metrics.

Overall, this IBCF model successfully delivers personalized recommendations by learning from collective user behavior, representing a meaningful advancement over non-personalized approaches.