Project 2: Content-Based and Collaborative Filtering

This code builds and evaluates a different movie recommender systems using the MovieLens dataset. It begins by setting up the environment with essential libraries like tidyverse, data.table, and recommenderlab, then efficiently downloads and loads the MovieLens data. The core of the recommendation process starts with transforming the raw ratings into a user-movie rating matrix and preparing it for evaluation by splitting it into training, known, and unknown sets. This crucial step allows for a fair comparison of how well models predict unseen ratings.

The code then implements and evaluates two primary collaborative filtering approaches: Item-Item Collaborative Filtering (IBCF) and User-User Collaborative Filtering (UBCF). Both methods leverage similarities (cosine similarity in this case) between items or users, respectively, to predict ratings. In parallel, a Content-Based Filtering (CBF) model is developed. This involves creating a feature matrix for movies based on their genres and then using cosine similarity between movies to predict ratings. A key aspect of the CBF implementation is its robust handling of missing predictions by falling back to user mean ratings.

Finally, the script focuses on evaluating the performance of all three recommender models. It calculates common accuracy metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and MSE (Mean Squared Error) for the content-based model, and RMSE for the collaborative filtering models. The results are then displayed in the console and visualized using a bar chart created with ggplot2, offering a clear comparison of each model’s effectiveness. Optionally, the evaluation results can be saved to a CSV file for further analysis.

# Step 1: Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
library(recommenderlab)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loading required package: arules
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
## 
## Loading required package: proxy
## 
## Attaching package: 'proxy'
## 
## The following object is masked from 'package:Matrix':
## 
##     as.matrix
## 
## The following objects are masked from 'package:stats':
## 
##     as.dist, dist
## 
## The following object is masked from 'package:base':
## 
##     as.matrix
## 
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
library(ggplot2)

# Step 2: Download & Load MovieLens Data
if (!file.exists("ml-latest-small.zip")) {
  download.file("https://files.grouplens.org/datasets/movielens/ml-latest-small.zip", "ml-latest-small.zip")
  unzip("ml-latest-small.zip")
}
movies  <- fread("ml-latest-small/movies.csv")
ratings <- fread("ml-latest-small/ratings.csv")

# Step 3: Build User-Movie Rating Matrix & Split
R0 <- dcast(ratings, userId ~ movieId, value.var = "rating") %>%
  column_to_rownames("userId") %>%
  as.matrix()
rrm <- as(R0, "realRatingMatrix")

set.seed(123)
es <- evaluationScheme(rrm, method = "split", train = 0.8, given = 10, goodRating = 4)
train_rrm  <- getData(es, "train")
known_rrm  <- getData(es, "known")
unknown_rrm <- getData(es, "unknown")

# Step 4a: Item-Item Collaborative Filtering IBCF
ibcf_mod <- Recommender(train_rrm, method = "IBCF", param = list(method = "cosine", k = 30))
pred_ibcf <- predict(ibcf_mod, known_rrm, type = "ratings")
acc_ibcf <- calcPredictionAccuracy(pred_ibcf, unknown_rrm)

# Step 4b: User-User Collaborative Filtering UBCF
ubcf_mod <- Recommender(train_rrm, method = "UBCF", param = list(method = "cosine", nn = 30))
pred_ubcf <- predict(ubcf_mod, known_rrm, type = "ratings")
acc_ubcf <- calcPredictionAccuracy(pred_ubcf, unknown_rrm)

# Step 5: Prepare Content-Based Feature Matrix (Genres) 
movies_feat <- movies %>%
  separate_rows(genres, sep = "\\|") %>%
  mutate(flag = 1) %>%
  pivot_wider(names_from = genres, values_from = flag, values_fill = 0) %>%
  select(-title, -`(no genres listed)`) %>%
  column_to_rownames("movieId") %>%
  data.matrix()

# Cosine similarity function
cosine_sim <- function(m) {
  norms <- sqrt(rowSums(m^2))
  norms[norms == 0] <- 1
  m2 <- m / norms
  tcrossprod(m2)
}
item_sim <- cosine_sim(movies_feat)

# Step 6: Prepare Numeric Data for CBF Prediction 
M <- 1 * (!is.na(R0))
R0_na <- R0
R0_na[is.na(R0_na)] <- 0

# Step 7: Align Items Between Ratings and Similarity Matrix
common_items <- intersect(colnames(R0_na), rownames(item_sim))
R0_sub  <- R0_na[, common_items]
M_sub   <- M[, common_items]
sim_sub <- item_sim[common_items, common_items]

# Step 8: Predict Ratings with Content-Based Filtering
Rt <- t(R0_sub)    # items × users
Mt <- t(M_sub)
num <- sim_sub %*% Rt
den <- abs(sim_sub) %*% Mt
pred_sub <- t(num / den)  # users × items

# Step 9: Fallback to User Means
user_means <- rowMeans(R0_sub, na.rm = TRUE)
zero_mask <- (t(den) == 0)
pred_sub[zero_mask] <- rep(user_means, each = ncol(pred_sub))[zero_mask]

#  Step 10: Embed Predictions into Full Matrix
pred_cb_mat <- matrix(NA, nrow = nrow(R0_na), ncol = ncol(R0_na), dimnames = dimnames(R0_na))
pred_cb_mat[, common_items] <- pred_sub

# Step 11: Evaluate Content-Based Filtering Performance
actual_unknown <- as(unknown_rrm, "matrix")
mask <- !is.na(actual_unknown)
errs <- pred_cb_mat[mask] - actual_unknown[mask]

rmse_cb <- sqrt(mean(errs^2))
mae_cb  <- mean(abs(errs))
mse_cb  <- mean(errs^2)

# Step 12: Output Evaluation Metrics
cat("Model Evaluation Metrics:\n")
## Model Evaluation Metrics:
cat(sprintf("  Item-Item CF     - RMSE: %.4f\n", acc_ibcf["RMSE"]))
##   Item-Item CF     - RMSE: 1.0632
cat(sprintf("  User-User CF     - RMSE: %.4f\n", acc_ubcf["RMSE"]))
##   User-User CF     - RMSE: 1.1001
cat(sprintf("  Content-Based    - RMSE: %.4f | MAE: %.4f | MSE: %.4f\n", rmse_cb, mae_cb, mse_cb))
##   Content-Based    - RMSE: 1.2571 | MAE: 0.9508 | MSE: 1.5804
# Step 13: Visualization of RMSEs
results_df <- data.frame(
  Model = c("Item-Item CF", "User-User CF", "Content-Based"),
  RMSE = c(acc_ibcf["RMSE"], acc_ubcf["RMSE"], rmse_cb)
)

ggplot(results_df, aes(x = Model, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "RMSE Comparison of Recommender Models", y = "RMSE", x = "")

# Step 14 (Optional): Save Results to CSV
write.csv(results_df, "recommender_evaluation_results.csv", row.names = FALSE)

Interpretation

The model evaluation shows a clear accuracy hierarchy: Item-Item Collaborative Filtering (RMSE: 1.0632) is the most accurate, followed by User-User Collaborative Filtering (RMSE: 1.1001). Content-Based Filtering (RMSE: 1.2571) is significantly less accurate, indicating its predictions deviate most from actual ratings.

Collaborative filtering consistently outperforms the content-based approach. This is likely because collaborative methods, by analyzing user behavior, capture more complex preferences than simple genre-based content analysis. Specifically, Item-Item CF slightly edged out User-User CF, suggesting more stable item-to-item relationships in this dataset.

The Content-Based model’s lower performance highlights its limitations when using basic features like genres, which don’t fully encompass all factors influencing ratings. In summary, for this dataset, Item-Item Collaborative Filtering is the most effective and accurate recommendation approach.

Including Plots

This code extends the analysis of the Content-Based Filtering (CBF) model. It begins by visualizing the distribution of prediction errors for the initial CBF model through a histogram, providing insight into the model’s accuracy tendencies. This helps understand the nature of the inaccuracies.

The latter part of the code focuses on optimizing the CBF model by introducing a shrinkage parameter (λ) into the cosine similarity calculation for items. This shrinkage helps to regularize the similarity scores, particularly for items with few co-occurrences or ratings, thereby preventing noisy similarities from unduly influencing predictions. The code then defines functions to predict ratings with this new shrinkage and to evaluate the model’s performance. Finally, it iteratively calculates the RMSE for various λ values and visualizes how shrinkage affects the model’s accuracy, aiming to identify an optimal λ that minimizes prediction error.

# Visualize Error Distribution Histogram (for CBF)
actual_unknown <- as(unknown_rrm, "matrix")
pred_cb_vec    <- pred_cb_mat[!is.na(actual_unknown)]
act_cb_vec     <- actual_unknown[!is.na(actual_unknown)]
errors_cb      <- pred_cb_vec - act_cb_vec

ggplot(data.frame(error = errors_cb), aes(x = error)) +
  geom_histogram(binwidth = 0.25, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Error Distribution for Content-Based Model",
       x     = "Prediction Error (pred – actual)",
       y     = "Count")

# Shrinkage λ vs. RMSE Line Plot
# Shrinked Cosine Similarity 
cosine_sim_shrink <- function(m, lambda = 10) {
  norms <- sqrt(rowSums(m^2))
  norms[norms == 0] <- 1
  m2 <- m / norms
  sim <- tcrossprod(m2)
  counts <- tcrossprod(!is.na(m))
  sim * (counts / (counts + lambda))
}

#  Content-Based Prediction Function
predict_cbf_matrix <- function(R, M = NULL, sim_mat) {
  if (is.null(M)) M <- 1 * (!is.na(R))
  R_na <- R
  R_na[is.na(R_na)] <- 0

  common_items <- intersect(colnames(R_na), rownames(sim_mat))
  R_sub <- R_na[, common_items]
  M_sub <- M[, common_items]
  sim_sub <- sim_mat[common_items, common_items]

  Rt <- t(R_sub)
  Mt <- t(M_sub)

  num <- sim_sub %*% Rt
  den <- abs(sim_sub) %*% Mt
  pred <- t(num / den)

  # Handle zero-denominator with fallback to user mean
  um <- rowMeans(R_sub, na.rm = TRUE)
  zeros <- t(den) == 0
  zeros[is.na(zeros)] <- FALSE
  pred[zeros] <- rep(um, each = ncol(pred))[zeros]

  # Embed prediction back into full matrix
  Pmat <- matrix(NA, nrow = nrow(R), ncol = ncol(R), dimnames = dimnames(R))
  Pmat[, common_items] <- pred
  return(Pmat)
}

#  Evaluation Function 
eval_cbf_matrix <- function(Pmat, unknown_rrm) {
  actual <- as(unknown_rrm, "matrix")
  mask <- !is.na(actual)
  errs <- Pmat[mask] - actual[mask]
  rmse <- sqrt(mean(errs^2))
  mae <- mean(abs(errs))
  mse <- mean(errs^2)
  return(c(RMSE = rmse, MAE = mae, MSE = mse))
}

# Run Content-Based Filtering with Shrinkage Tuning 
# Assumes movies_feat (one-hot genre matrix), R0 (ratings matrix), and unknown_rrm already exist

lambdas <- c(0, 5, 10, 25, 50)

cbf_results <- sapply(lambdas, function(lam) {
  sim_mat <- cosine_sim_shrink(movies_feat, lambda = lam)
  pred_mat <- predict_cbf_matrix(R0, NULL, sim_mat)
  eval_cbf_matrix(pred_mat, unknown_rrm)
})

cbf_results <- t(cbf_results)
rownames(cbf_results) <- paste0("lambda = ", lambdas)

# Displaying  Results 
print(round(cbf_results, 4))
##               RMSE    MAE    MSE
## lambda = 0  1.2571 0.9508 1.5804
## lambda = 5  1.2571 0.9508 1.5804
## lambda = 10 1.2571 0.9508 1.5804
## lambda = 25 1.2571 0.9508 1.5804
## lambda = 50 1.2571 0.9508 1.5804
# Plot RMSE across lambdas
library(ggplot2)
plot_df <- data.frame(lambda = lambdas, RMSE = cbf_results[, "RMSE"])
ggplot(plot_df, aes(x = lambda, y = RMSE)) +
  geom_line(color = "steelblue", size = 1.2) +
  geom_point(color = "darkred", size = 2) +
  theme_minimal() +
  labs(title = "Content-Based Filtering RMSE vs. Lambda (Shrinkage)",
       x = "Lambda (Shrinkage Penalty)", y = "RMSE")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation of error distribution

In conclusion, while the error distribution shows a generally centered and symmetrical pattern for the Content-Based model’s predictions, the attempt to optimize it using a shrinkage penalty on item similarity for genre features proved ineffective. This suggests that the current genre-based feature representation for movies might be too simple or too dense for this particular shrinkage method to yield performance improvements. Further content-based model improvements would likely require richer content features beyond just genres ( or a different similarity calculation method.

Conclusion

This project evaluated Item-Item CF, User-User CF, and Content-Based Filtering on MovieLens data. Item-Item CF emerged as the most accurate (RMSE: 1.0632), outperforming User-User CF (1.1001) and the significantly less accurate Content-Based Filtering (1.2571). Collaborative filtering, leveraging user behavior, proved superior to the genre-based content approach.

While the Content-Based model’s errors centered around zero, its limited feature set hindered overall accuracy. Attempts to optimize it with shrinkage on genre similarities yielded no improvement. This indicates that more detailed item features or advanced content-based methods are needed to enhance its predictive capability.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.