In this project, I compare memory-based (user-based collaborative filtering) and model-based (SVD matrix factorization) recommender systems. The goals:
Predict ratings as accurately as possible.
Compare model performance (RMSE).
Explore how tuning the number of latent factors (k
)
affects SVD performance.
I use the MovieLens 100k dataset provided by the
recommenderlab
package.
# Install and load packages
if (!require(recommenderlab)) install.packages("recommenderlab")
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(knitr)) install.packages("knitr")
library(recommenderlab)
library(ggplot2)
library(knitr)
# Load MovieLens data
data("MovieLense")
# Inspect dataset
MovieLense
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.
Explanation:
I use MovieLense, which contains 100,000 ratings from 943 users on 1664 movies. The data is in realRatingMatrix format where rows = users and columns = items (movies).
set.seed(123)
scheme <- evaluationScheme(MovieLense, method = "cross-validation", k = 5, given = 10, goodRating = 4)
Explanation:
I use 5-fold cross-validation, where each user has 10 ratings given, and then predict the rest.
results_ubcf <- evaluate(scheme, method = "UBCF", type = "ratings")
## UBCF run fold/sample [model time/prediction time]
## 1 [0.004sec/0.399sec]
## 2 [0.001sec/0.369sec]
## 3 [0.001sec/0.357sec]
## 4 [0.002sec/0.288sec]
## 5 [0.001sec/0.352sec]
avg(results_ubcf)
## RMSE MSE MAE
## [1,] 1.228683 1.510186 0.9588688
Explanation:
I apply user-based collaborative filtering (UBCF) and compute RMSE, MAE, and MSE.
results_svd_default <- evaluate(scheme, method = "SVD", type = "ratings")
## SVD run fold/sample [model time/prediction time]
## 1 [0.092sec/0.09sec]
## 2 [0.161sec/0.085sec]
## 3 [0.089sec/0.085sec]
## 4 [0.093sec/0.019sec]
## 5 [0.089sec/0.085sec]
avg(results_svd_default)
## RMSE MSE MAE
## [1,] 1.019596 1.039863 0.7963561
Explanation:
I apply SVD matrix factorization and compute the same metrics as UBCF.
k_values <- c(10, 20, 50)
svd_rmses <- data.frame(k = integer(), RMSE = numeric())
for (k in k_values) {
cat("Evaluating SVD with k =", k, "\n")
res <- evaluate(scheme, method = "SVD", type = "ratings", parameter = list(k = k))
avg_rmse <- avg(res)[1, "RMSE"]
print(avg(res))
svd_rmses <- rbind(svd_rmses, data.frame(k = k, RMSE = avg_rmse))
}
## Evaluating SVD with k = 10
## SVD run fold/sample [model time/prediction time]
## 1 [0.16sec/0.021sec]
## 2 [0.096sec/0.087sec]
## 3 [0.082sec/0.021sec]
## 4 [0.097sec/0.086sec]
## 5 [0.088sec/0.019sec]
## RMSE MSE MAE
## [1,] 1.019596 1.039863 0.7963561
## Evaluating SVD with k = 20
## SVD run fold/sample [model time/prediction time]
## 1 [0.145sec/0.089sec]
## 2 [0.145sec/0.022sec]
## 3 [0.145sec/0.089sec]
## 4 [0.142sec/0.086sec]
## 5 [0.147sec/0.088sec]
## RMSE MSE MAE
## [1,] 1.019174 1.038999 0.7960917
## Evaluating SVD with k = 50
## SVD run fold/sample [model time/prediction time]
## 1 [0.359sec/0.028sec]
## 2 [0.351sec/0.095sec]
## 3 [0.351sec/0.029sec]
## 4 [0.405sec/0.029sec]
## 5 [0.364sec/0.03sec]
## RMSE MSE MAE
## [1,] 1.019363 1.039388 0.7961063
print(svd_rmses)
## k RMSE
## RMSE 10 1.019596
## RMSE1 20 1.019174
## RMSE2 50 1.019363
Explanation:
I evaluate SVD for different latent dimension sizes (k) and store RMSE results.
ubcf_rmse <- avg(results_ubcf)[1, "RMSE"]
svd_rmse <- svd_rmses$RMSE[svd_rmses$k == 10]
comparison <- data.frame(
Model = c("UBCF", "SVD (k=10)"),
RMSE = c(ubcf_rmse, svd_rmse)
)
ggplot(comparison, aes(x = Model, y = RMSE, fill = Model)) +
geom_bar(stat = "identity") +
ylim(0, max(comparison$RMSE) * 1.1) +
geom_text(aes(label = round(RMSE, 3)), vjust = -0.5) +
ggtitle("RMSE Comparison: Memory-Based vs Model-Based (SVD)") +
theme_minimal()
# Plot SVD tuning results
ggplot(svd_rmses, aes(x = factor(k), y = RMSE)) +
geom_bar(stat = "identity", fill = "steelblue") +
ggtitle("SVD RMSE by Latent Dimension (k)") +
xlab("Number of Latent Features (k)") +
ylab("RMSE") +
theme_minimal() +
geom_text(aes(label = round(RMSE, 3)), vjust = -0.5)
Explanation:
The first plot compares RMSE for UBCF and SVD (default k).
The second plot shows how SVD RMSE varies with latent dimensions (k).
In the following code, I am using the best model; SVD with \(k=20\) since it had the lowest RMSE, to generate predicted ratings for users in the test set. Below are example predicted ratings and top-N recommendations for user 1.
# Train final SVD model with best k (let's use k = 20 as per your results)
final_svd_model <- Recommender(getData(scheme, "train"), method = "SVD", parameter = list(k = 20))
# Make predictions for users in the test set
final_predictions <- predict(final_svd_model, getData(scheme, "known"), type = "ratings")
# Convert predictions for first user to a data frame
pred_matrix <- as(final_predictions, "matrix")
user1_pred <- pred_matrix[1, , drop = FALSE]
# Convert to long format data frame
user1_df <- data.frame(
Movie = colnames(user1_pred),
Predicted_Rating = as.numeric(user1_pred[1, ])
)
# Remove movies that were not predicted (NA)
user1_df <- na.omit(user1_df)
# Optionally sort by predicted rating
user1_df <- user1_df[order(-user1_df$Predicted_Rating), ]
# Display as a table
kable(head(user1_df, 10), caption = "Top 10 Predicted Ratings for User 1")
Movie | Predicted_Rating | |
---|---|---|
808 | Great Day in Harlem, A (1994) | 4.404642 |
1526 | Aiqing wansui (1994) | 4.209570 |
1457 | Saint of Fort Washington, The (1993) | 4.144248 |
1490 | Santa with Muscles (1996) | 3.951662 |
1589 | Someone Else’s America (1995) | 3.918162 |
1628 | Some Mother’s Son (1996) | 3.799073 |
1645 | Rough Magic (1995) | 3.699040 |
1283 | Star Kid (1997) | 3.697260 |
1439 | Pather Panchali (1955) | 3.686491 |
1180 | Prefontaine (1997) | 3.678274 |
# Predict top 5 recommendations for each user
final_topN <- predict(final_svd_model, getData(scheme, "known"), type = "topNList", n = 5)
Let’s look at other users top 5 recommendations:
# Show top 5 for user 2
top5_user2 <- as(final_topN, "list")[[2]]
kable(data.frame(Rank = 1:5, Movie = top5_user2), caption = "Top 5 Movie Recommendations for User 2")
Rank | Movie |
---|---|
1 | Great Day in Harlem, A (1994) |
2 | Aiqing wansui (1994) |
3 | Saint of Fort Washington, The (1993) |
4 | Santa with Muscles (1996) |
5 | Someone Else’s America (1995) |
kable(data.frame(Rank = 1:5, Movie = as(final_topN, "list")[[3]]), caption = "Top 5 Movie Recommendations for User 3")
Rank | Movie |
---|---|
1 | Great Day in Harlem, A (1994) |
2 | Aiqing wansui (1994) |
3 | Saint of Fort Washington, The (1993) |
4 | Santa with Muscles (1996) |
5 | Someone Else’s America (1995) |
kable(data.frame(Rank = 1:5, Movie = as(final_topN, "list")[[4]]), caption = "Top 5 Movie Recommendations for User 4")
Rank | Movie |
---|---|
1 | Great Day in Harlem, A (1994) |
2 | Aiqing wansui (1994) |
3 | Saint of Fort Washington, The (1993) |
4 | Santa with Muscles (1996) |
5 | Someone Else’s America (1995) |
Generating recommendations for 4 different users gave identical lists. Let’s further check the top 5 most rate movies and compare if any of these movies are recommended:
# Get item rating counts
item_counts <- colCounts(MovieLense)
# Top 5 most rated movies as a named vector
top5_items <- head(sort(item_counts, decreasing = TRUE), 5)
# Convert to data frame for kable
top5_df <- data.frame(
Movie = names(top5_items),
Rating_Count = as.integer(top5_items),
row.names = NULL
)
# Display as table
knitr::kable(top5_df, caption = "Top 5 Most Rated Movies")
Movie | Rating_Count |
---|---|
Star Wars (1977) | 583 |
Contact (1997) | 509 |
Fargo (1996) | 508 |
Return of the Jedi (1983) | 507 |
Liar Liar (1997) | 485 |
Purpose: This code calculates how many ratings each movie has received in the dataset and displays the five most-rated (popular) movies. This helps assess what items the model could recommend if it followed popularity.
Let’s check if the popular movies are among the users’ given ratings:
# For user 1
user1_known <- as(getData(scheme, "known"), "matrix")[1, ]
names(user1_known[!is.na(user1_known)])
## [1] "Client, The (1994)" "Spawn (1997)"
Purpose: This code identifies the specific movies that User 1 has already rated in the “known” portion of the evaluation scheme. This explains why those movies wouldn’t appear in recommendations—they’re already part of the user’s history.
Let’s check how many ratings the recommended movies have overall:
# Subset item counts for specific movies
selected_items <- item_counts[names(item_counts) %in% c(
"Great Day in Harlem, A (1994)",
"Aiqing wansui (1994)",
"Saint of Fort Washington, The (1993)",
"Santa with Muscles (1996)",
"Someone Else's America (1995)"
)]
# Convert to data frame
selected_df <- data.frame(
Movie = names(selected_items),
Rating_Count = as.integer(selected_items),
row.names = NULL
)
# Display as table
knitr::kable(selected_df, caption = "Rating Counts for Recommended Movies")
Movie | Rating_Count |
---|---|
Great Day in Harlem, A (1994) | 1 |
Saint of Fort Washington, The (1993) | 2 |
Santa with Muscles (1996) | 2 |
Aiqing wansui (1994) | 1 |
Someone Else’s America (1995) | 1 |
Purpose: This code retrieves how many ratings each of the recommended movies received in the dataset. The low counts reveal that these are obscure items, helping us understand why the SVD model suggested them despite their limited popularity.
When generating top-5 recommendations for Users 1 to 4, I observed that all received identical lists. The recommendations included obscure titles such as Great Day in Harlem, A (1994) and Someone Else’s America (1995), each of which had very few ratings in the dataset.
To explore this, I:
Checked the most-rated (popular) movies, e.g., Star Wars (1977), Contact (1997).
Verified which movies User 1 had already rated to ensure they weren’t recommended again.
Retrieved the rating counts for the recommended movies, confirming their low popularity.
This behavior highlights how matrix factorization, in the presence of sparse user data and no popularity bias, may prioritize items in the latent space that do not have broad appeal or sufficient feedback. A hybrid approach, combining SVD with popularity filtering, could address this issue.
Memory-Based (UBCF)
\(\checkmark\) Easy to implement
\(\checkmark\) No model training (adapts quickly to new data)
\(\checkmark\) Recommendations are explainable (e.g. “similar users liked X”)
\(\times\) Poor scalability for large datasets
\(\times\) Sensitive to data sparsity
Model-Based (SVD)
\(\checkmark\) Reduces dimensionality → efficiency
\(\checkmark\) Captures hidden preference patterns
\(\checkmark\) Better at generalization, handles sparsity better
\(\times\) Computationally costly to train
\(\times\) Latent features are not interpretable
\(\times\) Needs retraining for new users/items
In this project, I compared two popular recommender system approaches on the MovieLens 100k dataset:
Memory-Based Collaborative Filtering (UBCF)
Model-Based Matrix Factorization using Singular Value Decomposition (SVD)
The SVD model consistently outperformed UBCF in predicting user ratings, achieving a lower RMSE of approximately 1.02 compared to UBCF’s RMSE of 1.23.
This performance gap demonstrates SVD’s strength in capturing latent user-item interaction patterns more effectively than simple similarity-based methods. However, the model’s recommendations for different users were identical and focused on obscure movies with very few ratings. This likely reflects the influence of data sparsity and the model’s tendency to recommend items that are underrepresented in the latent space when user profiles are small (10 known ratings).
Tuning the number of latent factors (k) in the SVD model showed minimal RMSE improvement beyond k=10. The RMSE ranged narrowly between 1.0192 and 1.0196 for k values of 10, 20, and 50.
Increasing k beyond 10 slightly increased computational time but did not significantly improve accuracy, suggesting that a smaller number of latent features is sufficient for this dataset.
The first plot clearly shows the RMSE advantage of SVD over UBCF, highlighting the improved predictive accuracy of the model-based approach.
The second plot illustrates that RMSE stabilizes quickly with increasing latent dimensions, emphasizing the law of diminishing returns in model complexity.
This analysis confirms that model-based matrix factorization is a more accurate and scalable approach for collaborative filtering on sparse rating data, such as MovieLens. The latent factors learned via SVD capture complex patterns that similarity-based methods struggle to exploit. However, model complexity should be balanced with computational cost, as excessive latent factors do not substantially improve prediction quality. Ultimately, blending model accuracy with practical recommendation utility remains a key goal for future recommender system development.
While SVD showed strong predictive accuracy in terms of RMSE, its recommendations favored low-popularity items. This reflects a limitation of pure matrix factorization on sparse data. Future work could:
Implement hybrid recommenders combining SVD with popularity or content-based signals.
Apply post-processing filters to exclude extremely low-rating-count items.
Explore additional tuning of hyperparameters beyond
k
.
recommenderlab documentation: https://cran.r-project.org/web/packages/recommenderlab/recommenderlab.pdf
MovieLens dataset: https://grouplens.org/datasets/movielens/100k/
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.50 ggplot2_3.5.2 recommenderlab_1.0.7
## [4] proxy_0.4-27 arules_1.7-11 Matrix_1.7-3
##
## loaded via a namespace (and not attached):
## [1] float_0.3-3 gtable_0.3.6 jsonlite_2.0.0 dplyr_1.1.4
## [5] compiler_4.5.1 tidyselect_1.2.1 Rcpp_1.0.14 jquerylib_0.1.4
## [9] scales_1.4.0 yaml_2.3.10 fastmap_1.2.0 lattice_0.22-7
## [13] R6_2.6.1 labeling_0.4.3 generics_0.1.4 tibble_3.3.0
## [17] bslib_0.9.0 pillar_1.10.2 RColorBrewer_1.1-3 rlang_1.1.6
## [21] cachem_1.1.0 xfun_0.52 sass_0.4.10 registry_0.5-1
## [25] cli_3.6.5 withr_3.0.2 magrittr_2.0.3 digest_0.6.37
## [29] grid_4.5.1 rstudioapi_0.17.1 irlba_2.3.5.1 lifecycle_1.0.4
## [33] vctrs_0.6.5 evaluate_1.0.3 glue_1.8.0 farver_2.1.2
## [37] recosystem_0.5.1 rmarkdown_2.29 matrixStats_1.5.0 tools_4.5.1
## [41] pkgconfig_2.0.3 htmltools_0.5.8.1
# Expanded set of k values
k_values <- c(5, 10, 15, 20, 30, 40, 50, 60, 80, 100)
svd_rmses <- data.frame(k = integer(), RMSE = numeric())
# Evaluate SVD for each k
for (k in k_values) {
cat("Evaluating SVD with k =", k, "\n")
res <- evaluate(scheme, method = "SVD", type = "ratings", parameter = list(k = k))
avg_rmse <- avg(res)[1, "RMSE"]
svd_rmses <- rbind(svd_rmses, data.frame(k = k, RMSE = avg_rmse))
}
## Evaluating SVD with k = 5
## SVD run fold/sample [model time/prediction time]
## 1 [0.132sec/0.092sec]
## 2 [0.059sec/0.019sec]
## 3 [0.06sec/0.09sec]
## 4 [0.059sec/0.02sec]
## 5 [0.129sec/0.018sec]
## Evaluating SVD with k = 10
## SVD run fold/sample [model time/prediction time]
## 1 [0.098sec/0.024sec]
## 2 [0.09sec/0.096sec]
## 3 [0.089sec/0.093sec]
## 4 [0.095sec/0.019sec]
## 5 [0.093sec/0.022sec]
## Evaluating SVD with k = 15
## SVD run fold/sample [model time/prediction time]
## 1 [0.114sec/0.023sec]
## 2 [0.189sec/0.02sec]
## 3 [0.134sec/0.023sec]
## 4 [0.118sec/0.022sec]
## 5 [0.197sec/0.02sec]
## Evaluating SVD with k = 20
## SVD run fold/sample [model time/prediction time]
## 1 [0.157sec/0.024sec]
## 2 [0.148sec/0.094sec]
## 3 [0.161sec/0.022sec]
## 4 [0.209sec/0.021sec]
## 5 [0.149sec/0.02sec]
## Evaluating SVD with k = 30
## SVD run fold/sample [model time/prediction time]
## 1 [0.305sec/0.022sec]
## 2 [0.281sec/0.024sec]
## 3 [0.215sec/0.096sec]
## 4 [0.226sec/0.025sec]
## 5 [0.195sec/0.024sec]
## Evaluating SVD with k = 40
## SVD run fold/sample [model time/prediction time]
## 1 [0.285sec/0.029sec]
## 2 [0.352sec/0.026sec]
## 3 [0.294sec/0.026sec]
## 4 [0.279sec/0.027sec]
## 5 [0.339sec/0.027sec]
## Evaluating SVD with k = 50
## SVD run fold/sample [model time/prediction time]
## 1 [0.417sec/0.028sec]
## 2 [0.351sec/0.031sec]
## 3 [0.364sec/0.103sec]
## 4 [0.341sec/0.032sec]
## 5 [0.447sec/0.026sec]
## Evaluating SVD with k = 60
## SVD run fold/sample [model time/prediction time]
## 1 [0.421sec/0.103sec]
## 2 [0.446sec/0.034sec]
## 3 [0.459sec/0.033sec]
## 4 [0.441sec/0.037sec]
## 5 [0.537sec/0.033sec]
## Evaluating SVD with k = 80
## SVD run fold/sample [model time/prediction time]
## 1 [0.684sec/0.109sec]
## 2 [0.673sec/0.034sec]
## 3 [0.639sec/0.039sec]
## 4 [0.685sec/0.037sec]
## 5 [0.742sec/0.038sec]
## Evaluating SVD with k = 100
## SVD run fold/sample [model time/prediction time]
## 1 [0.889sec/0.041sec]
## 2 [0.899sec/0.043sec]
## 3 [0.903sec/0.117sec]
## 4 [0.838sec/0.048sec]
## 5 [1.082sec/0.042sec]
# Print the results
print(svd_rmses)
## k RMSE
## RMSE 5 1.020198
## RMSE1 10 1.019596
## RMSE2 15 1.019303
## RMSE3 20 1.019174
## RMSE4 30 1.019170
## RMSE5 40 1.019413
## RMSE6 50 1.019362
## RMSE7 60 1.019377
## RMSE8 80 1.019603
## RMSE9 100 1.019786
# Plot RMSE as a function of k
ggplot(svd_rmses, aes(x = k, y = RMSE)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "darkred", size = 2) +
ggtitle("SVD RMSE by Number of Latent Factors (k)") +
xlab("Latent Dimension (k)") +
ylab("RMSE") +
theme_minimal() +
geom_text(aes(label = round(RMSE, 3)), vjust = -1, size = 3.5)