For this assignment I will be transitioning from the non-personalized Global Baseline Estimate to a personalized system that allows me to leverage the specific user behaviors captured in my survey data. I will maintain use of tidyverse syntax to complete this and utilize the ‘recommenderlab’ package, which is the standard R library for building and evaluating recommendation engines. I suspect that a challenge I may experience is the small dataset used for this that may limit how the recommendations are made.
Data Preparation:
I will read the same survey data from my Global Baseline Estimate and pivot it into a user-item matrix (realRatingMatrix), which is required by the ‘recommenderlab’ package.
# Load necessary librarieslibrary(tidyverse)
Warning: package 'dplyr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(recommenderlab)
Warning: package 'recommenderlab' was built under R version 4.5.3
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loading required package: arules
Warning: package 'arules' was built under R version 4.5.3
Attaching package: 'arules'
The following object is masked from 'package:dplyr':
recode
The following objects are masked from 'package:base':
abbreviate, write
Loading required package: proxy
Attaching package: 'proxy'
The following object is masked from 'package:Matrix':
as.matrix
The following objects are masked from 'package:stats':
as.dist, dist
The following object is masked from 'package:base':
as.matrix
library(gt)# Read the same data set used for Global Baseline Estimate assignmenturl<-"https://raw.githubusercontent.com/DRA-SPS27/DATA607-Week-3-Assignments/refs/heads/main/D.Atherley%20-%20Movie%20Ratings%20(IDs).csv"df <-read_csv(url)
Rows: 42 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Ratings_ID, User_ID, Movie_ID, Rating
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert the tidy dataframe into a wide matrix (Rows = Users, Columns = Movies)# Then cast it to a 'realRatingMatrix' for recommenderlabrating_matrix <- df |>filter(!is.na(Rating)) |>select(User_ID, Movie_ID, Rating) |>pivot_wider(names_from = Movie_ID, values_from = Rating) |>column_to_rownames(var ="User_ID") |>as.matrix() |>as("realRatingMatrix")
Personalized Recommendation Algorithm:
I will implement User-to-User Collaborative Filtering (UBCF) as my algorithm to find users with similar rating patterns, using Cosine similarity, and recommend movies that those similar users rated highly.
# Using User-Based Collaborative Filtering (UBCF) with Cosine similarityrec_model <-Recommender(data = rating_matrix,method ="UBCF",param =list(method ="Cosine", nn =3))
Recommender Output:
The recommender will output a “Top 2 List” for each user based on their unseen items.
# Predict the Top 2 unseen movies for all users in the datasetpredicted_top_n <-predict(rec_model, newdata = rating_matrix, type ="topNList", n =2)# Convert to a readable list, then to a tidy tibble for presentationrec_list <-as(predicted_top_n, "list")recommendation_table <-tibble(User_ID =as.integer(names(rec_list)),Recommended_Movie_IDs =map_chr(rec_list, ~paste(.x, collapse =", "))) |># Filter out users who have already rated all 6 movies (no new movies to recommend)filter(Recommended_Movie_IDs !="") # Display the recommendations using gtrecommendation_table |>gt() |>tab_header(title ="Personalized Movie Recommendations",subtitle ="Top Unseen Movies via User-Based Collaborative Filtering" )
Personalized Movie Recommendations
Top Unseen Movies via User-Based Collaborative Filtering
User_ID
Recommended_Movie_IDs
0
2
2
6
3
5, 6
5
5
6
5, 3
Evaluate Model Performance:
I will use a 4-Fold Cross Validation and Leave-One-Out method to calculate prediction accuracy metrics, specifically Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). I am using this approach because of the size of my dataset. Using a 4-Fold Cross Validation approach rotates the users through the training and testing sets multiple times, ensuring that every user gets evaluated and maximizing the data pool. Using the Leave-One-Out method tells the model to use all of the test user’s ratings except for one. It will use that single hidden rating for the test.
# Create a Cross-Validation scheme: 4 folds, leave-1-outset.seed(123) eval_scheme <-evaluationScheme(data = rating_matrix,method ="cross-validation",k =4, # 4-fold cross validationgiven =-1, # Use all ratings except 1 for testinggoodRating =4)# Train the model on the cross-validation schemeeval_model <-Recommender(getData(eval_scheme, "train"), method ="UBCF")# Predict the single hidden rating for the test userseval_predictions <-predict(eval_model, getData(eval_scheme, "known"), type ="ratings")# Calculate prediction accuracy metricsaccuracy_metrics <-calcPredictionAccuracy(eval_predictions, getData(eval_scheme, "unknown"))# Display the error metricsaccuracy_metrics
RMSE MSE MAE
1.022961 1.046448 0.935364
Conclusion:
I initially attempted to use a Split Hold Out method and the evaluation metrics returned a value of “NaN”. After researching what that meant, I learned that the model was unable to make any successful predictions because the dataset is very small. This was a challenge that I suspected would happen. To correct this, I switched to a 4-Fold Cross Validation and Leave-One-Out method instead. This approach allowed for this tiny data set to be used. The results of the Cross Validation scheme tell me how far off my User-Based Collaborative Filtering model’s predictions were from the actual ratings. The MAE is 0.935 which means that my model’s predicted rating was off by about 0.94 points. The RMSE is 1.023 which means that while the typical error is less than 1 point, there are a few instances where the model missed by a larger margin.
For an exploratory model running on a small dataset, an average error of about 1 point means that my model is generally capturing the right ballpark of a user’s preference. It is unlikely to strongly recommend a movie a user would hate. As the survey data increases, typically the error metrics should shrink closer to zero.