Assignment #11: Personalized Recommender

Author

Denise Atherley

Approach:

For this assignment I will be transitioning from the non-personalized Global Baseline Estimate to a personalized system that allows me to leverage the specific user behaviors captured in my survey data. I will maintain use of tidyverse syntax to complete this and utilize the ‘recommenderlab’ package, which is the standard R library for building and evaluating recommendation engines. I suspect that a challenge I may experience is the small dataset used for this that may limit how the recommendations are made.

Data Preparation:

I will read the same survey data from my Global Baseline Estimate and pivot it into a user-item matrix (realRatingMatrix), which is required by the ‘recommenderlab’ package.

# Load necessary libraries
library(tidyverse)
Warning: package 'dplyr' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(recommenderlab)
Warning: package 'recommenderlab' was built under R version 4.5.3
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loading required package: arules
Warning: package 'arules' was built under R version 4.5.3

Attaching package: 'arules'

The following object is masked from 'package:dplyr':

    recode

The following objects are masked from 'package:base':

    abbreviate, write

Loading required package: proxy

Attaching package: 'proxy'

The following object is masked from 'package:Matrix':

    as.matrix

The following objects are masked from 'package:stats':

    as.dist, dist

The following object is masked from 'package:base':

    as.matrix
library(gt)

# Read the same data set used for Global Baseline Estimate assignment

url<- "https://raw.githubusercontent.com/DRA-SPS27/DATA607-Week-3-Assignments/refs/heads/main/D.Atherley%20-%20Movie%20Ratings%20(IDs).csv"

df <- read_csv(url)
Rows: 42 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (4): Ratings_ID, User_ID, Movie_ID, Rating

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Convert the tidy dataframe into a wide matrix (Rows = Users, Columns = Movies)
# Then cast it to a 'realRatingMatrix' for recommenderlab
rating_matrix <- df |>
  filter(!is.na(Rating)) |>
  select(User_ID, Movie_ID, Rating) |>
  pivot_wider(names_from = Movie_ID, values_from = Rating) |>
  column_to_rownames(var = "User_ID") |>
  as.matrix() |>
  as("realRatingMatrix")

Personalized Recommendation Algorithm:

I will implement User-to-User Collaborative Filtering (UBCF) as my algorithm to find users with similar rating patterns, using Cosine similarity, and recommend movies that those similar users rated highly.

# Using User-Based Collaborative Filtering (UBCF) with Cosine similarity
rec_model <- Recommender(data = rating_matrix,
                         method = "UBCF",
                         param = list(method = "Cosine", nn = 3))

Recommender Output:

The recommender will output a “Top 2 List” for each user based on their unseen items.

# Predict the Top 2 unseen movies for all users in the dataset
predicted_top_n <- predict(rec_model, newdata = rating_matrix, type = "topNList", n = 2)

# Convert to a readable list, then to a tidy tibble for presentation
rec_list <- as(predicted_top_n, "list")

recommendation_table <- tibble(
  User_ID = as.integer(names(rec_list)),
  Recommended_Movie_IDs = map_chr(rec_list, ~ paste(.x, collapse = ", "))
) |>
  # Filter out users who have already rated all 6 movies (no new movies to recommend)
  filter(Recommended_Movie_IDs != "") 

# Display the recommendations using gt
recommendation_table |>
  gt() |>
  tab_header(
    title = "Personalized Movie Recommendations",
    subtitle = "Top Unseen Movies via User-Based Collaborative Filtering"
  )
Personalized Movie Recommendations
Top Unseen Movies via User-Based Collaborative Filtering
User_ID Recommended_Movie_IDs
0 2
2 6
3 5, 6
5 5
6 5, 3

Evaluate Model Performance:

I will use a 4-Fold Cross Validation and Leave-One-Out method to calculate prediction accuracy metrics, specifically Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). I am using this approach because of the size of my dataset. Using a 4-Fold Cross Validation approach rotates the users through the training and testing sets multiple times, ensuring that every user gets evaluated and maximizing the data pool. Using the Leave-One-Out method tells the model to use all of the test user’s ratings except for one. It will use that single hidden rating for the test.

# Create a Cross-Validation scheme: 4 folds, leave-1-out
set.seed(123) 
eval_scheme <- evaluationScheme(data = rating_matrix,
                                method = "cross-validation",
                                k = 4,          # 4-fold cross validation
                                given = -1,     # Use all ratings except 1 for testing
                                goodRating = 4)

# Train the model on the cross-validation scheme
eval_model <- Recommender(getData(eval_scheme, "train"), method = "UBCF")

# Predict the single hidden rating for the test users
eval_predictions <- predict(eval_model, getData(eval_scheme, "known"), type = "ratings")

# Calculate prediction accuracy metrics
accuracy_metrics <- calcPredictionAccuracy(eval_predictions, getData(eval_scheme, "unknown"))

# Display the error metrics
accuracy_metrics
    RMSE      MSE      MAE 
1.022961 1.046448 0.935364 

Conclusion:

I initially attempted to use a Split Hold Out method and the evaluation metrics returned a value of “NaN”. After researching what that meant, I learned that the model was unable to make any successful predictions because the dataset is very small. This was a challenge that I suspected would happen. To correct this, I switched to a 4-Fold Cross Validation and Leave-One-Out method instead. This approach allowed for this tiny data set to be used. The results of the Cross Validation scheme tell me how far off my User-Based Collaborative Filtering model’s predictions were from the actual ratings. The MAE is 0.935 which means that my model’s predicted rating was off by about 0.94 points. The RMSE is 1.023 which means that while the typical error is less than 1 point, there are a few instances where the model missed by a larger margin.

For an exploratory model running on a small dataset, an average error of about 1 point means that my model is generally capturing the right ballpark of a user’s preference. It is unlikely to strongly recommend a movie a user would hate. As the survey data increases, typically the error metrics should shrink closer to zero.