Code
library(tidyverse)
library(recommenderlab)The objective of this assignment is to build a personalized recommendation system using the same movie ratings dataset employed in the Week 3A Global Baseline Estimate project. Whereas the prior assignment produced non-personalized recommendations based on overall average ratings and bias terms, the present assignment calls for a recommender that generates outputs tailored to the preferences of individual users.
Therefore, to achieve the aforementioned, the personalized recommendation algorithm that will likely be implemented is user-to-user collaborative filtering. This method works by identifying users with similar rating patterns and using those similarities to estimate how a target user may score movies they have not yet rated.
As with the previous assignment, the movie ratings dataset provided by Professor Catlin will be used. The dataset is arranged in a wide format, where each row represents a user and each column represents a movie, with missing values indicating unrated items.
The data will first be imported into R and reshaped into a long format using functions such as pivot_longer(), producing variables such as user, movie, and rating. The long-format data will then be converted into a user-item matrix for collaborative filtering.
The user-to-user collaborative filtering model will measure similarity between users based on the movies they have both rated. Similarity may be calculated using a metric such as Pearson correlation or cosine similarity.
Once the similarity scores are determined, the most similar users will be identified, and their ratings will be used to estimate ratings for unseen movies for a target user. The recommender output will likely take the form of a top-N list of recommended movies for each user, based on the highest predicted ratings.
To evaluate the recommender, a portion of the ratings data will likely be held out and treated as test data. The recommender’s predicted ratings may then be compared against the actual ratings using a metric such as RMSE or MAE.
Additionally, the resulting top-N recommendations may be reviewed to determine whether they appear reasonable and personalized.
One anticipated challenge is the sparsity of the ratings matrix, since users may not have rated many of the same movies. This could make the similarity calculations less stable. Another possible challenge is the relatively small size of the ratings dataset, which may limit the effectiveness of collaborative filtering.
The first step, as with most of our analytical tasks in RStudio, will call for the loading of the required libraries. In this assignment, the tidyverse package will be used for data preparation, while the recommenderlab package will be used to construct the personalized recommender system.
library(tidyverse)
library(recommenderlab)Subsequently, we must import the movie ratings dataset into our working environment. As with the previous Global Baseline Estimate assignment, the dataset originates from the critic movie ratings file provided by Professor Catlin.
url <- "https://raw.githubusercontent.com/bkchanderban/CUNY_SPS/refs/heads/main/DATA607/DATA607/week_3A_assignment/MovieRatings.csv"
raw_ratings <- read.csv(url, stringsAsFactors = FALSE)
glimpse(raw_ratings)Rows: 16
Columns: 7
$ Critic <chr> "Burton", "Charley", "Dan", "Dieudonne", "Matt", "Mauri…
$ CaptainAmerica <int> NA, 4, NA, 5, 4, 4, 4, NA, 4, 4, 5, NA, 5, 4, 4, NA
$ Deadpool <int> NA, 5, 5, 4, NA, NA, 4, NA, 4, 3, 5, NA, 5, NA, 5, NA
$ Frozen <int> NA, 4, NA, NA, 2, 3, 4, NA, 1, 5, 5, 4, 5, NA, 3, 5
$ JungleBook <int> 4, 3, NA, NA, NA, 3, 2, NA, NA, 5, 5, 5, 4, NA, 3, 5
$ PitchPerfect2 <int> NA, 2, NA, NA, 2, 4, 2, NA, NA, 2, NA, NA, 4, NA, 3, NA
$ StarWarsForce <int> 4, 3, 5, 5, 5, NA, 4, 4, 5, 3, 4, 3, 5, 4, NA, NA
At this stage, the dataset remains in its original wide format, where each row represents a critic/user, each column represents a movie, and each cell contains the rating that the given user assigned to the corresponding movie. Missing values indicate movies that were not rated by that user.
For collaborative filtering, the data must be placed into a user-item matrix, where rows represent users and columns represent movies. Since the imported dataset is already mostly arranged in this structure, the main preparation step entails separating the user names from the numeric movie ratings and converting the ratings portion into a matrix.
ratings_matrix <- raw_ratings %>%
column_to_rownames("Critic") %>%
as.matrix()
ratings_matrix <- apply(ratings_matrix, 2, as.numeric)
rownames(ratings_matrix) <- raw_ratings$Critic
ratings_matrix CaptainAmerica Deadpool Frozen JungleBook PitchPerfect2 StarWarsForce
Burton NA NA NA 4 NA 4
Charley 4 5 4 3 2 3
Dan NA 5 NA NA NA 5
Dieudonne 5 4 NA NA NA 5
Matt 4 NA 2 NA 2 5
Mauricio 4 NA 3 3 4 NA
Max 4 4 4 2 2 4
Nathan NA NA NA NA NA 4
Param 4 4 1 NA NA 5
Parshu 4 3 5 5 2 3
Prashanth 5 5 5 5 NA 4
Shipra NA NA 4 5 NA 3
Sreejaya 5 5 5 4 4 5
Steve 4 NA NA NA NA 4
Vuthy 4 5 3 3 3 NA
Xingjia NA NA 5 5 NA NA
The resulting matrix now contains users as rows, movies as columns, and ratings as the matrix values. This structure is now suitable for conversion into a recommenderlab rating matrix object.
The recommenderlab package requires the ratings data to be stored as a realRatingMatrix. This format is designed specifically for user-item rating data and allows recommender algorithms to be applied more directly.
ratings_rrm <- as(ratings_matrix, "realRatingMatrix")
ratings_rrm16 x 6 rating matrix of class 'realRatingMatrix' with 61 ratings.
At this stage, the movie ratings data has been converted into the required structure for building a personalized recommender system.
The personalized recommendation method used in this assignment will be user-to-user collaborative filtering. This method identifies users with similar rating patterns and then uses those similarities to recommend items that the target user has not yet rated.
ubcf_model <- Recommender(
data = ratings_rrm,
method = "UBCF",
parameter = list(method = "Pearson", nn = 5)
)
ubcf_modelRecommender of type 'UBCF' for 'realRatingMatrix'
learned using 16 users.
In this model, Pearson correlation was used to measure the similarity between users. The nn = 5 argument indicates that the model will consider up to five nearest neighbors when generating recommendations.
Once the model has been built, it can be used to generate personalized recommendations. In this case, the recommender will output the top three recommended movies for each user.
top_n_recommendations <- predict(
ubcf_model,
newdata = ratings_rrm,
n = 3
)
top_n_recommendationsRecommendations as 'topNList' with n = 3 for 16 users.
Initially, the recommendation output is stored in a recommenderlab-specific format. To make the results easier to inspect, the recommendations can be converted into a list.
# Convert to list
recommendations_list <- as(top_n_recommendations, "list")
# Assign row names from original dataset to recommendation list
names(recommendations_list) <- rownames(ratings_matrix)
# Convert to dataframe for cleaner design
recommendations_df <- data.frame(
Critic = names(recommendations_list),
Recommended_Movies = sapply(recommendations_list, toString),
row.names = NULL
)
recommendations_df Critic Recommended_Movies
1 Burton
2 Charley
3 Dan
4 Dieudonne JungleBook, Frozen, PitchPerfect2
5 Matt Deadpool, JungleBook
6 Mauricio StarWarsForce, Deadpool
7 Max
8 Nathan
9 Param PitchPerfect2, JungleBook
10 Parshu
11 Prashanth PitchPerfect2
12 Shipra CaptainAmerica, Deadpool, PitchPerfect2
13 Sreejaya
14 Steve
15 Vuthy StarWarsForce
16 Xingjia
To evaluate the recommender, a hold-out evaluation approach will be used. In this method, a portion of the known ratings will be withheld from the model and then predicted after training. The predicted ratings will then be compared against the withheld ratings using accuracy measures such as RMSE and MAE.
set.seed(6767)
evaluation_scheme <- evaluationScheme(
ratings_rrm,
method = "split",
train = 0.8,
given = -1,
goodRating = 4
)
ubcf_eval <- Recommender(
getData(evaluation_scheme, "train"),
method = "UBCF",
parameter = list(method = "Pearson", nn = 5)
)
ubcf_predictions <- predict(
ubcf_eval,
getData(evaluation_scheme, "known"),
type = "ratings"
)
accuracy <- calcPredictionAccuracy(
ubcf_predictions,
getData(evaluation_scheme, "unknown")
)
accuracy RMSE MSE MAE
0.7029084 0.4940802 0.4722222
The warning indicates that one user (user 8 in the matrix) did not have enough ratings for the hold-out evaluation. This is not unexpected, given the small and sparse nature of the dataset.
The model produced an RMSE of 0.703 and an MAE of 0.472, meaning that the predicted ratings differed from the actual held-out ratings by roughly 0.5 to 0.7 points on average. As such, the recommender shows some predictive ability, although the results should be interpreted with caution due to the limited size of the ratings dataset.
The recommendations_df table presents the personalized movie recommendations generated through user-to-user collaborative filtering. The model recommends movies that each critic has not yet rated, based on the preferences of other users with similar rating patterns.
Some critics received recommendations, such as Dieudonne being recommended JungleBook, Frozen, and PitchPerfect2, and Shipra being recommended CaptainAmerica, Deadpool, and PitchPerfect2. These results suggest that the model was able to identify comparable users and use their ratings to generate personalized suggestions.
However, several critics received no recommendations. This may be because some had already rated most or all of the available movies, leaving few unseen items to recommend. In other cases, the user may not have had enough overlapping ratings with others for the model to identify reliable similarities. This is expected given the small size and sparsity of the dataset.
The evaluation results also support this cautious interpretation. While the RMSE and MAE values suggest that the recommender had some predictive ability, the warning generated during evaluation highlights the limitations caused by limited user ratings and sparse overlap.
In conclusion, the personalized recommender system was able to generate meaningful recommendations for several users, while also highlighting the limitations of collaborative filtering when applied to a small and sparse dataset.