For this assignment, I plan to build a personalized recommendation system using the same survey data from the earlier baseline recommendation assignment. Instead of producing one overall recommendation list for all users, the goal here is to generate recommendations that are tailored to each individual user based on their own responses and patterns in the data.
To tackle the problem, I will first examine how the survey data is structured and identify what should count as the users, the items, and the preference signal. In a recommendation setting, the data usually needs to be organized so that each user has some type of interaction, rating, or preference associated with different items. If the current survey data is not already in that format, I will need to reshape it into a user-item structure that can support recommendation modeling.
My choice is a collaborative filtering approach because it is a common and effective way to generate personalized recommendations based on similarities in user behavior or item preference patterns. Once the model is built, I want it to produce a ranked list of recommended items for each user, such as the top 3 or top 5 items the user has not already selected but is predicted to like.
To evaluate the recommender, I plan to hold out part of the data and test how well the model predicts unseen preferences. This will help me measure whether the recommendations are meaningful rather than just fitting the original data too closely. Depending on the final format of the output, I may evaluate performance using prediction accuracy or ranking-based measures.
Anticipated Data Challenges
One challenge I anticipate is that the survey data may not look like a traditional recommender dataset. Recommendation systems usually rely on clear user-item interactions, but survey data may contain responses in a format that needs interpretation before it can be used. I may need to decide how to convert survey answers into a usable preference measure, such as binary selections, scaled ratings, or other indicators of interest.
Another likely challenge is missing or sparse data. In many recommendation problems, users have only interacted with a small number of items, which can make it harder to identify strong patterns or similarities. If the survey responses are incomplete or uneven across users, that could affect how well the recommender performs.
I also expect that evaluation may be a challenge. Since personalized recommendation is about suggesting relevant items to each user, I will need to make sure the testing approach reflects that goal. This means I will need to carefully separate training and test data so the model is evaluated fairly.
Overall, my approach is to first make the survey data suitable for recommendation analysis, then apply one personalized recommendation method, and finally assess whether the model can generate useful user-specific recommendations.
Codebase
#Since I have a cleaned version of my survey data, I'm going to reuse that:library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)movie_ratings <-readRDS("movie_ratings.rds")#First I want to remove the NA so we get a proper clean data that we can use to calculate the averagesratings_clean <- movie_ratings %>%filter(!is.na(rating))names(ratings_clean)
[1] "initials" "title" "rating"
head(ratings_clean)
initials title rating
1 D.B Sinners 5
2 M.D One Battle After Another 3
3 M.D Caught Stealing 1
4 M.D Frankenstein 4
5 M.D Wake Up Dead Man 5
6 M.D Sinners 5
Next, I need an item summary to see which movies are generally rated the highest.
# A tibble: 6 × 3
title avg_rating num_ratings
<chr> <dbl> <int>
1 Sinners 4.86 7
2 Wake Up Dead Man 4.6 5
3 People We Meet on Vacation 4.4 5
4 Frankenstein 3.71 7
5 One Battle After Another 3.5 4
6 Caught Stealing 3 5
Now we can start using collaborative filtering, let’s pick a user first to begin with.
# A tibble: 1 × 3
title avg_rating times_recommended
<chr> <dbl> <int>
1 People We Meet on Vacation 4.4 5
The recommender identified “People We Meet on Vacation” as the personalized recommendation for M.D. Now, we can make it an algorithm to implement for more than one user.
For R.T, the system recommended “Sinners” and “Caught Stealing.” These movies were recommended because they were highly rated by users who had overlapping preferences with R.T, but R.T had not already rated them.
Next, we’ll hide one rating, build recommendations from the remaining data, and see whether the hidden movie appears in the recommendation.
# Choose one known movie M.D liked and pretend we do not know ittest_user <-"M.D"hidden_movie <-"Sinners"# Create training data without that one ratingtrain_data <- ratings_clean %>%filter(!(initials == test_user & title == hidden_movie))# Generate recommendations using only the training datatest_recommendations <-get_recommendations(test_user, train_data)test_recommendations
# A tibble: 2 × 3
title avg_rating times_recommended
<chr> <dbl> <int>
1 Sinners 4.75 4
2 People We Meet on Vacation 4.25 4
hidden_movie %in% test_recommendations$title
[1] TRUE
The results showed that “Sinners” appeared in the recommendation list, indicating that the recommender was able to successfully recover a known preference. This suggests that the model is capturing meaningful patterns in user preferences.