Assignment 11

Author

Theresa Benny

Approach

Planned Approach

For this assignment, I plan to build a personalized recommendation system using the same survey data from the earlier baseline recommendation assignment. Instead of producing one overall recommendation list for all users, the goal here is to generate recommendations that are tailored to each individual user based on their own responses and patterns in the data.

To tackle the problem, I will first examine how the survey data is structured and identify what should count as the users, the items, and the preference signal. In a recommendation setting, the data usually needs to be organized so that each user has some type of interaction, rating, or preference associated with different items. If the current survey data is not already in that format, I will need to reshape it into a user-item structure that can support recommendation modeling.

My choice is a collaborative filtering approach because it is a common and effective way to generate personalized recommendations based on similarities in user behavior or item preference patterns. Once the model is built, I want it to produce a ranked list of recommended items for each user, such as the top 3 or top 5 items the user has not already selected but is predicted to like.

To evaluate the recommender, I plan to hold out part of the data and test how well the model predicts unseen preferences. This will help me measure whether the recommendations are meaningful rather than just fitting the original data too closely. Depending on the final format of the output, I may evaluate performance using prediction accuracy or ranking-based measures.

Anticipated Data Challenges

One challenge I anticipate is that the survey data may not look like a traditional recommender dataset. Recommendation systems usually rely on clear user-item interactions, but survey data may contain responses in a format that needs interpretation before it can be used. I may need to decide how to convert survey answers into a usable preference measure, such as binary selections, scaled ratings, or other indicators of interest.

Another likely challenge is missing or sparse data. In many recommendation problems, users have only interacted with a small number of items, which can make it harder to identify strong patterns or similarities. If the survey responses are incomplete or uneven across users, that could affect how well the recommender performs.

I also expect that evaluation may be a challenge. Since personalized recommendation is about suggesting relevant items to each user, I will need to make sure the testing approach reflects that goal. This means I will need to carefully separate training and test data so the model is evaluated fairly.

Overall, my approach is to first make the survey data suitable for recommendation analysis, then apply one personalized recommendation method, and finally assess whether the model can generate useful user-specific recommendations.

Codebase

#Since I have a cleaned version of my survey data, I'm going to reuse that:
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
movie_ratings <- readRDS("movie_ratings.rds")
#First I want to remove the NA so we get a proper clean data that we can use to calculate the averages
ratings_clean <- movie_ratings %>%
  filter(!is.na(rating))

names(ratings_clean)

[1] "initials" "title"    "rating"

head(ratings_clean)

  initials                    title rating
1      D.B                  Sinners      5
2      M.D One Battle After Another      3
3      M.D          Caught Stealing      1
4      M.D             Frankenstein      4
5      M.D         Wake Up Dead Man      5
6      M.D                  Sinners      5

Next, I need an item summary to see which movies are generally rated the highest.

item_summary <- ratings_clean %>%
  group_by(title) %>%
  summarize(
    avg_rating = mean(rating),
    num_ratings = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_rating), desc(num_ratings))

item_summary

# A tibble: 6 × 3
  title                      avg_rating num_ratings
  <chr>                           <dbl>       <int>
1 Sinners                          4.86           7
2 Wake Up Dead Man                 4.6            5
3 People We Meet on Vacation       4.4            5
4 Frankenstein                     3.71           7
5 One Battle After Another         3.5            4
6 Caught Stealing                  3              5

Now we can start using collaborative filtering, let’s pick a user first to begin with.

target_user <- "M.D"

user_data <- ratings_clean %>%
  filter(initials == target_user)

user_data

  initials                    title rating
1      M.D One Battle After Another      3
2      M.D          Caught Stealing      1
3      M.D             Frankenstein      4
4      M.D         Wake Up Dead Man      5
5      M.D                  Sinners      5

If the rating is >= 4, that means they like it and we’ll use those as the ones to recommend.

liked_items <- user_data %>%
  filter(rating >= 4) %>%
  pull(title)

liked_items

[1] "Frankenstein"     "Wake Up Dead Man" "Sinners"

Let’s see who else rated these movies >= 4 too.

similar_users <- ratings_clean %>%
  filter(
    title %in% liked_items,
    rating >= 4,
    initials != target_user
  ) %>%
  distinct(initials)

similar_users

  initials
1      D.B
2      R.T
3      K.B
4      M.P
5      Z.S
6      T.C
7      C.S
8      R.Y

Now we recommend movies that similar users rated highly, but remove movies M.D already rated.

already_rated <- user_data %>%
  pull(title)

recommendations <- ratings_clean %>%
  filter(
    initials %in% similar_users$initials,
    rating >= 4,
    !(title %in% already_rated)
  ) %>%
  group_by(title) %>%
  summarize(
    avg_rating = mean(rating),
    times_recommended = n(),
    .groups = "drop"
  ) %>%
  arrange(desc(times_recommended), desc(avg_rating))

recommendations

# A tibble: 1 × 3
  title                      avg_rating times_recommended
  <chr>                           <dbl>             <int>
1 People We Meet on Vacation        4.4                 5

The recommender identified “People We Meet on Vacation” as the personalized recommendation for M.D. Now, we can make it an algorithm to implement for more than one user.

get_recommendations <- function(user_name, data) {
  
  user_data <- data %>%
    filter(initials == user_name)
  
  liked_items <- user_data %>%
    filter(rating >= 4) %>%
    pull(title)
  
  similar_users <- data %>%
    filter(
      title %in% liked_items,
      rating >= 4,
      initials != user_name
    ) %>%
    distinct(initials)
  
  already_rated <- user_data %>%
    pull(title)
  
  recommendations <- data %>%
    filter(
      initials %in% similar_users$initials,
      rating >= 4,
      !(title %in% already_rated)
    ) %>%
    group_by(title) %>%
    summarize(
      avg_rating = mean(rating),
      times_recommended = n(),
      .groups = "drop"
    ) %>%
    arrange(desc(times_recommended), desc(avg_rating))
  
  return(recommendations)
}

Testing the function now.

get_recommendations("R.T", ratings_clean)

# A tibble: 2 × 3
  title           avg_rating times_recommended
  <chr>                <dbl>             <int>
1 Sinners               4.83                 6
2 Caught Stealing       4.5                  2

For R.T, the system recommended “Sinners” and “Caught Stealing.” These movies were recommended because they were highly rated by users who had overlapping preferences with R.T, but R.T had not already rated them.

Next, we’ll hide one rating, build recommendations from the remaining data, and see whether the hidden movie appears in the recommendation.

# Choose one known movie M.D liked and pretend we do not know it
test_user <- "M.D"
hidden_movie <- "Sinners"

# Create training data without that one rating
train_data <- ratings_clean %>%
  filter(!(initials == test_user & title == hidden_movie))

# Generate recommendations using only the training data
test_recommendations <- get_recommendations(test_user, train_data)

test_recommendations

# A tibble: 2 × 3
  title                      avg_rating times_recommended
  <chr>                           <dbl>             <int>
1 Sinners                          4.75                 4
2 People We Meet on Vacation       4.25                 4

hidden_movie %in% test_recommendations$title

[1] TRUE

The results showed that “Sinners” appeared in the recommendation list, indicating that the recommender was able to successfully recover a known preference. This suggests that the model is capturing meaningful patterns in user preferences.