SPS_Data607_Week11_DC

Author

David Chen

Assignment: Build a Personalized Recommendation System

In a previous assignment, you implemented a Global Baseline Estimate, which produced non-personalized recommendations. In this assignment, you will use the same survey data to build a personalized recommender system.

Your task is to:

Choose and implement one personalized recommendation algorithm, such as:
- Content-based filtering
- Item-to-item collaborative filtering
- User-to-user collaborative filtering
- Matrix factorization
Define what your recommender will output (e.g., top-N items per user, predicted ratings, ranked lists).
Evaluate the performance of your recommender using an appropriate method (e.g., hold-out data, cross-validation, ranking metrics).

You may either:

Use an existing recommender package, or
Implement the algorithm from scratch.

Your submission should include the code, the recommendation output, and a brief explanation of how the model was built and evaluated.

Approach

Due to the small size of the dataset, I chose user-to-user collaborative filtering as the recommendation approach. This method is suitable because it can compute similarities directly from user ratings, is easy to implement, and is straightforward to explain.

For each user, I identify the most similar users based on their rating patterns and recommend movies that those similar users rated highly but the target user has not yet seen.

To make the system more practical and realistic, I extend the approach to generate Top-N recommendations, rather than a single item, by ranking unseen movies based on weighted similarity scores.

The implementation consists of the following steps:

Load the dataset
Compute user-to-user similarity
Apply a Top-N recommendation function
Output the Top-N recommended movies for each user

Code Base

library(DBI)
library(RPostgres)

# Checking the exiting CSV file
if(file.exists("w11_rating.csv")) {
  # load both files
  print("Cache files exits\n")

  w11_data <- read.csv("w11_rating.csv")
  
}else{
  con <- dbConnect(
  RPostgres::Postgres(),
  dbname = "chatgpt_c",    # your database name
  host = "192.168.100.61",          # or server IP
  port = 5432,                 # default PostgreSQL port
  user = "postgres",           # your DB username
  password = "ubuntu"    # your DB password
)
  dbListTables(con)
  
  query <- "SELECT m.title, r.rater_name, r.rating
          FROM movies m
          JOIN ratings r ON m.movie_id = r.movie_id
          ORDER BY m.movie_id, r.rater_name;"

  w11_data <- dbGetQuery(con, query)
  write.csv(w11_data, "w11_rating.csv", row.names = FALSE)
  
}

[1] "Cache files exits\n"

head(w11_data)

                     title rater_name rating
1 Avatar: The Way of Water      Alice      5
2 Avatar: The Way of Water        Bob      4
3 Avatar: The Way of Water    Charlie      3
4 Avatar: The Way of Water      David      4
5 Avatar: The Way of Water        Eve     NA
6              Oppenheimer      Alice      4

Option1: Replacing NA with “0”

library(tidyr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
w11_data <- w11_data%>%
  mutate(rating = ifelse(is.na(rating), 0, rating))
head(w11_data)

                     title rater_name rating
1 Avatar: The Way of Water      Alice      5
2 Avatar: The Way of Water        Bob      4
3 Avatar: The Way of Water    Charlie      3
4 Avatar: The Way of Water      David      4
5 Avatar: The Way of Water        Eve      0
6              Oppenheimer      Alice      4

Compute user-to-user similarity

user_movie_matrix <- w11_data %>%
  pivot_wider(names_from = title, values_from = rating)

user_movie_matrix <- as.data.frame(user_movie_matrix)

rownames(user_movie_matrix) <- user_movie_matrix$rater_name
user_movie_matrix$rater_name <- NULL

cosine_similarity <- function(x, y) {
  common <- which(!is.na(x) & !is.na(y))
  if(length(common) == 0) return(0)
  
  sum(x[common] * y[common]) / 
    (sqrt(sum(x[common]^2)) * sqrt(sum(y[common]^2)))
}

user_similarity <- matrix(0, 
                          nrow = nrow(user_movie_matrix), 
                          ncol = nrow(user_movie_matrix))

rownames(user_similarity) <- rownames(user_movie_matrix)
colnames(user_similarity) <- rownames(user_movie_matrix)

for(i in 1:nrow(user_movie_matrix)) {
  for(j in 1:nrow(user_movie_matrix)) {
    user_similarity[i, j] <- cosine_similarity(
      user_movie_matrix[i, ], 
      user_movie_matrix[j, ]
    )
  }
}

Apply a Top-N recommendation function

Output the Top-N recommended movies for each user

recommend_movies <- function(target_user, top_n = 5) {
  
  # Get similarity scores
  sim_scores <- user_similarity[target_user, ]
  
  # Sort similar users (excluding self)
  sim_users <- sort(sim_scores, decreasing = TRUE)
  sim_users <- sim_users[names(sim_users) != target_user]
  
  # Get target user's ratings
  target_ratings <- user_movie_matrix[target_user, ]
  #print(target_ratings)
  
  # Movies not yet seen
  unseen_movies <- names(target_ratings)[which(target_ratings == 0)]
  #print(unseen_movies)
  
  scores <- c()
  
  for(movie in unseen_movies) {
    weighted_sum <- 0
    sim_sum <- 0
    
    for(user in names(sim_users)) {
      rating <- user_movie_matrix[user, movie]
      
      if(!is.na(rating)) {
        weighted_sum <- weighted_sum + sim_users[user] * rating
        sim_sum <- sim_sum + abs(sim_users[user])
      }
    }
    
    if(sim_sum > 0) {
      scores[movie] <- weighted_sum / sim_sum
    }
  }
  #print(scores)
  # Return Top-N movies
  recommended <- sort(scores, decreasing = TRUE)
  return(head(recommended, top_n))
}

recommend_movies("David", top_n = 5)

Stranger Things S5      Dune Part Two      Black Panther      The Godfather 
         4.7007046          2.0016862          1.1797700          0.4937011

Conclusion

In this project, a Top-N recommendation approach was implemented using a relatively small dataset. While the method successfully generates personalized recommendations based on user similarity, the limited size of the dataset introduces notable challenges. Specifically, some movies with very low ratings still appear in the recommendation list. This occurs because the model relies heavily on sparse user interactions, which can lead to unreliable similarity calculations and less accurate ranking of items. As a result, the quality of recommendations is affected.

LLMS used:

• OpenAI. (2025). ChatGPT (Version 5.2) [Large language model]. https://chat.openai.com. Accessed Apr 26, 2026.