Assignment 11 Personalized Recommender

Author

Khandker Qaiduzzaman

Objective

The objective of this assignment is to build a simple personalized movie recommender system using movie rating survey data. The dataset contains movie ratings provided by different users for several 2025 movie titles.

Approach

For this assignment, I will build an item-to-item collaborative filtering recommender system using user rating data collected through a survey-style dataset. The system will focus on identifying relationships between movies based on how similarly users have rated them.

The core idea behind this approach is that if two movies receive similar ratings from many users, they are likely to be similar in terms of audience preference. This similarity can then be used to recommend movies a user has not yet watched.

The main steps of the approach are:

  • Data Cleaning and Preparation: The raw dataset will be cleaned by standardizing column names, renaming variables for clarity, and converting rating values into numeric format. Movie-related columns will be isolated for analysis.

  • User-Item Matrix Construction: A matrix will be created where rows represent users and columns represent movies. Each cell contains the rating a user has given to a specific movie.

  • Similarity Computation: Item-to-item similarity will be calculated using cosine similarity. This will measure how closely related two movies are based on user rating behavior.

  • Recommendation Logic: For a given user, the system will compute a weighted score for each unseen movie based on similarity with movies the user has already rated.

  • Top-N Recommendation: The final step will generate a ranked list of top recommended movies for each user based on predicted preference scores.

Recommender Output Definition

The recommender outputs a ranked list of top-N movies for each user based on predicted ratings. For each user, the system predicts ratings for movies they have not seen, sorts them in descending order, and returns the highest-rated items as personalized recommendations. The final output includes the user name, movie title, and predicted rating, forming a ranked recommendation list.

Anticipated Challenges

One of the main challenges in this assignment is handling missing ratings, since not all users have watched or rated all movies. These missing values must be properly managed to avoid bias in similarity calculations.

Another challenge is ensuring that similarity scores remain meaningful when the number of common raters between two movies is small, which can sometimes distort cosine similarity results.

Additionally, converting raw survey data into a clean user-item matrix requires careful data transformation and type conversion to ensure accurate computations.

The dataset can be viewed here: https://raw.githubusercontent.com/NafeesKhandker/Recommender-Systems/refs/heads/main/Movie%20Rating%20(Responses).csv

Data Import and Initial Exploration

The movie rating dataset is imported from a GitHub repository using the tidyverse package. This dataset contains user ratings for several 2025 movie titles along with demographic information such as age and gender. After loading the dataset, a quick preview is performed to understand its structure and verify successful import.

library(tidyverse)
library(gt)

url <- "https://raw.githubusercontent.com/NafeesKhandker/Recommender-Systems/refs/heads/main/Movie%20Rating%20(Responses).csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

head(df)
# A tibble: 6 × 11
  Timestamp  `Email Address` Name  Rate Superman (2025)…¹ Rate F1: The Movie (…²
  <chr>      <chr>           <chr>                  <dbl>                  <dbl>
1 2/4/2026 … feva0706@gmail… Foiz…                    3.5                    4.2
2 2/4/2026 … jahidneel10@gm… Jahi…                    4.5                    4  
3 2/5/2026 … mhasanww@gmail… Mahm…                   NA                     NA  
4 2/5/2026 … sadmansobhan@y… Sadm…                   NA                      4.9
5 2/5/2026 … shahjahan.csek… Shah…                    3.6                   NA  
6 2/5/2026 … rhoque.nsu@gma… Read…                   NA                      3.5
# ℹ abbreviated names:
#   ¹​`Rate Superman (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`,
#   ²​`Rate F1: The Movie (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`
# ℹ 6 more variables:
#   `Rate Mission: Impossible – The Final Reckoning (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
#   `Rate Jurassic World: Rebirth (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
#   `Rate Sinners (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>, …

Data Cleaning and Preparation

This section focuses on cleaning and standardizing the dataset for analysis. Column names are converted into a consistent format using clean_names(). Key variables such as user information and demographic attributes are renamed for clarity. Movie rating columns are also simplified by removing long text patterns from their names. Finally, all movie rating variables are converted into numeric format to ensure compatibility with recommender system modeling.

# install.packages("janitor")
library(tidyverse)
library(janitor)

df_clean <- df %>%
  clean_names() %>%   # makes names snake_case
  rename(
    user = name,
    age = please_enter_your_age,
    gender = please_enter_your_gender
  )

df_clean <- df_clean %>%
  rename_with(
    ~ str_replace_all(., "rate_|_on_a_scale_of_1_to_5_enter_na_if_you_havent_watched_this_movie", ""),
    starts_with("rate_")
  )

df_clean <- df_clean %>%
  mutate(across(contains("2025"), as.numeric))

df_clean |> gt()
timestamp email_address user superman_2025 f1_the_movie_2025 mission_impossible_the_final_reckoning_2025 jurassic_world_rebirth_2025 sinners_2025 zootopia_2_2025 age gender
2/4/2026 20:12 feva0706@gmail.com Foizunnesa Eva 3.5 4.2 4.5 3.0 NA NA 33 Female
2/4/2026 21:37 jahidneel10@gmail.com Jahid Hasan 4.5 4.0 4.1 3.0 NA 4.3 23 Male
2/5/2026 10:18 mhasanww@gmail.com Mahmudul Hasan NA NA 4.8 3.5 NA NA 36 Male
2/5/2026 13:20 sadmansobhan@yahoo.com Sadman Sobhan NA 4.9 4.0 NA 4.5 3.9 32 Male
2/5/2026 16:01 shahjahan.cseku11@gmail.com Shahjahan Shahed 3.6 NA 4.8 NA 4.0 NA 40 Male
2/5/2026 20:30 rhoque.nsu@gmail.com Readwanul Hoque NA 3.5 3.7 NA 4.1 NA 49 Male
2/5/2026 20:32 kayenath.sdc07@gmail.com Tabassumul Kayenath NA NA NA NA NA 4.6 39 Female

User-Item Matrix Construction and Train-Test Split

In this section, the cleaned dataset is transformed into a user-item rating matrix, where rows represent users and columns represent movies. This matrix is then converted into a realRatingMatrix format required by the recommenderlab package. The data is split into training and testing sets using an 80/20 split to evaluate model performance. The split also separates known and unknown ratings for prediction evaluation.

#install.packages("recommenderlab")
library(recommenderlab)
library(tidyverse)

# -----------------------------
# Create User-Item Matrix
# -----------------------------
ratings_matrix <- df_clean %>%
  select(user, contains("2025")) %>%
  column_to_rownames("user") %>%
  as.matrix()

ratings_rrm <- as(ratings_matrix, "realRatingMatrix")

# -----------------------------
# Train/Test Split
# -----------------------------
set.seed(123)

scheme <- evaluationScheme(
  data = ratings_rrm,
  method = "split",
  train = 0.8,
  given = -1
)

train_data <- getData(scheme, "train")
known_data <- getData(scheme, "known")
unknown_data <- getData(scheme, "unknown")

Hyperparameter Tuning (k Selection Using RMSE)

This section evaluates different values of k (from 1 to 5) to identify the optimal number of nearest neighbors in the item-based collaborative filtering model. For each k value, an IBCF model is trained using cosine similarity, and predictions are generated for the test data. The Root Mean Squared Error (RMSE) is calculated by comparing predicted and actual ratings. The results are stored in a summary table for comparison.

# -----------------------------
# Function to compute RMSE for a given k
# -----------------------------
get_rmse <- function(k_value) {
  
  model <- Recommender(
    data = train_data,
    method = "IBCF",
    parameter = list(method = "Cosine", k = k_value)
  )
  
  pred <- predict(model, known_data, type = "ratings")
  
  pred_mat <- as(pred, "matrix")
  true_mat <- as(unknown_data, "matrix")
  
  sqrt(mean((pred_mat - true_mat)^2, na.rm = TRUE))
}

# -----------------------------
# Run for k = 1 to 5
# -----------------------------
k_values <- 1:5

rmse_results <- tibble(
  k = k_values,
  RMSE = sapply(k_values, get_rmse)
)

rmse_results |> gt()
k RMSE
1 0.7000000
2 0.7000000
3 0.5869545
4 0.5869545
5 0.5869545

Best k Selection

The optimal value of k is selected based on the lowest RMSE obtained from the previous step. This ensures that the final recommender model achieves the best predictive accuracy on unseen data.

best_k <- rmse_results %>%
  arrange(RMSE) %>%
  slice(1)

best_k |> gt()
k RMSE
3 0.5869545

Final Model Training

Using the optimal k value identified through tuning, the final Item-Based Collaborative Filtering model is trained on the full dataset. This allows the model to leverage all available user ratings while maintaining optimal similarity constraints.

best_k_value <- best_k$k

final_model <- Recommender(
  data = ratings_rrm,
  method = "IBCF",
  parameter = list(method = "Cosine", k = best_k_value)
)

pred_final <- predict(final_model, ratings_rrm, type = "ratings")

pred_matrix <- as(pred_final, "matrix")

Personalized Movie Recommendations

This section generates personalized movie recommendations for each user. For every user, only movies they have not rated are considered. Predicted ratings are sorted in descending order to produce a ranked list of recommended movies. The final output is presented in a structured table format for easy interpretation.

rec_list <- lapply(rownames(ratings_matrix), function(u) {
  
  user_ratings <- ratings_matrix[u, ]
  user_pred <- pred_matrix[u, ]
  
  # only unseen movies
  unseen <- is.na(user_ratings)
  
  sort(user_pred[unseen], decreasing = TRUE)
})

names(rec_list) <- rownames(ratings_matrix)

rec_df <- bind_rows(
  lapply(names(rec_list), function(u) {
    tibble(
      user = u,
      movie = names(rec_list[[u]]),
      predicted_rating = as.numeric(rec_list[[u]])
    )
  })
) %>%
  group_by(user) %>%
  arrange(user, desc(predicted_rating)) %>%
  ungroup()

rec_df %>%
  gt() %>%
  tab_header(title = "Personalized Movie Recommendations (Best k Model)")
Personalized Movie Recommendations (Best k Model)
user movie predicted_rating
Foizunnesa Eva zootopia_2_2025 4.005594
Foizunnesa Eva sinners_2025 3.835671
Jahid Hasan sinners_2025 4.299487
Mahmudul Hasan f1_the_movie_2025 4.800000
Mahmudul Hasan sinners_2025 4.800000
Mahmudul Hasan zootopia_2_2025 4.800000
Mahmudul Hasan superman_2025 3.500000
Readwanul Hoque superman_2025 4.100000
Readwanul Hoque zootopia_2_2025 3.679588
Readwanul Hoque jurassic_world_rebirth_2025 3.574708
Sadman Sobhan jurassic_world_rebirth_2025 4.563815
Sadman Sobhan superman_2025 4.200000
Shahjahan Shahed f1_the_movie_2025 4.264369
Shahjahan Shahed zootopia_2_2025 4.193597
Shahjahan Shahed jurassic_world_rebirth_2025 3.880326
Tabassumul Kayenath superman_2025 4.600000
Tabassumul Kayenath mission_impossible_the_final_reckoning_2025 4.600000

Conclusion and Limitations

This project implemented an item-based collaborative filtering recommender system using cosine similarity, with k = 3 selected as optimal based on the lowest RMSE.

However, the results are constrained by the small and sparse nature of the dataset (6 movies and 7 users). Many users have rated only a few movies, which limits how many reliable recommendations can be generated.

For example, Tabassumul Kayenath rated only one movie and left five movies unrated. In item-based collaborative filtering, recommendations are derived by comparing a user’s rated items with similar items. Since she has only one rated movie, the model has very limited information to propagate preferences, which results in only a small number of meaningful predicted ratings instead of a full recommendation list.

Additionally, missing ratings (NA values) and the small number of items reduce the strength of similarity estimates. This is also why increasing k beyond 3 does not improve performance, as reflected in the stable RMSE values.

Overall, while the model successfully generates personalized recommendations, its effectiveness is limited by data sparsity and the small scale of the dataset, which are common challenges in real-world recommender systems with limited user engagement.

Reference