Assignment 11 Personal Recommender

Author

Khandker Qaiduzzaman

Objective

The objective of this assignment is to build a simple personalized movie recommender system using movie rating survey data. The dataset contains movie ratings provided by different users for several 2025 movie titles.

Approach

For this assignment, I will build an item-to-item collaborative filtering recommender system using user rating data collected through a survey-style dataset. The system will focus on identifying relationships between movies based on how similarly users have rated them.

The core idea behind this approach is that if two movies receive similar ratings from many users, they are likely to be similar in terms of audience preference. This similarity can then be used to recommend movies a user has not yet watched.

The main steps of the approach are:

  • Data Cleaning and Preparation: The raw dataset will be cleaned by standardizing column names, renaming variables for clarity, and converting rating values into numeric format. Movie-related columns will be isolated for analysis.

  • User-Item Matrix Construction: A matrix will be created where rows represent users and columns represent movies. Each cell contains the rating a user has given to a specific movie.

  • Similarity Computation: Item-to-item similarity will be calculated using cosine similarity. This will measure how closely related two movies are based on user rating behavior.

  • Recommendation Logic: For a given user, the system will compute a weighted score for each unseen movie based on similarity with movies the user has already rated.

  • Top-N Recommendation: The final step will generate a ranked list of top recommended movies for each user based on predicted preference scores.

Anticipated Challenges

One of the main challenges in this assignment is handling missing ratings, since not all users have watched or rated all movies. These missing values must be properly managed to avoid bias in similarity calculations.

Another challenge is ensuring that similarity scores remain meaningful when the number of common raters between two movies is small, which can sometimes distort cosine similarity results.

Additionally, converting raw survey data into a clean user-item matrix requires careful data transformation and type conversion to ensure accurate computations.

The dataset can be viewed here: https://raw.githubusercontent.com/NafeesKhandker/Recommender-Systems/refs/heads/main/Movie%20Rating%20(Responses).csv

Implementation of Data Import

The following code demonstrates how the movie rating dataset is imported and cleaned in R. First, required libraries are loaded. The dataset is then imported directly from GitHub and previewed.

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gt)

url <- "https://raw.githubusercontent.com/NafeesKhandker/Recommender-Systems/refs/heads/main/Movie%20Rating%20(Responses).csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

head(df)
# A tibble: 6 × 11
  Timestamp  `Email Address` Name  Rate Superman (2025)…¹ Rate F1: The Movie (…²
  <chr>      <chr>           <chr>                  <dbl>                  <dbl>
1 2/4/2026 … feva0706@gmail… Foiz…                    3.5                    4.2
2 2/4/2026 … jahidneel10@gm… Jahi…                    4.5                    4  
3 2/5/2026 … mhasanww@gmail… Mahm…                   NA                     NA  
4 2/5/2026 … sadmansobhan@y… Sadm…                   NA                      4.9
5 2/5/2026 … shahjahan.csek… Shah…                    3.6                   NA  
6 2/5/2026 … rhoque.nsu@gma… Read…                   NA                      3.5
# ℹ abbreviated names:
#   ¹​`Rate Superman (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`,
#   ²​`Rate F1: The Movie (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`
# ℹ 6 more variables:
#   `Rate Mission: Impossible – The Final Reckoning (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
#   `Rate Jurassic World: Rebirth (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
#   `Rate Sinners (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>, …

Next, the dataset is cleaned and prepared for analysis. Column names are standardized, key variables are renamed for clarity, and movie rating columns are converted into numeric format.

# install.packages("janitor")
library(tidyverse)
library(janitor)
Warning: package 'janitor' was built under R version 4.5.3

Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
df_clean <- df %>%
  clean_names() %>%   # makes names snake_case
  rename(
    user = name,
    age = please_enter_your_age,
    gender = please_enter_your_gender
  )

df_clean <- df_clean %>%
  rename_with(
    ~ str_replace_all(., "rate_|_on_a_scale_of_1_to_5_enter_na_if_you_havent_watched_this_movie", ""),
    starts_with("rate_")
  )

df_clean <- df_clean %>%
  mutate(across(contains("2025"), as.numeric))

df_clean |> gt()
timestamp email_address user superman_2025 f1_the_movie_2025 mission_impossible_the_final_reckoning_2025 jurassic_world_rebirth_2025 sinners_2025 zootopia_2_2025 age gender
2/4/2026 20:12 feva0706@gmail.com Foizunnesa Eva 3.5 4.2 4.5 3.0 NA NA 33 Female
2/4/2026 21:37 jahidneel10@gmail.com Jahid Hasan 4.5 4.0 4.1 3.0 NA 4.3 23 Male
2/5/2026 10:18 mhasanww@gmail.com Mahmudul Hasan NA NA 4.8 3.5 NA NA 36 Male
2/5/2026 13:20 sadmansobhan@yahoo.com Sadman Sobhan NA 4.9 4.0 NA 4.5 3.9 32 Male
2/5/2026 16:01 shahjahan.cseku11@gmail.com Shahjahan Shahed 3.6 NA 4.8 NA 4.0 NA 40 Male
2/5/2026 20:30 rhoque.nsu@gmail.com Readwanul Hoque NA 3.5 3.7 NA 4.1 NA 49 Male
2/5/2026 20:32 kayenath.sdc07@gmail.com Tabassumul Kayenath NA NA NA NA NA 4.6 39 Female

Reference

  • OpenAI. (2026). ChatGPT (Apr 23 version) assistance on recommender system Quarto document development. https://chat.openai.com/