Assignment 11 Personal Recommender

Author

Khandker Qaiduzzaman

Objective

The objective of this assignment is to build a simple personalized movie recommender system using movie rating survey data. The dataset contains movie ratings provided by different users for several 2025 movie titles.

Approach

For this assignment, I will build an item-to-item collaborative filtering recommender system using user rating data collected through a survey-style dataset. The system will focus on identifying relationships between movies based on how similarly users have rated them.

The core idea behind this approach is that if two movies receive similar ratings from many users, they are likely to be similar in terms of audience preference. This similarity can then be used to recommend movies a user has not yet watched.

The main steps of the approach are:

Data Cleaning and Preparation: The raw dataset will be cleaned by standardizing column names, renaming variables for clarity, and converting rating values into numeric format. Movie-related columns will be isolated for analysis.
User-Item Matrix Construction: A matrix will be created where rows represent users and columns represent movies. Each cell contains the rating a user has given to a specific movie.
Similarity Computation: Item-to-item similarity will be calculated using cosine similarity. This will measure how closely related two movies are based on user rating behavior.
Recommendation Logic: For a given user, the system will compute a weighted score for each unseen movie based on similarity with movies the user has already rated.
Top-N Recommendation: The final step will generate a ranked list of top recommended movies for each user based on predicted preference scores.

Anticipated Challenges

One of the main challenges in this assignment is handling missing ratings, since not all users have watched or rated all movies. These missing values must be properly managed to avoid bias in similarity calculations.

Another challenge is ensuring that similarity scores remain meaningful when the number of common raters between two movies is small, which can sometimes distort cosine similarity results.

Additionally, converting raw survey data into a clean user-item matrix requires careful data transformation and type conversion to ensure accurate computations.

The dataset can be viewed here: https://raw.githubusercontent.com/NafeesKhandker/Recommender-Systems/refs/heads/main/Movie%20Rating%20(Responses).csv

Implementation of Data Import

The following code demonstrates how the movie rating dataset is imported and cleaned in R. First, required libraries are loaded. The dataset is then imported directly from GitHub and previewed.

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gt)

url <- "https://raw.githubusercontent.com/NafeesKhandker/Recommender-Systems/refs/heads/main/Movie%20Rating%20(Responses).csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE,
  progress = FALSE
)

head(df)

# A tibble: 6 × 11
  Timestamp  `Email Address` Name  Rate Superman (2025)…¹ Rate F1: The Movie (…²
  <chr>      <chr>           <chr>                  <dbl>                  <dbl>
1 2/4/2026 … feva0706@gmail… Foiz…                    3.5                    4.2
2 2/4/2026 … jahidneel10@gm… Jahi…                    4.5                    4  
3 2/5/2026 … mhasanww@gmail… Mahm…                   NA                     NA  
4 2/5/2026 … sadmansobhan@y… Sadm…                   NA                      4.9
5 2/5/2026 … shahjahan.csek… Shah…                    3.6                   NA  
6 2/5/2026 … rhoque.nsu@gma… Read…                   NA                      3.5
# ℹ abbreviated names:
#   ¹`Rate Superman (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`,
#   ²`Rate F1: The Movie (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`
# ℹ 6 more variables:
#   `Rate Mission: Impossible – The Final Reckoning (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
#   `Rate Jurassic World: Rebirth (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
#   `Rate Sinners (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>, …

Next, the dataset is cleaned and prepared for analysis. Column names are standardized, key variables are renamed for clarity, and movie rating columns are converted into numeric format.

# install.packages("janitor")
library(tidyverse)
library(janitor)

Warning: package 'janitor' was built under R version 4.5.3


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

df_clean <- df %>%
  clean_names() %>%   # makes names snake_case
  rename(
    user = name,
    age = please_enter_your_age,
    gender = please_enter_your_gender
  )

df_clean <- df_clean %>%
  rename_with(
    ~ str_replace_all(., "rate_|_on_a_scale_of_1_to_5_enter_na_if_you_havent_watched_this_movie", ""),
    starts_with("rate_")
  )

df_clean <- df_clean %>%
  mutate(across(contains("2025"), as.numeric))

df_clean |> gt()

timestamp	email_address	user	superman_2025	f1_the_movie_2025	mission_impossible_the_final_reckoning_2025	jurassic_world_rebirth_2025	sinners_2025	zootopia_2_2025	age	gender
2/4/2026 20:12	feva0706@gmail.com	Foizunnesa Eva	3.5	4.2	4.5	3.0	NA	NA	33	Female
2/4/2026 21:37	jahidneel10@gmail.com	Jahid Hasan	4.5	4.0	4.1	3.0	NA	4.3	23	Male
2/5/2026 10:18	mhasanww@gmail.com	Mahmudul Hasan	NA	NA	4.8	3.5	NA	NA	36	Male
2/5/2026 13:20	sadmansobhan@yahoo.com	Sadman Sobhan	NA	4.9	4.0	NA	4.5	3.9	32	Male
2/5/2026 16:01	shahjahan.cseku11@gmail.com	Shahjahan Shahed	3.6	NA	4.8	NA	4.0	NA	40	Male
2/5/2026 20:30	rhoque.nsu@gmail.com	Readwanul Hoque	NA	3.5	3.7	NA	4.1	NA	49	Male
2/5/2026 20:32	kayenath.sdc07@gmail.com	Tabassumul Kayenath	NA	NA	NA	NA	NA	4.6	39	Female

Reference

OpenAI. (2026). ChatGPT (Apr 23 version) assistance on recommender system Quarto document development. https://chat.openai.com/