The objective of this assignment is to build a simple personalized movie recommender system using movie rating survey data. The dataset contains movie ratings provided by different users for several 2025 movie titles.
Approach
For this assignment, I will build an item-to-item collaborative filtering recommender system using user rating data collected through a survey-style dataset. The system will focus on identifying relationships between movies based on how similarly users have rated them.
The core idea behind this approach is that if two movies receive similar ratings from many users, they are likely to be similar in terms of audience preference. This similarity can then be used to recommend movies a user has not yet watched.
The main steps of the approach are:
Data Cleaning and Preparation: The raw dataset will be cleaned by standardizing column names, renaming variables for clarity, and converting rating values into numeric format. Movie-related columns will be isolated for analysis.
User-Item Matrix Construction: A matrix will be created where rows represent users and columns represent movies. Each cell contains the rating a user has given to a specific movie.
Similarity Computation: Item-to-item similarity will be calculated using cosine similarity. This will measure how closely related two movies are based on user rating behavior.
Recommendation Logic: For a given user, the system will compute a weighted score for each unseen movie based on similarity with movies the user has already rated.
Top-NRecommendation: The final step will generate a ranked list of top recommended movies for each user based on predicted preference scores.
Anticipated Challenges
One of the main challenges in this assignment is handling missing ratings, since not all users have watched or rated all movies. These missing values must be properly managed to avoid bias in similarity calculations.
Another challenge is ensuring that similarity scores remain meaningful when the number of common raters between two movies is small, which can sometimes distort cosine similarity results.
Additionally, converting raw survey data into a clean user-item matrix requires careful data transformation and type conversion to ensure accurate computations.
The following code demonstrates how the movie rating dataset is imported and cleaned in R. First, required libraries are loaded. The dataset is then imported directly from GitHub and previewed.
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 6 × 11
Timestamp `Email Address` Name Rate Superman (2025)…¹ Rate F1: The Movie (…²
<chr> <chr> <chr> <dbl> <dbl>
1 2/4/2026 … feva0706@gmail… Foiz… 3.5 4.2
2 2/4/2026 … jahidneel10@gm… Jahi… 4.5 4
3 2/5/2026 … mhasanww@gmail… Mahm… NA NA
4 2/5/2026 … sadmansobhan@y… Sadm… NA 4.9
5 2/5/2026 … shahjahan.csek… Shah… 3.6 NA
6 2/5/2026 … rhoque.nsu@gma… Read… NA 3.5
# ℹ abbreviated names:
# ¹`Rate Superman (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`,
# ²`Rate F1: The Movie (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`
# ℹ 6 more variables:
# `Rate Mission: Impossible – The Final Reckoning (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
# `Rate Jurassic World: Rebirth (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
# `Rate Sinners (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>, …
Next, the dataset is cleaned and prepared for analysis. Column names are standardized, key variables are renamed for clarity, and movie rating columns are converted into numeric format.