The objective of this assignment is to build a simple personalized movie recommender system using movie rating survey data. The dataset contains movie ratings provided by different users for several 2025 movie titles.
Approach
For this assignment, I will build an item-to-item collaborative filtering recommender system using user rating data collected through a survey-style dataset. The system will focus on identifying relationships between movies based on how similarly users have rated them.
The core idea behind this approach is that if two movies receive similar ratings from many users, they are likely to be similar in terms of audience preference. This similarity can then be used to recommend movies a user has not yet watched.
The main steps of the approach are:
Data Cleaning and Preparation: The raw dataset will be cleaned by standardizing column names, renaming variables for clarity, and converting rating values into numeric format. Movie-related columns will be isolated for analysis.
User-Item Matrix Construction: A matrix will be created where rows represent users and columns represent movies. Each cell contains the rating a user has given to a specific movie.
Similarity Computation: Item-to-item similarity will be calculated using cosine similarity. This will measure how closely related two movies are based on user rating behavior.
Recommendation Logic: For a given user, the system will compute a weighted score for each unseen movie based on similarity with movies the user has already rated.
Top-NRecommendation: The final step will generate a ranked list of top recommended movies for each user based on predicted preference scores.
Recommender Output Definition
The recommender outputs a ranked list of top-N movies for each user based on predicted ratings. For each user, the system predicts ratings for movies they have not seen, sorts them in descending order, and returns the highest-rated items as personalized recommendations. The final output includes the user name, movie title, and predicted rating, forming a ranked recommendation list.
Anticipated Challenges
One of the main challenges in this assignment is handling missing ratings, since not all users have watched or rated all movies. These missing values must be properly managed to avoid bias in similarity calculations.
Another challenge is ensuring that similarity scores remain meaningful when the number of common raters between two movies is small, which can sometimes distort cosine similarity results.
Additionally, converting raw survey data into a clean user-item matrix requires careful data transformation and type conversion to ensure accurate computations.
The movie rating dataset is imported from a GitHub repository using the tidyverse package. This dataset contains user ratings for several 2025 movie titles along with demographic information such as age and gender. After loading the dataset, a quick preview is performed to understand its structure and verify successful import.
# A tibble: 6 × 11
Timestamp `Email Address` Name Rate Superman (2025)…¹ Rate F1: The Movie (…²
<chr> <chr> <chr> <dbl> <dbl>
1 2/4/2026 … feva0706@gmail… Foiz… 3.5 4.2
2 2/4/2026 … jahidneel10@gm… Jahi… 4.5 4
3 2/5/2026 … mhasanww@gmail… Mahm… NA NA
4 2/5/2026 … sadmansobhan@y… Sadm… NA 4.9
5 2/5/2026 … shahjahan.csek… Shah… 3.6 NA
6 2/5/2026 … rhoque.nsu@gma… Read… NA 3.5
# ℹ abbreviated names:
# ¹`Rate Superman (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`,
# ²`Rate F1: The Movie (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.`
# ℹ 6 more variables:
# `Rate Mission: Impossible – The Final Reckoning (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
# `Rate Jurassic World: Rebirth (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>,
# `Rate Sinners (2025) on a scale of 1 to 5. Enter "NA" if you haven't watched this movie.` <dbl>, …
Data Cleaning and Preparation
This section focuses on cleaning and standardizing the dataset for analysis. Column names are converted into a consistent format using clean_names(). Key variables such as user information and demographic attributes are renamed for clarity. Movie rating columns are also simplified by removing long text patterns from their names. Finally, all movie rating variables are converted into numeric format to ensure compatibility with recommender system modeling.
User-Item Matrix Construction and Train-Test Split
In this section, the cleaned dataset is transformed into a user-item rating matrix, where rows represent users and columns represent movies. This matrix is then converted into a realRatingMatrix format required by the recommenderlab package. The data is split into training and testing sets using an 80/20 split to evaluate model performance. The split also separates known and unknown ratings for prediction evaluation.
This section evaluates different values of k (from 1 to 5) to identify the optimal number of nearest neighbors in the item-based collaborative filtering model. For each k value, an IBCF model is trained using cosine similarity, and predictions are generated for the test data. The Root Mean Squared Error (RMSE) is calculated by comparing predicted and actual ratings. The results are stored in a summary table for comparison.
# -----------------------------# Function to compute RMSE for a given k# -----------------------------get_rmse <-function(k_value) { model <-Recommender(data = train_data,method ="IBCF",parameter =list(method ="Cosine", k = k_value) ) pred <-predict(model, known_data, type ="ratings") pred_mat <-as(pred, "matrix") true_mat <-as(unknown_data, "matrix")sqrt(mean((pred_mat - true_mat)^2, na.rm =TRUE))}# -----------------------------# Run for k = 1 to 5# -----------------------------k_values <-1:5rmse_results <-tibble(k = k_values,RMSE =sapply(k_values, get_rmse))rmse_results |>gt()
k
RMSE
1
0.7000000
2
0.7000000
3
0.5869545
4
0.5869545
5
0.5869545
Best k Selection
The optimal value of k is selected based on the lowest RMSE obtained from the previous step. This ensures that the final recommender model achieves the best predictive accuracy on unseen data.
Using the optimal k value identified through tuning, the final Item-Based Collaborative Filtering model is trained on the full dataset. This allows the model to leverage all available user ratings while maintaining optimal similarity constraints.
best_k_value <- best_k$kfinal_model <-Recommender(data = ratings_rrm,method ="IBCF",parameter =list(method ="Cosine", k = best_k_value))pred_final <-predict(final_model, ratings_rrm, type ="ratings")pred_matrix <-as(pred_final, "matrix")
Personalized Movie Recommendations
This section generates personalized movie recommendations for each user. For every user, only movies they have not rated are considered. Predicted ratings are sorted in descending order to produce a ranked list of recommended movies. The final output is presented in a structured table format for easy interpretation.
This project implemented an item-based collaborative filtering recommender system using cosine similarity, with k = 3 selected as optimal based on the lowest RMSE.
However, the results are constrained by the small and sparse nature of the dataset (6 movies and 7 users). Many users have rated only a few movies, which limits how many reliable recommendations can be generated.
For example, Tabassumul Kayenath rated only one movie and left five movies unrated. In item-based collaborative filtering, recommendations are derived by comparing a user’s rated items with similar items. Since she has only one rated movie, the model has very limited information to propagate preferences, which results in only a small number of meaningful predicted ratings instead of a full recommendation list.
Additionally, missing ratings (NA values) and the small number of items reduce the strength of similarity estimates. This is also why increasing k beyond 3 does not improve performance, as reflected in the stable RMSE values.
Overall, while the model successfully generates personalized recommendations, its effectiveness is limited by data sparsity and the small scale of the dataset, which are common challenges in real-world recommender systems with limited user engagement.
Reference
OpenAI. (2026). ChatGPT conversation: Recommender System Approach [Large language model]. https://chat.openai.com/