Project: MovieLens Data Science Capstone
Dataset: MovieLens Latest Small (100,836 ratings)
Date: June 2026
Prepared by: SurajKumar Shetty
This report presents the initial exploratory analysis of the MovieLens dataset, a widely-used benchmark dataset for building movie recommendation systems. The goal of this project is to develop a movie rating prediction algorithm and deploy it as an interactive Shiny application.
Key Findings:
The dataset contains 100,836 ratings from 610 users on 9,724 movies, with a sparsity of 98.3% - typical for recommendation system datasets.
Ratings are positively skewed, with an average rating of 3.50/5.0 and the most common rating being 4.0 stars.
Drama and Comedy are the most prevalent genres, while Film-Noir, War, and Documentary films receive the highest average ratings.
User activity varies significantly - the most active user contributed 2,698 ratings, while the median is only 70.5 ratings per user.
35.4% of movies have only a single rating, presenting a “cold start” challenge for the prediction algorithm.
The MovieLens dataset is provided by GroupLens Research at the University of Minnesota. For this project, we use the “latest-small” dataset, which is appropriate for education and development purposes. The dataset consists of four files:
| File | Description | Rows | Columns |
|---|---|---|---|
ratings.csv |
User-movie ratings | 100,836 | userId, movieId, rating, timestamp |
movies.csv |
Movie metadata | 9,724 | movieId, title, genres |
tags.csv |
User-generated tags | 3,683 | userId, movieId, tag, timestamp |
links.csv |
External IDs (IMDb, TMDb) | 9,724 | movieId, imdbId, tmdbId |
| Metric | Value |
|---|---|
| Number of Users | 610 |
| Number of Movies | 9,724 |
| Number of Ratings | 100,836 |
| Number of Tags | 3,683 |
| Rating Scale | 0.5 - 5.0 (half-star increments) |
| Average Rating | 3.50 |
| Median Rating | 3.5 |
| Matrix Sparsity | 98.30% |
Note on Sparsity: The sparsity of 98.3% means that users have rated only about 1.7% of all possible user-movie combinations. This is a common characteristic of recommendation datasets and is the core challenge our prediction algorithm must address.
The distribution of ratings across the dataset reveals a clear positive bias - users tend to rate movies they liked more often than movies they disliked.
| Rating | Count | Percentage |
|---|---|---|
| 0.5 | 1,370 | 1.36% |
| 1.0 | 2,811 | 2.79% |
| 1.5 | 1,791 | 1.78% |
| 2.0 | 7,551 | 7.49% |
| 2.5 | 5,550 | 5.50% |
| 3.0 | 20,047 | 19.88% |
| 3.5 | 13,136 | 13.03% |
| 4.0 | 26,818 | 26.60% |
| 4.5 | 8,551 | 8.48% |
| 5.0 | 13,211 | 13.10% |
Key Observations: - 46.7% of all ratings are 4.0 or higher, indicating a strong positive bias. - The most common rating is 4.0 stars (26.6% of all ratings), followed by 3.0 stars (19.9%). - Low ratings (1.5 and below) account for only 5.9% of all ratings. - The distribution is left-skewed, which is typical for voluntary rating systems where users rate movies they chose to watch.
Understanding how users engage with the platform is critical for building a recommendation system.
| User Activity Metric | Value |
|---|---|
| Mean ratings per user | 165.3 |
| Median ratings per user | 70.5 |
| Standard deviation | 269.5 |
| Most active user | 2,698 ratings |
| Least active user | 20 ratings |
| Users with 50+ ratings | 385 (63.1%) |
Key Observations: - The distribution is heavily right-skewed: a small number of highly active users contribute disproportionately to the dataset. - The median (70.5) is less than half the mean (165.3), confirming the presence of “super-users.” - All users have rated at least 20 movies, which is a minimum threshold applied by MovieLens to ensure data quality.
Just as user activity varies, movie popularity also follows a highly skewed distribution.
| Movie Popularity Metric | Value |
|---|---|
| Mean ratings per movie | 10.4 |
| Median ratings per movie | 3.0 |
| Most rated movie | 329 ratings |
| Movies with only 1 rating | 3,446 (35.4%) |
| Movies with 50+ ratings | 450 (4.6%) |
Key Observations: - The long-tail effect is prominent: a small number of popular movies receive most ratings, while the majority have very few. - 35.4% of movies have only a single rating, which poses a significant challenge for collaborative filtering approaches. - Only 4.6% of movies have 50 or more ratings, meaning our algorithm must handle sparse data effectively.
| Rank | Movie Title | Number of Ratings | Average Rating |
|---|---|---|---|
| 1 | Forrest Gump (1994) | 329 | 4.16 |
| 2 | The Shawshank Redemption (1994) | 317 | 4.43 |
| 3 | Pulp Fiction (1994) | 307 | 4.20 |
| 4 | The Silence of the Lambs (1991) | 279 | 4.16 |
| 5 | The Matrix (1999) | 278 | 4.19 |
| 6 | Star Wars: Episode IV - A New Hope (1977) | 251 | 4.23 |
| 7 | Jurassic Park (1993) | 238 | 3.75 |
| 8 | Braveheart (1995) | 237 | 4.03 |
| 9 | Terminator 2: Judgment Day (1991) | 224 | 3.97 |
| 10 | Schindler’s List (1993) | 220 | 4.22 |
Key Observations: - Classic films from the 1990s dominate the most-rated list, likely reflecting the demographics of MovieLens users. - All top 10 movies have average ratings above 3.75, suggesting that popular movies are also well-rated. - The most-rated movie (Forrest Gump) has only 329 ratings from 610 users, confirming the dataset’s sparsity.
The dataset contains movies spanning nearly a century, with a strong concentration in recent decades.
Key Observations: - The dataset has a strong bias toward movies from the 1980s through the 2010s. - The peak is around the early 2000s, likely reflecting the MovieLens user base’s viewing preferences. - This temporal bias is important to consider - the recommendation system may perform better for newer movies.
Movies in the dataset are tagged with 19 distinct genres (plus 34 movies with no genre listed). A single movie can belong to multiple genres.
| Rank | Genre | Movie Count | Total Ratings | Avg Rating |
|---|---|---|---|---|
| 1 | Drama | 4,361 | 41,928 | 3.66 |
| 2 | Comedy | 3,756 | 39,053 | 3.38 |
| 3 | Thriller | 1,894 | 26,452 | 3.49 |
| 4 | Action | 1,828 | 30,635 | 3.45 |
| 5 | Romance | 1,596 | 18,124 | 3.51 |
| 6 | Adventure | 1,263 | 24,161 | 3.51 |
| 7 | Crime | 1,199 | 16,681 | 3.66 |
| 8 | Sci-Fi | 980 | 17,243 | 3.46 |
| 9 | Horror | 978 | 7,291 | 3.26 |
| 10 | Fantasy | 779 | 11,834 | 3.49 |
| 11 | Children | 664 | 9,208 | 3.41 |
| 12 | Animation | 611 | 6,988 | 3.63 |
| 13 | Mystery | 573 | 7,674 | 3.63 |
| 14 | Documentary | 440 | 1,219 | 3.80 |
| 15 | War | 382 | 4,859 | 3.81 |
| 16 | Musical | 334 | 4,138 | 3.56 |
| 17 | Western | 167 | 1,930 | 3.58 |
| 18 | IMAX | 158 | 4,145 | 3.62 |
| 19 | Film-Noir | 87 | 870 | 3.92 |
When examining average ratings by genre, clear patterns emerge:
| Rank | Genre | Avg Rating | vs. Overall (3.50) |
|---|---|---|---|
| 1 | Film-Noir | 3.92 | +0.42 |
| 2 | War | 3.81 | +0.31 |
| 3 | Documentary | 3.80 | +0.30 |
| 4 | Drama | 3.66 | +0.16 |
| 5 | Crime | 3.66 | +0.16 |
| 6 | Animation | 3.63 | +0.13 |
| 7 | Mystery | 3.63 | +0.13 |
| 8 | IMAX | 3.62 | +0.12 |
| 9 | Western | 3.58 | +0.08 |
| 10 | Musical | 3.56 | +0.06 |
| 11 | Adventure | 3.51 | +0.01 |
| 12 | Romance | 3.51 | ~0.00 |
| 13 | Thriller | 3.49 | -0.01 |
| 14 | Fantasy | 3.49 | -0.01 |
| 15 | Sci-Fi | 3.46 | -0.04 |
| 16 | Action | 3.45 | -0.05 |
| 17 | Children | 3.41 | -0.09 |
| 18 | Comedy | 3.38 | -0.12 |
| 19 | Horror | 3.26 | -0.24 |
Key Observations: - Film-Noir receives the highest average rating (3.92), but has the fewest movies (87), suggesting a niche but appreciative audience. - Horror receives the lowest average rating (3.26), despite having nearly 1,000 movies - this may reflect the genre’s polarizing nature. - Documentary and War films also rate highly, suggesting that viewers who choose these genres have specific positive expectations. - Mainstream genres like Comedy and Action have average ratings below the overall mean, likely due to their larger and more diverse viewership.
The boxplot below shows how ratings are distributed within each of the top genres:
The dataset spans ratings from 1995 to 2018, with varying levels of activity across years.
Key Observations: - Rating activity peaked around the year 2000 and again in the mid-2010s. - Average ratings fluctuate between approximately 3.3 and 3.9 across years, with no clear long-term trend. - Early years (1995-1998) show high variability due to smaller user bases.
Analyzing when users are most active reveals interesting behavioral patterns:
Key Observations: - Evening hours (6 PM - 10 PM) consistently show the highest rating activity across all days. - Weekday evenings (Monday-Friday, 6-10 PM) show the strongest concentration of ratings. - Late-night activity (midnight - 2 AM) is notable, particularly on weekends. - Sunday evenings also represent a peak activity period.
Data Sparsity (98.3%): The rating matrix is extremely sparse. This is the fundamental challenge for collaborative filtering - we need to make predictions for user-movie pairs with no historical interaction.
Cold Start Problem: 35.4% of movies have only a single rating, and many users have rated relatively few movies. The algorithm must handle new users and new movies gracefully.
Rating Bias: The positive skew in ratings (mean = 3.50, only 5.9% of ratings below 2.0) means the algorithm must account for users’ tendency to rate movies they expect to enjoy.
Long-Tail Distribution: Both user activity and movie popularity follow power-law distributions. A small subset of users and movies dominate the dataset.
Genre Imbalance: Drama and Comedy represent over 83% of all movies, while niche genres like Film-Noir and Western have limited data.
Genre Quality Hierarchy: There is a clear hierarchy of average ratings by genre. Niche genres (Film-Noir, Documentary, War) consistently outperform mainstream genres. This suggests that self-selection effects are strong - users who choose niche genres have higher satisfaction.
The 1990s Effect: The most-rated movies are disproportionately from the 1990s, suggesting either a demographic bias in the user base or a genuine “golden age” perception.
User Heterogeneity: The gap between the most active user (2,698 ratings) and the median user (70.5 ratings) is enormous. This suggests that different recommendation strategies may be needed for different user segments.
Temporal Stability: Despite spanning over two decades, the average rating has remained relatively stable around 3.5, suggesting consistent rating behavior over time.
Based on the exploratory analysis, here is our plan for building the movie rating prediction algorithm:
Given the challenges identified (sparsity, cold start, long-tail), we will implement a hybrid approach combining multiple techniques:
| Component | Technique | Purpose |
|---|---|---|
| Collaborative Filtering | Matrix Factorization (SVD/ALS) | Capture latent user preferences and movie features from the rating matrix |
| Content-Based Filtering | Genre features + TF-IDF on tags | Handle cold-start for new movies with limited ratings |
| Regularization | L2 regularization + bias terms | Prevent overfitting on sparse data; account for user and movie rating biases |
| Ensemble | Weighted average of CF and CB predictions | Combine strengths of both approaches |
| Metric | Description | Target |
|---|---|---|
| RMSE | Root Mean Squared Error | < 0.90 |
| MAE | Mean Absolute Error | < 0.70 |
We will use 5-fold cross-validation on the training set for hyperparameter tuning, then evaluate the final model on the held-out test set.
Matrix Factorization handles sparsity well by learning latent representations, and has been shown to be highly effective on the MovieLens dataset.
Content-Based Filtering provides a fallback for the cold-start problem where collaborative filtering struggles.
The hybrid approach is robust and widely used in industry (Netflix, Amazon), making it suitable for production deployment in the Shiny app.
The Shiny app will provide an interactive interface to demonstrate the recommendation system:
| Feature | Description |
|---|---|
| Movie Search | Users can search for movies by title and see predicted ratings |
| Personalized Recommendations | Given a user ID, the app displays top-N recommended movies |
| Genre Filter | Users can filter recommendations by preferred genres |
| Rating History | View a user’s past ratings and predicted ratings side-by-side |
| Model Explanation | Visualize how different factors contribute to a prediction |
Shiny App
|-- Input Panel
| |-- User ID selector
| |-- Genre filter (multi-select)
| '-- Number of recommendations slider
|-- Output Panel
| |-- Recommended movies table (with predicted ratings)
| |-- Genre breakdown of recommendations
| '-- User's rating history visualization
'-- Model Backend (R code)
|-- Trained matrix factorization model
|-- Movie metadata (genres, titles)
'-- Prediction function
The model will be pre-trained and loaded into the Shiny app as an RData file for fast predictions
Movie metadata will be stored as a lookup table
The app will handle edge cases (new users with no history, invalid user IDs) gracefully
Predictions will be computed on-demand using matrix multiplication, which is fast even for the full catalog
| Phase | Task | Timeline |
|---|---|---|
| Phase 1 | Finalize data preprocessing and train-test split | Week 1 |
| Phase 2 | Implement matrix factorization model | Week 1-2 |
| Phase 3 | Implement content-based component | Week 2 |
| Phase 4 | Build hybrid ensemble and tune hyperparameters | Week 2-3 |
| Phase 5 | Evaluate final model and document results | Week 3 |
| Phase 6 | Build Shiny app UI and backend | Week 3-4 |
| Phase 7 | Test, debug, and deploy app | Week 4 |
This exploratory analysis has confirmed that the MovieLens dataset is well-suited for building a recommendation system, while also highlighting the key challenges that must be addressed. The extreme sparsity of the data (98.3%) makes this a non-trivial prediction problem, but the clear patterns in genre preferences, user behavior, and temporal activity provide a strong foundation for our hybrid approach.
Our planned hybrid recommendation system - combining matrix factorization for collaborative filtering with genre-based content filtering - is designed to handle both the sparsity challenge and the cold-start problem. The accompanying Shiny app will make these predictions accessible and interactive, providing a complete end-to-end data science product.
Key Takeaways for Stakeholders: