MovieLens Data Science Capstone

Project: MovieLens Data Science Capstone
Dataset: MovieLens Latest Small (100,836 ratings)
Date: June 2026
Prepared by: SurajKumar Shetty

Executive Summary

This report presents the initial exploratory analysis of the MovieLens dataset, a widely-used benchmark dataset for building movie recommendation systems. The goal of this project is to develop a movie rating prediction algorithm and deploy it as an interactive Shiny application.

Key Findings:

  • The dataset contains 100,836 ratings from 610 users on 9,724 movies, with a sparsity of 98.3% - typical for recommendation system datasets.

  • Ratings are positively skewed, with an average rating of 3.50/5.0 and the most common rating being 4.0 stars.

  • Drama and Comedy are the most prevalent genres, while Film-Noir, War, and Documentary films receive the highest average ratings.

  • User activity varies significantly - the most active user contributed 2,698 ratings, while the median is only 70.5 ratings per user.

  • 35.4% of movies have only a single rating, presenting a “cold start” challenge for the prediction algorithm.

1. Dataset Overview

The MovieLens dataset is provided by GroupLens Research at the University of Minnesota. For this project, we use the “latest-small” dataset, which is appropriate for education and development purposes. The dataset consists of four files:

File Description Rows Columns
ratings.csv User-movie ratings 100,836 userId, movieId, rating, timestamp
movies.csv Movie metadata 9,724 movieId, title, genres
tags.csv User-generated tags 3,683 userId, movieId, tag, timestamp
links.csv External IDs (IMDb, TMDb) 9,724 movieId, imdbId, tmdbId

Key Dataset Statistics

Metric Value
Number of Users 610
Number of Movies 9,724
Number of Ratings 100,836
Number of Tags 3,683
Rating Scale 0.5 - 5.0 (half-star increments)
Average Rating 3.50
Median Rating 3.5
Matrix Sparsity 98.30%

Note on Sparsity: The sparsity of 98.3% means that users have rated only about 1.7% of all possible user-movie combinations. This is a common characteristic of recommendation datasets and is the core challenge our prediction algorithm must address.

2. Rating Distribution Analysis

2.1 Overall Rating Distribution

The distribution of ratings across the dataset reveals a clear positive bias - users tend to rate movies they liked more often than movies they disliked.

Rating Count Percentage
0.5 1,370 1.36%
1.0 2,811 2.79%
1.5 1,791 1.78%
2.0 7,551 7.49%
2.5 5,550 5.50%
3.0 20,047 19.88%
3.5 13,136 13.03%
4.0 26,818 26.60%
4.5 8,551 8.48%
5.0 13,211 13.10%

Key Observations: - 46.7% of all ratings are 4.0 or higher, indicating a strong positive bias. - The most common rating is 4.0 stars (26.6% of all ratings), followed by 3.0 stars (19.9%). - Low ratings (1.5 and below) account for only 5.9% of all ratings. - The distribution is left-skewed, which is typical for voluntary rating systems where users rate movies they chose to watch.

Rating Distribution
Rating Distribution

3. User Activity Analysis

3.1 User Rating Patterns

Understanding how users engage with the platform is critical for building a recommendation system.

User Activity Metric Value
Mean ratings per user 165.3
Median ratings per user 70.5
Standard deviation 269.5
Most active user 2,698 ratings
Least active user 20 ratings
Users with 50+ ratings 385 (63.1%)

Key Observations: - The distribution is heavily right-skewed: a small number of highly active users contribute disproportionately to the dataset. - The median (70.5) is less than half the mean (165.3), confirming the presence of “super-users.” - All users have rated at least 20 movies, which is a minimum threshold applied by MovieLens to ensure data quality.

User Activity Distribution
User Activity Distribution

4. Movie Analysis

4.1 Movie Popularity Distribution

Just as user activity varies, movie popularity also follows a highly skewed distribution.

Movie Popularity Metric Value
Mean ratings per movie 10.4
Median ratings per movie 3.0
Most rated movie 329 ratings
Movies with only 1 rating 3,446 (35.4%)
Movies with 50+ ratings 450 (4.6%)

Key Observations: - The long-tail effect is prominent: a small number of popular movies receive most ratings, while the majority have very few. - 35.4% of movies have only a single rating, which poses a significant challenge for collaborative filtering approaches. - Only 4.6% of movies have 50 or more ratings, meaning our algorithm must handle sparse data effectively.

Movie Popularity Distribution
Movie Popularity Distribution

4.2 Top 10 Most Rated Movies

Rank Movie Title Number of Ratings Average Rating
1 Forrest Gump (1994) 329 4.16
2 The Shawshank Redemption (1994) 317 4.43
3 Pulp Fiction (1994) 307 4.20
4 The Silence of the Lambs (1991) 279 4.16
5 The Matrix (1999) 278 4.19
6 Star Wars: Episode IV - A New Hope (1977) 251 4.23
7 Jurassic Park (1993) 238 3.75
8 Braveheart (1995) 237 4.03
9 Terminator 2: Judgment Day (1991) 224 3.97
10 Schindler’s List (1993) 220 4.22

Key Observations: - Classic films from the 1990s dominate the most-rated list, likely reflecting the demographics of MovieLens users. - All top 10 movies have average ratings above 3.75, suggesting that popular movies are also well-rated. - The most-rated movie (Forrest Gump) has only 329 ratings from 610 users, confirming the dataset’s sparsity.

Top 15 Most Rated Movies
Top 15 Most Rated Movies

4.3 Movies by Release Year

The dataset contains movies spanning nearly a century, with a strong concentration in recent decades.

Movies by Release Year
Movies by Release Year

Key Observations: - The dataset has a strong bias toward movies from the 1980s through the 2010s. - The peak is around the early 2000s, likely reflecting the MovieLens user base’s viewing preferences. - This temporal bias is important to consider - the recommendation system may perform better for newer movies.

5. Genre Analysis

5.1 Genre Distribution

Movies in the dataset are tagged with 19 distinct genres (plus 34 movies with no genre listed). A single movie can belong to multiple genres.

Rank Genre Movie Count Total Ratings Avg Rating
1 Drama 4,361 41,928 3.66
2 Comedy 3,756 39,053 3.38
3 Thriller 1,894 26,452 3.49
4 Action 1,828 30,635 3.45
5 Romance 1,596 18,124 3.51
6 Adventure 1,263 24,161 3.51
7 Crime 1,199 16,681 3.66
8 Sci-Fi 980 17,243 3.46
9 Horror 978 7,291 3.26
10 Fantasy 779 11,834 3.49
11 Children 664 9,208 3.41
12 Animation 611 6,988 3.63
13 Mystery 573 7,674 3.63
14 Documentary 440 1,219 3.80
15 War 382 4,859 3.81
16 Musical 334 4,138 3.56
17 Western 167 1,930 3.58
18 IMAX 158 4,145 3.62
19 Film-Noir 87 870 3.92
Genre Distribution
Genre Distribution

5.2 Average Rating by Genre

When examining average ratings by genre, clear patterns emerge:

Rank Genre Avg Rating vs. Overall (3.50)
1 Film-Noir 3.92 +0.42
2 War 3.81 +0.31
3 Documentary 3.80 +0.30
4 Drama 3.66 +0.16
5 Crime 3.66 +0.16
6 Animation 3.63 +0.13
7 Mystery 3.63 +0.13
8 IMAX 3.62 +0.12
9 Western 3.58 +0.08
10 Musical 3.56 +0.06
11 Adventure 3.51 +0.01
12 Romance 3.51 ~0.00
13 Thriller 3.49 -0.01
14 Fantasy 3.49 -0.01
15 Sci-Fi 3.46 -0.04
16 Action 3.45 -0.05
17 Children 3.41 -0.09
18 Comedy 3.38 -0.12
19 Horror 3.26 -0.24
Average Rating by Genre
Average Rating by Genre

Key Observations: - Film-Noir receives the highest average rating (3.92), but has the fewest movies (87), suggesting a niche but appreciative audience. - Horror receives the lowest average rating (3.26), despite having nearly 1,000 movies - this may reflect the genre’s polarizing nature. - Documentary and War films also rate highly, suggesting that viewers who choose these genres have specific positive expectations. - Mainstream genres like Comedy and Action have average ratings below the overall mean, likely due to their larger and more diverse viewership.

5.3 Rating Distribution by Genre

The boxplot below shows how ratings are distributed within each of the top genres:

Rating Distribution by Genre
Rating Distribution by Genre

6. Temporal Patterns

6.1 Ratings Over Time

The dataset spans ratings from 1995 to 2018, with varying levels of activity across years.

Ratings Over Time
Ratings Over Time

Key Observations: - Rating activity peaked around the year 2000 and again in the mid-2010s. - Average ratings fluctuate between approximately 3.3 and 3.9 across years, with no clear long-term trend. - Early years (1995-1998) show high variability due to smaller user bases.

6.2 Rating Activity Patterns

Analyzing when users are most active reveals interesting behavioral patterns:

Rating Activity Heatmap
Rating Activity Heatmap

Key Observations: - Evening hours (6 PM - 10 PM) consistently show the highest rating activity across all days. - Weekday evenings (Monday-Friday, 6-10 PM) show the strongest concentration of ratings. - Late-night activity (midnight - 2 AM) is notable, particularly on weekends. - Sunday evenings also represent a peak activity period.

7. Key Challenges and Interesting Findings

7.1 Major Challenges Identified

  1. Data Sparsity (98.3%): The rating matrix is extremely sparse. This is the fundamental challenge for collaborative filtering - we need to make predictions for user-movie pairs with no historical interaction.

  2. Cold Start Problem: 35.4% of movies have only a single rating, and many users have rated relatively few movies. The algorithm must handle new users and new movies gracefully.

  3. Rating Bias: The positive skew in ratings (mean = 3.50, only 5.9% of ratings below 2.0) means the algorithm must account for users’ tendency to rate movies they expect to enjoy.

  4. Long-Tail Distribution: Both user activity and movie popularity follow power-law distributions. A small subset of users and movies dominate the dataset.

  5. Genre Imbalance: Drama and Comedy represent over 83% of all movies, while niche genres like Film-Noir and Western have limited data.

7.2 Interesting Findings

  1. Genre Quality Hierarchy: There is a clear hierarchy of average ratings by genre. Niche genres (Film-Noir, Documentary, War) consistently outperform mainstream genres. This suggests that self-selection effects are strong - users who choose niche genres have higher satisfaction.

  2. The 1990s Effect: The most-rated movies are disproportionately from the 1990s, suggesting either a demographic bias in the user base or a genuine “golden age” perception.

  3. User Heterogeneity: The gap between the most active user (2,698 ratings) and the median user (70.5 ratings) is enormous. This suggests that different recommendation strategies may be needed for different user segments.

  4. Temporal Stability: Despite spanning over two decades, the average rating has remained relatively stable around 3.5, suggesting consistent rating behavior over time.

8. Plan for Prediction Algorithm

Based on the exploratory analysis, here is our plan for building the movie rating prediction algorithm:

8.1 Approach: Hybrid Recommendation System

Given the challenges identified (sparsity, cold start, long-tail), we will implement a hybrid approach combining multiple techniques:

Component Technique Purpose
Collaborative Filtering Matrix Factorization (SVD/ALS) Capture latent user preferences and movie features from the rating matrix
Content-Based Filtering Genre features + TF-IDF on tags Handle cold-start for new movies with limited ratings
Regularization L2 regularization + bias terms Prevent overfitting on sparse data; account for user and movie rating biases
Ensemble Weighted average of CF and CB predictions Combine strengths of both approaches

8.2 Algorithm Steps

  1. Data Preprocessing:
    • Split data into training (80%) and test (20%) sets using stratified sampling to preserve rating distribution
    • Extract genre features as multi-hot encoded vectors
    • Normalize ratings by subtracting user mean and movie mean (bias correction)
  2. Collaborative Filtering (Matrix Factorization):
    • Decompose the user-item rating matrix into latent factor matrices
    • Learn user factors (preferences) and item factors (characteristics) simultaneously
    • Use stochastic gradient descent for optimization
    • Hyperparameters to tune: number of latent factors (k), learning rate, regularization strength
  3. Content-Based Component:
    • Use genre information to compute movie-movie similarity
    • For movies with few ratings, predict based on similar movies the user has rated
    • Incorporate tag data (TF-IDF) to enhance similarity computation
  4. Hybrid Ensemble:
    • Weight the CF and CB predictions based on available data
    • Higher weight on CF for users/movies with many ratings
    • Higher weight on CB for new users/movies (cold start)

8.3 Evaluation Strategy

Metric Description Target
RMSE Root Mean Squared Error < 0.90
MAE Mean Absolute Error < 0.70

We will use 5-fold cross-validation on the training set for hyperparameter tuning, then evaluate the final model on the held-out test set.

8.4 Why This Approach?

  • Matrix Factorization handles sparsity well by learning latent representations, and has been shown to be highly effective on the MovieLens dataset.

  • Content-Based Filtering provides a fallback for the cold-start problem where collaborative filtering struggles.

  • The hybrid approach is robust and widely used in industry (Netflix, Amazon), making it suitable for production deployment in the Shiny app.

9. Plan for Shiny Application

The Shiny app will provide an interactive interface to demonstrate the recommendation system:

9.1 App Features

Feature Description
Movie Search Users can search for movies by title and see predicted ratings
Personalized Recommendations Given a user ID, the app displays top-N recommended movies
Genre Filter Users can filter recommendations by preferred genres
Rating History View a user’s past ratings and predicted ratings side-by-side
Model Explanation Visualize how different factors contribute to a prediction

9.2 App Architecture

Shiny App
|-- Input Panel
|   |-- User ID selector
|   |-- Genre filter (multi-select)
|   '-- Number of recommendations slider
|-- Output Panel
|   |-- Recommended movies table (with predicted ratings)
|   |-- Genre breakdown of recommendations
|   '-- User's rating history visualization
'-- Model Backend (R code)
    |-- Trained matrix factorization model
    |-- Movie metadata (genres, titles)
    '-- Prediction function

9.3 User Experience Flow

  1. User enters a User ID (or selects a random user)
  2. The app retrieves the user’s rating history from the dataset
  3. The trained model generates personalized predictions for all unseen movies
  4. Results are displayed as a sortable table with movie poster, title, genres, predicted rating, and confidence score
  5. Users can filter by genre or minimum predicted rating

9.4 Technical Considerations

  • The model will be pre-trained and loaded into the Shiny app as an RData file for fast predictions

  • Movie metadata will be stored as a lookup table

  • The app will handle edge cases (new users with no history, invalid user IDs) gracefully

  • Predictions will be computed on-demand using matrix multiplication, which is fast even for the full catalog

10. Next Steps and Timeline

Phase Task Timeline
Phase 1 Finalize data preprocessing and train-test split Week 1
Phase 2 Implement matrix factorization model Week 1-2
Phase 3 Implement content-based component Week 2
Phase 4 Build hybrid ensemble and tune hyperparameters Week 2-3
Phase 5 Evaluate final model and document results Week 3
Phase 6 Build Shiny app UI and backend Week 3-4
Phase 7 Test, debug, and deploy app Week 4

11. Conclusion

This exploratory analysis has confirmed that the MovieLens dataset is well-suited for building a recommendation system, while also highlighting the key challenges that must be addressed. The extreme sparsity of the data (98.3%) makes this a non-trivial prediction problem, but the clear patterns in genre preferences, user behavior, and temporal activity provide a strong foundation for our hybrid approach.

Our planned hybrid recommendation system - combining matrix factorization for collaborative filtering with genre-based content filtering - is designed to handle both the sparsity challenge and the cold-start problem. The accompanying Shiny app will make these predictions accessible and interactive, providing a complete end-to-end data science product.

Key Takeaways for Stakeholders:

  1. The data is loaded and validated - 100,836 ratings across 9,724 movies are ready for modeling.
  2. Users rate movies positively on average (3.5/5.0), with Drama and Comedy being the most common genres.
  3. The main challenge is data sparsity - our algorithm must predict ratings for unseen movie-user pairs.
  4. Our hybrid approach (collaborative + content-based filtering) is designed to handle this challenge.
  5. The final deliverable will be a working Shiny app that provides personalized movie recommendations.