MovieLens Data Science Capstone

Project: MovieLens Data Science Capstone
Dataset: MovieLens Latest Small (100,836 ratings)
Date: June 2026
Prepared by: SurajKumar Shetty

Executive Summary

This report presents the initial exploratory analysis of the MovieLens dataset, a widely-used benchmark dataset for building movie recommendation systems. The goal of this project is to develop a movie rating prediction algorithm and deploy it as an interactive Shiny application.

Key Findings:

The dataset contains 100,836 ratings from 610 users on 9,724 movies, with a sparsity of 98.3% - typical for recommendation system datasets.
Ratings are positively skewed, with an average rating of 3.50/5.0 and the most common rating being 4.0 stars.
Drama and Comedy are the most prevalent genres, while Film-Noir, War, and Documentary films receive the highest average ratings.
User activity varies significantly - the most active user contributed 2,698 ratings, while the median is only 70.5 ratings per user.
35.4% of movies have only a single rating, presenting a “cold start” challenge for the prediction algorithm.

1. Dataset Overview

The MovieLens dataset is provided by GroupLens Research at the University of Minnesota. For this project, we use the “latest-small” dataset, which is appropriate for education and development purposes. The dataset consists of four files:

File	Description	Rows	Columns
`ratings.csv`	User-movie ratings	100,836	userId, movieId, rating, timestamp
`movies.csv`	Movie metadata	9,724	movieId, title, genres
`tags.csv`	User-generated tags	3,683	userId, movieId, tag, timestamp
`links.csv`	External IDs (IMDb, TMDb)	9,724	movieId, imdbId, tmdbId

Key Dataset Statistics

Metric	Value
Number of Users	610
Number of Movies	9,724
Number of Ratings	100,836
Number of Tags	3,683
Rating Scale	0.5 - 5.0 (half-star increments)
Average Rating	3.50
Median Rating	3.5
Matrix Sparsity	98.30%

Note on Sparsity: The sparsity of 98.3% means that users have rated only about 1.7% of all possible user-movie combinations. This is a common characteristic of recommendation datasets and is the core challenge our prediction algorithm must address.

2. Rating Distribution Analysis

2.1 Overall Rating Distribution

The distribution of ratings across the dataset reveals a clear positive bias - users tend to rate movies they liked more often than movies they disliked.

Rating	Count	Percentage
0.5	1,370	1.36%
1.0	2,811	2.79%
1.5	1,791	1.78%
2.0	7,551	7.49%
2.5	5,550	5.50%
3.0	20,047	19.88%
3.5	13,136	13.03%
4.0	26,818	26.60%
4.5	8,551	8.48%
5.0	13,211	13.10%

Key Observations: - 46.7% of all ratings are 4.0 or higher, indicating a strong positive bias. - The most common rating is 4.0 stars (26.6% of all ratings), followed by 3.0 stars (19.9%). - Low ratings (1.5 and below) account for only 5.9% of all ratings. - The distribution is left-skewed, which is typical for voluntary rating systems where users rate movies they chose to watch.

Rating Distribution

3. User Activity Analysis

3.1 User Rating Patterns

Understanding how users engage with the platform is critical for building a recommendation system.

User Activity Metric	Value
Mean ratings per user	165.3
Median ratings per user	70.5
Standard deviation	269.5
Most active user	2,698 ratings
Least active user	20 ratings
Users with 50+ ratings	385 (63.1%)

Key Observations: - The distribution is heavily right-skewed: a small number of highly active users contribute disproportionately to the dataset. - The median (70.5) is less than half the mean (165.3), confirming the presence of “super-users.” - All users have rated at least 20 movies, which is a minimum threshold applied by MovieLens to ensure data quality.

User Activity Distribution

4. Movie Analysis

4.1 Movie Popularity Distribution

Just as user activity varies, movie popularity also follows a highly skewed distribution.

Movie Popularity Metric	Value
Mean ratings per movie	10.4
Median ratings per movie	3.0
Most rated movie	329 ratings
Movies with only 1 rating	3,446 (35.4%)
Movies with 50+ ratings	450 (4.6%)

Key Observations: - The long-tail effect is prominent: a small number of popular movies receive most ratings, while the majority have very few. - 35.4% of movies have only a single rating, which poses a significant challenge for collaborative filtering approaches. - Only 4.6% of movies have 50 or more ratings, meaning our algorithm must handle sparse data effectively.

Movie Popularity Distribution

4.2 Top 10 Most Rated Movies

Rank	Movie Title	Number of Ratings	Average Rating
1	Forrest Gump (1994)	329	4.16
2	The Shawshank Redemption (1994)	317	4.43
3	Pulp Fiction (1994)	307	4.20
4	The Silence of the Lambs (1991)	279	4.16
5	The Matrix (1999)	278	4.19
6	Star Wars: Episode IV - A New Hope (1977)	251	4.23
7	Jurassic Park (1993)	238	3.75
8	Braveheart (1995)	237	4.03
9	Terminator 2: Judgment Day (1991)	224	3.97
10	Schindler’s List (1993)	220	4.22

Key Observations: - Classic films from the 1990s dominate the most-rated list, likely reflecting the demographics of MovieLens users. - All top 10 movies have average ratings above 3.75, suggesting that popular movies are also well-rated. - The most-rated movie (Forrest Gump) has only 329 ratings from 610 users, confirming the dataset’s sparsity.

Top 15 Most Rated Movies

4.3 Movies by Release Year

The dataset contains movies spanning nearly a century, with a strong concentration in recent decades.

Movies by Release Year

Key Observations: - The dataset has a strong bias toward movies from the 1980s through the 2010s. - The peak is around the early 2000s, likely reflecting the MovieLens user base’s viewing preferences. - This temporal bias is important to consider - the recommendation system may perform better for newer movies.

5. Genre Analysis

5.1 Genre Distribution

Movies in the dataset are tagged with 19 distinct genres (plus 34 movies with no genre listed). A single movie can belong to multiple genres.

Rank	Genre	Movie Count	Total Ratings	Avg Rating
1	Drama	4,361	41,928	3.66
2	Comedy	3,756	39,053	3.38
3	Thriller	1,894	26,452	3.49
4	Action	1,828	30,635	3.45
5	Romance	1,596	18,124	3.51
6	Adventure	1,263	24,161	3.51
7	Crime	1,199	16,681	3.66
8	Sci-Fi	980	17,243	3.46
9	Horror	978	7,291	3.26
10	Fantasy	779	11,834	3.49
11	Children	664	9,208	3.41
12	Animation	611	6,988	3.63
13	Mystery	573	7,674	3.63
14	Documentary	440	1,219	3.80
15	War	382	4,859	3.81
16	Musical	334	4,138	3.56
17	Western	167	1,930	3.58
18	IMAX	158	4,145	3.62
19	Film-Noir	87	870	3.92

Genre Distribution

5.2 Average Rating by Genre

When examining average ratings by genre, clear patterns emerge:

Rank	Genre	Avg Rating	vs. Overall (3.50)
1	Film-Noir	3.92	+0.42
2	War	3.81	+0.31
3	Documentary	3.80	+0.30
4	Drama	3.66	+0.16
5	Crime	3.66	+0.16
6	Animation	3.63	+0.13
7	Mystery	3.63	+0.13
8	IMAX	3.62	+0.12
9	Western	3.58	+0.08
10	Musical	3.56	+0.06
11	Adventure	3.51	+0.01
12	Romance	3.51	~0.00
13	Thriller	3.49	-0.01
14	Fantasy	3.49	-0.01
15	Sci-Fi	3.46	-0.04
16	Action	3.45	-0.05
17	Children	3.41	-0.09
18	Comedy	3.38	-0.12
19	Horror	3.26	-0.24

Average Rating by Genre

Key Observations: - Film-Noir receives the highest average rating (3.92), but has the fewest movies (87), suggesting a niche but appreciative audience. - Horror receives the lowest average rating (3.26), despite having nearly 1,000 movies - this may reflect the genre’s polarizing nature. - Documentary and War films also rate highly, suggesting that viewers who choose these genres have specific positive expectations. - Mainstream genres like Comedy and Action have average ratings below the overall mean, likely due to their larger and more diverse viewership.

5.3 Rating Distribution by Genre

The boxplot below shows how ratings are distributed within each of the top genres:

Rating Distribution by Genre

6. Temporal Patterns

6.1 Ratings Over Time

The dataset spans ratings from 1995 to 2018, with varying levels of activity across years.

Ratings Over Time

Key Observations: - Rating activity peaked around the year 2000 and again in the mid-2010s. - Average ratings fluctuate between approximately 3.3 and 3.9 across years, with no clear long-term trend. - Early years (1995-1998) show high variability due to smaller user bases.

6.2 Rating Activity Patterns

Analyzing when users are most active reveals interesting behavioral patterns:

Rating Activity Heatmap

Key Observations: - Evening hours (6 PM - 10 PM) consistently show the highest rating activity across all days. - Weekday evenings (Monday-Friday, 6-10 PM) show the strongest concentration of ratings. - Late-night activity (midnight - 2 AM) is notable, particularly on weekends. - Sunday evenings also represent a peak activity period.

7. Key Challenges and Interesting Findings

7.1 Major Challenges Identified

Data Sparsity (98.3%): The rating matrix is extremely sparse. This is the fundamental challenge for collaborative filtering - we need to make predictions for user-movie pairs with no historical interaction.
Cold Start Problem: 35.4% of movies have only a single rating, and many users have rated relatively few movies. The algorithm must handle new users and new movies gracefully.
Rating Bias: The positive skew in ratings (mean = 3.50, only 5.9% of ratings below 2.0) means the algorithm must account for users’ tendency to rate movies they expect to enjoy.
Long-Tail Distribution: Both user activity and movie popularity follow power-law distributions. A small subset of users and movies dominate the dataset.
Genre Imbalance: Drama and Comedy represent over 83% of all movies, while niche genres like Film-Noir and Western have limited data.

7.2 Interesting Findings

Genre Quality Hierarchy: There is a clear hierarchy of average ratings by genre. Niche genres (Film-Noir, Documentary, War) consistently outperform mainstream genres. This suggests that self-selection effects are strong - users who choose niche genres have higher satisfaction.
The 1990s Effect: The most-rated movies are disproportionately from the 1990s, suggesting either a demographic bias in the user base or a genuine “golden age” perception.
User Heterogeneity: The gap between the most active user (2,698 ratings) and the median user (70.5 ratings) is enormous. This suggests that different recommendation strategies may be needed for different user segments.
Temporal Stability: Despite spanning over two decades, the average rating has remained relatively stable around 3.5, suggesting consistent rating behavior over time.

8. Plan for Prediction Algorithm

Based on the exploratory analysis, here is our plan for building the movie rating prediction algorithm:

8.1 Approach: Hybrid Recommendation System

Given the challenges identified (sparsity, cold start, long-tail), we will implement a hybrid approach combining multiple techniques:

Component	Technique	Purpose
Collaborative Filtering	Matrix Factorization (SVD/ALS)	Capture latent user preferences and movie features from the rating matrix
Content-Based Filtering	Genre features + TF-IDF on tags	Handle cold-start for new movies with limited ratings
Regularization	L2 regularization + bias terms	Prevent overfitting on sparse data; account for user and movie rating biases
Ensemble	Weighted average of CF and CB predictions	Combine strengths of both approaches

8.2 Algorithm Steps

Data Preprocessing:
- Split data into training (80%) and test (20%) sets using stratified sampling to preserve rating distribution
- Extract genre features as multi-hot encoded vectors
- Normalize ratings by subtracting user mean and movie mean (bias correction)
Collaborative Filtering (Matrix Factorization):
- Decompose the user-item rating matrix into latent factor matrices
- Learn user factors (preferences) and item factors (characteristics) simultaneously
- Use stochastic gradient descent for optimization
- Hyperparameters to tune: number of latent factors (k), learning rate, regularization strength
Content-Based Component:
- Use genre information to compute movie-movie similarity
- For movies with few ratings, predict based on similar movies the user has rated
- Incorporate tag data (TF-IDF) to enhance similarity computation
Hybrid Ensemble:
- Weight the CF and CB predictions based on available data
- Higher weight on CF for users/movies with many ratings
- Higher weight on CB for new users/movies (cold start)

8.3 Evaluation Strategy

Metric	Description	Target
RMSE	Root Mean Squared Error	< 0.90
MAE	Mean Absolute Error	< 0.70

We will use 5-fold cross-validation on the training set for hyperparameter tuning, then evaluate the final model on the held-out test set.

8.4 Why This Approach?

Matrix Factorization handles sparsity well by learning latent representations, and has been shown to be highly effective on the MovieLens dataset.
Content-Based Filtering provides a fallback for the cold-start problem where collaborative filtering struggles.
The hybrid approach is robust and widely used in industry (Netflix, Amazon), making it suitable for production deployment in the Shiny app.

9. Plan for Shiny Application

The Shiny app will provide an interactive interface to demonstrate the recommendation system:

9.1 App Features

Feature	Description
Movie Search	Users can search for movies by title and see predicted ratings
Personalized Recommendations	Given a user ID, the app displays top-N recommended movies
Genre Filter	Users can filter recommendations by preferred genres
Rating History	View a user’s past ratings and predicted ratings side-by-side
Model Explanation	Visualize how different factors contribute to a prediction

9.2 App Architecture

Shiny App
|-- Input Panel
|   |-- User ID selector
|   |-- Genre filter (multi-select)
|   '-- Number of recommendations slider
|-- Output Panel
|   |-- Recommended movies table (with predicted ratings)
|   |-- Genre breakdown of recommendations
|   '-- User's rating history visualization
'-- Model Backend (R code)
    |-- Trained matrix factorization model
    |-- Movie metadata (genres, titles)
    '-- Prediction function

9.3 User Experience Flow

User enters a User ID (or selects a random user)
The app retrieves the user’s rating history from the dataset
The trained model generates personalized predictions for all unseen movies
Results are displayed as a sortable table with movie poster, title, genres, predicted rating, and confidence score
Users can filter by genre or minimum predicted rating

9.4 Technical Considerations

The model will be pre-trained and loaded into the Shiny app as an RData file for fast predictions
Movie metadata will be stored as a lookup table
The app will handle edge cases (new users with no history, invalid user IDs) gracefully
Predictions will be computed on-demand using matrix multiplication, which is fast even for the full catalog

10. Next Steps and Timeline

Phase	Task	Timeline
Phase 1	Finalize data preprocessing and train-test split	Week 1
Phase 2	Implement matrix factorization model	Week 1-2
Phase 3	Implement content-based component	Week 2
Phase 4	Build hybrid ensemble and tune hyperparameters	Week 2-3
Phase 5	Evaluate final model and document results	Week 3
Phase 6	Build Shiny app UI and backend	Week 3-4
Phase 7	Test, debug, and deploy app	Week 4

11. Conclusion

This exploratory analysis has confirmed that the MovieLens dataset is well-suited for building a recommendation system, while also highlighting the key challenges that must be addressed. The extreme sparsity of the data (98.3%) makes this a non-trivial prediction problem, but the clear patterns in genre preferences, user behavior, and temporal activity provide a strong foundation for our hybrid approach.

Our planned hybrid recommendation system - combining matrix factorization for collaborative filtering with genre-based content filtering - is designed to handle both the sparsity challenge and the cold-start problem. The accompanying Shiny app will make these predictions accessible and interactive, providing a complete end-to-end data science product.

Key Takeaways for Stakeholders:

The data is loaded and validated - 100,836 ratings across 9,724 movies are ready for modeling.
Users rate movies positively on average (3.5/5.0), with Drama and Comedy being the most common genres.
The main challenge is data sparsity - our algorithm must predict ratings for unseen movie-user pairs.
Our hybrid approach (collaborative + content-based filtering) is designed to handle this challenge.
The final deliverable will be a working Shiny app that provides personalized movie recommendations.