Project 2: Decision Trees vs Random Forest: Predicting IMDB Ratings

Author

Team: I Love Lucy

Published

November 18, 2024

github   |   web presentation

Project Overview

In this project, we set out to use Random Forest on the IMDB non-commercial dataset to predict movie ratings. While the dataset lacks financial performance metrics like box office revenue or streaming views, it provides key features such as ratings, genres, and cast information, which became the foundation of our analysis.

This project provided an ideal test case for exploring decision tree limitations, as movie ratings involve both categorical (genres) and numerical (runtime) features, similar to many real world applications.

One of the biggest challenges we encountered was the sheer number of unique actors. With millions of actors in the dataset, one-hot encoding was not practical. To address this, we developed a system to measure actor experience through an average-based metric and a Likert-scale score for each movie’s cast, giving us a way to quantify experience without overwhelming the model.

Despite the challenges, our efforts paid off. From the initial runs, we achieved promising predictive results, showing that thoughtful feature engineering can unlock valuable insights even in large, complex datasets.

Data Preparation

Our workflow to prepare the data consisted of three stages:

  1. Download the Raw Datasets
    We downloaded multiple IMDB datasets, such as ratings, basics, and principals, and saved them locally for processing.

  2. Merge Datasets by Movie Title
    Using each movie’s unique identifier (tconst), we merged datasets to create a single, consolidated DataFrame, which we persisted for efficiency.

  3. Add Calculated Columns
    After merging, we added several calculated columns (detailed below) to enrich the data with features like actor experience and genre dummy variables for better predictive power.

Calculated Columns Overview

Below is a summary of the calculated columns added and their purpose:

  • num_actors:
    Total number of actors in each movie. Helps capture the cast size.

  • actor_names:
    A string of all actor names for each movie, separated by commas. Useful for analyzing trends or patterns based on cast members.

  • experienced_actor_count:
    Counts the number of experienced actors (those with more than 10 prior roles) in a movie. Measures the potential impact of experience on quality or reception.

  • experienced_actors_likert:
    A Likert-scale score (1–5) based on the average experience of the cast. Quantifies cast experience for easier analysis of its effect on ratings.

  • rating_bin:
    Binned version of the average rating (e.g., 1–10). Simplifies predictions by grouping continuous ratings into categories, which aligns with classification models like Random Forest.

  • Genre Dummy Variables:
    Each genre is expanded into a binary (0/1) column. Provides genre-specific features to analyze how genres influence ratings.

Data Overview

data_dir = "622data_nogit/imdb"
from lussi.imdb import *
from lussi.glimpse import *
df = load_imdb(data_dir)
glimpse(df)
Rows: 1047620
Columns: 37

Column preview:
--------------------------------------------------------------------------------
tconst               <category> tt0000001, tt0000002, tt0000003, tt0000004, tt0000005
primaryTitle         <string> Carmencita, Le clown et ses chiens, Poor Pierrot, Un bon bock, Blacksmith Scene
runtimeMinutes       <Int64> 1, 5, 5, 12, 1
numVotes             <int32> 2097, 283, 2106, 183, 2842
rating_bin           <int64> 5, 5, 6, 5, 6
num_actors           <float64> 4.0, 2.0, 5.0, 2.0, 3.0
actor_names          <string> Carmencita, William K.L. Dickson, William K.L. Dic..., Émile Reynaud, Gaston Paulin, Émile Reynaud, Julien Pappé, Émile Reynaud, Gaston..., Émile Reynaud, Gaston Paulin, Charles Kayser, John Ott, Thomas A. Edison
Action               <int64> 0, 0, 0, 0, 0
Adult                <int64> 0, 0, 0, 0, 0
Adventure            <int64> 0, 0, 0, 0, 0
Animation            <int64> 0, 1, 1, 1, 0
Biography            <int64> 0, 0, 0, 0, 0
Comedy               <int64> 0, 0, 1, 0, 1
Crime                <int64> 0, 0, 0, 0, 0
Documentary          <int64> 1, 0, 0, 0, 0
Drama                <int64> 0, 0, 0, 0, 0
Family               <int64> 0, 0, 0, 0, 0
Fantasy              <int64> 0, 0, 0, 0, 0
Film-Noir            <int64> 0, 0, 0, 0, 0
Game-Show            <int64> 0, 0, 0, 0, 0
History              <int64> 0, 0, 0, 0, 0
Horror               <int64> 0, 0, 0, 0, 0
Music                <int64> 0, 0, 0, 0, 0
Musical              <int64> 0, 0, 0, 0, 0
Mystery              <int64> 0, 0, 0, 0, 0
News                 <int64> 0, 0, 0, 0, 0
Reality-TV           <int64> 0, 0, 0, 0, 0
Romance              <int64> 0, 0, 1, 0, 0
Sci-Fi               <int64> 0, 0, 0, 0, 0
Short                <int64> 1, 1, 0, 1, 1
Sport                <int64> 0, 0, 0, 0, 0
Talk-Show            <int64> 0, 0, 0, 0, 0
Thriller             <int64> 0, 0, 0, 0, 0
War                  <int64> 0, 0, 0, 0, 0
Western              <int64> 0, 0, 0, 0, 0
experienced_actor_count <int64> 3, 1, 2, 1, 1
experienced_actors_likert <int64> 5, 4, 3, 4, 4

Model Comparisons

Decision Tree Models

First, we’ll establish baseline performance using decision trees of varying complexity.

Basic Decision Tree Model

# Train basic decision tree without vote counts
df_cleaned = df.drop(columns=['tconst', 'primaryTitle', 'actor_names', 'numVotes'])
basic_tree, basic_metrics, basic_preds = train_and_evaluate_basic_tree(
    df_cleaned,
    cache_path='basic_tree_no_votes.joblib',
    rerun=False,
    show_output=False
)
generate_summary_report(*basic_preds, "Basic Decision Tree")

======================================================================
Results for: Basic Decision Tree
======================================================================

Quick Summary:
- Gets the exact rating 38% of the time
- Within one rating point 81% of the time
- Makes big mistakes only 4.9% of the time

Technical Details:
- R² Score: 0.204
- RMSE: 1.222

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 81% of the time.
================================================================================

This basic model used only simple features like runtime and genres, providing a baseline for comparison.

Complex Decision Tree Model

# Train complex decision tree without vote counts
complex_tree, complex_metrics, complex_preds = train_and_evaluate_complex_tree(
    df_cleaned,
    cache_path='complex_tree_no_votes.joblib',
    rerun=False,
    show_output=False
)
generate_summary_report(*complex_preds, "Complex Decision Tree")

======================================================================
Results for: Complex Decision Tree
======================================================================

Quick Summary:
- Gets the exact rating 38% of the time
- Within one rating point 82% of the time
- Makes big mistakes only 4.5% of the time

Technical Details:
- R² Score: 0.226
- RMSE: 1.205

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 82% of the time.
================================================================================

The complex model added actor experience features, improving the R² score from 0.204 to 0.226 and reducing major mistakes from 4.9% to 4.5%. This shows that while adding complexity helps somewhat, a single tree still has limitations.

Random Forest Models

We’ll examine Random Forest performance under two scenarios: including vote counts (which could introduce bias) and excluding them (more realistic for new movies).

With Vote Counts

While using vote counts doesn’t reflect real-world prediction scenarios (since we wouldn’t have this data for new movies), we include it to test the model’s potential performance with this information.

# Train Random Forest including vote counts
df_with_votes = df.drop(columns=['tconst', 'primaryTitle', 'actor_names'])
rf_votes, importance_votes, metrics_votes, predictions_votes = train_and_evaluate_rf(
    df_with_votes,
    cache_path='rf_with_votes.joblib',
    rerun=False,
    show_output=False
)
generate_summary_report(*predictions_votes, "Random Forest (with votes)")
# display(plot_importance(importance_votes))

======================================================================
Results for: Random Forest (with votes)
======================================================================

Quick Summary:
- Gets the exact rating 42% of the time
- Within one rating point 84% of the time
- Makes big mistakes only 4.0% of the time

Technical Details:
- R² Score: 0.320
- RMSE: 1.130

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 84% of the time.
================================================================================

Without Vote Counts

This scenario better reflects real-world use where we predict ratings for new movies without vote information.

# Train Random Forest excluding vote counts
rf_no_votes, importance_no_votes, metrics_no_votes, predictions_no_votes = train_and_evaluate_rf(
    df_cleaned,
    cache_path='rf_no_votes.joblib',
    rerun=False,
    show_output=False
)
generate_summary_report(*predictions_no_votes, "Random Forest (without votes)")
# display(plot_importance(importance_no_votes))

======================================================================
Results for: Random Forest (without votes)
======================================================================

Quick Summary:
- Gets the exact rating 39% of the time
- Within one rating point 82% of the time
- Makes big mistakes only 5.0% of the time

Technical Details:
- R² Score: 0.220
- RMSE: 1.210

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 82% of the time.
================================================================================

The Random Forest maintained the accuracy improvements while being more robust, demonstrating how ensemble methods can overcome single tree limitations.

Summary of Model Performance

Model Type Exact Rating Within 1 Point Major Mistakes R² Score
Basic Decision Tree 38% 81% 4.9% 0.204
Complex Decision Tree 38% 82% 4.5% 0.226
Random Forest (basic) 39% 82% 5.0% 0.220
Random Forest (votes) 42% 84% 4.0% 0.320

The dramatic improvement in the Random Forest model when including vote counts (R² jumping from 0.220 to 0.320) reveals an important insight: number of votes is a strong predictor of rating because popular movies tend to be higher quality. This makes intuitive sense - a movie seen by thousands of people is likely to be reasonably good, while poor movies rarely attract large audiences. While this is useful to know, it’s not helpful for our goal of predicting ratings for new movies that don’t have votes yet. This finding reinforces our main lesson that the quality of input features matters more than model complexity.

Comparing Our Approaches

Now that we’ve seen the detailed performance of each model, let’s examine how our different approaches addressed common decision tree challenges. Interestingly, the results showed that all models achieved similar base accuracy (around 81-82% within one rating point), with only modest improvements from added complexity.

  1. Simple Decision Tree (81% accurate, R² 0.204)
  • Used basic features like movie length and genres
  • Provided surprisingly good baseline performance
  • Showed that simple approaches can be effective
  1. Complex Decision Tree (82% accurate, R² 0.226)
  • Added actor experience scores and more features
  • Improved R² score slightly but not dramatically
  • Demonstrated that more complexity doesn’t always mean better results
  1. Random Forest Models
  • Basic version (82% accurate, R² 0.220): Performed similarly to complex tree
  • With votes (84% accurate, R² 0.320): Showed meaningful improvement
  • Highlighted that feature selection matters more than model complexity

The blog warned about trees becoming unwieldy with too many options, but our experience showed a different challenge: getting meaningful improvements from added complexity. The real gains came not from model choice but from thoughtful feature engineering.

Learning From Past Decision Tree Problems

The blog warned about several problems, but our experience revealed some nuances:

Problem 1: Trees That Memorize Instead of Learn - Blog Warning: Models that can’t generalize - Our Finding: Even simple trees generalized reasonably well - Lesson: Good feature engineering may matter more than model complexity

Problem 2: Too Many Options - Blog Warning: Trees getting overwhelmed with choices - Our Solution: Converted complex data (thousands of actors) into simple scores - Result: Even simple models performed well with cleaned, structured data

Problem 3: Hard to Maintain - Blog Warning: Systems that break easily - Our Approach: Created robust, reusable features - Result: A stable system that works for new movies

Conclusion

Our work with decision trees taught us some surprising lessons:

  • Simple approaches can work surprisingly well
  • Feature engineering matters more than model complexity
  • Random Forests help most when you have strong signals (like vote counts)

While our approach works, we see opportunities for improvement:

  • Finding stronger predictive features
  • Better handling of rare or unusual cases
  • More sophisticated actor experience metrics

Most importantly, we learned that success with decision trees isn’t just about avoiding their limitations - it’s about understanding what actually drives predictions. Our experience suggests that thoughtful data preparation matters more than model complexity for real-world applications.