Project 2: IMDB Ratings using Random Forest

Author

Team: I Love Lucy

Published

November 2, 2024

github   |   web presentation

Project Overview

In this project, we set out to use Random Forest on the IMDB non-commercial datasetto predict movie ratings. While the dataset lacks financial performance metrics like box office revenue or streaming views, it provides key features such as ratings, genres, and cast information, which became the foundation of our analysis.

One of the biggest challenges we encountered was the sheer number of unique actors. With millions of actors in the dataset, one-hot encoding was not practical. To address this, we developed a system to measure actor experience through an average-based metric and a Likert-scale score for each movie’s cast, giving us a way to quantify experience without overwhelming the model.

Despite the challenges, our efforts paid off. From the initial runs, we achieved promising predictive results, showing that thoughtful feature engineering can unlock valuable insights even in large, complex datasets.

Data Preparation

Our workflow to prepare the data consisted of three stages:

  1. Download the Raw Datasets
    We downloaded multiple IMDB datasets, such as ratings, basics, and principals, and saved them locally for processing.

  2. Merge Datasets by Movie Title
    Using each movie’s unique identifier (tconst), we merged datasets to create a single, consolidated DataFrame, which we persisted for efficiency.

  3. Add Calculated Columns
    After merging, we added several calculated columns (detailed below) to enrich the data with features like actor experience and genre dummy variables for better predictive power.

Calculated Columns Overview

Below is a summary of the calculated columns added and their purpose:

  • num_actors:
    Total number of actors in each movie. Helps capture the cast size.

  • actor_names:
    A string of all actor names for each movie, separated by commas. Useful for analyzing trends or patterns based on cast members.

  • experienced_actor_count:
    Counts the number of experienced actors (those with more than 10 prior roles) in a movie. Measures the potential impact of experience on quality or reception.

  • experienced_actors_likert:
    A Likert-scale score (1–5) based on the average experience of the cast. Quantifies cast experience for easier analysis of its effect on ratings.

  • rating_bin:
    Binned version of the average rating (e.g., 1–10). Simplifies predictions by grouping continuous ratings into categories, which aligns with classification models like Random Forest.

  • Genre Dummy Variables:
    Each genre is expanded into a binary (0/1) column. Provides genre-specific features to analyze how genres influence ratings.

Data Overview

data_dir = "622data_nogit/imdb"
from lussi.imdb import *
from lussi.glimpse import *
df = load_imdb(data_dir)
glimpse(df)
Rows: 1047620
Columns: 37

Column preview:
--------------------------------------------------------------------------------
tconst               <category> tt0000001, tt0000002, tt0000003, tt0000004, tt0000005
primaryTitle         <string> Carmencita, Le clown et ses chiens, Poor Pierrot, Un bon bock, Blacksmith Scene
runtimeMinutes       <Int64> 1, 5, 5, 12, 1
numVotes             <int32> 2097, 283, 2106, 183, 2842
rating_bin           <int64> 5, 5, 6, 5, 6
num_actors           <float64> 4.0, 2.0, 5.0, 2.0, 3.0
actor_names          <string> Carmencita, William K.L. Dickson, William K.L. Dic..., Émile Reynaud, Gaston Paulin, Émile Reynaud, Julien Pappé, Émile Reynaud, Gaston..., Émile Reynaud, Gaston Paulin, Charles Kayser, John Ott, Thomas A. Edison
Action               <int64> 0, 0, 0, 0, 0
Adult                <int64> 0, 0, 0, 0, 0
Adventure            <int64> 0, 0, 0, 0, 0
Animation            <int64> 0, 1, 1, 1, 0
Biography            <int64> 0, 0, 0, 0, 0
Comedy               <int64> 0, 0, 1, 0, 1
Crime                <int64> 0, 0, 0, 0, 0
Documentary          <int64> 1, 0, 0, 0, 0
Drama                <int64> 0, 0, 0, 0, 0
Family               <int64> 0, 0, 0, 0, 0
Fantasy              <int64> 0, 0, 0, 0, 0
Film-Noir            <int64> 0, 0, 0, 0, 0
Game-Show            <int64> 0, 0, 0, 0, 0
History              <int64> 0, 0, 0, 0, 0
Horror               <int64> 0, 0, 0, 0, 0
Music                <int64> 0, 0, 0, 0, 0
Musical              <int64> 0, 0, 0, 0, 0
Mystery              <int64> 0, 0, 0, 0, 0
News                 <int64> 0, 0, 0, 0, 0
Reality-TV           <int64> 0, 0, 0, 0, 0
Romance              <int64> 0, 0, 1, 0, 0
Sci-Fi               <int64> 0, 0, 0, 0, 0
Short                <int64> 1, 1, 0, 1, 1
Sport                <int64> 0, 0, 0, 0, 0
Talk-Show            <int64> 0, 0, 0, 0, 0
Thriller             <int64> 0, 0, 0, 0, 0
War                  <int64> 0, 0, 0, 0, 0
Western              <int64> 0, 0, 0, 0, 0
experienced_actor_count <int64> 3, 1, 2, 1, 1
experienced_actors_likert <int64> 5, 4, 3, 4, 4

Random Forest Models

Assumes number of votes

While using the number of votes doesn’t make sense, we included it to test the model. This feature reflects audience reception after a movie’s release, so incorporating it could skew the predictions and introduce bias.

#This assumes a number of votes already. Unlikely. 
df_dropped1 = df.drop(columns=['tconst', 'primaryTitle', 'actor_names'])
model1, importance1, metrics1, predictions1 = train_and_evaluate_rf(df_dropped1)
generate_summary_report(*predictions1)
display(plot_importance(importance1))
Model Performance:
R2 Score: 0.320
Root Mean Squared Error: 1.130

Total model train and execution time: 0:05:16.389973 

**Accuracy Measures:**
- 42.25% exact matches (got the rating bin exactly right)
- 84.10% within 1 bin (either exact or just one bin off)

**Distribution of Errors:**
- 42.3% perfect predictions (0 bins off)
- 41.8% off by just 1 bin (22.8% low + 19.0% high)
- Only 15.9% off by more than 1 bin
- Major mistakes (off by 3 or more bins): 4.0%

**Putting It in Perspective:**
If you're trying to predict if a movie is 'good' (7–8), 'great' (8–9), or 'excellent' (9–10), you'll be within the right range 84.10% of the time. Major errors are pretty rare, happening less than 4.0% of the time.
Feature Importance Plot

Assumes no votes

This is more likely. We’ll get a record, it’ll be new and with no votes, and we’ll try to predict the ratings from here.

# This is a new program with no votes at all. 
df_dropped2 = df.drop(columns=['tconst', 'primaryTitle', 'actor_names', 'numVotes'])
model2, importance2, metrics2, predictions2 = train_and_evaluate_rf(df_dropped2)
generate_summary_report(*predictions2)
display(plot_importance(importance2))
Model Performance:
R2 Score: 0.220
Root Mean Squared Error: 1.210

Total model train and execution time: 0:03:57.261132 

**Accuracy Measures:**
- 39.10% exact matches (got the rating bin exactly right)
- 81.72% within 1 bin (either exact or just one bin off)

**Distribution of Errors:**
- 39.1% perfect predictions (0 bins off)
- 42.6% off by just 1 bin (23.1% low + 19.5% high)
- Only 18.3% off by more than 1 bin
- Major mistakes (off by 3 or more bins): 5.0%

**Putting It in Perspective:**
If you're trying to predict if a movie is 'good' (7–8), 'great' (8–9), or 'excellent' (9–10), you'll be within the right range 81.72% of the time. Major errors are pretty rare, happening less than 5.0% of the time.
Feature Importance Plot

Conclusion

Our results show that, with around 80% accuracy, we can predict movie ratings within one rating bin using features like actor experience, genre, and runtime. Including vote counts in the model slightly improved performance but introduced bias. Without votes, the model still performed well, confirming that carefully engineered features provide meaningful insights. This project demonstrates that Random Forest is a viable approach for predicting IMDB ratings, even when financial data is not available.