Project 3: IMDB Ratings, comparing Random Forest to Support Vector Machines

Author

Team: I Love Lucy

Published

November 12, 2024

github   |   project 2 |   project 3

Executive Summary

This project compares two machine learning approaches—Random Forest and Support Vector Machines (SVM)—for predicting IMDB movie ratings. Through multiple iterations and optimizations, Random Forest consistently outperformed SVM, achieving 84% accuracy within one rating bin compared to SVM’s 58.99%. This better performance can be attributed to Random Forest’s natural ability to handle our dataset’s complex characteristics: high dimensionality (37 features), mixed data types (binary and continuous), and non-linear relationships between features. While we successfully improved SVM’s performance through feature scaling and kernel optimization, Random Forest proved to be inherently better suited for this prediction task.

Project Overview

Building on the work from Project 2 where we used Random Forest to predict movie ratings, this project explores the application of Support Vector Machines (SVM) for the same goal of predicting movie ratings using the IMDB non-commercial dataset. Our objective is to compare the performance of SVM and Random Forest, examining their strengths in prediction accuracy and their effectiveness for this classification task.

The IMDB dataset provides essential features, including ratings, genres, and cast details, which we used previously to build a predictive model. Although it lacks financial metrics such as box office revenue or streaming views, this data offers valuable insights into potential predictors of movie ratings. Like before, a primary challenge lies in handling the vast number of unique actors, which we addressed in Project 2 by creating numeric fields to represent cast characteristics. In this project, we’ll re-utilize those engineered features—such as the Likert score for cast experience and an average experience metric—to maintain consistency and comparability across the models.

Project Workflow

Our analysis consisted of these key stages:

  1. Data Overview and Preparation
    • Review dataset characteristics and calculated features
    • Understand data dimensionality and feature types
  2. Algorithm Suitability Analysis
    • Examine characteristics of Random Forest vs SVM
    • Analyze appropriateness for regression vs classification tasks
  3. Model Implementation and Comparison
    • Review Random Forest results from Project 2
    • Implement basic SVM model
    • Compare initial results
  4. SVM Optimization
    • Apply feature scaling and RBF kernel
    • Implement stratified sampling
    • Compare optimized results
  5. Analysis and Conclusions
    • Compare all model performances
    • Evaluate algorithm suitability
    • Make final recommendations

Data Overview

data_dir = "622data_nogit/imdb"
from lussi.imdb import *
from lussi.glimpse import *
df = load_imdb(data_dir)
glimpse(df)
Rows: 1047620
Columns: 37

Column preview:
--------------------------------------------------------------------------------
tconst               <category> tt0000001, tt0000002, tt0000003, tt0000004, tt0000005
primaryTitle         <string> Carmencita, Le clown et ses chiens, Poor Pierrot, Un bon bock, Blacksmith Scene
runtimeMinutes       <Int64> 1, 5, 5, 12, 1
numVotes             <int32> 2097, 283, 2106, 183, 2842
rating_bin           <int64> 5, 5, 6, 5, 6
num_actors           <float64> 4.0, 2.0, 5.0, 2.0, 3.0
actor_names          <string> Carmencita, William K.L. Dickson, William K.L. Dic..., Émile Reynaud, Gaston Paulin, Émile Reynaud, Julien Pappé, Émile Reynaud, Gaston..., Émile Reynaud, Gaston Paulin, Charles Kayser, John Ott, Thomas A. Edison
Action               <int64> 0, 0, 0, 0, 0
Adult                <int64> 0, 0, 0, 0, 0
Adventure            <int64> 0, 0, 0, 0, 0
Animation            <int64> 0, 1, 1, 1, 0
Biography            <int64> 0, 0, 0, 0, 0
Comedy               <int64> 0, 0, 1, 0, 1
Crime                <int64> 0, 0, 0, 0, 0
Documentary          <int64> 1, 0, 0, 0, 0
Drama                <int64> 0, 0, 0, 0, 0
Family               <int64> 0, 0, 0, 0, 0
Fantasy              <int64> 0, 0, 0, 0, 0
Film-Noir            <int64> 0, 0, 0, 0, 0
Game-Show            <int64> 0, 0, 0, 0, 0
History              <int64> 0, 0, 0, 0, 0
Horror               <int64> 0, 0, 0, 0, 0
Music                <int64> 0, 0, 0, 0, 0
Musical              <int64> 0, 0, 0, 0, 0
Mystery              <int64> 0, 0, 0, 0, 0
News                 <int64> 0, 0, 0, 0, 0
Reality-TV           <int64> 0, 0, 0, 0, 0
Romance              <int64> 0, 0, 1, 0, 0
Sci-Fi               <int64> 0, 0, 0, 0, 0
Short                <int64> 1, 1, 0, 1, 1
Sport                <int64> 0, 0, 0, 0, 0
Talk-Show            <int64> 0, 0, 0, 0, 0
Thriller             <int64> 0, 0, 0, 0, 0
War                  <int64> 0, 0, 0, 0, 0
Western              <int64> 0, 0, 0, 0, 0
experienced_actor_count <int64> 3, 1, 2, 1, 1
experienced_actors_likert <int64> 5, 4, 3, 4, 4

Reminder: Calculated Columns Overview

Below is a summary of the calculated columns added and their purpose:

  • num_actors:
    Total number of actors in each movie. Helps capture the cast size.

  • actor_names:
    A string of all actor names for each movie, separated by commas. Useful for analyzing trends or patterns based on cast members.

  • experienced_actor_count:
    Counts the number of experienced actors (those with more than 10 prior roles) in a movie. Measures the potential impact of experience on quality or reception.

  • experienced_actors_likert:
    A Likert-scale score (1–5) based on the average experience of the cast. Quantifies cast experience for easier analysis of its effect on ratings.

  • rating_bin:
    Binned version of the average rating (e.g., 1–10). Simplifies predictions by grouping continuous ratings into categories, which aligns with classification models like Random Forest.

  • Genre Dummy Variables:
    Each genre is expanded into a binary (0/1) column. Provides genre-specific features to analyze how genres influence ratings.

Suitability for Regression vs. Classification Tasks

In evaluating SVM and Random Forest, it’s helpful to consider the strengths of each algorithm for regression and classification scenarios. Both algorithms are versatile but have differing strengths based on the nature of the task.

Random Forest: Random Forest is generally more flexible and robust in handling complex, non-linear relationships within the data, making it well-suited for regression tasks. Its ability to capture interactions and non-linear relationships between features without extensive pre-processing makes it an effective model for predicting continuous outcomes. In this project, Random Forest effectively managed the high-dimensional IMDB dataset and produced relatively accurate predictions for movie ratings, which is a regression task. Random Forest’s ensemble nature also helps reduce overfitting by averaging multiple decision trees, making it particularly robust for larger datasets like ours.

Support Vector Machines (SVM): SVM is often more effective in binary or multi-class classification problems, especially when data points are linearly separable or nearly so. SVM’s use of the kernel trick enables it to create non-linear decision boundaries, but the computational cost can be high with large datasets and numerous features, as we observed. For regression tasks, SVM with an RBF or linear kernel can work, but it tends to be less flexible than Random Forest in capturing non-linear relationships. In this analysis, SVM struggled with the IMDB dataset’s high dimensionality and mixed data types, leading to longer training times and lower accuracy compared to Random Forest.

In summary, while Random Forest’s adaptability makes it a strong choice for regression tasks with complex data, SVM is generally better suited for classification problems where the relationships between features are clearer and the data is more structured. This distinction aligns with our results, where Random Forest outperformed SVM for predicting IMDB movie ratings.

Project 2 Review, Random Forest

This is straight from project 2’s code base.

#This assumes a number of votes already. Unlikely. 
df_dropped1 = df.drop(columns=['tconst', 'primaryTitle', 'actor_names'])
model1, importance1, metrics1, predictions1 = train_and_evaluate_rf(df_dropped1, show_output=False)
generate_summary_report(*predictions1, report_name="Project 2 RF")
Loading cached model from rf_model.joblib...

================================================================================
Results for: Project 2 RF
================================================================================

Quick Summary:
- Gets the exact rating 42% of the time
- Within one rating point 84% of the time
- Makes big mistakes only 4.0% of the time

Technical Details:
- R² Score: 0.320
- RMSE: 1.130

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 84% of the time.
================================================================================

IMDB using SVM, First Run

For this first run, we’ll use almost exactly the same code base and data as we did with the Random Forest run. We will only swap the algorithm.

# Drop unnecessary columns
df_dropped2 = df.drop(columns=['tconst', 'primaryTitle', 'actor_names']).dropna(subset=['num_actors'])
df_sampled = df_dropped2.sample(frac=0.05, random_state=42)  # Adjust frac as needed
model2, metrics2, predictions2 = train_and_evaluate_svm(df_sampled, kernel='linear', show_output=False)
generate_summary_report(*predictions2, report_name="SVM (first run / not optimized)")
Loading cached SVM model from svm_model.joblib...

================================================================================
Results for: SVM (first run / not optimized)
================================================================================

Quick Summary:
- Gets the exact rating 27% of the time
- Within one rating point 64% of the time
- Makes big mistakes only 19.5% of the time

Technical Details:
- R² Score: -3186.443
- RMSE: 78.555

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 64% of the time.
================================================================================

Comparing RF with SVM First Run

Project 2 Random Forest Performance:

  • 42.25% exact matches
  • 84.10% within 1 bin
  • Only 4% major errors (off by 3+ bins)
  • R² Score: 0.320
  • RMSE: 1.130

SVM First Run Performance:

  • 27.16% exact matches
  • 64.38% within 1 bin
  • 19.5% major errors
  • R² Score: -3186.443 (extremely poor)
  • RMSE: 78.555 (extremely high in the context of our dataset)

Comparison -> Random Forest wins on all fronts:

  • Better exact matches (42% vs 27%)
  • Better near matches (84% vs 64%)
  • Far fewer major errors (4% vs 19.5%)
  • Much better R² score (0.32 vs -3186)
  • Much better RMSE (1.13 vs 78.56)
  • SVM’s extremely negative R² score suggests it’s performing worse than a horizontal line would.

Why Random Forest Performs Better:

  • Better handles the mixed data types
  • Can capture non-linear relationships
  • More robust to outliers
  • Handles the categorical nature of genre features better
  • Better with high-dimensional data (we have 37 features)

Though we made the case for Random Forest after we tested both models, this preference might have been anticipated. In their work on feature selection, Guyon and Elisseeff emphasize that effective feature selection is crucial for models dealing with high-dimensional data, where irrelevant features can disrupt model performance. Their findings suggest that SVM may struggle with datasets like ours, which contain a mix of categorical and continuous features. In contrast, Random Forest has a natural advantage, as it inherently ranks and selects the most relevant features during its tree-building process.

What we could have done differently or better:

  • We used a linear kernel which might be too simple for this data. We could have used the RBF kernel
  • We sampled only 5% of the data to train
  • SVMs typically need more feature scaling/normalization. We didn’t do any of that.

Rerunning the model with dedicated SVM adjustments

In optimizing the SVM model for the second run, we implemented several best practices to improve performance on our dataset. First, we applied stratified sampling to ensure balanced respresentation of each rating bin, reducing potential model bias by preserving class proportions. Below, we can see that there is a disproportionate distribution of each rating bin. Stratified sampling will address this imbalance. See this demonstrated multiple ways in in Applied Predictive Modeling by Max Kuhn and Kjell Johnson.

histogram_likert(df=df_dropped2)

We also scaled all features using StandardScaler, which is critical for SVM since unscaled features can distort the model’s decision boundary. Learn more about the StandardScaler in Hands-On Machine Learning by Aurélien Géron.

Finally, we implemented RandomizedSearchCV to find the best parameters for our SVM model. This approach explores a subset of the hyperparameter space efficiently, striking a balance between computational cost and performance. RandomizedSearch CV found that the best hyperparameters were as follows:

  • Kernel: RBF
  • Epsilon: 0.5
  • C: 100

These improvements collectively resulted in a more robust and effective SVM model, addressing dataset imbalances, ensuring proper feature scaling, and optimizing hyperparameters.

df_dropped3 = df.drop(columns=['tconst', 'primaryTitle', 'actor_names']).dropna(subset=['num_actors'])
model3, metrics3, predictions3 = train_and_evaluate_svm_optimized(df_dropped3, show_output=False)
generate_summary_report(*predictions3,report_name="SVM (optimized)")
Loading cached optimized SVM model from svm_model_opt.joblib...

================================================================================
Results for: SVM (optimized)
================================================================================

Quick Summary:
- Gets the exact rating 23% of the time
- Within one rating point 59% of the time
- Makes big mistakes only 21.4% of the time

Technical Details:
- R² Score: 0.294
- RMSE: 2.110

What This Means:
If you're trying to decide if a movie will be good (7-8), great (8-9),
or excellent (9-10), this model will get it right or be off by just
one rating point 59% of the time.
================================================================================

Comparing SVM Optimized run with SVM First Run:

  • Uses the RBF (Radial Basis Function) kernel instead of linear - better for non-linear relationships
  • Uses stratified sampling to ensure balanced representation of rating bins
  • Added feature scaling (StandardScaler) - crucial for SVM performance
  • Reduced sample size to 50,000 rows total
  • Better R² score (0.294 vs -3186.443)
  • Much better RMSE (2.110 vs 78.555)
  • Slightly lower accuracy (58.99% vs 64% within 1 bin), possibly due to smaller but more balanced sample
  • Similar exact match rate (22.7% vs 27%)
  • Slightly higher major error rate (21.4% vs 20%)

Despite these improvements, Random Forest still significantly outperforms both SVM approaches for this particular problem.

Conclusions

This data is more easily predicted using Random Forest. The dataset contains 37 features, making it highly dimensional, and includes a mix of binary features (like genre indicators) and continuous features (like runtime and number of votes). There are likely non-linear relationships between features (such as how actor experience influences ratings differently at different budget levels) and complex interactions between features (like how certain genres might amplify or diminish the impact of other features).

Random Forest works better because it naturally handles these complexities. It doesn’t require feature scaling since each decision tree makes binary splits, handles mixed data types seamlessly, and automatically captures feature interactions through its tree structure. Its ensemble nature means it can effectively model non-linear relationships, and its random feature sampling at each split makes it particularly well-suited for high-dimensional data. These characteristics make Random Forest a natural fit for this kind of complex, heterogeneous dataset, as demonstrated by its superior performance (84% accuracy within one rating bin compared to SVM’s 58.99%).