DATA 698 Final Presentation

Josh Iden

Measuring “Star Power”: Predicting Movie Box Office Revenue Based on Directors’ and Leading Actors’ Recent Success

Film Industry Statistics

  • 42.5B USD to the US Economy (as of 2022)
  • 2.5 million jobs (Zane, 2023)
  • 36% of movies make profit (Lash, Zhao, 2016)
  • Avg. tenure of executives:
    • 1940s: 20 years
    • 1980s: 4 (Ravid, 1999)

Literature Review

  • Linear Regression using Rentals as response
  • Economic Theory vs. Communication Theory
  • Static factors vs. Dynamic factors
  • Neural Nets, Random Forest
  • International Films

“Star Power”

  • Binary variables or Rankings
  • IMDb popularity
  • Cumulative lifetime earnings
  • Classification modeling
  • Full cast

Hypothesis and Methodology

  • Tree-based models
  • Continuous and Categorical
  • Ensemble models
    • Bagged Trees, Random Forest, XGBoost, Cubist
    • High accuracy and generalizability
  • Hypothesis: Star Power is an important predictor of revenue
  • Variable Importance Measures
    • Top 5 Variables
  • Mean Absolute Percentage Error (MAPE)
    • MAPE below 5%

Data Acquisition

  • Written and cleaned in Python
  • Web Scraping: IMDb, Box Office Mojo
  • API: OMBb

Data Acquisition

  • Box Office Mojo
    • Unique IMDB IDs for the Top 200 earning movies per year
  • IMDb
    • Production budgets
    • Leading actor and Director Unique IDs
  • OMDb
    • Must be queried by unique IMDB IDs
    • Title, Year, Rating, Release Date, Runtime, Genre, Director, Writer, Actors, Plot, Language, Country, Awards, Poster, MetaRating, Domestic Box Office Revenue

Combined Dataset

Feature Engineering

Feature Engineering

  • R
  • Subset data to 2012-2022 to have full 10 year data beginning in 2012
  • Combine low frequency values
    • Country: United States vs. Foreign
    • Language: English vs. Foreign
    • Genres
    • Rating

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

.

Exploratory Data Analysis

Summary Statistics

Model Prep

  • Handling missing values
  • Transforming the response
  • Feature scaling

Handling Missing Values

  • Approaches
    • Drop observations
    • 5- nearest neighbors imputation
    • Zero imputation and categorical budget variable (Y/N)
    • Transform budget into categorical by IQR
  • Best results: zero imputation + categorical variable

Transforming the Response

  • Highly right-skewed
  • Box-Cox transformation

Feature Scaling

  • Reduces the standard deviation to 1
  • Improves model performance
  • No centering performed

Modeling

  • Caret Package in R
  • Model Versions
  • Hyperparameter Tuning
  • Variable Importance
  • Findings

Model Versions

  • Shuffle & 80/20 Split
  • 10-fold Cross-Validation
  • Five “Focus” Models

Hyperparameter Tuning

Variable Importance

  • Bagged Trees & Random Forest:
    • Avg. Decrease in MSE when a predictor is excluded.
  • XGBoost:
    • Number of timese a variable is used to split the data.
  • Cubist:
    • Linear combination of the percentage of times each variable is used.

Variable Importance

Findings

  • Results by MAPE

Findings

  • Variable Importance - Best Model

Conclusion