DATA 698 Final Presentation

Josh Iden

Measuring “Star Power”: Predicting Movie Box Office Revenue Based on Directors’ and Leading Actors’ Recent Success

Film Industry Statistics

42.5B USD to the US Economy (as of 2022)
2.5 million jobs (Zane, 2023)
36% of movies make profit (Lash, Zhao, 2016)
Avg. tenure of executives:
- 1940s: 20 years
- 1980s: 4 (Ravid, 1999)

Literature Review

Linear Regression using Rentals as response
Economic Theory vs. Communication Theory
Static factors vs. Dynamic factors
Neural Nets, Random Forest
International Films

“Star Power”

Binary variables or Rankings
IMDb popularity
Cumulative lifetime earnings
Classification modeling
Full cast

Hypothesis and Methodology

Tree-based models
Continuous and Categorical
Ensemble models
- Bagged Trees, Random Forest, XGBoost, Cubist
- High accuracy and generalizability

Hypothesis: Star Power is an important predictor of revenue
Variable Importance Measures
- Top 5 Variables
Mean Absolute Percentage Error (MAPE)
- MAPE below 5%

Data Acquisition

Written and cleaned in Python
Web Scraping: IMDb, Box Office Mojo
API: OMBb

Data Acquisition

Box Office Mojo
- Unique IMDB IDs for the Top 200 earning movies per year
IMDb
- Production budgets
- Leading actor and Director Unique IDs
OMDb
- Must be queried by unique IMDB IDs
- Title, Year, Rating, Release Date, Runtime, Genre, Director, Writer, Actors, Plot, Language, Country, Awards, Poster, MetaRating, Domestic Box Office Revenue

Combined Dataset

Feature Engineering

Feature Engineering

R
Subset data to 2012-2022 to have full 10 year data beginning in 2012
Combine low frequency values
- Country: United States vs. Foreign
- Language: English vs. Foreign
- Genres
- Rating

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

.

Exploratory Data Analysis

Summary Statistics

Model Prep

Handling missing values
Transforming the response
Feature scaling

Handling Missing Values

Approaches
- Drop observations
- 5- nearest neighbors imputation
- Zero imputation and categorical budget variable (Y/N)
- Transform budget into categorical by IQR
Best results: zero imputation + categorical variable

Transforming the Response

Highly right-skewed
Box-Cox transformation

Feature Scaling

Reduces the standard deviation to 1
Improves model performance
No centering performed

Modeling

Caret Package in R
Model Versions
Hyperparameter Tuning
Variable Importance
Findings

Model Versions

Shuffle & 80/20 Split
10-fold Cross-Validation
Five “Focus” Models

Hyperparameter Tuning

Variable Importance

Bagged Trees & Random Forest:
- Avg. Decrease in MSE when a predictor is excluded.
XGBoost:
- Number of timese a variable is used to split the data.
Cubist:
- Linear combination of the percentage of times each variable is used.

Variable Importance

Findings

Results by MAPE

Findings

Variable Importance - Best Model

Conclusion