DATA 698 Final Presentation
Josh Iden
Measuring “Star Power”: Predicting Movie Box Office Revenue Based on
Directors’ and Leading Actors’ Recent Success
Film Industry Statistics
- 42.5B USD to the US Economy (as of 2022)
- 2.5 million jobs (Zane, 2023)
- 36% of movies make profit (Lash, Zhao, 2016)
- Avg. tenure of executives:
- 1940s: 20 years
- 1980s: 4 (Ravid, 1999)
Literature Review
- Linear Regression using Rentals as response
- Economic Theory vs. Communication Theory
- Static factors vs. Dynamic factors
- Neural Nets, Random Forest
- International Films
“Star Power”
- Binary variables or Rankings
- IMDb popularity
- Cumulative lifetime earnings
- Classification modeling
- Full cast
Hypothesis and Methodology
- Tree-based models
- Continuous and Categorical
- Ensemble models
- Bagged Trees, Random Forest, XGBoost, Cubist
- High accuracy and generalizability
- Hypothesis: Star Power is an important predictor of revenue
- Variable Importance Measures
- Mean Absolute Percentage Error (MAPE)
Data Acquisition
- Written and cleaned in Python
- Web Scraping: IMDb, Box Office Mojo
- API: OMBb
Data Acquisition
- Box Office Mojo
- Unique IMDB IDs for the Top 200 earning movies per year
- IMDb
- Production budgets
- Leading actor and Director Unique IDs
- OMDb
- Must be queried by unique IMDB IDs
- Title, Year, Rating, Release Date, Runtime, Genre, Director, Writer,
Actors, Plot, Language, Country, Awards, Poster, MetaRating, Domestic
Box Office Revenue
Combined Dataset
![]()
Feature Engineering
![]()
Feature Engineering
- R
- Subset data to 2012-2022 to have full 10 year data beginning in
2012
- Combine low frequency values
- Country: United States vs. Foreign
- Language: English vs. Foreign
- Genres
- Rating
Exploratory Data Analysis
![]()
Exploratory Data Analysis
![]()
Exploratory Data Analysis
![]()
Exploratory Data Analysis
.
Exploratory Data Analysis
![]()
Summary Statistics
![]()
Model Prep
- Handling missing values
- Transforming the response
- Feature scaling
Handling Missing Values
- Approaches
- Drop observations
- 5- nearest neighbors imputation
- Zero imputation and categorical budget variable (Y/N)
- Transform budget into categorical by IQR
- Best results: zero imputation + categorical variable
Feature Scaling
- Reduces the standard deviation to 1
- Improves model performance
- No centering performed
Modeling
- Caret Package in R
- Model Versions
- Hyperparameter Tuning
- Variable Importance
- Findings
Model Versions
- Shuffle & 80/20 Split
- 10-fold Cross-Validation
- Five “Focus” Models
![]()
Hyperparameter Tuning
![]()
Variable Importance
- Bagged Trees & Random Forest:
- Avg. Decrease in MSE when a predictor is excluded.
- XGBoost:
- Number of timese a variable is used to split the data.
- Cubist:
- Linear combination of the percentage of times each variable is
used.
Variable Importance
![]()
Findings
![]()
Findings
- Variable Importance - Best Model
![]()