Research Question
Using a dataset displaying information about the most popular movies in the last 30 years, we hope to better understand how a movie’s genre and budget affect its US box office gross revenue. Box office gross revenue is defined as the total earnings collected at US movie theaters as determined by ticket sales. There were movies across 17 different genres included in the dataset, but we narrowed down the genres to only include drama, animation, comedy, action, and horror. These genres were chosen based on personal interest.
Information about genre and budget impact on revenue is valuable to people in the movie industry when making decisions about what kind of movie they want to make. This information is also of interest to the general public, as it may explain why movie companies choose to make more of one type of movie over another. Revenue may also be a reflection of audience’s preferences for a certain genre or budget/production value.
Data Information
Context
Popular movies vary greatly in their resources used during production. Some movies are similar to Toy Story, which took 5 years and $30 million to animate and produce. Others can be like Tangerine, which was shot on three iPhones with a budget of $100,000.
Most would assume that the higher the budget of the movie, the more it will gross in theatres. We will investigate if this line of thinking is statistically accurate. We will also take into account genre, as it may have an additional impact on how much a movie makes.
Sources
The data set used came from Kaggle and is the “Movie Industry: Three decades of movies” dataset1. The creator of this dataset scraped the data automatically using a Python script and IMDb’s advanced search tool2. The 220 most popular films from each year from 1986 to 2016 were recorded. The criteria for “most popular” is unknown, as the IMDb advanced search tool does not reveal what exactly is meant by this term.
Limitations
The gross revenue represented in the dataset only includes the revenue made in US theater sales3. So, our model will not account for any money made outside of the theatre through mediums such as public TV, video rentals, Netflix, or Hulu. This is an important limitation to be aware of, as the popularity of movie-viewing sources such as Netflix have risen dramatically in the 30-year span represented in the data. Some of the most popular movies in recent years were released exclusively on websites. So, this model is representative of only one way movies can make money: movie theater ticket sales.
Another limitation of the gross revenue measure is that IMDb only collects the “Gross USA” value. This may exclude significant success of films internationally. For example, Transformers: Age of Extinction only grossed $241.2 million domestically (US and Canada) and received poor reviews, but internationally, the film grossed $763.8 million4.
Additionally, analysis of the 220 most popular movies may not be able to be abstracted to all movies. For example, less popular movies probably tend to have lower budgets and revenues, and might show a different interaction with genre. We cannot know with certainty that any patterns found with this data will apply to less popular movies. However, we can make the guess that any patterns found between budget, genre, and revenue will hold for less popular movies.
Finally, 1,750 out of the 5,409 movies in the genres we are exploring have a budget of 0. We removed these movies with 0 budget from our visualization and analysis because movies with a 0 budget indicate that the budget was not known by IMDb for that movie. There likely exists a systematic reason why some movies did not have a budget associated with them. So, this presents an unknown bias in our data.
Hypothesis
We predict that budget and revenue are positively correlated across all genres. We also think that action and animation films will generally have the highest gross revenues and budgets.
| name | gross | budget | genre | log10_budget | log10_gross |
|---|---|---|---|---|---|
| Ferris Bueller’s Day Off | 70136369 | 6000000 | Comedy | 6.778151 | 7.845943 |
| Top Gun | 179800601 | 15000000 | Action | 7.176091 | 8.254791 |
| Aliens | 85160248 | 18500000 | Action | 7.267172 | 7.930237 |
| Platoon | 138530565 | 6000000 | Drama | 6.778151 | 8.141546 |
| Blue Velvet | 8551228 | 6000000 | Drama | 6.778151 | 6.932029 |
The variables available are the name of the movie, the gross revenue, the budget, the genre, and the log10 transformations of the budget and gross revenue.
Summary Statistics and Observations
Number of movies, means and standard deviations of numerical values| num_movies | mean_revenue | sd_revenue | mean_budget | sd_budget |
|---|---|---|---|---|
| 3659 | 47604601 | 67845023 | 37268166 | 41238844 |
There are over 3,000 movies being analyzed, and the mean revenue is more than the mean budget by about 10 million dollars. Both variables seem to have a large spread.
Count of number of movies in each genre| genre | count |
|---|---|
| Action | 1099 |
| Animation | 229 |
| Comedy | 1310 |
| Drama | 793 |
| Horror | 228 |
There are around 1,000 comedy and action movies, 800 dramas, and about 230 animation and horror movies.
Correlation between budget and revenue| correlation |
|---|
| 0.6779195 |
There seems to be a generally positive relationship between budget and revenue.
Visualizations and Observations
Histogram of gross revenues: Most of the movies seem to be in the range of 106 (1 million) to 108 (100 million) dollars in revenue.
Box plot of revenues by genre: Animation has the highest median revenue, and drama has the lowest. All genres’ median revenues are between 107 (10 million) and 108 (100 million) dollars. All genres except for animation have a very large spread and many outliers. Animation has a smaller, higher spread of revenues, with only 3 outliers.
Scatter plot comparing budget, genre, and revenue: Across all genres, the relationship between budget and revenue is positive. The line of best fit for action movies has the steepest slope, although all slopes are fairly similar. Due to the large cluster of points in the upper right corner of the plot, we can tell that most of these popular movies have both high budgets and high revenues. There is a notable cluster of red points in the upper right corner, meaning that action movies seem to have both comparatively high budgets and revenues.
Components
Our multiple regression model uses log10 budget (numerical) and genre (categorical) as explanatory variables, and log 10 gross revenue as the outcome variable. It is a parallel slopes model.
Fitting parallel lopes model and regression table| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.585 | 0.169 | 3.457 | 0.001 | 0.253 | 0.917 |
| genreAnimation | 0.216 | 0.052 | 4.191 | 0.000 | 0.115 | 0.318 |
| genreComedy | 0.067 | 0.030 | 2.205 | 0.028 | 0.007 | 0.127 |
| genreDrama | -0.151 | 0.034 | -4.396 | 0.000 | -0.219 | -0.084 |
| genreHorror | 0.168 | 0.054 | 3.142 | 0.002 | 0.063 | 0.273 |
| log10_budget | 0.903 | 0.022 | 40.732 | 0.000 | 0.860 | 0.947 |
Parallel slopes model visualization
Interpretations
We used the parallel slopes model for multiple regression to investigate the relationship between movie genre, budget, and revenue.
log10_budget
For all the genres, the slope of the line modeling log10 budget vs. log10 revenue is 0.903. This means that for every extra log 10 dollar in movie budget, there is an associated increase of, on average, 0.903 log10 dollars in gross revenue.
Intercepts
The first 5 rows of the estimate column on the regression table indicate the intercepts of the regression lines for each of the genres analyzed. The first row is the baseline group, action movies. This indicates that an action movie with a budget of 0 log10 dollars would have a revenue of 0.585 log10 dollars.
The following 4 rows indicate the offsets of the intercepts for the remaining genres. These offsets can be interpreted as the average difference in log10 revenue that each genre makes relative to the baseline of action movies.
So, animation’s intercept is (0.585 + 0.216) = 0.801. The intercepts, ranked from highest to lowest, are:
In non-log10 dollars, these intercept interpretations are:
Animation had the highest intercept and drama had the lowest intercept. This means that, controlling for budget, animated movies make the most revenue, followed by horror, comedy, action, and drama.
Comparison of multiple regression and EDA results
Both the multiple regression analysis and the EDA box plot confirm that animated movies tend to make the most revenue. However, the boxplot was not controlling for budget, while our multiple regression model is. This shows that whether or not budget is taken into consideration, animated movies tend to make the most revenue. While in the EDA scatterplot action movies seemed to have one of the highest revenues, when controlling for budget in the parallel slopes model, action movies actually had one of the lowest revenues.
The overall positive slope of the relationship between budget and revenue was confirmed in both the EDA scatterplot and in the parallel slopes model.
Limitations of parallel slopes analysis
For the final draft, we may try an interaction model because this model retricts the genres from showing different slopes. So, some information about the true relationship between budget and revenue for each genre is lost.
Taking into account the budget, films over the last 30 years have made the most to least money by genre in the following order: animation, horror, comedy, action, then drama. That is, animation films tended to make the most money, and drama films tended to make the least, taking into account budget. Across all genres, as movie budget increases, the revenue of the movie also increases.
Note: This section is to be skipped for the initial submission and completed for the resubmission.
Note: This section is to be skipped for the initial submission and completed for the resubmission.
Optional: If you have any other materials that you think are interesting, but not directly relevant to the project. For example interesting observations or a cool visualization.