Charlie Bucket (Peter Ostrum) in Charlie Willy Wonka & the Chocolate Factory (1971) after finding the Golden Ticket
Research Question
Using a dataset displaying information about the most popular movies in the last 30 years, we hope to better understand how a movie’s genre and budget affect its US box office gross revenue. Box office gross revenue is defined as the total earnings collected at US movie theaters as determined by ticket sales. There were movies across 17 different genres included in the dataset, but we narrowed down the genres to only include drama, animation, comedy, action, and horror. These genres were chosen based on personal interest.
Information about genre and budget impact on revenue is valuable to people in the movie industry when making decisions about what kind of movie they want to make. This information is also of interest to the general public, as it may explain why movie companies choose to make more of one type of movie over another. Revenue may also be a reflection of audience’s preferences for a certain genre or budget/production value.
Data Information
Context
Popular movies vary greatly in their resources used during production. Some movies are similar to Toy Story, which took 5 years and $30 million to animate and produce. Others can be like Tangerine, which was shot on three iPhones with a budget of $100,000.
Intuitively, it is expected that the higher the budget of the movie, the more it will gross in theaters. We will investigate if this statement is statistically accurate. We will also take into account genre, as it may have an additional impact on how much a movie makes.
Sources
The dataset used came from Kaggle and is the “Movie Industry: Three decades of movies” dataset 1. The creator of this dataset scraped the data automatically using a Python script and IMDb’s advanced search tool2. The 220 most popular films from each year from 1986 to 2016 were recorded. The criteria for “most popular” is unknown, as the IMDb advanced search tool does not reveal what exactly is meant by this term.
Limitations
The gross revenue represented in the dataset only includes the revenue made in US theater sales3. So, our model will not account for any money made outside of the theater through mediums such as public TV, video rentals, Netflix, or Hulu. This is an important limitation to be aware of, as the popularity of movie streaming platforms such as Netflix have risen dramatically in the 30-year span represented in the data. Some of the most popular movies in recent years were released exclusively on streaming platforms. So, this model is representative of only one way movies can make money: movie theater ticket sales.
Another limitation of the gross revenue measure is that IMDb only collects the amount grossed in the USA. This may exclude significant success of films internationally. For example, Transformers: Age of Extinction only grossed $241.2 million domestically (US and Canada) and received poor reviews, but internationally, the film grossed $763.8 million4.
Additionally, analysis of the 220 most popular movies may not be able to be abstracted to all movies. For example, less popular movies probably tend to have lower budgets and revenues, and might show a different interaction with genre. We cannot know with certainty that any patterns found with this data will apply to less popular movies. However, we can make the guess that any patterns found between budget, genre, and revenue will roughly hold for less popular movies.
Finally, 1,750 out of the 5,409 movies in the genres we are exploring have a budget of 0. We removed these movies with 0 budget from our visualization and analysis because movies with a 0 budget indicate that the budget was not known by IMDb for that movie. There likely exists a systematic reason why some movies did not have a budget associated with them. So, this presents an unknown bias in our data.
Hypothesis
We predict that budget and revenue are positively correlated across all genres. We also think that action and animation films will generally have the highest gross revenues and budgets.
| name | gross | budget | genre | log10_budget | log10_gross |
|---|---|---|---|---|---|
| Ferris Bueller’s Day Off | 70136369 | 6000000 | Comedy | 6.778151 | 7.845943 |
| Top Gun | 179800601 | 15000000 | Action | 7.176091 | 8.254791 |
| Aliens | 85160248 | 18500000 | Action | 7.267172 | 7.930237 |
| Platoon | 138530565 | 6000000 | Drama | 6.778151 | 8.141546 |
| Blue Velvet | 8551228 | 6000000 | Drama | 6.778151 | 6.932029 |
The variables available are the name of the movie, the gross revenue, the budget, the genre, and the log10 transformations of the budget and gross revenue.
Summary Statistics and Observations
Number of movies, means and standard deviations of numerical values| num_movies | mean_revenue | sd_revenue | mean_budget | sd_budget |
|---|---|---|---|---|
| 3659 | $47,604,601 | 67845023 | $37,268,166 | 41238844 |
There are over 3,000 movies being analyzed, and the mean revenue is more than the mean budget by about 10 million dollars. Both variables have a large spread.
Count of number of movies in each genre| genre | count |
|---|---|
| Action | 1099 |
| Animation | 229 |
| Comedy | 1310 |
| Drama | 793 |
| Horror | 228 |
There are around 1,000 comedy and action movies, 800 dramas, and about 230 animation and horror movies.
Correlation between budget and revenue| correlation |
|---|
| 0.678 |
There seems to be a generally positive relationship between budget and revenue.
Visualizations and Observations
Histogram of gross revenues:
Most of the movies seem to be in the range of 106 (1 million) to 108 (100 million) dollars in revenue.
Box plot of revenues by genre: Animation has the highest median revenue, and drama has the lowest. All genres’ median revenues are between 107 (10 million) and 108 (100 million) dollars. All genres except for animation have a very large spread and many outliers. Animation has a smaller, higher spread of revenues, with only 3 outliers.
Scatter plot comparing budget, genre, and revenue: Across all genres, the relationship between budget and revenue is positive. The line of best fit for action movies has the steepest slope, although all slopes are fairly similar. Due to the large cluster of points in the upper right corner of the plot, we can tell that most of these popular movies have both high budgets and high revenues. There is a notable cluster of red points in the upper right corner, meaning that action movies seem to have both comparatively high budgets and revenues.
Components
Our multiple regression model uses log10 budget (numerical) and genre (categorical) as explanatory variables, and log 10 gross revenue as the outcome variable. It is an interaction model.
Fitting interaction model and regression table| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -0.443 | 0.334 | -1.327 | 0.185 | -1.098 | 0.212 |
| genreAnimation | 0.903 | 0.971 | 0.930 | 0.352 | -1.001 | 2.806 |
| genreComedy | 1.109 | 0.423 | 2.623 | 0.009 | 0.280 | 1.939 |
| genreDrama | 1.536 | 0.453 | 3.391 | 0.001 | 0.648 | 2.425 |
| genreHorror | 2.225 | 0.669 | 3.327 | 0.001 | 0.914 | 3.535 |
| log10_budget | 1.039 | 0.044 | 23.603 | 0.000 | 0.953 | 1.125 |
| genreAnimation:log10_budget | -0.092 | 0.125 | -0.731 | 0.465 | -0.338 | 0.154 |
| genreComedy:log10_budget | -0.138 | 0.057 | -2.418 | 0.016 | -0.249 | -0.026 |
| genreDrama:log10_budget | -0.228 | 0.061 | -3.718 | 0.000 | -0.349 | -0.108 |
| genreHorror:log10_budget | -0.284 | 0.094 | -3.014 | 0.003 | -0.469 | -0.099 |
The visualization of this interaction model is the scatterplot displayed in the EDA (Budget vs. Revenue by genre).
Interpretations
We used the interaction model for multiple regression to investigate the relationship between movie genre, budget, and revenue.
Intercepts
The first 5 rows of the estimate column on the regression table indicate the intercepts of the regression lines for each of the genres analyzed. The first row is the baseline group, action movies. This indicates that an action movie with a budget of 0 log10 dollars would have a revenue of -0.443 log10 dollars.
The following 4 rows indicate the offsets of the intercepts for the remaining genres. These offsets can be interpreted as the average difference in log10 revenue that each genre makes relative to the baseline of action movies.
So, animation’s intercept is (-0.443 + 0.903) = 0.46. The intercepts, ranked from highest to lowest, are:
Horror had the highest intercept and action had the lowest intercept. However, intercept has no practical interpretation here. Our model was not built accounting for movies with no budget, thus we cannot use the model to extrapolate scenarios in which there was a $0 budget. Additionally, the log10 scale of the budget variable makes it impossible to actually have a budget of 0.
Slopes
The last 5 rows of the estimate column on the regression table indicate the slopes of the regression lines for each of the genres analyzed. The row log10_budget is the baseline group, action movies. For action movies, taking into account all other variables, for every increase in 1 log10 dollar in budget, there’s an associated increase of, on average, 1.04 log10 dollars in gross revenue.
The last 4 rows indicate the offsets of the slopes for the remaining genres. These offsets can be interpreted as the average difference in slope that each genre has relative to the baseline of action movies.
The slopes from highest to lowest are as follows:
The same interpretation as with the action movie slope can be applied to the other genres. For example, for animation movies, taking into account all other variables, for every increase in 1 log10 dollar in budget, there’s an associated increase of, on average, 0.948 log10 dollars in gross revenue.
Comparison of multiple regression and EDA results
In the box plot of revenues by genre in the EDA, it seemed like animation would have the largest revenue which would translate as the highest slope and intercept. However, the interaction model revealed that action movies had the highest slope and horror movies had the highest intercept. It makes sense that action would have the highest slope because the action box plot had the highest maximum values of all the movies which would drive up the slope. The box plot for comedy has a median close to being the middle of all the medians among the genres. The multiple regression model confirms comedy as a middling money-maker in relation to budget.
The scatter plot from the EDA is identical to the interaction model, so it’s helpful to have the slope and intercept values from the multiple regression model. We correctly estimated that action movies have the steepest slope. The EDA scatter plot does also look like horror would have the highest intercept.
The overall positive slope of the relationship between budget and revenue was confirmed in the interaction model.
Limitations of interaction model analysis
The interaction model reveals relationships between differing slopes and differing intercepts. This creates a complexity that can be difficult to interpret. It becomes hard to definitively say which genre makes the most money for the starting budget. For example, you might know that the action movies have the highest slope, but you can’t ignore that horror has the highest intercept. With so many interactions, complexity increases, making interpretation complex as well.
According to our model, for every extra dollar spent on a movie’s budget, the revenue increases by the largest amount for action movies. So, action movies give you the most “bang for your buck” in terms of revenue (out of the analyzed genres). Action movies’ high payoff is followed by animation, comedy, drama, and horror, which has the lowest revenue increase for every extra dollar spent in budget.
Confidence Intervals
We previously determined that the intercepts of our interaction model have no practical meaning, so we will not interpret their confidence intervals.
Action Slope (baseline)
The confidence interval of the action line’s slope (log10_budget) is [0.953, 1.125]. So, we are 95% confident that the true slope of the action line falls within these values. This also tells us that we can be sure that the slope is not 0, as it does not fall within the 95% confidence interval. So, we can say with confidence that there is a positive relationship between the budget of an action movie and its revenue.
Animation Slope Offset
The confidence interval of the offset of the slope for animated movies (genreAnimation:log10_budget) is [-0.338, 0.154]. So, we are 95% confident that the true offset of the slope for animated movies falls within these values. This also tells us that we cannot be sure that the slope offset is not 0, as 0 falls within the 95% confidence interval. So, there is no practically significant difference between the slope of the action line and the slope of the animation line.
All other slope offsets
The three remaining slope offsets (genreComedy:log10_budget, genreDrama:log10_budget, and genreHorror:log10_budget) have the same confidence interval interpretations. The confidence intervals of these slope offsets all contain only negative values. So, we are confident that the true offset of these slopes is negative. This also tells us that we can be sure that the slope offsets are not 0, as 0 does not fall within the confidence interval. So, there is a practically significant difference between the slope of the action line and the slopes of the comedy, drama, and horror lines.
P-Values
We previously determined that the intercepts of our interaction model have no practical meaning, so we will not interpret their p-values. We will set \(\alpha\) = 0.05.
Action Slope (baseline)
The p-value for the slope of the action movies line is 0, where \[\begin{aligned} H_0: slope_{action} = 0 \\\ \mbox{vs }H_A:slope_{action} \neq 0 \end{aligned}\] The p value of 0 means that there is a 0% chance of getting a slope of 1.039 in a world where there is no relationship between budget and revenue for action movies. Since 0 < \(\alpha\), we can reject the null hypothesis and conclude that there is a statistically significant positive relationship between budget and revenue for action movies.
Animation Slope Offset
The p-value for the offset of the slope for animated movies is 0.465, where \[\begin{aligned} H_0: slope_{action} = slope_{animation} \\\ \mbox{vs }H_A:slope_{action} \neq slope_{animation} \end{aligned}\] The p value of 0.465 means that there is a 46.5% chance of getting a slope offset of -0.092 or more extreme in a world where there is no difference between the slope for action movies and the slope for animated movies. Since 0.465 > \(\alpha\), we fail to reject the null hypothesis, and conclude that there is not a statistically significant difference between the slope for action movies and animated movies.
All other slope offsets
The p values of the slope offsets for comedy, drama, and horror are all less than \(\alpha\), where: \[\begin{aligned} H_0: slope_{action} = slope_{genre} \\\ \mbox{vs }H_A:slope_{action} \neq slope_{genre} \end{aligned}\] This means that there is a <5% chance of getting these observed slope offsets or more extreme in a world where there is no difference between the slope for action movies and the slope for comedy/drama/horror movies. So, we reject the null hypothesis and conclude that there is a statistically significant difference between the slope for action movies and comedy, drama, and horror movies.
Model Selection
We decided to use the interaction model rather than the parallel slopes model. This is because the slope offsets shown in the interaction model are statistically significant according to their p values and confidence intervals, with the exception of animated movies. This means that there is a meaningful difference in the relationship between budget and revenue across different genres. This information would be lost with the parallel slopes model, which forces all genres to have the same slope. We felt that this additional information about difference in slopes was worth the added complexity of the interaction model.
Residual Analysis
This scatter plot reveals a slight heteroskedastic pattern to the residuals of our model for movie revenue. It appears that our model tends to have larger residuals for movies with lower budgets, and smaller residuals for movies with higher budgets. However, as can be observed with the green density lines, the majority of residuals do not exhibit a large degree of heteroskedasticity.
This histogram is centered at 0, meaning that the average residual is about 0. It also shows a slight left skew, meaning that there are more positive residuals than negative. This means that our model tends to slightly underestimate the revenue.
Since the residuals for our model are on average about 0 with a fairly normal and even distribution, we can use the generated p values and confidence intervals to appropriately assess the model’s outputs. But, the residuals’ slight skew and heteroskedasticity may impact these values as well.
Our analysis showed that across the five chosen genres (action, animation, comedy, drama, and horror), all have a positive relationship between the money put into production (budget) and the money returned in the box office (box office gross revenue). However, as we suspected, the relationship between money invested and money made is not the same for every genre. According the the interaction model we built, action movies make more money considering the money invested. Horror movies make the least money for the money invested when the budget is on the relative upper end. The other middling genres fall close together. The differences between all the movies in relation to action movies were statistically significant except with animation movies.
Long story short:
Of course, this model is flawed and not built to accomodate several important factors if the question about the relationship between movie budget, revenue, and genre was to be fully addressed. This model was only accounting for the most popular movies, so we can’t be certain how patterns might shift for less successful movies. In addition, as previously discussed, the US box office is often not the only source of revenue for films.
Thus, future work could try to complete the picture of revenue by adding in sources like Internet streaming platforms, video rentals, TV, etc. Of course, it would also be interesting to analyze more genres and see if anything really different pops out. Of the five we looked at, none were extremely different from each other, but that doesn’t mean the possibility doesn’t exist.
Both of the Power %>% Girls are fans of the talented and versatile actor Danny DeVito. Given this personal interest, we decided to ask the question: is there such thing as a “DeVito Effect”? In other words, do movies which DeVito directed, starred in, or wrote do better in terms of IMDb user rating and/or revenue compared to other movies of that genre?
https://www.youtube.com/watch?v=J7tPXiKn-Qc
https://twitter.com/TylerNaegleLOL/status/1007019099438764033
We decided to construct some visualizations to answer this question.
According to these visuals, the DeVito Effect is not what we expected. For rating, DeVito films were lower than the overall movie set’s average rating by genre. For revenue, many DeVito films also fall short, being lower than the movie set’s average gross revenue by genre. The difference between DeVito revenue and overall revenue is less extreme than with the scores. However, this doesn’t apply for comedy films. In both histograms DeVito films had median rating/revenue that nearly matched the total set averages. There were a few films that had much higher ratings/revenue than the genre average.
The take home point? Danny DeVito does well in comedy, but not so much in other genres.
Caveat: This particular set only included 12 movies with DeVito. Of these there was only one movie in each genre besides comedy. Also, this set was about the last 30 years, so some of DeVito’s most famous work wasn’t included. Nevertheless, we thought it would be a fun analysis to do.
Images
https://www.kryptinc.com/blog/wp-content/uploads/2017/02/goldenticket-preferred-products.jpg
Video
“Danny DeVito, in and out of character” posted by CBS Sunday Morning https://www.youtube.com/watch?v=J7tPXiKn-Qc
Tweet
@TylerNaegleLOL https://twitter.com/TylerNaegleLOL/status/1007019099438764033
Wikipedia
Toy Story https://en.wikipedia.org/wiki/Toy_Story
Movie Industry.https://www.kaggle.com/danielgrijalvas/movies. Accessed 12 Dec. 2018.↩
“IMDb: Advanced Title Search.” IMDb, http://www.imdb.com/search/title. Accessed 13 Dec. 2018.↩
“FilmProfit(r) Glossary of Film Terms.” FilmDependent - a Little IndieLogic, 1 July 2007, http://filmdependent.com/newsletter/about/filmprofitr-glossary-of-film-terms/.↩
“How the Movie Box Office Works.” HowStuffWorks, 13 Feb. 2015, https://entertainment.howstuffworks.com/movie-box-office.htm.↩