Introduction

I’m sure movie executives have long debated one topic in particular: What makes a movie great? Is it the actors? The story? The art form? No.

It’s Profit of course!

For this project, we’ll try to figure out what a movie great (at making money).

Data

The data comes from The Internet Movie Data Base (https://www.themoviedb.org/), which was conveniently uploaded to Kagle (https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion). Each observation is a movie, and contains information such as release date, budget, revenue, popularity, run-time, and an average rating. We will be examining the relationship between revenue and each other variable.

This is going to be an observational study, as we are looking at data that already exists and are not performing any treatments. The scope of inference should include all English speaking, American movies that have an TMDB profile. This likely won’t include movies that were not released in theaters or otherwise widely distributed. As far as I know, this should be a fairly random sample of movies, but I cannot speak to whatever biases went into the selection.

That takes care of basic cleaning, but there is still one other major problem with the data: price inflation. The data-set now contains movies going back to 1916, and the value of currencies have dropped considerably since then. For example, the data-set lists a revenue of about $400 million for ‘Gone With the Wind’; this would equate to about billions in today’s money.

Luckily, there’s the ‘priceR’ package, which will easily inflate/deflate currencies to any given date. Below, the budgets and revenues are translated into their 2019 equivalent. A rate of 3.71% a year was used, which is approximately the average inflation rate from 1939 to 2019.

## Retrieving countries data
## Generating URL to request all 304 results
## Retrieving inflation data for US 
## Generating URL to request all 60 results
## Retrieving countries data
## Generating URL to request all 304 results
## Retrieving inflation data for US 
## Generating URL to request all 60 results

Exploratory data analysis

Now that we have clean data, let’s do some explorations.

Below are summary statistics for the revenue data.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 3.592e+03 2.996e+07 8.856e+07 1.874e+08 2.180e+08 7.273e+09

Below are summary statistics for the budget data.

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      9092  18417464  39273168  57243882  79635785 431893742

There are more variables that may be at play however. They are all plotted below using a pairs plot. We can see that some of the strongest correlations are between: revenue and popularity, budget; vote average and budget, run-time; budget and popularity.

It should be noted that run-time may have more of a parabolic relationship to other variables. However, most variables appear to have fairly linear relationships with revenue; those will be the main variables by which we will attempt to model movie revenue.

Inference

If the goal is to predict revenue, we can attempt to use the variables from before to create a model. Most variables appear to have slight linear associations with revenue. The exception may be run-time, which may have more of an parabolic relationship (it’s widely believed that a length around 90-120 mins is optimal).

## 
## Call:
## lm(formula = revenue.adjusted ~ budget.adjusted + popularity + 
##     release_date + runtime + vote_average, data = movies)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.596e+09 -8.569e+07 -1.390e+07  4.535e+07  6.567e+09 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.135e+08  4.106e+07  -2.765  0.00572 ** 
## budget.adjusted  2.169e+00  9.294e-02  23.342  < 2e-16 ***
## popularity       2.665e+06  1.407e+05  18.938  < 2e-16 ***
## release_date    -1.413e+04  9.671e+02 -14.613  < 2e-16 ***
## runtime          5.107e+05  2.483e+05   2.057  0.03975 *  
## vote_average     3.300e+07  5.960e+06   5.538 3.33e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 244800000 on 3080 degrees of freedom
## Multiple R-squared:  0.3926, Adjusted R-squared:  0.3916 
## F-statistic: 398.1 on 5 and 3080 DF,  p-value: < 2.2e-16

The model above takes all variables into account, and does better than I thought it might, accounting for about 39% of the variation in revenue. The p-value was almost 0, so the model is highly significant. Residuals are also normally distributed.

Conclusions

I think it would be safe to say that if one wants to make a profitable movie, then it may be helpful to have a larger budget, appeal to the masses (popularity), and get a higher rating (make a good movie?). Oh, also release it as far back in time as possible.

The model I’ve made accounts for about 39% of the variation in movie revenue, and is highly significant. While some of the variables may not be entirely independent of each-other, I think this model can provide good indication of what might drive revenue. Though, I would be very interested to see the real models film producers actually use to determine which movies to pursue.

References

The Movie Database: https://www.themoviedb.org/