Movie Budget Vs. Worldwide Gross Revenues

Data is extracted from the website The Numbers. Using this data I want to find if there is any correlation between movie budget and revenues it generates after the movie release. I used top 1000 movies based on budget to build a regression model.

Data collection

library(rvest)
library(plyr)
library(dplyr)
library(knitr)

for(i in seq(from=1, to=1001, by=100)){
  url <- paste0("http://www.the-numbers.com/movie/budgets/all/", i)
  htmlData <- html(url)
  rawMovieData <- htmlData %>%
                    html_nodes("table") %>%
                    .[[1]] %>%
                    html_table(fill = TRUE)

  rawMovieData <- na.omit(rawMovieData)
  if (i == 1){
    MovieData <- rawMovieData
  }
  else{
    MovieData <- rbind(MovieData, rawMovieData)
  }

}

MovieData$Budget <- gsub(",","",MovieData$`Production Budget`, fixed=TRUE)
MovieData$Budget <- gsub("$","",MovieData$Budget, fixed=TRUE)
MovieData$Budget <- round(as.numeric(MovieData$Budget)/1000000,2)

MovieData$Domestic <- gsub(",","",MovieData$`Domestic Gross`, fixed=TRUE)
MovieData$Domestic <- gsub("$","",MovieData$Domestic, fixed=TRUE)
MovieData$Domestic <- round(as.numeric(MovieData$Domestic)/1000000,2)

MovieData$Worldwide <- gsub(",","",MovieData$`Worldwide Gross`, fixed=TRUE)
MovieData$Worldwide <- gsub("$","",MovieData$Worldwide, fixed=TRUE)
MovieData$Worldwide <- round(as.numeric(MovieData$Worldwide)/1000000,2)


colnames(MovieData)[0] <- "MovieNum"
Top1000 <- MovieData %>% select (Movie, Budget, Domestic, Worldwide)

Top1000 <- head(arrange(Top1000,desc(Budget)), n = 1000)

Top1000 %>% 
  select (Movie, Budget, Domestic, Worldwide) %>% 
  filter(rank(desc(Budget))<=20) %>% 
  kable(digits = 2, format='pandoc', caption = "Top 1000 Movie Budget Vs. Gross Revenues Worldwide in Millions")
Top 1000 Movie Budget Vs. Gross Revenues Worldwide in Millions
Movie Budget Domestic Worldwide
Avatar 425.0 760.51 2783.92
Pirates of the Caribbean: On Stranger Tides 410.6 241.06 1045.66
Avengers: Age of Ultron 330.6 459.01 1408.22
Star Wars Ep. VII: The Force Awakens 306.0 936.66 2058.66
Pirates of the Caribbean: At World’s End 300.0 309.42 963.42
Spectre 300.0 200.07 879.62
The Dark Knight Rises 275.0 448.14 1084.44
The Lone Ranger 275.0 89.30 260.00
John Carter 275.0 73.06 282.78
Tangled 260.0 200.82 586.58
Spider-Man 3 258.0 336.53 894.86
Captain America: Civil War 250.0 408.08 1153.30
Batman v Superman: Dawn of Justice 250.0 330.36 868.16
The Hobbit: An Unexpected Journey 250.0 303.00 1017.00
Harry Potter and the Half-Blood Prince 250.0 301.96 935.08
The Hobbit: The Desolation of Smaug 250.0 258.37 960.37
The Hobbit: The Battle of the Five Armies 250.0 255.12 955.12
The Fate of the Furious 250.0 225.76 1237.44
Superman Returns 232.0 200.12 374.09

Plot data

Top1000.lm <- lm(Top1000$Worldwide ~ Top1000$Budget)
hist(Top1000.lm$residuals)

Histogram of residuals shows data has single peak and it is right skewed.

plot(Top1000$Worldwide ~ Top1000$Budget, xlab='Movie Budget', ylab='Worldwide Gross', 
     main='Top 1000 Movie Budget Vs. Worldwide Gross Revenues in Millions')
abline(Top1000.lm)

The plot shows some correlation between movie budget and gross revenues worldwide.

Top1000.lm
## 
## Call:
## lm(formula = Top1000$Worldwide ~ Top1000$Budget)
## 
## Coefficients:
##    (Intercept)  Top1000$Budget  
##        -90.616           3.805

Generic regression equation is \(y = a_0 + a_1x_1\) where \(x_1\) denotes movie budget, \(a_0\) is y-intercept and \(a_1\) is slope of the equation.

In this case, the y-intercept is = -90.62 and the slope = 0:5863. Thus, the final regression model is: \(y = -90.62 + 3.81x_1\), this can be read as \(Worldwide~ Gross~ Revenue~ = -90.62 + 3.81 * Movie~ Budget\)

Model Quality

summary(Top1000.lm)
## 
## Call:
## lm(formula = Top1000$Worldwide ~ Top1000$Budget)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -695.83 -117.14  -38.36   76.43 1537.19 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -90.6161    15.6765   -5.78 9.96e-09 ***
## Top1000$Budget   3.8052     0.1409   27.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.4 on 998 degrees of freedom
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4218 
## F-statistic: 729.7 on 1 and 998 DF,  p-value: < 2.2e-16

Standard error for budget is 0.141 and is less than coefficient \(\frac {3.805}{0.141} = 26.9858156\). As coefficient value is high it suggests that there is little variability in the slope estimate. \(Pr(>|t|)\), values are very low suggesting that budget is not relevant in this model.

plot(fitted(Top1000.lm),residuals(Top1000.lm))

Plot shows there data is clustered, as we move to right it is not spread evenly suggesting data is skewed.

qqnorm(Top1000.lm$residuals)
qqline(Top1000.lm$residuals)  # adds diagonal line to the normal prob plot

If the residuals were normally distributed, we would expect the points to hug the straight line. Two ends diverge significantly from fitted line suggesting residuals are not normally distributed. This indicates the budget alone cannot be used to estimate worldwide gross revenues.

Conclusion