Data is extracted from the website The Numbers. Using this data I want to find if there is any correlation between movie budget and revenues it generates after the movie release. I used top 1000 movies based on budget to build a regression model.
library(rvest)
library(plyr)
library(dplyr)
library(knitr)
for(i in seq(from=1, to=1001, by=100)){
url <- paste0("http://www.the-numbers.com/movie/budgets/all/", i)
htmlData <- html(url)
rawMovieData <- htmlData %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill = TRUE)
rawMovieData <- na.omit(rawMovieData)
if (i == 1){
MovieData <- rawMovieData
}
else{
MovieData <- rbind(MovieData, rawMovieData)
}
}
MovieData$Budget <- gsub(",","",MovieData$`Production Budget`, fixed=TRUE)
MovieData$Budget <- gsub("$","",MovieData$Budget, fixed=TRUE)
MovieData$Budget <- round(as.numeric(MovieData$Budget)/1000000,2)
MovieData$Domestic <- gsub(",","",MovieData$`Domestic Gross`, fixed=TRUE)
MovieData$Domestic <- gsub("$","",MovieData$Domestic, fixed=TRUE)
MovieData$Domestic <- round(as.numeric(MovieData$Domestic)/1000000,2)
MovieData$Worldwide <- gsub(",","",MovieData$`Worldwide Gross`, fixed=TRUE)
MovieData$Worldwide <- gsub("$","",MovieData$Worldwide, fixed=TRUE)
MovieData$Worldwide <- round(as.numeric(MovieData$Worldwide)/1000000,2)
colnames(MovieData)[0] <- "MovieNum"
Top1000 <- MovieData %>% select (Movie, Budget, Domestic, Worldwide)
Top1000 <- head(arrange(Top1000,desc(Budget)), n = 1000)
Top1000 %>%
select (Movie, Budget, Domestic, Worldwide) %>%
filter(rank(desc(Budget))<=20) %>%
kable(digits = 2, format='pandoc', caption = "Top 1000 Movie Budget Vs. Gross Revenues Worldwide in Millions")
| Movie | Budget | Domestic | Worldwide |
|---|---|---|---|
| Avatar | 425.0 | 760.51 | 2783.92 |
| Pirates of the Caribbean: On Stranger Tides | 410.6 | 241.06 | 1045.66 |
| Avengers: Age of Ultron | 330.6 | 459.01 | 1408.22 |
| Star Wars Ep. VII: The Force Awakens | 306.0 | 936.66 | 2058.66 |
| Pirates of the Caribbean: At World’s End | 300.0 | 309.42 | 963.42 |
| Spectre | 300.0 | 200.07 | 879.62 |
| The Dark Knight Rises | 275.0 | 448.14 | 1084.44 |
| The Lone Ranger | 275.0 | 89.30 | 260.00 |
| John Carter | 275.0 | 73.06 | 282.78 |
| Tangled | 260.0 | 200.82 | 586.58 |
| Spider-Man 3 | 258.0 | 336.53 | 894.86 |
| Captain America: Civil War | 250.0 | 408.08 | 1153.30 |
| Batman v Superman: Dawn of Justice | 250.0 | 330.36 | 868.16 |
| The Hobbit: An Unexpected Journey | 250.0 | 303.00 | 1017.00 |
| Harry Potter and the Half-Blood Prince | 250.0 | 301.96 | 935.08 |
| The Hobbit: The Desolation of Smaug | 250.0 | 258.37 | 960.37 |
| The Hobbit: The Battle of the Five Armies | 250.0 | 255.12 | 955.12 |
| The Fate of the Furious | 250.0 | 225.76 | 1237.44 |
| Superman Returns | 232.0 | 200.12 | 374.09 |
Top1000.lm <- lm(Top1000$Worldwide ~ Top1000$Budget)
hist(Top1000.lm$residuals)
Histogram of residuals shows data has single peak and it is right skewed.
plot(Top1000$Worldwide ~ Top1000$Budget, xlab='Movie Budget', ylab='Worldwide Gross',
main='Top 1000 Movie Budget Vs. Worldwide Gross Revenues in Millions')
abline(Top1000.lm)
The plot shows some correlation between movie budget and gross revenues worldwide.
Top1000.lm
##
## Call:
## lm(formula = Top1000$Worldwide ~ Top1000$Budget)
##
## Coefficients:
## (Intercept) Top1000$Budget
## -90.616 3.805
Generic regression equation is \(y = a_0 + a_1x_1\) where \(x_1\) denotes movie budget, \(a_0\) is y-intercept and \(a_1\) is slope of the equation.
In this case, the y-intercept is = -90.62 and the slope = 0:5863. Thus, the final regression model is: \(y = -90.62 + 3.81x_1\), this can be read as \(Worldwide~ Gross~ Revenue~ = -90.62 + 3.81 * Movie~ Budget\)
summary(Top1000.lm)
##
## Call:
## lm(formula = Top1000$Worldwide ~ Top1000$Budget)
##
## Residuals:
## Min 1Q Median 3Q Max
## -695.83 -117.14 -38.36 76.43 1537.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -90.6161 15.6765 -5.78 9.96e-09 ***
## Top1000$Budget 3.8052 0.1409 27.01 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 218.4 on 998 degrees of freedom
## Multiple R-squared: 0.4224, Adjusted R-squared: 0.4218
## F-statistic: 729.7 on 1 and 998 DF, p-value: < 2.2e-16
Standard error for budget is 0.141 and is less than coefficient \(\frac {3.805}{0.141} = 26.9858156\). As coefficient value is high it suggests that there is little variability in the slope estimate. \(Pr(>|t|)\), values are very low suggesting that budget is not relevant in this model.
plot(fitted(Top1000.lm),residuals(Top1000.lm))
Plot shows there data is clustered, as we move to right it is not spread evenly suggesting data is skewed.
qqnorm(Top1000.lm$residuals)
qqline(Top1000.lm$residuals) # adds diagonal line to the normal prob plot
If the residuals were normally distributed, we would expect the points to hug the straight line. Two ends diverge significantly from fitted line suggesting residuals are not normally distributed. This indicates the budget alone cannot be used to estimate worldwide gross revenues.