Akula-DATA605-Week11-Discussion

Movie Budget Vs. Worldwide Gross Revenues

Data is extracted from the website The Numbers. Using this data I want to find if there is any correlation between movie budget and revenues it generates after the movie release. I used top 1000 movies based on budget to build a regression model.

Data collection

library(rvest)
library(plyr)
library(dplyr)
library(knitr)

for(i in seq(from=1, to=1001, by=100)){
  url <- paste0("http://www.the-numbers.com/movie/budgets/all/", i)
  htmlData <- html(url)
  rawMovieData <- htmlData %>%
                    html_nodes("table") %>%
                    .[[1]] %>%
                    html_table(fill = TRUE)

  rawMovieData <- na.omit(rawMovieData)
  if (i == 1){
    MovieData <- rawMovieData
  }
  else{
    MovieData <- rbind(MovieData, rawMovieData)
  }

}

MovieData$Budget <- gsub(",","",MovieData$`Production Budget`, fixed=TRUE)
MovieData$Budget <- gsub("$","",MovieData$Budget, fixed=TRUE)
MovieData$Budget <- round(as.numeric(MovieData$Budget)/1000000,2)

MovieData$Domestic <- gsub(",","",MovieData$`Domestic Gross`, fixed=TRUE)
MovieData$Domestic <- gsub("$","",MovieData$Domestic, fixed=TRUE)
MovieData$Domestic <- round(as.numeric(MovieData$Domestic)/1000000,2)

MovieData$Worldwide <- gsub(",","",MovieData$`Worldwide Gross`, fixed=TRUE)
MovieData$Worldwide <- gsub("$","",MovieData$Worldwide, fixed=TRUE)
MovieData$Worldwide <- round(as.numeric(MovieData$Worldwide)/1000000,2)


colnames(MovieData)[0] <- "MovieNum"
Top1000 <- MovieData %>% select (Movie, Budget, Domestic, Worldwide)

Top1000 <- head(arrange(Top1000,desc(Budget)), n = 1000)

Top1000 %>% 
  select (Movie, Budget, Domestic, Worldwide) %>% 
  filter(rank(desc(Budget))<=20) %>% 
  kable(digits = 2, format='pandoc', caption = "Top 1000 Movie Budget Vs. Gross Revenues Worldwide in Millions")

Top 1000 Movie Budget Vs. Gross Revenues Worldwide in Millions
Movie	Budget	Domestic	Worldwide
Avatar	425.0	760.51	2783.92
Pirates of the Caribbean: On Stranger Tides	410.6	241.06	1045.66
Avengers: Age of Ultron	330.6	459.01	1408.22
Star Wars Ep. VII: The Force Awakens	306.0	936.66	2058.66
Pirates of the Caribbean: At World’s End	300.0	309.42	963.42
Spectre	300.0	200.07	879.62
The Dark Knight Rises	275.0	448.14	1084.44
The Lone Ranger	275.0	89.30	260.00
John Carter	275.0	73.06	282.78
Tangled	260.0	200.82	586.58
Spider-Man 3	258.0	336.53	894.86
Captain America: Civil War	250.0	408.08	1153.30
Batman v Superman: Dawn of Justice	250.0	330.36	868.16
The Hobbit: An Unexpected Journey	250.0	303.00	1017.00
Harry Potter and the Half-Blood Prince	250.0	301.96	935.08
The Hobbit: The Desolation of Smaug	250.0	258.37	960.37
The Hobbit: The Battle of the Five Armies	250.0	255.12	955.12
The Fate of the Furious	250.0	225.76	1237.44
Superman Returns	232.0	200.12	374.09

Plot data

Top1000.lm <- lm(Top1000$Worldwide ~ Top1000$Budget)
hist(Top1000.lm$residuals)

Histogram of residuals shows data has single peak and it is right skewed.

plot(Top1000$Worldwide ~ Top1000$Budget, xlab='Movie Budget', ylab='Worldwide Gross', 
     main='Top 1000 Movie Budget Vs. Worldwide Gross Revenues in Millions')
abline(Top1000.lm)

The plot shows some correlation between movie budget and gross revenues worldwide.

Top1000.lm

## 
## Call:
## lm(formula = Top1000$Worldwide ~ Top1000$Budget)
## 
## Coefficients:
##    (Intercept)  Top1000$Budget  
##        -90.616           3.805

Generic regression equation is \(y = a_0 + a_1x_1\) where \(x_1\) denotes movie budget, \(a_0\) is y-intercept and \(a_1\) is slope of the equation.

In this case, the y-intercept is = -90.62 and the slope = 0:5863. Thus, the final regression model is: \(y = -90.62 + 3.81x_1\), this can be read as \(Worldwide~ Gross~ Revenue~ = -90.62 + 3.81 * Movie~ Budget\)

Model Quality

summary(Top1000.lm)

## 
## Call:
## lm(formula = Top1000$Worldwide ~ Top1000$Budget)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -695.83 -117.14  -38.36   76.43 1537.19 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -90.6161    15.6765   -5.78 9.96e-09 ***
## Top1000$Budget   3.8052     0.1409   27.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 218.4 on 998 degrees of freedom
## Multiple R-squared:  0.4224, Adjusted R-squared:  0.4218 
## F-statistic: 729.7 on 1 and 998 DF,  p-value: < 2.2e-16

Standard error for budget is 0.141 and is less than coefficient \(\frac {3.805}{0.141} = 26.9858156\). As coefficient value is high it suggests that there is little variability in the slope estimate. \(Pr(>|t|)\), values are very low suggesting that budget is not relevant in this model.

plot(fitted(Top1000.lm),residuals(Top1000.lm))

Plot shows there data is clustered, as we move to right it is not spread evenly suggesting data is skewed.

qqnorm(Top1000.lm$residuals)
qqline(Top1000.lm$residuals)  # adds diagonal line to the normal prob plot

If the residuals were normally distributed, we would expect the points to hug the straight line. Two ends diverge significantly from fitted line suggesting residuals are not normally distributed. This indicates the budget alone cannot be used to estimate worldwide gross revenues.

Conclusion

\(R^2\) value of 0.422 suggests only 42% of the data variation.
Budget alone cannot be used to estimate worldwide gross revenues.