Background: This research paper looks at quantifiable and easily obtainable factors that affect a movie’s success such as critic reviews, domestic box office performance, overall budget, and profit in US Dollars (USD) using data from notable websites. The goal of this analysis is to test the relationship between these four variables: critic reviews, budget, domestic box office performance, and net profit, using simple linear regression and to possibly develop a multiple linear regression model that uses critic reviews to predict a movie’s expected box office performance.
Methods: This research draws upon Google Trends data to select the top ten trending movies for the past three years. The critic reviews (IMDb, Metacritic, Rotten Tomatoes, and Roger Ebert) came from their respective websites, while domestic box office performance and production budgets were sourced from The Numbers. Initial analysis was done via simple linear regression, culminating in a final multiple regression model created by stepwise regression using critic review scores from seven different publications.
Results: A t-test for correlation showed that there was a significant association between box office performance and Rotten Tomatoes ratings (p-value = 0.045). The associations between budget and IMDb ratings (p-value = 0.018), budget and Metacritic ratings (p-value = 0.046), and budget and Roger Ebert ratings (p-value = 0.018) were all significant. The final resulting multiple regression model using critic reviews and budget to predict box office performance was statistically significant (p-value = 0.00022) and had a moderate degree of predictive power (R2 = 0.47).
Conclusions: Critic reviews exhibit a statistically significant, but weak influence on the success of a movie. A multiple linear regression model using budget data and critic reviews was only able to explain some of the variability in domestic box office performance. Future research using a larger sample size and taking into account extraneous variables is needed to conclusively quantify the magnitude of critic reviews’ effect.
Movies are an integral part of modern culture, generating over billions and billions of dollars each year and delivering rich, intricately crafted stories to a worldwide audience. The film industry not only serves as a deeply expressive artistic medium but also has a significant impact on the global economy. The global box office revenue is expected to have an increase from approximately 38 billion USD in 2016 to about 50 billion USD in 2020 (Statista, 2016). Moreover, the US is one of the most highly ranked countries in making the most movies per year. Studies indicate that the US is ranked the third largest film maker in the world next to China and India (Statista, 2016).
This study focuses on analyzing how different factors affect a movie’s box office success for a selection of top ten trending movies in the US for the past three years. Although many of these factors are either impossible to quantify or difficult to obtain reliably, such as profitability of a movie’s release period, popularity of a movie’s content, strength of a movie’s marketing strategies and advertising budget of a movie (Mullich, 2014), this study focuses on factors that can be obtained consistently for all the movies of interest such as critic reviews from a variety of highly respected websites (including IMDb, Metacritic, Rotten Tomatoes, and Roger Ebert), movie budget, domestic box office performance and overall net profit all measured in USD.
The hypothesis of this study was that movies which are more favorably reviewed by consumers will have better domestic box office performance, hence higher box office sales. Following this hypothesis, a multiple linear regression model was developed that factors in numerous critic review sources to predict a movie’s expected box office performance. Additionally, it was also hypothesized that some of the review websites selected for the study will be better predictors of a movie’s success than others given that review websites that consider the general populus’ sentiments, such as IMDb and Rotten Tomatoes, will be better predictors of net profit than review websites that focus solely on professional reviews, such as Metacritic and Roger Ebert. The primary goal of this study is to test the relationship between the four variables: critic reviews, budget, domestic box office performance, and net profit, using simple linear regression and to develop a multiple linear regression model that uses theese variables to predict a movie’s expected box office performance.
Data was selected using the most popular information aggregator, Google. Google Trends data provided the top ten trending movies each year. Analyzing the top ten trending movies from 2013, 2014, and 2015, yields a total sample size of thirty movies. For the top Google-searched movies of these three years, the relationships between the four main variables critic reviews, domestic box office performance, budget and net profit were tested using linear regression. The critic reviews that were used (IMDb, Metacritic, Rotten Tomatoes, and Roger Ebert) came from their respective websites, while domestic box office performance and production budget were sourced from The Numbers, a movie industry data and research company. To obtain net profit, domestic box office performance was subtracted from production budget. Overall, these sources of data were found to be highly reliable and consistent, providing usable data for the project. Although advertising budgets for each movie were considered as analyzable factors, they were not available on any free, reliable databases, which would have led to searching the individual advertising budgets for each movie. This would have introduced inconsistencies into the data collection, since different sources have different methods of estimating advertising budgets.
Initial analysis of the data found that Metacritic was not very predictive of domestic box office performance. Additionally, Rotten Tomatoes and IMDb are continuous review websites: that is, users can continually review the movies long after they are released and unavailable in theaters. An accurate model used to predict how critic reviews are linked to box office performance cannot factor in these continuous reviews. Thus, in an attempt to alleviate these issues, individual review sources that were included in Metacritic were chosen to develop a new model based on critic reviews. Although Metacritic was not as predictive as Rotten Tomatoes and IMDb, the reviews aggregated by it are not continuous, providing a good source of critic reviews. These reviews included by Metacritic were published before and around when each movie was released, which would theoretically influence consumer behavior.
A correlation matrix was used to determine the most significant predictors of domestic box office performance (Appendix E). Out of the variables tested in this study, only movie budget and net profit were found to have a significant correlation with domestic box office performance at a 5% significance level. Movie budget was chosen as one of the variables for the multiple linear regression model because budget can be used practically to predict box office earnings before a movie is released. Individual critic reviews were also included in the model to predict box office earnings. Using Metacritic which contains a database of critic review scores rated out of 100, scores from 22 different publications were recorded for each of the 30 movies. Out of these 22 publications, 7 were chosen for their completeness and critic scores from these publications were used to create a linear model in R to predict the domestic box office earnings for a given critic score. The model was pared down by removing one variable at a time based on the highest p-value given by the summary statistics of the linear model. The final linear model was determined by checking the p-value and adjusted R-squared value after each stepwise removal of a variable from the model (Appendix F).
Thirty movies were included in the final analysis, and overall characteristics of these movies can be found in the appendix (A, B). Both the distribution of box office performance and the distribution of net profit were skewed to the right. The average box office performance was found to be $231,647,612, while the average net profit was $118,364,278. The distribution of budget appears to be bimodal and centered around $113,283,333. For the review sources used, both the distributions of IMDb ratings and Rotten Tomatoes ratings are left skewed, while both the distributions of Metacritic ratings and Roger Ebert ratings appear to be relatively normal. The average IMDb rating was 7.03, the average Metacritic rating was 63.5, the average Rotten Tomatoes rating was 67.5, and the average Roger Ebert rating was 2.82. Histograms showing all these distributions are included in the appendix (C,D).
As the study is mainly focused on the predictors of box office performance, one of the first analyses conducted was to determine the relationship between box office performance and the four main critic review sites included. Although the resulting R2 values for all four linear models indicated an overall weak relationship between box office performance and critic reviews, the p-value for Rotten Tomatoes was significant (p-value = 0.045, R2 = 0.1359).
Figure 1. Domestic Box Office Performance and Critic Review Scatterplots with a Linear Regression Line
Review.Website | R2.value | p.value |
---|---|---|
IMDb | 0.080 | 0.129 |
Metacritic | 0.120 | 0.061 |
Rotten Tomatoes | 0.136 | 0.045 |
Roger Ebert | 0.086 | 0.115 |
In the interest of exploring confounding variables, the association between season of release and box office performance was considered next. Overall, there was no evidence of association between these two variables. An attempt to correlate the two variables resulted in an R2 value of 0.001 and a p-value of 0.880, both of which suggest that there is no relationship between the two variables.
The relationship between budget and critic reviews was also analyzed. Only the association between budget and Rotten Tomatoes ratings was insignificant (p-value = 0.102), for the rest of the three reviews, the associations were significant (p-value < 0.05). With this information, simple linear regressions were created, predicting individual review website rankings based on movie production budget:
\[\begin{eqnarray} Budget = 35525592 * IMDb - 136698416\\ Budget = 2181946 * Metacritic - 25270268\\ Budget = 41058252 * RogerEbert - 2364078 \\ \end{eqnarray}\]
Figure 2. Budget and Critic Review Scatterplots with a Linear Regression Line
Review.Website | R2.value | p.value |
---|---|---|
IMDb | 0.183 | 0.018 |
Metacritic | 0.135 | 0.046 |
Rotten Tomatoes | 0.093 | 0.102 |
Roger Ebert | 0.184 | 0.018 |
In addition to the significance of the p-values, a percent error test utilizing data from the top trending movies of 2012 found that the fit of the models was sound. The error for IMDb ranged between 1.25% and 40.87%, for Metacritic between 1.20% and 41.96%, and for RogerEbert between 2.27% and 28.27% (Appendix H).
The final multiple linear model uses critic scores from three publications and budget to predict domestic box office earnings: Los Angeles Times, the New York Times and movie budget. The prediction equation after stepwise removal of predictors is:
\[\begin{eqnarray} Predicted \: box \: office \: earnings = 95,490,000 + 3,538,000 * LATimes + 1.012 * Budget - 3,608,000 * NYT \end{eqnarray}\]
Using an ANOVA test, the p-value with 3 and 26 degrees freedom for the linear model is about 0.00022, so there is evidence that the linear model is useful for predicting domestic box office earnings. Comparing the p-value of the t-test for individual terms (LATimes p=0.0093, Budget p=0.00035, NYT p=0.010), all three variables are effective predictors in this model, with movie budget being the most significant predictor of domestic box office earnings. From the adjusted R-squared score, the two publications and movie budget in this model can only account for 47% of the variability in domestic box office earnings. The moderate R-squared value suggests that there may be other variables that may contribute or more accurately explain the variability in domestic box office earnings. The histogram of the residuals was slightly left skewed and a plot of the residuals against fitted values from the model showed a pattern of constant variability (Appendix G) so a linear regression model is appropriate for prediction.
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + Budget + LATimes, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -251894038 -63759447 5739032 67293294 199651768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.549e+07 7.482e+07 1.276 0.213127
## NYT -3.608e+06 1.298e+06 -2.779 0.009982 **
## Budget 1.012e+00 2.461e-01 4.113 0.000348 ***
## LATimes 3.538e+06 1.259e+06 2.811 0.009264 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102100000 on 26 degrees of freedom
## Multiple R-squared: 0.521, Adjusted R-squared: 0.4657
## F-statistic: 9.425 on 3 and 26 DF, p-value: 0.000218
Figure 3. Summary Statistics for Final Multiple Linear Regression Model
Although a multiple regression model predicting domestic box office performance based on critic reviews was successfully generated, it is hard to infer causality between these two variables given that this is an observational study in which none of the variables were actively controlled but simply recorded. Furthermore, there exist other confounding factors influencing the relationship between these two variables such as word-of-mouth information, marketing budget, presence of major distributor and genre of movie. Even if the p-values of the coefficients in the model were low, causality cannot be established. However, a positive, linear relationship between domestic box office performance and highly favorable critic reviews and budget can be concluded. Additionally, analysis of the linear regression model can only explain 47% amount of the variability in box office earnings. In short, the scope of the causal relationship between these three variables of interest cannot be inferred, but prediction is possible.
The first evidence of the limited predictive power of individual critic reviews can be observed in the preliminary research findings. The R-squared value for a linear model fitting Roger Ebert, an individual publication, with domestic box office earnings was 0.086 with a corresponding p-value of 0.115, indicating that there is inconclusive evidence that the model is an effective predictor of domestic box office earnings (Figure 1). Conceptually, this is consistent with pre-existing literature which suggests that critics have negligible influence on widely-released movies or popular genres such as action or comedy movies (Reinstein, 2005). Since all 30 of the movies sampled in this study were chosen using Google top trends, movies in this dataset were likely to be popular releases and, therefore, critic reviews would be expected to explain only a small amount of the variation in domestic box office earnings.
Besides leaving out other potential predictors there are various reasons why a linear model using critic reviews may be limited in explaining the variability in domestic box office earnings. Critic reviews are highly subjective and are meant to evaluate movies with respect to art. Critics may not accurately reflect how a movie is received by the general movie-going public. For example, Fifty Shades of Grey, which scored 53 in Metacritic had higher domestic box office earnings than the median value for the 30 movies. In a world of growing social media and decline of written publications, it is also important to consider the evolving roles of professional critics and peer critics.
There were also flaws in the data that were beyond the control of this study. Out of the 22 publications for which critic review scores were collected, only 7 publications had scores for all 30 movies. This presented a significant challenge in using stepwise regression to create a linear model. Because some publications were missing several review scores and movies with missing scores varied from publication to publication, using all 22 publications in the publications in the linear model would have deleted review scores for 18 movies in the multiple regression analysis. This deletion due to missing values would have reduced the movie sample size by over half and compromised the informativeness of the stepwise regression. To circumvent this problem, only the 7 publications that had review scores for all 22 movies were used in stepwise regression.
In order to create a multiple linear regression model that would explain a larger percentage of variability in box office earnings, further research on other factors that could affect the popularity of a movie should be investigated. Data on word-of-mouth interaction, marketing budgets, production values, distributors and genre would have to be collected and combined with budget data to create a new linear regression model and, subsequently, pared down using stepwise regression. Further research would also need a larger sample size of movies in order to mitigate the effect of deletion due to missing observations.
IMDb. IMDb.com, Inc., n.d. Web. 05 Dec. 2016.
Irvine, Victoria. “Topic: Film Industry.” www.statista.com. N.p., 2016. Web. 01 Dec. 2016. ↩
“Movie Box Office Records.” The Numbers. Nash Information Services, LLC, n.d. Web. 05 Dec. 2016.
“Movie Budgets.” The Numbers. Nash Information Services, LLC, n.d. Web. 05 Dec. 2016.
“Movie Reviews.” Metacritic. CBS Interactive, Inc., n.d. Web. 05 Dec. 2016.
Mullich, David. “Movies: What Determines the Success of a Movie (by Box Office Revenue)?” - Quora. N.p., n.d. Web. 01 Dec. 2016↩
“Reviews.” Roger Ebert. Ebert Digital, LLC, n.d. Web. 5 Dec. 2016.
Reinstein, David, et al. “The Influence of Expert Reviews on Consumer Demand for Experience Goods: A Case Study of Movie Critics” The Journal of Industrial Economics. Vol. 53, No. 1 (Mar., 2005), pp 29. Web. 05 Dec. 2016 http://www.jstor.org/stable/3569753↩
Rotten Tomatoes. Fandango, n.d. Web. 5 Dec. 2016.
RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.
Appendix A: Financial Summary of Movies Analyzed↩
Movie | YearReleased | Budget | DomesticBoxOffice | NetProfit |
---|---|---|---|---|
Jurassic World | 2015 | 2.15e+08 | 652198010 | 437198010 |
American Sniper | 2014 | 5.80e+07 | 350126372 | 292126372 |
Straight Outta Compton | 2015 | 2.80e+07 | 161058685 | 133058685 |
50 Shades of Grey | 2015 | 4.00e+07 | 166167230 | 126167230 |
Furious 7 | 2015 | 1.90e+08 | 351032910 | 161032910 |
Pitch Perfect 2 | 2015 | 2.90e+07 | 183785415 | 154785415 |
Inside Out | 2015 | 1.75e+08 | 356461711 | 181461711 |
Avengers Age of Ultron | 2015 | 2.50e+08 | 459005868 | 209005868 |
Minions | 2015 | 7.40e+07 | 336045770 | 262045770 |
Mad Max: Fury Road | 2015 | 1.50e+08 | 153636354 | 3636354 |
Frozen | 2013 | 1.50e+08 | 400738009 | 250738009 |
Interstellar | 2014 | 1.65e+08 | 188017894 | 23017894 |
Divergent | 2014 | 8.50e+07 | 150947895 | 65947895 |
Gone Girl | 2014 | 6.10e+07 | 167767189 | 106767189 |
Lone Survivor | 2013 | 4.00e+07 | 125095601 | 85095601 |
Godzilla | 2014 | 1.60e+08 | 200676069 | 40676069 |
22 Jump Street | 2014 | 5.00e+07 | 191719337 | 141719337 |
Big Hero 6 | 2014 | 1.65e+08 | 222527828 | 57527828 |
Annabelle | 2014 | 6.50e+06 | 84273813 | 77773813 |
Maleficent | 2014 | 1.80e+08 | 241407328 | 61407328 |
Man of Steel | 2013 | 2.25e+08 | 291045518 | 66045518 |
Iron Man 3 | 2013 | 2.00e+08 | 408992272 | 208992272 |
World War Z | 2013 | 1.90e+08 | 202359711 | 12359711 |
Jobs | 2013 | 1.80e+07 | 16131410 | -1868590 |
The Conjuring | 2013 | 2.00e+07 | 137400141 | 117400141 |
The Great Gatsby | 2013 | 1.90e+08 | 144840419 | -45159581 |
Despicable Me 2 | 2013 | 7.60e+07 | 368065385 | 292065385 |
The Purge | 2013 | 3.00e+06 | 64473115 | 61473115 |
Pacific Rim | 2013 | 1.90e+08 | 101802906 | -88197094 |
Mama | 2013 | 1.50e+07 | 71628180 | 56628180 |
Appendix B: Review Score Summary of Movies Analyzed↩
Movie | IMDb | Metacritic | RottenTomatoes | RogerEbert |
---|---|---|---|---|
Jurassic World | 7.0 | 59 | 71 | 3.0 |
American Sniper | 7.3 | 72 | 72 | 3.5 |
Straight Outta Compton | 7.9 | 72 | 87 | 4.0 |
50 Shades of Grey | 4.1 | 46 | 25 | 2.0 |
Furious 7 | 7.2 | 67 | 79 | 3.5 |
Pitch Perfect 2 | 6.5 | 63 | 66 | 2.5 |
Inside Out | 8.2 | 94 | 98 | 4.0 |
Avengers Age of Ultron | 7.5 | 66 | 75 | 3.0 |
Minions | 6.4 | 56 | 56 | 3.0 |
Mad Max: Fury Road | 8.1 | 90 | 97 | 4.0 |
Frozen | 7.6 | 74 | 89 | 2.5 |
Interstellar | 8.6 | 74 | 71 | 3.5 |
Divergent | 6.7 | 48 | 41 | 2.5 |
Gone Girl | 8.1 | 79 | 88 | 3.5 |
Lone Survivor | 7.6 | 60 | 75 | 2.0 |
Godzilla | 6.5 | 62 | 74 | 3.5 |
22 Jump Street | 7.1 | 71 | 84 | 3.0 |
Big Hero 6 | 7.9 | 74 | 89 | 3.0 |
Annabelle | 5.4 | 37 | 29 | 1.0 |
Maleficent | 7.0 | 56 | 50 | 3.0 |
Man of Steel | 7.1 | 55 | 55 | 3.0 |
Iron Man 3 | 7.2 | 79 | 79 | 2.5 |
World War Z | 7.0 | 63 | 67 | 2.0 |
Jobs | 5.9 | 44 | 27 | 2.0 |
The Conjuring | 7.5 | 65 | 86 | 1.0 |
The Great Gatsby | 7.3 | 55 | 48 | 2.5 |
Despicable Me 2 | 7.5 | 62 | 73 | 3.0 |
The Purge | 5.7 | 41 | 37 | 1.5 |
Pacific Rim | 7.0 | 64 | 71 | 4.0 |
Mama | 6.2 | 57 | 65 | 3.0 |
Appendix C: Movie Capital Distribution Histograms↩
Appendix D: Critic Reviews Distribution Histograms↩
Appendix E: Correlation Matrix of Movie Data↩
my_data <- dat[, c(4,5,6,7,8,9,10)]
res2 <- rcorr(as.matrix(my_data))
res2
## IMDb Metacritic RottenTomatoes RogerEbert Budget
## IMDb 1.00 0.79 0.81 0.54 0.42
## Metacritic 0.79 1.00 0.91 0.65 0.37
## RottenTomatoes 0.81 0.91 1.00 0.56 0.30
## RogerEbert 0.54 0.65 0.56 1.00 0.43
## Budget 0.42 0.37 0.30 0.43 1.00
## DomesticBoxOffice 0.29 0.35 0.37 0.29 0.59
## NetProfit 0.07 0.17 0.24 0.06 0.03
## DomesticBoxOffice NetProfit
## IMDb 0.29 0.07
## Metacritic 0.35 0.17
## RottenTomatoes 0.37 0.24
## RogerEbert 0.29 0.06
## Budget 0.59 0.03
## DomesticBoxOffice 1.00 0.83
## NetProfit 0.83 1.00
##
## n= 30
##
##
## P
## IMDb Metacritic RottenTomatoes RogerEbert Budget
## IMDb 0.0000 0.0000 0.0019 0.0210
## Metacritic 0.0000 0.0000 0.0000 0.0455
## RottenTomatoes 0.0000 0.0000 0.0013 0.1017
## RogerEbert 0.0019 0.0000 0.0013 0.0180
## Budget 0.0210 0.0455 0.1017 0.0180
## DomesticBoxOffice 0.1168 0.0611 0.0450 0.1152 0.0006
## NetProfit 0.7198 0.3674 0.1961 0.7402 0.8828
## DomesticBoxOffice NetProfit
## IMDb 0.1168 0.7198
## Metacritic 0.0611 0.3674
## RottenTomatoes 0.0450 0.1961
## RogerEbert 0.1152 0.7402
## Budget 0.0006 0.8828
## DomesticBoxOffice 0.0000
## NetProfit 0.0000
Appendix F: Stepwise Regression using Critic Reviews from Individual Publications↩
fit2.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + EW + Variety + AVClub + HollywoodReporter + Budget, dat)
summary(fit2.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + EW +
## Variety + AVClub + HollywoodReporter + Budget, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176616718 -60622143 -1640936 66778243 218710420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.416e+08 1.266e+08 1.119 0.27575
## NYT -2.945e+06 1.824e+06 -1.615 0.12131
## NYDaily 1.254e+06 1.231e+06 1.018 0.32013
## LATimes 3.199e+06 1.598e+06 2.001 0.05843 .
## EW -1.116e+05 1.344e+06 -0.083 0.93463
## Variety 5.492e+05 1.913e+06 0.287 0.77687
## AVClub -9.412e+05 2.136e+06 -0.441 0.66397
## HollywoodReporter -1.486e+06 2.145e+06 -0.693 0.49618
## Budget 1.000e+00 2.985e-01 3.351 0.00303 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108100000 on 21 degrees of freedom
## Multiple R-squared: 0.5659, Adjusted R-squared: 0.4006
## F-statistic: 3.422 on 8 and 21 DF, p-value: 0.01125
fit3.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety + AVClub + HollywoodReporter + Budget, dat)
summary(fit3.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety +
## AVClub + HollywoodReporter + Budget, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -176595841 -59048340 -988146 67387295 218097111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.402e+08 1.226e+08 1.144 0.26492
## NYT -2.909e+06 1.730e+06 -1.681 0.10694
## NYDaily 1.253e+06 1.203e+06 1.042 0.30886
## LATimes 3.209e+06 1.557e+06 2.061 0.05127 .
## Variety 5.207e+05 1.839e+06 0.283 0.77973
## AVClub -9.836e+05 2.027e+06 -0.485 0.63224
## HollywoodReporter -1.546e+06 1.971e+06 -0.785 0.44108
## Budget 9.949e-01 2.847e-01 3.494 0.00205 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105700000 on 22 degrees of freedom
## Multiple R-squared: 0.5658, Adjusted R-squared: 0.4276
## F-statistic: 4.095 on 7 and 22 DF, p-value: 0.005093
fit4.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + HollywoodReporter + Budget, dat)
summary(fit4.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + HollywoodReporter +
## Budget, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -188301647 -54364604 -535060 66038182 210624154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.173e+08 9.801e+07 1.196 0.243219
## NYT -2.946e+06 1.425e+06 -2.067 0.049659 *
## NYDaily 1.395e+06 9.838e+05 1.418 0.169031
## LATimes 3.116e+06 1.337e+06 2.330 0.028530 *
## HollywoodReporter -1.605e+06 1.900e+06 -0.845 0.406716
## Budget 9.635e-01 2.501e-01 3.852 0.000765 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.02e+08 on 24 degrees of freedom
## Multiple R-squared: 0.5585, Adjusted R-squared: 0.4665
## F-statistic: 6.071 on 5 and 24 DF, p-value: 0.0009086
fit5.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety + Budget, dat)
summary(fit5.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety +
## Budget, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189844790 -52169909 1794011 57619712 208421298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.933e+07 8.250e+07 0.719 0.47902
## NYT -3.679e+06 1.478e+06 -2.489 0.02015 *
## NYDaily 7.578e+05 1.053e+06 0.720 0.47863
## LATimes 2.858e+06 1.422e+06 2.010 0.05581 .
## Variety 6.735e+05 1.772e+06 0.380 0.70727
## Budget 1.003e+00 2.772e-01 3.618 0.00137 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 103200000 on 24 degrees of freedom
## Multiple R-squared: 0.5481, Adjusted R-squared: 0.4539
## F-statistic: 5.821 on 5 and 24 DF, p-value: 0.001173
fit6.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + Budget, dat)
summary(fit6.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + Budget,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -200698709 -49829545 -1034243 58739955 209759673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.776e+07 7.810e+07 0.868 0.393868
## NYT -3.428e+06 1.299e+06 -2.639 0.014114 *
## NYDaily 9.853e+05 8.509e+05 1.158 0.257825
## LATimes 3.029e+06 1.326e+06 2.285 0.031073 *
## Budget 9.599e-01 2.486e-01 3.861 0.000708 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101400000 on 25 degrees of freedom
## Multiple R-squared: 0.5454, Adjusted R-squared: 0.4726
## F-statistic: 7.497 on 4 and 25 DF, p-value: 0.0004111
fit7.all=lm(DomesticBoxOffice ~ NYT + LATimes + Budget, dat)
summary(fit7.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + LATimes + Budget, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -251894038 -63759447 5739032 67293294 199651768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.549e+07 7.482e+07 1.276 0.213127
## NYT -3.608e+06 1.298e+06 -2.779 0.009982 **
## LATimes 3.538e+06 1.259e+06 2.811 0.009264 **
## Budget 1.012e+00 2.461e-01 4.113 0.000348 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 102100000 on 26 degrees of freedom
## Multiple R-squared: 0.521, Adjusted R-squared: 0.4657
## F-statistic: 9.425 on 3 and 26 DF, p-value: 0.000218
fit8.all=lm(DomesticBoxOffice ~ LATimes + Budget, dat)
summary(fit8.all)
##
## Call:
## lm(formula = DomesticBoxOffice ~ LATimes + Budget, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -228370822 -59643928 -18224362 33582796 323486046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.983e+07 7.934e+07 0.376 0.70989
## LATimes 1.287e+06 1.077e+06 1.195 0.24240
## Budget 9.711e-01 2.745e-01 3.537 0.00148 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114100000 on 27 degrees of freedom
## Multiple R-squared: 0.3786, Adjusted R-squared: 0.3326
## F-statistic: 8.226 on 2 and 27 DF, p-value: 0.001623
Appendix G: Histogram of the Residuals and Plot of the Residuals against Fitted Values ↩
plot(fitted(fit7.all), residuals(fit6.all),
ylab="Residuals", xlab="Predicted Domestic Box Office Earnings")
abline(0, 0)
hist(fit7.all$res, xlab="Residuals", main="Histogram of Residuals", freq=FALSE)
curve(dnorm(x,0,sd(fit6.all$res)), add = TRUE)
plot(imdb, meta, main = "IMDb vs. Metacritic Ratings", xlab = "IMDb Ratings (out of 10)", ylab = "Metacritic Ratings (out of 100)")
abline(lm(meta~imdb))
summary(lm(meta~imdb))
##
## Call:
## lm(formula = meta ~ imdb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.6933 -6.9988 -0.1933 4.4587 17.3460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.065 11.686 -1.375 0.18
## imdb 11.307 1.647 6.866 1.84e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.269 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.6274, Adjusted R-squared: 0.6141
## F-statistic: 47.14 on 1 and 28 DF, p-value: 1.842e-07
plot(boxoffice, profit, main = "Profit vs. Box Office Performance", xlab = "Box Office Performance (in USD)", ylab = "Profit (in USD)")
abline(lm(profit~boxoffice))
summary(lm(boxoffice~profit))
##
## Call:
## lm(formula = boxoffice ~ profit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -109166210 -73234931 3431639 73867885 134936815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.110e+08 2.142e+07 5.179 1.70e-05 ***
## profit 1.020e+00 1.320e-01 7.727 2.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 80310000 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.6807, Adjusted R-squared: 0.6693
## F-statistic: 59.7 on 1 and 28 DF, p-value: 2.04e-08
testdata <- read.csv("testdata.csv", header=TRUE)
imdb.lm = lm(Domestic.Box.Office~IMDb, dat2)
summary(imdb.lm)
##
## Call:
## lm(formula = Domestic.Box.Office ~ IMDb, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -165723965 -110000483 -35192161 98670916 422156600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76596994 192084732 -0.399 0.693
## IMDb 43805486 27068958 1.618 0.117
##
## Residual standard error: 135900000 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.08553, Adjusted R-squared: 0.05287
## F-statistic: 2.619 on 1 and 28 DF, p-value: 0.1168
imdb.residuals = resid(imdb.lm)
fitted = predict(imdb.lm)
meta.lm = lm(Domestic.Box.Office~Metacritic, dat2)
summary(meta.lm)
##
## Call:
## lm(formula = Domestic.Box.Office ~ Metacritic, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174217431 -91756654 -36759838 89597412 436887296
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1115838 120617603 0.009 0.9927
## Metacritic 3630422 1860395 1.951 0.0611 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 133300000 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.1197, Adjusted R-squared: 0.08828
## F-statistic: 3.808 on 1 and 28 DF, p-value: 0.06108
meta.residuals = resid(meta.lm)
fitted = predict(meta.lm)
tomatoes.lm = lm(Domestic.Box.Office~Rotten.Tomatoes, dat2)
summary(tomatoes.lm)
##
## Call:
## lm(formula = Domestic.Box.Office ~ Rotten.Tomatoes, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153826245 -108105921 -41055498 90630946 411679078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62255596 84253539 0.739 0.466
## Rotten.Tomatoes 2510751 1196545 2.098 0.045 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132100000 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.1359, Adjusted R-squared: 0.105
## F-statistic: 4.403 on 1 and 28 DF, p-value: 0.04502
tomatoes.residuals = resid(tomatoes.lm)
fitted = predict(tomatoes.lm)
roger.lm = lm(Domestic.Box.Office~Roger.Ebert.com, dat2)
summary(roger.lm)
##
## Call:
## lm(formula = Domestic.Box.Office ~ Roger.Ebert.com, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -188668802 -75965305 -28502003 79879938 411436806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91629692 89625199 1.022 0.315
## Roger.Ebert.com 49710504 30576757 1.626 0.115
##
## Residual standard error: 135900000 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.08625, Adjusted R-squared: 0.05362
## F-statistic: 2.643 on 1 and 28 DF, p-value: 0.1152
roger.residuals = resid(roger.lm)
fitted = predict(roger.lm)
par2 <- par(mfrow=c(2, 2))
plot(fitted, imdb.residuals, main="IMDb", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
plot(fitted, meta.residuals, main="Metacritic", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
plot(fitted, tomatoes.residuals, main="Rotten Tomatoes", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
plot(fitted, roger.residuals, main="Roger Ebert", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
par(par2)
par3 <- par(mfrow=c(2, 2))
hist(imdb.residuals, main = "IMDb", xlab = "Residual", ylab = "Frequency")
hist(meta.residuals, main = "Metacritic", xlab = "Residual", ylab = "Frequency")
hist(tomatoes.residuals, main = "Rotten Tomatoes", xlab = "Residual", ylab = "Frequency")
hist(roger.residuals, main = "Roger Ebert", xlab = "Residual", ylab = "Frequency")
par(par3)
imdbbox.predict <- predict(imdb.lm, newdata = testdata)
imdbbox.predict
## 1 2 3 4 5 6 7
## 243183056 230041410 278227445 190616473 295749640 230041410 238802508
## 8 9 10
## 221280313 164333181 124908243
(imdbbox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
## 1 2 3 4 5 6 7
## -40.39787 81.88387 -55.36073 67.61681 -34.00495 5.20231 72.48576
## 8 9 10
## 77.00438 -43.78403 131.73927
metabox.predict <- predict(meta.lm, newdata = testdata)
metabox.predict
## 1 2 3 4 5 6 7
## 247984509 237093244 251614930 262506195 284288725 226201979 251614930
## 8 9 10
## 157223968 189897763 146332703
(metabox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
## 1 2 3 4 5 6
## -39.221076 87.459448 -59.630485 130.832368 -36.562392 3.446465
## 7 8 9 10
## 81.740101 25.765059 -35.038764 171.487558
tomatoesbox.predict <- predict(tomatoes.lm, newdata = testdata)
tomatoesbox.predict
## 1 2 3 4 5 6 7
## 273158698 245540435 293244708 263115693 280690952 230475928 275669449
## 8 9 10
## 135067381 185282406 122513625
(tomatoesbox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
## 1 2 3 4 5 6
## -33.051093 94.138280 -52.951335 131.368324 -37.365217 5.401023
## 7 8 9 10
## 99.114550 8.041779 -36.617610 127.296594
rogerbox.predict <- predict(roger.lm, newdata = testdata)
rogerbox.predict
## 1 2 3 4 5 6 7
## 240761204 290471708 240761204 240761204 240761204 265616456 240761204
## 8 9 10
## 215905952 215905952 NA
(rogerbox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
## 1 2 3 4 5 6 7
## -40.99145 129.66351 -61.37187 111.71111 -46.27534 21.47145 73.90051
## 8 9 10
## 72.70538 -26.14174 NA
Appendix H: Plot and histogram of the residuals for predicting critic reviews from budget data and prediction values with percent error. ↩
#predicting critic reviews based on budget
imdb.budget = lm(IMDb~Budget, data = final)
summary(imdb.budget)
##
## Call:
## lm(formula = IMDb ~ Budget, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5735 -0.4357 -0.2054 0.6378 1.3224
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+00 2.782e-01 23.274 <2e-16 ***
## Budget 4.956e-09 2.026e-09 2.446 0.021 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8613 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.1761, Adjusted R-squared: 0.1466
## F-statistic: 5.983 on 1 and 28 DF, p-value: 0.02099
imdbbudget.residuals = resid(imdb.budget)
fitted = predict(imdb.budget)
plot(fitted, imdbbudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
hist(imdbbudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")
imdb.predict <- predict(imdb.budget, newdata = testdata)
imdb.predict
## 1 2 3 4 5 6 7 8
## 6.871724 7.094731 7.590301 6.509958 7.838086 6.723053 6.683408 6.623939
## 9 10
## 7.150235 6.500047
(imdb.predict-testdata$IMDb)/testdata$IMDb*100
## 1 2 3 4 5 6 7
## -5.866788 1.353299 -6.292581 6.720629 -7.787224 -3.956379 -7.174891
## 8 9 10
## -2.589126 30.004269 41.305369
meta.budget = lm(Metacritic~Budget, data = final)
summary(meta.budget)
##
## Call:
## lm(formula = Metacritic ~ Budget, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8764 -11.4311 -0.8311 7.9899 26.6718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.647e+01 4.069e+00 13.880 4.47e-14 ***
## Budget 6.203e-08 2.963e-08 2.094 0.0455 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.6 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.1353, Adjusted R-squared: 0.1045
## F-statistic: 4.383 on 1 and 28 DF, p-value: 0.04548
metabudget.residuals = resid(meta.budget)
fitted = predict(meta.budget)
plot(fitted, metabudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
hist(metabudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")
meta.predict <- predict(meta.budget, newdata = testdata)
meta.predict
## 1 2 3 4 5 6 7 8
## 61.43548 64.22677 70.42964 56.90739 73.53107 59.57462 59.07839 58.33405
## 9 10
## 64.92149 56.78333
(meta.predict-testdata$Metacritic)/testdata$Metacritic*100
## 1 2 3 4 5 6
## -9.653707 -1.189586 2.071937 -20.961964 -5.729397 -3.911905
## 7 8 9 10
## -14.379146 35.660571 24.849020 41.958321
tomatoes.budget = lm(Rotten.Tomatoes~Budget, data = final)
summary(tomatoes.budget)
##
## Call:
## lm(formula = Rotten.Tomatoes ~ Budget, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.669 -18.068 3.756 16.414 26.629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.851e+01 6.420e+00 9.113 7.18e-10 ***
## Budget 7.911e-08 4.675e-08 1.692 0.102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.88 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.09276, Adjusted R-squared: 0.06036
## F-statistic: 2.863 on 1 and 28 DF, p-value: 0.1017
tomatoesbudget.residuals = resid(tomatoes.budget)
fitted = predict(tomatoes.budget)
plot(fitted, tomatoesbudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
hist(tomatoesbudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")
tomatoes.predict <- predict(tomatoes.budget, newdata = testdata)
tomatoes.predict
## 1 2 3 4 5 6 7 8
## 64.83374 68.39353 76.30418 59.05897 80.25950 62.46054 61.82769 60.87841
## 9 10
## 69.27952 58.90075
(tomatoes.predict-testdata$Rotten.Tomatoes)/testdata$Rotten.Tomatoes*100
## 1 2 3 4 5 6
## -22.816977 -6.310231 -17.060674 -26.176294 -7.747696 -6.775307
## 7 8 9 10
## -27.261538 109.925567 41.386783 145.419801
roger.budget = lm(Roger.Ebert.com~Budget, data = final)
summary(roger.budget)
##
## Call:
## lm(formula = Roger.Ebert.com ~ Budget, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39827 -0.48632 -0.08226 0.47214 1.56585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.309e+00 2.450e-01 9.423 3.51e-10 ***
## Budget 4.485e-09 1.784e-09 2.514 0.018 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7584 on 28 degrees of freedom
## (968 observations deleted due to missingness)
## Multiple R-squared: 0.1842, Adjusted R-squared: 0.155
## F-statistic: 6.32 on 1 and 28 DF, p-value: 0.01796
rogerbudget.residuals = resid(roger.budget)
fitted = predict(roger.budget)
plot(fitted, rogerbudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
hist(rogerbudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")
roger.predict <- predict(roger.budget, newdata = testdata)
roger.predict
## 1 2 3 4 5 6 7 8
## 2.667385 2.869218 3.317737 2.339966 3.541997 2.532829 2.496947 2.443125
## 9 10
## 2.919452 2.330995
(roger.predict-testdata$Roger.Ebert.com)/testdata$Roger.Ebert.com*100
## 1 2 3 4 5 6
## -11.087180 -28.269546 10.591238 -22.001142 18.066555 -27.633460
## 7 8 9 10
## -16.768421 -2.274996 16.778091 NA