Abstract

Background: This research paper looks at quantifiable and easily obtainable factors that affect a movie’s success such as critic reviews, domestic box office performance, overall budget, and profit in US Dollars (USD) using data from notable websites. The goal of this analysis is to test the relationship between these four variables: critic reviews, budget, domestic box office performance, and net profit, using simple linear regression and to possibly develop a multiple linear regression model that uses critic reviews to predict a movie’s expected box office performance.

Methods: This research draws upon Google Trends data to select the top ten trending movies for the past three years. The critic reviews (IMDb, Metacritic, Rotten Tomatoes, and Roger Ebert) came from their respective websites, while domestic box office performance and production budgets were sourced from The Numbers. Initial analysis was done via simple linear regression, culminating in a final multiple regression model created by stepwise regression using critic review scores from seven different publications.

Results: A t-test for correlation showed that there was a significant association between box office performance and Rotten Tomatoes ratings (p-value = 0.045). The associations between budget and IMDb ratings (p-value = 0.018), budget and Metacritic ratings (p-value = 0.046), and budget and Roger Ebert ratings (p-value = 0.018) were all significant. The final resulting multiple regression model using critic reviews and budget to predict box office performance was statistically significant (p-value = 0.00022) and had a moderate degree of predictive power (R2 = 0.47).

Conclusions: Critic reviews exhibit a statistically significant, but weak influence on the success of a movie. A multiple linear regression model using budget data and critic reviews was only able to explain some of the variability in domestic box office performance. Future research using a larger sample size and taking into account extraneous variables is needed to conclusively quantify the magnitude of critic reviews’ effect.

Introduction

Movies are an integral part of modern culture, generating over billions and billions of dollars each year and delivering rich, intricately crafted stories to a worldwide audience. The film industry not only serves as a deeply expressive artistic medium but also has a significant impact on the global economy. The global box office revenue is expected to have an increase from approximately 38 billion USD in 2016 to about 50 billion USD in 2020 (Statista, 2016). Moreover, the US is one of the most highly ranked countries in making the most movies per year. Studies indicate that the US is ranked the third largest film maker in the world next to China and India (Statista, 2016).

This study focuses on analyzing how different factors affect a movie’s box office success for a selection of top ten trending movies in the US for the past three years. Although many of these factors are either impossible to quantify or difficult to obtain reliably, such as profitability of a movie’s release period, popularity of a movie’s content, strength of a movie’s marketing strategies and advertising budget of a movie (Mullich, 2014), this study focuses on factors that can be obtained consistently for all the movies of interest such as critic reviews from a variety of highly respected websites (including IMDb, Metacritic, Rotten Tomatoes, and Roger Ebert), movie budget, domestic box office performance and overall net profit all measured in USD.

The hypothesis of this study was that movies which are more favorably reviewed by consumers will have better domestic box office performance, hence higher box office sales. Following this hypothesis, a multiple linear regression model was developed that factors in numerous critic review sources to predict a movie’s expected box office performance. Additionally, it was also hypothesized that some of the review websites selected for the study will be better predictors of a movie’s success than others given that review websites that consider the general populus’ sentiments, such as IMDb and Rotten Tomatoes, will be better predictors of net profit than review websites that focus solely on professional reviews, such as Metacritic and Roger Ebert. The primary goal of this study is to test the relationship between the four variables: critic reviews, budget, domestic box office performance, and net profit, using simple linear regression and to develop a multiple linear regression model that uses theese variables to predict a movie’s expected box office performance.

Methods

Data was selected using the most popular information aggregator, Google. Google Trends data provided the top ten trending movies each year. Analyzing the top ten trending movies from 2013, 2014, and 2015, yields a total sample size of thirty movies. For the top Google-searched movies of these three years, the relationships between the four main variables critic reviews, domestic box office performance, budget and net profit were tested using linear regression. The critic reviews that were used (IMDb, Metacritic, Rotten Tomatoes, and Roger Ebert) came from their respective websites, while domestic box office performance and production budget were sourced from The Numbers, a movie industry data and research company. To obtain net profit, domestic box office performance was subtracted from production budget. Overall, these sources of data were found to be highly reliable and consistent, providing usable data for the project. Although advertising budgets for each movie were considered as analyzable factors, they were not available on any free, reliable databases, which would have led to searching the individual advertising budgets for each movie. This would have introduced inconsistencies into the data collection, since different sources have different methods of estimating advertising budgets.

Initial analysis of the data found that Metacritic was not very predictive of domestic box office performance. Additionally, Rotten Tomatoes and IMDb are continuous review websites: that is, users can continually review the movies long after they are released and unavailable in theaters. An accurate model used to predict how critic reviews are linked to box office performance cannot factor in these continuous reviews. Thus, in an attempt to alleviate these issues, individual review sources that were included in Metacritic were chosen to develop a new model based on critic reviews. Although Metacritic was not as predictive as Rotten Tomatoes and IMDb, the reviews aggregated by it are not continuous, providing a good source of critic reviews. These reviews included by Metacritic were published before and around when each movie was released, which would theoretically influence consumer behavior.

A correlation matrix was used to determine the most significant predictors of domestic box office performance (Appendix E). Out of the variables tested in this study, only movie budget and net profit were found to have a significant correlation with domestic box office performance at a 5% significance level. Movie budget was chosen as one of the variables for the multiple linear regression model because budget can be used practically to predict box office earnings before a movie is released. Individual critic reviews were also included in the model to predict box office earnings. Using Metacritic which contains a database of critic review scores rated out of 100, scores from 22 different publications were recorded for each of the 30 movies. Out of these 22 publications, 7 were chosen for their completeness and critic scores from these publications were used to create a linear model in R to predict the domestic box office earnings for a given critic score. The model was pared down by removing one variable at a time based on the highest p-value given by the summary statistics of the linear model. The final linear model was determined by checking the p-value and adjusted R-squared value after each stepwise removal of a variable from the model (Appendix F).

Results

Thirty movies were included in the final analysis, and overall characteristics of these movies can be found in the appendix (A, B). Both the distribution of box office performance and the distribution of net profit were skewed to the right. The average box office performance was found to be $231,647,612, while the average net profit was $118,364,278. The distribution of budget appears to be bimodal and centered around $113,283,333. For the review sources used, both the distributions of IMDb ratings and Rotten Tomatoes ratings are left skewed, while both the distributions of Metacritic ratings and Roger Ebert ratings appear to be relatively normal. The average IMDb rating was 7.03, the average Metacritic rating was 63.5, the average Rotten Tomatoes rating was 67.5, and the average Roger Ebert rating was 2.82. Histograms showing all these distributions are included in the appendix (C,D).

As the study is mainly focused on the predictors of box office performance, one of the first analyses conducted was to determine the relationship between box office performance and the four main critic review sites included. Although the resulting R2 values for all four linear models indicated an overall weak relationship between box office performance and critic reviews, the p-value for Rotten Tomatoes was significant (p-value = 0.045, R2 = 0.1359).

Figure 1. Domestic Box Office Performance and Critic Review Scatterplots with a Linear Regression Line

Figure 1. Domestic Box Office Performance and Critic Review Scatterplots with a Linear Regression Line

Review.Website R2.value p.value
IMDb 0.080 0.129
Metacritic 0.120 0.061
Rotten Tomatoes 0.136 0.045
Roger Ebert 0.086 0.115

In the interest of exploring confounding variables, the association between season of release and box office performance was considered next. Overall, there was no evidence of association between these two variables. An attempt to correlate the two variables resulted in an R2 value of 0.001 and a p-value of 0.880, both of which suggest that there is no relationship between the two variables.

The relationship between budget and critic reviews was also analyzed. Only the association between budget and Rotten Tomatoes ratings was insignificant (p-value = 0.102), for the rest of the three reviews, the associations were significant (p-value < 0.05). With this information, simple linear regressions were created, predicting individual review website rankings based on movie production budget:

\[\begin{eqnarray} Budget = 35525592 * IMDb - 136698416\\ Budget = 2181946 * Metacritic - 25270268\\ Budget = 41058252 * RogerEbert - 2364078 \\ \end{eqnarray}\]

Figure 2. Budget and Critic Review Scatterplots with a Linear Regression Line

Figure 2. Budget and Critic Review Scatterplots with a Linear Regression Line

Review.Website R2.value p.value
IMDb 0.183 0.018
Metacritic 0.135 0.046
Rotten Tomatoes 0.093 0.102
Roger Ebert 0.184 0.018

In addition to the significance of the p-values, a percent error test utilizing data from the top trending movies of 2012 found that the fit of the models was sound. The error for IMDb ranged between 1.25% and 40.87%, for Metacritic between 1.20% and 41.96%, and for RogerEbert between 2.27% and 28.27% (Appendix H).

The final multiple linear model uses critic scores from three publications and budget to predict domestic box office earnings: Los Angeles Times, the New York Times and movie budget. The prediction equation after stepwise removal of predictors is:

\[\begin{eqnarray} Predicted \: box \: office \: earnings = 95,490,000 + 3,538,000 * LATimes + 1.012 * Budget - 3,608,000 * NYT \end{eqnarray}\]

Using an ANOVA test, the p-value with 3 and 26 degrees freedom for the linear model is about 0.00022, so there is evidence that the linear model is useful for predicting domestic box office earnings. Comparing the p-value of the t-test for individual terms (LATimes p=0.0093, Budget p=0.00035, NYT p=0.010), all three variables are effective predictors in this model, with movie budget being the most significant predictor of domestic box office earnings. From the adjusted R-squared score, the two publications and movie budget in this model can only account for 47% of the variability in domestic box office earnings. The moderate R-squared value suggests that there may be other variables that may contribute or more accurately explain the variability in domestic box office earnings. The histogram of the residuals was slightly left skewed and a plot of the residuals against fitted values from the model showed a pattern of constant variability (Appendix G) so a linear regression model is appropriate for prediction.

## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + Budget + LATimes, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -251894038  -63759447    5739032   67293294  199651768 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.549e+07  7.482e+07   1.276 0.213127    
## NYT         -3.608e+06  1.298e+06  -2.779 0.009982 ** 
## Budget       1.012e+00  2.461e-01   4.113 0.000348 ***
## LATimes      3.538e+06  1.259e+06   2.811 0.009264 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102100000 on 26 degrees of freedom
## Multiple R-squared:  0.521,  Adjusted R-squared:  0.4657 
## F-statistic: 9.425 on 3 and 26 DF,  p-value: 0.000218

Figure 3. Summary Statistics for Final Multiple Linear Regression Model

Discussion and Conclusion

Although a multiple regression model predicting domestic box office performance based on critic reviews was successfully generated, it is hard to infer causality between these two variables given that this is an observational study in which none of the variables were actively controlled but simply recorded. Furthermore, there exist other confounding factors influencing the relationship between these two variables such as word-of-mouth information, marketing budget, presence of major distributor and genre of movie. Even if the p-values of the coefficients in the model were low, causality cannot be established. However, a positive, linear relationship between domestic box office performance and highly favorable critic reviews and budget can be concluded. Additionally, analysis of the linear regression model can only explain 47% amount of the variability in box office earnings. In short, the scope of the causal relationship between these three variables of interest cannot be inferred, but prediction is possible.

The first evidence of the limited predictive power of individual critic reviews can be observed in the preliminary research findings. The R-squared value for a linear model fitting Roger Ebert, an individual publication, with domestic box office earnings was 0.086 with a corresponding p-value of 0.115, indicating that there is inconclusive evidence that the model is an effective predictor of domestic box office earnings (Figure 1). Conceptually, this is consistent with pre-existing literature which suggests that critics have negligible influence on widely-released movies or popular genres such as action or comedy movies (Reinstein, 2005). Since all 30 of the movies sampled in this study were chosen using Google top trends, movies in this dataset were likely to be popular releases and, therefore, critic reviews would be expected to explain only a small amount of the variation in domestic box office earnings.

Besides leaving out other potential predictors there are various reasons why a linear model using critic reviews may be limited in explaining the variability in domestic box office earnings. Critic reviews are highly subjective and are meant to evaluate movies with respect to art. Critics may not accurately reflect how a movie is received by the general movie-going public. For example, Fifty Shades of Grey, which scored 53 in Metacritic had higher domestic box office earnings than the median value for the 30 movies. In a world of growing social media and decline of written publications, it is also important to consider the evolving roles of professional critics and peer critics.

There were also flaws in the data that were beyond the control of this study. Out of the 22 publications for which critic review scores were collected, only 7 publications had scores for all 30 movies. This presented a significant challenge in using stepwise regression to create a linear model. Because some publications were missing several review scores and movies with missing scores varied from publication to publication, using all 22 publications in the publications in the linear model would have deleted review scores for 18 movies in the multiple regression analysis. This deletion due to missing values would have reduced the movie sample size by over half and compromised the informativeness of the stepwise regression. To circumvent this problem, only the 7 publications that had review scores for all 22 movies were used in stepwise regression.

In order to create a multiple linear regression model that would explain a larger percentage of variability in box office earnings, further research on other factors that could affect the popularity of a movie should be investigated. Data on word-of-mouth interaction, marketing budgets, production values, distributors and genre would have to be collected and combined with budget data to create a new linear regression model and, subsequently, pared down using stepwise regression. Further research would also need a larger sample size of movies in order to mitigate the effect of deletion due to missing observations.

References

IMDb. IMDb.com, Inc., n.d. Web. 05 Dec. 2016.

Irvine, Victoria. “Topic: Film Industry.” www.statista.com. N.p., 2016. Web. 01 Dec. 2016.

“Movie Box Office Records.” The Numbers. Nash Information Services, LLC, n.d. Web. 05 Dec. 2016.

“Movie Budgets.” The Numbers. Nash Information Services, LLC, n.d. Web. 05 Dec. 2016.

“Movie Reviews.” Metacritic. CBS Interactive, Inc., n.d. Web. 05 Dec. 2016.

Mullich, David. “Movies: What Determines the Success of a Movie (by Box Office Revenue)?” - Quora. N.p., n.d. Web. 01 Dec. 2016

“Reviews.” Roger Ebert. Ebert Digital, LLC, n.d. Web. 5 Dec. 2016.

Reinstein, David, et al. “The Influence of Expert Reviews on Consumer Demand for Experience Goods: A Case Study of Movie Critics” The Journal of Industrial Economics. Vol. 53, No. 1 (Mar., 2005), pp 29. Web. 05 Dec. 2016 http://www.jstor.org/stable/3569753

Rotten Tomatoes. Fandango, n.d. Web. 5 Dec. 2016.

RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.

Appendix

Appendix A: Financial Summary of Movies Analyzed

Movie YearReleased Budget DomesticBoxOffice NetProfit
Jurassic World 2015 2.15e+08 652198010 437198010
American Sniper 2014 5.80e+07 350126372 292126372
Straight Outta Compton 2015 2.80e+07 161058685 133058685
50 Shades of Grey 2015 4.00e+07 166167230 126167230
Furious 7 2015 1.90e+08 351032910 161032910
Pitch Perfect 2 2015 2.90e+07 183785415 154785415
Inside Out 2015 1.75e+08 356461711 181461711
Avengers Age of Ultron 2015 2.50e+08 459005868 209005868
Minions 2015 7.40e+07 336045770 262045770
Mad Max: Fury Road 2015 1.50e+08 153636354 3636354
Frozen 2013 1.50e+08 400738009 250738009
Interstellar 2014 1.65e+08 188017894 23017894
Divergent 2014 8.50e+07 150947895 65947895
Gone Girl 2014 6.10e+07 167767189 106767189
Lone Survivor 2013 4.00e+07 125095601 85095601
Godzilla 2014 1.60e+08 200676069 40676069
22 Jump Street 2014 5.00e+07 191719337 141719337
Big Hero 6 2014 1.65e+08 222527828 57527828
Annabelle 2014 6.50e+06 84273813 77773813
Maleficent 2014 1.80e+08 241407328 61407328
Man of Steel 2013 2.25e+08 291045518 66045518
Iron Man 3 2013 2.00e+08 408992272 208992272
World War Z 2013 1.90e+08 202359711 12359711
Jobs 2013 1.80e+07 16131410 -1868590
The Conjuring 2013 2.00e+07 137400141 117400141
The Great Gatsby 2013 1.90e+08 144840419 -45159581
Despicable Me 2 2013 7.60e+07 368065385 292065385
The Purge 2013 3.00e+06 64473115 61473115
Pacific Rim 2013 1.90e+08 101802906 -88197094
Mama 2013 1.50e+07 71628180 56628180

Appendix B: Review Score Summary of Movies Analyzed

Movie IMDb Metacritic RottenTomatoes RogerEbert
Jurassic World 7.0 59 71 3.0
American Sniper 7.3 72 72 3.5
Straight Outta Compton 7.9 72 87 4.0
50 Shades of Grey 4.1 46 25 2.0
Furious 7 7.2 67 79 3.5
Pitch Perfect 2 6.5 63 66 2.5
Inside Out 8.2 94 98 4.0
Avengers Age of Ultron 7.5 66 75 3.0
Minions 6.4 56 56 3.0
Mad Max: Fury Road 8.1 90 97 4.0
Frozen 7.6 74 89 2.5
Interstellar 8.6 74 71 3.5
Divergent 6.7 48 41 2.5
Gone Girl 8.1 79 88 3.5
Lone Survivor 7.6 60 75 2.0
Godzilla 6.5 62 74 3.5
22 Jump Street 7.1 71 84 3.0
Big Hero 6 7.9 74 89 3.0
Annabelle 5.4 37 29 1.0
Maleficent 7.0 56 50 3.0
Man of Steel 7.1 55 55 3.0
Iron Man 3 7.2 79 79 2.5
World War Z 7.0 63 67 2.0
Jobs 5.9 44 27 2.0
The Conjuring 7.5 65 86 1.0
The Great Gatsby 7.3 55 48 2.5
Despicable Me 2 7.5 62 73 3.0
The Purge 5.7 41 37 1.5
Pacific Rim 7.0 64 71 4.0
Mama 6.2 57 65 3.0

Appendix C: Movie Capital Distribution Histograms

Appendix D: Critic Reviews Distribution Histograms

Appendix E: Correlation Matrix of Movie Data

my_data <- dat[, c(4,5,6,7,8,9,10)]
res2 <- rcorr(as.matrix(my_data))
res2
##                   IMDb Metacritic RottenTomatoes RogerEbert Budget
## IMDb              1.00       0.79           0.81       0.54   0.42
## Metacritic        0.79       1.00           0.91       0.65   0.37
## RottenTomatoes    0.81       0.91           1.00       0.56   0.30
## RogerEbert        0.54       0.65           0.56       1.00   0.43
## Budget            0.42       0.37           0.30       0.43   1.00
## DomesticBoxOffice 0.29       0.35           0.37       0.29   0.59
## NetProfit         0.07       0.17           0.24       0.06   0.03
##                   DomesticBoxOffice NetProfit
## IMDb                           0.29      0.07
## Metacritic                     0.35      0.17
## RottenTomatoes                 0.37      0.24
## RogerEbert                     0.29      0.06
## Budget                         0.59      0.03
## DomesticBoxOffice              1.00      0.83
## NetProfit                      0.83      1.00
## 
## n= 30 
## 
## 
## P
##                   IMDb   Metacritic RottenTomatoes RogerEbert Budget
## IMDb                     0.0000     0.0000         0.0019     0.0210
## Metacritic        0.0000            0.0000         0.0000     0.0455
## RottenTomatoes    0.0000 0.0000                    0.0013     0.1017
## RogerEbert        0.0019 0.0000     0.0013                    0.0180
## Budget            0.0210 0.0455     0.1017         0.0180           
## DomesticBoxOffice 0.1168 0.0611     0.0450         0.1152     0.0006
## NetProfit         0.7198 0.3674     0.1961         0.7402     0.8828
##                   DomesticBoxOffice NetProfit
## IMDb              0.1168            0.7198   
## Metacritic        0.0611            0.3674   
## RottenTomatoes    0.0450            0.1961   
## RogerEbert        0.1152            0.7402   
## Budget            0.0006            0.8828   
## DomesticBoxOffice                   0.0000   
## NetProfit         0.0000

Appendix F: Stepwise Regression using Critic Reviews from Individual Publications

fit2.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + EW + Variety + AVClub + HollywoodReporter + Budget, dat)
summary(fit2.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + EW + 
##     Variety + AVClub + HollywoodReporter + Budget, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -176616718  -60622143   -1640936   66778243  218710420 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        1.416e+08  1.266e+08   1.119  0.27575   
## NYT               -2.945e+06  1.824e+06  -1.615  0.12131   
## NYDaily            1.254e+06  1.231e+06   1.018  0.32013   
## LATimes            3.199e+06  1.598e+06   2.001  0.05843 . 
## EW                -1.116e+05  1.344e+06  -0.083  0.93463   
## Variety            5.492e+05  1.913e+06   0.287  0.77687   
## AVClub            -9.412e+05  2.136e+06  -0.441  0.66397   
## HollywoodReporter -1.486e+06  2.145e+06  -0.693  0.49618   
## Budget             1.000e+00  2.985e-01   3.351  0.00303 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108100000 on 21 degrees of freedom
## Multiple R-squared:  0.5659, Adjusted R-squared:  0.4006 
## F-statistic: 3.422 on 8 and 21 DF,  p-value: 0.01125
fit3.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety + AVClub + HollywoodReporter + Budget, dat)
summary(fit3.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety + 
##     AVClub + HollywoodReporter + Budget, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -176595841  -59048340    -988146   67387295  218097111 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        1.402e+08  1.226e+08   1.144  0.26492   
## NYT               -2.909e+06  1.730e+06  -1.681  0.10694   
## NYDaily            1.253e+06  1.203e+06   1.042  0.30886   
## LATimes            3.209e+06  1.557e+06   2.061  0.05127 . 
## Variety            5.207e+05  1.839e+06   0.283  0.77973   
## AVClub            -9.836e+05  2.027e+06  -0.485  0.63224   
## HollywoodReporter -1.546e+06  1.971e+06  -0.785  0.44108   
## Budget             9.949e-01  2.847e-01   3.494  0.00205 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105700000 on 22 degrees of freedom
## Multiple R-squared:  0.5658, Adjusted R-squared:  0.4276 
## F-statistic: 4.095 on 7 and 22 DF,  p-value: 0.005093
fit4.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + HollywoodReporter + Budget, dat)
summary(fit4.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + HollywoodReporter + 
##     Budget, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -188301647  -54364604    -535060   66038182  210624154 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.173e+08  9.801e+07   1.196 0.243219    
## NYT               -2.946e+06  1.425e+06  -2.067 0.049659 *  
## NYDaily            1.395e+06  9.838e+05   1.418 0.169031    
## LATimes            3.116e+06  1.337e+06   2.330 0.028530 *  
## HollywoodReporter -1.605e+06  1.900e+06  -0.845 0.406716    
## Budget             9.635e-01  2.501e-01   3.852 0.000765 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.02e+08 on 24 degrees of freedom
## Multiple R-squared:  0.5585, Adjusted R-squared:  0.4665 
## F-statistic: 6.071 on 5 and 24 DF,  p-value: 0.0009086
fit5.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety + Budget, dat)
summary(fit5.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + Variety + 
##     Budget, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -189844790  -52169909    1794011   57619712  208421298 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  5.933e+07  8.250e+07   0.719  0.47902   
## NYT         -3.679e+06  1.478e+06  -2.489  0.02015 * 
## NYDaily      7.578e+05  1.053e+06   0.720  0.47863   
## LATimes      2.858e+06  1.422e+06   2.010  0.05581 . 
## Variety      6.735e+05  1.772e+06   0.380  0.70727   
## Budget       1.003e+00  2.772e-01   3.618  0.00137 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 103200000 on 24 degrees of freedom
## Multiple R-squared:  0.5481, Adjusted R-squared:  0.4539 
## F-statistic: 5.821 on 5 and 24 DF,  p-value: 0.001173
fit6.all=lm(DomesticBoxOffice ~ NYT + NYDaily + LATimes + Budget, dat)
summary(fit6.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + NYDaily + LATimes + Budget, 
##     data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -200698709  -49829545   -1034243   58739955  209759673 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.776e+07  7.810e+07   0.868 0.393868    
## NYT         -3.428e+06  1.299e+06  -2.639 0.014114 *  
## NYDaily      9.853e+05  8.509e+05   1.158 0.257825    
## LATimes      3.029e+06  1.326e+06   2.285 0.031073 *  
## Budget       9.599e-01  2.486e-01   3.861 0.000708 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101400000 on 25 degrees of freedom
## Multiple R-squared:  0.5454, Adjusted R-squared:  0.4726 
## F-statistic: 7.497 on 4 and 25 DF,  p-value: 0.0004111
fit7.all=lm(DomesticBoxOffice ~ NYT + LATimes + Budget, dat)
summary(fit7.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ NYT + LATimes + Budget, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -251894038  -63759447    5739032   67293294  199651768 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.549e+07  7.482e+07   1.276 0.213127    
## NYT         -3.608e+06  1.298e+06  -2.779 0.009982 ** 
## LATimes      3.538e+06  1.259e+06   2.811 0.009264 ** 
## Budget       1.012e+00  2.461e-01   4.113 0.000348 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 102100000 on 26 degrees of freedom
## Multiple R-squared:  0.521,  Adjusted R-squared:  0.4657 
## F-statistic: 9.425 on 3 and 26 DF,  p-value: 0.000218
fit8.all=lm(DomesticBoxOffice ~ LATimes + Budget, dat)
summary(fit8.all)
## 
## Call:
## lm(formula = DomesticBoxOffice ~ LATimes + Budget, data = dat)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -228370822  -59643928  -18224362   33582796  323486046 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 2.983e+07  7.934e+07   0.376  0.70989   
## LATimes     1.287e+06  1.077e+06   1.195  0.24240   
## Budget      9.711e-01  2.745e-01   3.537  0.00148 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 114100000 on 27 degrees of freedom
## Multiple R-squared:  0.3786, Adjusted R-squared:  0.3326 
## F-statistic: 8.226 on 2 and 27 DF,  p-value: 0.001623

Appendix G: Histogram of the Residuals and Plot of the Residuals against Fitted Values

plot(fitted(fit7.all), residuals(fit6.all), 
     ylab="Residuals", xlab="Predicted Domestic Box Office Earnings") 
abline(0, 0)    

hist(fit7.all$res, xlab="Residuals", main="Histogram of Residuals", freq=FALSE)
curve(dnorm(x,0,sd(fit6.all$res)), add = TRUE)

plot(imdb, meta, main = "IMDb vs. Metacritic Ratings", xlab = "IMDb Ratings (out of 10)", ylab = "Metacritic Ratings (out of 100)")
abline(lm(meta~imdb))

summary(lm(meta~imdb))
## 
## Call:
## lm(formula = meta ~ imdb)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.6933  -6.9988  -0.1933   4.4587  17.3460 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -16.065     11.686  -1.375     0.18    
## imdb          11.307      1.647   6.866 1.84e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.269 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.6274, Adjusted R-squared:  0.6141 
## F-statistic: 47.14 on 1 and 28 DF,  p-value: 1.842e-07
plot(boxoffice, profit, main = "Profit vs. Box Office Performance", xlab = "Box Office Performance (in USD)", ylab = "Profit (in USD)")
abline(lm(profit~boxoffice))

summary(lm(boxoffice~profit))
## 
## Call:
## lm(formula = boxoffice ~ profit)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -109166210  -73234931    3431639   73867885  134936815 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.110e+08  2.142e+07   5.179 1.70e-05 ***
## profit      1.020e+00  1.320e-01   7.727 2.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80310000 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.6807, Adjusted R-squared:  0.6693 
## F-statistic:  59.7 on 1 and 28 DF,  p-value: 2.04e-08
testdata <- read.csv("testdata.csv", header=TRUE)


imdb.lm = lm(Domestic.Box.Office~IMDb, dat2)
summary(imdb.lm)
## 
## Call:
## lm(formula = Domestic.Box.Office ~ IMDb, data = dat2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -165723965 -110000483  -35192161   98670916  422156600 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -76596994  192084732  -0.399    0.693
## IMDb         43805486   27068958   1.618    0.117
## 
## Residual standard error: 135900000 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.08553,    Adjusted R-squared:  0.05287 
## F-statistic: 2.619 on 1 and 28 DF,  p-value: 0.1168
imdb.residuals = resid(imdb.lm)
fitted = predict(imdb.lm)


meta.lm = lm(Domestic.Box.Office~Metacritic, dat2)
summary(meta.lm)
## 
## Call:
## lm(formula = Domestic.Box.Office ~ Metacritic, data = dat2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -174217431  -91756654  -36759838   89597412  436887296 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1115838  120617603   0.009   0.9927  
## Metacritic    3630422    1860395   1.951   0.0611 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133300000 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.1197, Adjusted R-squared:  0.08828 
## F-statistic: 3.808 on 1 and 28 DF,  p-value: 0.06108
meta.residuals = resid(meta.lm)
fitted = predict(meta.lm)


tomatoes.lm = lm(Domestic.Box.Office~Rotten.Tomatoes, dat2)
summary(tomatoes.lm)
## 
## Call:
## lm(formula = Domestic.Box.Office ~ Rotten.Tomatoes, data = dat2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -153826245 -108105921  -41055498   90630946  411679078 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     62255596   84253539   0.739    0.466  
## Rotten.Tomatoes  2510751    1196545   2.098    0.045 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132100000 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.1359, Adjusted R-squared:  0.105 
## F-statistic: 4.403 on 1 and 28 DF,  p-value: 0.04502
tomatoes.residuals = resid(tomatoes.lm)
fitted = predict(tomatoes.lm)

roger.lm = lm(Domestic.Box.Office~Roger.Ebert.com, dat2)
summary(roger.lm)
## 
## Call:
## lm(formula = Domestic.Box.Office ~ Roger.Ebert.com, data = dat2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -188668802  -75965305  -28502003   79879938  411436806 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     91629692   89625199   1.022    0.315
## Roger.Ebert.com 49710504   30576757   1.626    0.115
## 
## Residual standard error: 135900000 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.08625,    Adjusted R-squared:  0.05362 
## F-statistic: 2.643 on 1 and 28 DF,  p-value: 0.1152
roger.residuals = resid(roger.lm)
fitted = predict(roger.lm)
par2 <- par(mfrow=c(2, 2))
plot(fitted, imdb.residuals, main="IMDb", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
plot(fitted, meta.residuals, main="Metacritic", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
plot(fitted, tomatoes.residuals, main="Rotten Tomatoes", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)
plot(fitted, roger.residuals, main="Roger Ebert", xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)

par(par2)
par3 <- par(mfrow=c(2, 2))
hist(imdb.residuals, main = "IMDb", xlab = "Residual", ylab = "Frequency")
hist(meta.residuals, main = "Metacritic", xlab = "Residual", ylab = "Frequency")
hist(tomatoes.residuals, main = "Rotten Tomatoes", xlab = "Residual", ylab = "Frequency")
hist(roger.residuals, main = "Roger Ebert", xlab = "Residual", ylab = "Frequency")

par(par3)
imdbbox.predict <- predict(imdb.lm, newdata = testdata)
imdbbox.predict
##         1         2         3         4         5         6         7 
## 243183056 230041410 278227445 190616473 295749640 230041410 238802508 
##         8         9        10 
## 221280313 164333181 124908243
(imdbbox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
##         1         2         3         4         5         6         7 
## -40.39787  81.88387 -55.36073  67.61681 -34.00495   5.20231  72.48576 
##         8         9        10 
##  77.00438 -43.78403 131.73927
metabox.predict <- predict(meta.lm, newdata = testdata)
metabox.predict
##         1         2         3         4         5         6         7 
## 247984509 237093244 251614930 262506195 284288725 226201979 251614930 
##         8         9        10 
## 157223968 189897763 146332703
(metabox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
##          1          2          3          4          5          6 
## -39.221076  87.459448 -59.630485 130.832368 -36.562392   3.446465 
##          7          8          9         10 
##  81.740101  25.765059 -35.038764 171.487558
tomatoesbox.predict <- predict(tomatoes.lm, newdata = testdata)
tomatoesbox.predict
##         1         2         3         4         5         6         7 
## 273158698 245540435 293244708 263115693 280690952 230475928 275669449 
##         8         9        10 
## 135067381 185282406 122513625
(tomatoesbox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
##          1          2          3          4          5          6 
## -33.051093  94.138280 -52.951335 131.368324 -37.365217   5.401023 
##          7          8          9         10 
##  99.114550   8.041779 -36.617610 127.296594
rogerbox.predict <- predict(roger.lm, newdata = testdata)
rogerbox.predict
##         1         2         3         4         5         6         7 
## 240761204 290471708 240761204 240761204 240761204 265616456 240761204 
##         8         9        10 
## 215905952 215905952        NA
(rogerbox.predict-testdata$Domestic.Box.Office)/testdata$Domestic.Box.Office*100
##         1         2         3         4         5         6         7 
## -40.99145 129.66351 -61.37187 111.71111 -46.27534  21.47145  73.90051 
##         8         9        10 
##  72.70538 -26.14174        NA

Appendix H: Plot and histogram of the residuals for predicting critic reviews from budget data and prediction values with percent error.

#predicting critic reviews based on budget

imdb.budget = lm(IMDb~Budget, data = final)
summary(imdb.budget)
## 
## Call:
## lm(formula = IMDb ~ Budget, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5735 -0.4357 -0.2054  0.6378  1.3224 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+00  2.782e-01  23.274   <2e-16 ***
## Budget      4.956e-09  2.026e-09   2.446    0.021 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8613 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.1761, Adjusted R-squared:  0.1466 
## F-statistic: 5.983 on 1 and 28 DF,  p-value: 0.02099
imdbbudget.residuals = resid(imdb.budget)
fitted = predict(imdb.budget)
plot(fitted, imdbbudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)

hist(imdbbudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")

imdb.predict <- predict(imdb.budget, newdata = testdata)
imdb.predict
##        1        2        3        4        5        6        7        8 
## 6.871724 7.094731 7.590301 6.509958 7.838086 6.723053 6.683408 6.623939 
##        9       10 
## 7.150235 6.500047
(imdb.predict-testdata$IMDb)/testdata$IMDb*100
##         1         2         3         4         5         6         7 
## -5.866788  1.353299 -6.292581  6.720629 -7.787224 -3.956379 -7.174891 
##         8         9        10 
## -2.589126 30.004269 41.305369
meta.budget = lm(Metacritic~Budget, data = final)
summary(meta.budget)
## 
## Call:
## lm(formula = Metacritic ~ Budget, data = final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8764 -11.4311  -0.8311   7.9899  26.6718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.647e+01  4.069e+00  13.880 4.47e-14 ***
## Budget      6.203e-08  2.963e-08   2.094   0.0455 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.6 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.1353, Adjusted R-squared:  0.1045 
## F-statistic: 4.383 on 1 and 28 DF,  p-value: 0.04548
metabudget.residuals = resid(meta.budget)
fitted = predict(meta.budget)
plot(fitted, metabudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)

hist(metabudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")

meta.predict <- predict(meta.budget, newdata = testdata)
meta.predict
##        1        2        3        4        5        6        7        8 
## 61.43548 64.22677 70.42964 56.90739 73.53107 59.57462 59.07839 58.33405 
##        9       10 
## 64.92149 56.78333
(meta.predict-testdata$Metacritic)/testdata$Metacritic*100
##          1          2          3          4          5          6 
##  -9.653707  -1.189586   2.071937 -20.961964  -5.729397  -3.911905 
##          7          8          9         10 
## -14.379146  35.660571  24.849020  41.958321
tomatoes.budget = lm(Rotten.Tomatoes~Budget, data = final)
summary(tomatoes.budget)
## 
## Call:
## lm(formula = Rotten.Tomatoes ~ Budget, data = final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.669 -18.068   3.756  16.414  26.629 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.851e+01  6.420e+00   9.113 7.18e-10 ***
## Budget      7.911e-08  4.675e-08   1.692    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.88 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.09276,    Adjusted R-squared:  0.06036 
## F-statistic: 2.863 on 1 and 28 DF,  p-value: 0.1017
tomatoesbudget.residuals = resid(tomatoes.budget)
fitted = predict(tomatoes.budget)
plot(fitted, tomatoesbudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)

hist(tomatoesbudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")

tomatoes.predict <- predict(tomatoes.budget, newdata = testdata)
tomatoes.predict
##        1        2        3        4        5        6        7        8 
## 64.83374 68.39353 76.30418 59.05897 80.25950 62.46054 61.82769 60.87841 
##        9       10 
## 69.27952 58.90075
(tomatoes.predict-testdata$Rotten.Tomatoes)/testdata$Rotten.Tomatoes*100
##          1          2          3          4          5          6 
## -22.816977  -6.310231 -17.060674 -26.176294  -7.747696  -6.775307 
##          7          8          9         10 
## -27.261538 109.925567  41.386783 145.419801
roger.budget = lm(Roger.Ebert.com~Budget, data = final)
summary(roger.budget)
## 
## Call:
## lm(formula = Roger.Ebert.com ~ Budget, data = final)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39827 -0.48632 -0.08226  0.47214  1.56585 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.309e+00  2.450e-01   9.423 3.51e-10 ***
## Budget      4.485e-09  1.784e-09   2.514    0.018 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7584 on 28 degrees of freedom
##   (968 observations deleted due to missingness)
## Multiple R-squared:  0.1842, Adjusted R-squared:  0.155 
## F-statistic:  6.32 on 1 and 28 DF,  p-value: 0.01796
rogerbudget.residuals = resid(roger.budget)
fitted = predict(roger.budget)
plot(fitted, rogerbudget.residuals, xlab = "Fitted Value", ylab = "Residual")
abline(0, 0)

hist(rogerbudget.residuals, main = "Histogram of Residuals", xlab = "Residual", ylab = "Frequency")

roger.predict <- predict(roger.budget, newdata = testdata)
roger.predict
##        1        2        3        4        5        6        7        8 
## 2.667385 2.869218 3.317737 2.339966 3.541997 2.532829 2.496947 2.443125 
##        9       10 
## 2.919452 2.330995
(roger.predict-testdata$Roger.Ebert.com)/testdata$Roger.Ebert.com*100
##          1          2          3          4          5          6 
## -11.087180 -28.269546  10.591238 -22.001142  18.066555 -27.633460 
##          7          8          9         10 
## -16.768421  -2.274996  16.778091         NA