In order to understand what the relationship between budget and the gross of a film is, we begin by exploring the data visually. Budget and gross are both numeric variables, meaning that the value is a number. In this case, the numeric variables are the amount of money spent on making the film and the amount of money made in ticket sales on the film, both with units of US dollars. We will first investigate the budget variable through two graphic visualizations, followed by an investigation of the gross variable. Finally, there will be graphic representations of the relationship between the two numeric variables via scatterplots.
Figure 1.1 and Figure 1.2 are used to visualize the movie budget data. The first graph, Figure 1.1, is a frequency polygraph which demonstrates that the data is positively skewed, with the majority of the movie data under a budget of 100,000,000 US dollars. The second graph, Figure 1.2, is a histogram that similarly demonstrates the positive skew of the data, and further suggests that most of the data has a budget around 50,000,000 US dollars. Figure 1.3 and 1.4 are used to visualize the movie gross data. The first graph, Figure 1.3, is a frequency polygraph which demonstrates that the data is also positively skewed, with the majority of the ticket sales under 200,000,000 US dollars. The second graph, Figure 1.4, is a histogram that similar demonstrates the positive skew of the data, and further suggests that the large majority of the data has a gross under 100,000,000. Figure 1.5 is a scatter plot, with movie budget on the x-axis and movie gross on the y-axis, so that budget is the independent variable and gross is the dependent variable. This graph demonstrates that there may be a very weak positive relationship between budget and gross, but in general the data appears to be a random cloud in shape. Figure 1.6 is a closer look at the majority of the data, with a limit of a budget of 100,000,000. Even with a smaller frame of analysis, the data appears more similar to a random cloud of data than a linear line, and therefore does not strongly support the use of an LSLR model. A different kind of fitted model may therefore be more appropriate for this data.
Model 1 is modeled as: \[Gross = \beta_0 + \beta_1Budget + error\]
In order to use an LSLR model, we must check that the data meets certain assumptions. These conditions include shape, zero mean, and constant variance. The shape condition for an LSLR line indicates that the data forms a linear pattern. The zero mean condition requires the average of the residuals to be 0, which we may check using a residual plot. The constant variance condition can also be checked using a residual plot, and requires that the standard deviation of the residuals should be approximately constant. Therefore, if one were to draw a line on either side of the data, the distance from the regression line should be roughly the same for both sides. To use this model, we must first check to see if these three conditions are met.
Based on Figure 2.1, the data does not appear to meet the shape condition for LSLR. The data does not appear to follow a linear pattern, which is necessary for an LSLR model. Based on Figure 2.2, the residual plot, the data additionally does not appear to meet the zero mean or constant variance conditions. The mean of the residuals is well over zero, and the residuals fan out rather than maintain a relatively constant standard deviation away from the mean. In regards to an outlier analysis, Figure 2.3 demonstrates that this data contains many outliers. Figure 2.3 is a studentized residual plot, and any data points that are above 2 or below -2 on the y-axis are considered to be outliers. Any data points above 3 or below -3 on the y-axis are considered to be extreme outliers. The data points highlighted in red in Figure 2.3 are outliers that are considered extreme, and most are seen above the 3 line. The majority of these appear to be general outliers, in which the y coordinate is unusual. Figure 2.4 highlights the same outliers in a scatterplot of the data in order to determine if they are influential to the slope of the line. These outliers do not appear to be highly influential, and therefore are worth noting but do not necessarily need to be removed from the dataset. However, as the data do not meet the conditions to use this LSLR line, I would not suggest the use of an LSLR model for this data.
Model 2 is modeled as: \[log(Gross) = \beta_0 + \beta_1(log(Budget))\] In order to use this model, we must check the same three conditions as the LSLR model in the same way.
Based on Figure 3.1, the data appears to meet the shape condition for the LSLR model. The data appears to follow a moderate to strong positive linear pattern, which is necessary for an LSLR model. It particularly follows the linear shape towards the higher end of the x-axis, while it fans out at the lower end. Based on the residual plot, Figure 3.2, the data appears to meet the zero mean condition, as the residuals are averaged around zero. The constant variance condition is also met, and most residuals are within five standard deviations from the mean on either side. In regards to an outlier analysis, Figure 3.3 demonstrates that this data contains several outliers, though in comparison to Model 1 it has less outliers. Figure 3.3 is a studentized residual plot, and any data points that are above 2 or below -2 on the y-axis are considered to be outliers. Any data points above 3 or below -3 on the y-axis are considered to be extreme outliers, and they are highlighted in red. These outliers are mainly general outliers, meaning that they have an unusual y-value. Figure 3.4 is a scatterplot that highlights the same outliers highlighted in Figure 3.3 to test whether the outliers are influential to the slope. However, Figure 3.4 demonstrates that the slope of the line would likely not change if these outliers were removed, and therefore they are not influential. I would therefore take note of these outliers, but it is not necessary to remove them from the dataset.
I would reccomend using Model 2 for this data, as it is the best fit for the data and meets all the necessary conditions, which Model 1 does not. Model 2 thus creates the LSLR line below.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 0.9242195 | 0.3944432 | 2.343099 | 0.0192168 |
| log(budget) | 0.9425048 | 0.0230841 | 40.829220 | 0.0000000 |
\[Log\widehat{(Gross)} = 0.92422 + 0.94250(Log(Budget)) \] For every 1 % increase in a movie’s budget, we predict an average increase of 0.9425% in a movie’s gross. The R-squared value for this model is 0.4415, meaning that 44.15% of the variation in the amount of ticket sales can be accounted for with the movie’s budget.
Based on the selected Model 2, we may use the LSLR line to determine if there is evidence to support the hypothesis that higher budgets are associated with higher ticket sales. However, in order to perform inference tests, we must check three more assumptions for an LSLR model. These conditions are independence, normality, and randomness. The normality of the model can be checked using a Q-Q plot, which plots how many standard deviations away from the mean the data is. If the data is normal, it will follow a line.
The Q-Q plot demonstrates that the data is normal, as the majority of the data follows the line. Therefore, the condition of normality is met. The independence condition assumes that the residuals are independent of each other, and the random condition assumes that Y is a random variable. The error of the residuals are independent of each other, and we assume that the movie gross is a random variable. Therefore, the independence and random conditions are met. As Model 2 meets all six conditions for inference, the first three being shown in Section 3 of this report and the last three being shown above, we may continue with out hypothesis test. A hypothesis test is done in six steps, as demonstrated below.
Step 1: The null hypothesis is that there is no relationship between log budget and log ticket sales gross. The alternative hypothesis is that there is a relationship between log budget and log ticket sales gross.
Step 2: \[\widehat{\beta_1} = 0.92422\] \[SE_\widehat{\beta_1} = 0.02308\]
Step 3: To test this hypothesis, we use a t-test, where \[t = \widehat{\beta_1} - \beta_1 / SE_\widehat{\beta_1} \] t = 0.94250 - 0 / 0.02308 = 40.83 Therefore, our t-statistic ends up as 40.83.
Step 4: If the null hypothesis is true and there is no relationship between log budget and log gross, then the sampling distribution of our test statistic, t, is a t-distribution with 2109 degrees of freedom.
Step 5: If the null hypothesis is true, the likelihood of getting a test statistic as extreme as 40.83 is approximately zero. The data therefore provide convincing evidence of a positive linear relationship between log budget and log ticket sales gross.
In order to provide for random error, we may build a confidence interval for the slope of the LSLR line. As the hypothesis test utilized an alpha level of 0.05, which allows for a 5% chance of Type 1 error, we may use a 95% confidence interval. Confidence intervals for the slope of an LSLR line are constructed through the equation
\[\widehat{\beta_1} ± (t^*)(SE_\widehat{\beta_1})\]
The t, or t critical value, for a 95% confidence interval and df = 2109 is t = 1.6456.
0.94250 ± (1.6456 x 0.02308) = 0.94250 ± 0.03798.
We are therefore 95% confident that, for each 1% increase in a movie’s budget, we expect the average movie’s gross to increase between 0.9045% and 0.9805%.
In order to further explore the data, we constructed an another model with an additional categorical variable. The categorical variable of interest is genre, which is separated into 11 categories. These categories are displayed in Figure 6.1, in a bar chart. The genres are action, adventure, animation, biography, comedy, crime, documentary, drama, fantasy, horror, and mystery. From Figure 6.1, we can see that the majority of the data is in either the action, comedy, or drama genres, and the lowest amount of data is in the animation, documentary, fantasy, and mystery genres.
Model 3 thus uses genre to predict gross ticket sales, and is demonstrated through the model \[Gross = \beta_0 + \beta_1Adventure + \beta_2Animation + \beta_3Biography + \beta_4Comedy + \beta_5Crime + \beta_6Documentary + \beta_7Drama + \beta_8Fantasy + \beta_9Horror + \beta_10Mystery + error\]
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 94698438 | 3081856 | 30.7277248 | 0.0000000 |
| genresAdventure | 9116850 | 5715838 | 1.5950153 | 0.1108593 |
| genresAnimation | 10538288 | 13720814 | 0.7680512 | 0.4425431 |
| genresBiography | -45561121 | 7782863 | -5.8540310 | 0.0000000 |
| genresComedy | -52479467 | 4199600 | -12.4963021 | 0.0000000 |
| genresCrime | -58364739 | 6907004 | -8.4500804 | 0.0000000 |
| genresDocumentary | -68015013 | 18525350 | -3.6714563 | 0.0002472 |
| genresDrama | -61135077 | 4857721 | -12.5851368 | 0.0000000 |
| genresFantasy | -42625086 | 20654570 | -2.0637120 | 0.0391676 |
| genresHorror | -57622031 | 7749648 | -7.4354384 | 0.0000000 |
| genresMystery | -24789815 | 18525350 | -1.3381564 | 0.1809903 |
Figure 6.2 is a histogram that displays the relationship between genre and gross. It highlights that most grosses within genres are positively skewed, particularly the drama and mystery genres. Figure 6.3 shows the boxplots of the genres and their budgets, demonstrating that many gross outleirs comes from the action genre. Outliers are also seen mostly in the adventure, comedy, and drama genres. The genre with the highest median gross appears to be animation, while the lowest median gross appears to be documentary. When creating the LSLR line for a model including categorical variables, one level of the categorical variable becomes the baseline. The baseline or reference level of this model is the genre action movie, so that it is described by the intercept of the model. Therefore, we predict the average gross of an action movie to be $ 94698438. We predict the average gross for adventure genre movies to be $9116850 higher than action movies. We predict the average gross for animation genre movies to be $10538288 higher than action movies. We predict the average gross for biography genre movies to be $45561121 lower than action movies. We predict the average gross for comedy genre movies to be $52479467 lower than action movies. We predict the average gross for crime genre movies to be $58364739 lower than action movies. We predict the average gross for documentary genre movies to be $68015013 lower than action movies. We predict the average gross for drama genre movies to be $61135077 lower than action movies. We predict the average gross for fantasy genre movies to be $42625086 lower than action movies. We predict the average gross for horror genre movies to be $57622031 lower than action movies. We predict the average gross for mystery genre movies to be $24789815 lower than action movies.
The adjusted R-squared of Model 3 is 0.1353, meaning that genre explains 13.53% of the variance in a movie’s gross ticket sales. We use adjusted R-squared to judge this rather than R-squared because there are more than two predictors.
Based on the three models explored in this analysis report, I would recommend to use Model 2 to model this dataset. Model 1 did not meet the necessary assumptions for the use of an LSLR line, while Model 2 did. In comparing Model 2 and Model 3, we may compare their adjusted R-squared values to determine which explains more variance in movie gross. We use an adjusted R-squared to compare them as Model 3 uses more than two predictors. The adjusted R-squared of Model 2 is 0.4412 and the adjusted R-squared of Model 3 is 0.1353. Model 2 therefore has a higher adjusted R-squared, so that it explains 44.12% of the variance in movie gross. The scatterplot of Figure 3.1 suggests a moderate to strong relationship between the budget and gross. The p-value of Model 2’s hypothesis test demonstrates that the relationship found between log budget and log movie gross is statistically significant. The resulting p-value demonstrated that there is practically a 0% likelihood that this relationship was a result of random error. From the confidence interval, we are therefore 95% confident that, for each 1% increase in a movie’s budget, we expect the average movie’s gross to increase between 0.9045% and 0.9805%. Practically, this is significant, as it suggests that around 44% of a movie’s gross can be predicted by gross alone. For a single variable, that is a large amount of variance to be explained, and this information may be used when considering what the budget of a certain movie should be. When considering the results of this analysis, it is important to note that the data set contained multiple outliers that may be worth investigating. These results are also limited to films in the English language and made in the USA.