= read.csv("movies_data.csv") df
Regression Discussion 4
Data Loading
Data set found on Kaggle : https://www.kaggle.com/datasets/delfinaoliva/movies
Estimating Equation
\[ Box Office = \beta_0\ +\beta_1\ * Budget \]
We will run a simple linear regression to predict if budget for a movie will affect the box office outcome (money made on tickets) for that movie.
This data set has information on 3974 movies with 16 columns, we will be focusing on the budget and the box office outcome (both in dollars). This will allow us to predict if the budget on a movie has a relationship with the box office earnings.
Running Simple Linear Regression
<- lm(Box.Office ~ Budget, data = df)
my_reg
summary(my_reg)
Call:
lm(formula = Box.Office ~ Budget, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.093e+09 -4.834e+07 -9.867e+06 1.804e+07 2.219e+09
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.215e+06 2.667e+06 -0.456 0.649
Budget 2.978e+00 4.725e-02 63.017 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 127200000 on 3972 degrees of freedom
Multiple R-squared: 0.4999, Adjusted R-squared: 0.4998
F-statistic: 3971 on 1 and 3972 DF, p-value: < 2.2e-16
The intercept value (Beta0) is -1.215e^6 which converts to -$1215000. When the budget is $0 the box office earnings is estimated to be -$1215000. Which in a practical sense we would not inpterpret this - budget being $0, earnings being negative. The slope parameter gives us the better insight in this equation. Slope is $2.978, this can be interpreted as: when the budget for a movie increases by one dollar, the box office earning will increase by $3, essentially we are predicting that increasing the budget for a movie will result in the earnings at box office tripling. Given the significance code, I find these coefficients to be statistically significantly, alpha = 0.001. Given this prediction found that a $1 increase in budget will increase box office earnings by $3, I would say this is economically meaningful - simply because of the magnitude of the findings and the significance.
Residual Plots
plot(my_reg)
The residual vs fitted plot allows us to test the linearity and homoscedasticity in the regression model. It plots the fitted (predicted values) on the x and then the residuals for each prediction on the y. Any patterns in the scatter plot can suggest a trend in the data that is not linear and would be taken as the assumption is not met. I would say the assumption of linearity and homoscedasticity are met given this plot although there is a slight curve trend in the data.
A Q-Q Residual plot evaluated the normality of the regression model assumption. The way to interpret this graph would be to focus on if the plotted values fall on or around the straight dotted line or if they stray very far from it, since the plotted values here do stray from it a bit at the end of the plot it could suggest issues in the normality assumption.
The scale location plot again tests for homoscedasticity of the residuals, to ensure they have constant variance. It plots the square root of standardized residuals against the fitted values (predicted). Once again if you see any trends in the data here is would suggest non constant variance in the residuals, heteroscedasticity. The goal is to have the plotted values horizontal along the chart, I would say by the look of the linear red line included in the plot that there could be ways to improve the model such as logarithmic values.
Lastly, the residual vs leverage plot shows any points that are influential and could cause issues with the model and predictions. It plots residuals vs influence of the residual on the models fit. Seeing as there are a few point outside of the Cook’s Distance lines there could be room for improvement by removing these points or checking their influence.
Chart Suggestions
The charts show some improvements could definitely be made the the model to meet the assumptions better. The Q-Q plot shows the normality assumption is not met as well as it could be, some ways to improve this would be transformations to the variables such as using log values. Since there is a slight linear trend in the scale location plot instead of horizontal- there is an issue with the homoscedasticity assumption not being met well, transformations will hep this but so will adding variables and interaction terms to capture the variability best. The residual leverage plot shows a few leverage points, it would be best to check the influence of these points on the model - their importance, or if they can be removed.
I will test out taking the log(budget) to see if that improves the model or any of the residual plots.
Testing Transformation
<- lm(Box.Office ~ log(Budget), data = df)
my_reg2
summary(my_reg2)
Call:
lm(formula = Box.Office ~ log(Budget), data = df)
Residuals:
Min 1Q Median 3Q Max
-248273670 -87781803 -39958627 32948509 2660040131
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -887427049 28840723 -30.77 <2e-16 ***
log(Budget) 59656323 1720760 34.67 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 157600000 on 3972 degrees of freedom
Multiple R-squared: 0.2323, Adjusted R-squared: 0.2321
F-statistic: 1202 on 1 and 3972 DF, p-value: < 2.2e-16
The slope value changed to if the log(Budget) increased by one unit (log-dollars), the Box office earning would increase by $59656323.
Checking if this helped the assumptions in our model.
plot(my_reg2)
It looks like by including the transformed variable in our model the residual vs fitted plot shows more of a fan shape - suggesting heteroscedasticity, the Q-Q plot deviates more form the normal line (it was better when the transformation was not taken). The scale location has an elbow sort of shape but could be said to be more horizontal than before and the leverage points are outside of Cook’s Distance.