For this homework, we will investigate the Movie.csv dataset located on D2L. The dataset contains information for movies since 2010. Be sure to download the file to your computer and import using “Import Dataset”, then “From Text File…”
Answer all questions completely. Explain in complete sentences, do not just submit code. See the Day2PracticeSolution.Rmd file for an example of what is expected. You may use any R functions to help answer the questions.
1) View the dataset and think about possible questions you could answer using a first-order regression model. What information could you predict for a movie? What variables might be useful for predicting? Explain.
I could predict the box office numbers for the movie based on year, budget, and season. Economic conditions over a year may affect the number of customers available and ticket prices. Movies with higher budgets are generally marketed well and attract a lot of interest. If movie is not a first season, then the previous movie must have been successful and gained a lot of interest for the considered movie.
2) Consider Budget as \((X)\) and Box.Office as \((Y)\). Obtain the estimated regression function.
X <- Movie$Budget
Y <- Movie$Box.Office
mod1 <- lm(Y ~ Budget, data = Movie)
summary(mod1)
##
## Call:
## lm(formula = Y ~ Budget, data = Movie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218015017 -31206199 -11272350 21906706 420865827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.726e+07 4.142e+06 4.166 3.55e-05 ***
## Budget 9.957e-01 4.608e-02 21.610 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65750000 on 598 degrees of freedom
## Multiple R-squared: 0.4385, Adjusted R-squared: 0.4376
## F-statistic: 467 on 1 and 598 DF, p-value: < 2.2e-16
3) Plot the estimated regression function and the data. How well does the estimated regression function fit the data?
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
ggplot(data = Movie, aes(x = Movie$Budget, y = Y)) +
geom_point() +
geom_abline(aes(slope = mod1$coefficients[2], intercept = mod1$coefficients[1])) +
labs(title = "Budget vs Box Office", x = "Budget for Movie",
y = "Box Office Revenue")
4) Interpret \(b_0\) in your estimated regression function. Does \(b_0\) provide any relevant information here? Explain.
# When the budget is 0. The estimated box office is the intercept, b_0, as $1.726e+07. The intercept does not provide any relevant information.
5) Interpret \(b_1\) in your estimated regression function, be sure to include units.
# The b_1 is $9.957e-01 or ~ $1.00. For every dollar is added to the budget the box office increases by $1.
6) Calculate and interpret a 90 percent confidence interval for \(b_0\).
predict(mod1, newdata = data.frame(Budget = 0), interval = "confidence", level = 0.9)
## fit lwr upr
## 1 17258427 10434248 24082606
# We are 90 percent confident that the intercept is between $10,434,248 and $24,882,606.
7) Calculate and interpret a 90 percent confidence interval for \(b_1\).
confint(mod1, 'Budget', level=0.9)
## 5 % 95 %
## Budget 0.9197873 1.071596
# We are 90 percent confident that the slope is between $0.92 and $1.07.
8) A movie company believes that for each additional dollar they put towards the budget, they should get back at least one additional dollar in box office sales. Conduct a hypothesis test to check this claim. Be sure to state the hypotheses, test statistic, p value, and conclusion.
#Enter Code Here
# Hypothesis: H0: u = 1, Ha: u > 1
# The test statistic is t = (u - 1) / s(u) = -0.0933 and
# p = 0.5372
b0<- mod1$coefficients[1]
b1<- mod1$coefficients[2]
n<-nrow(Movie)
MSE <- sum(mod1$residuals^2)/(n-2)
s_beta1 <- sqrt(MSE/sum((Movie$Budget-mean(Movie$Budget))^2))
teststat <- (b1-1)/s_beta1
p <- pt(teststat, n-2, lower.tail = FALSE)
# since the p value is much larger than 0.05 we fail to reject H0. Therefor, we believe the movie company is not getting back a dollar for each dollar they put in a budget.
9) Obtain a point estimate of the mean box office sales when \(X=1.5 \times 10^8\).
#Enter Code Here
predict(mod1, newdata = data.frame(Budget = 1.5*10^8))
## 1
## 166612210
# The point estimate is $166,612,210
10) Avengers Endgame had a budget of $356 million. Obtain a 95 percent confidence interval for the mean box office total for movies with a budget of $356 million.
#Enter Code Here
predict(mod1, newdata = data.frame(Budget = 356000000), interval = "prediction", level = 0.95)
## fit lwr upr
## 1 371724739 239890361 503559116
# We are 95 percent confident that the mean box office total for movies with a budget of $356 million is between $239,890,361 and $503,559,116.
11) Now construct a 95 percent prediction interval for the box office total for Avengers Endgame. The movie made $2.24 billion in the box office, does this number fall in the prediction interval? If not, why do you think the interval missed? How large is the interval?
yhat <- predict(mod1,newdata = data.frame(Budget = 356000000))
tstar <- qt((1+.95)/2,nrow(Movie)-2)
s <- sqrt(MSE*(1+1/nrow(Movie)+(365000000-mean(Movie$Budget))^2/sum((Movie$Budget-mean(Movie$Budget))^2)))
lower <- yhat-tstar*s
upper <- yhat+tstar*s
c(lower,upper)
## 1 1
## 239727221 503722256
upper-lower
## 1
## 263995034
# The prediction interval for Avengets Endgame is between $239,727,221 and $503,722,256.
# The movie's actual box office did not fall in the prediction interval because a movie's budget cannot be the primary factor in determining a movie's box office. The interval is $263,995,034 wide.