Submit your .rmd file on D2L by 12:30pm on May 13th

For this homework, we will investigate the Movie.csv dataset located on D2L. The dataset contains information for movies since 2010. Be sure to download the file to your computer and import using “Import Dataset”, then “From Text File…”

Answer all questions completely. Explain in complete sentences, do not just submit code. See the Day3PracticeSolution.Rmd file for an example of what is expected. You may use any R functions to help answer the questions.

We will be considering the same model you found in the last homework. Note for simplicity, Box.Office and Budget have been converted to millions of dollars for the model.

1) Fit the linear model by gradient descent. State the values used for \(\alpha\) and the threshold, in addition to the number of iterations needed to converge. Be sure to calculate the standard error for each coefficient. Hint: You may need to make the step size \(\alpha\) smaller

## [1] 17.25843
## [1] 0.9956919
## [1] 251168

The values used for alpha is 0.0001; the threshold is 10^-12; and the number of iterations needed is 251168. The standard error for beta0 is 4.14 and the standard error for beta1 is 0.046.

2) Set up the ANOVA table. Be sure to include a row for Total in your table.

## Warning: package 'knitr' was built under R version 3.5.3
SOV SS df MS f p
Regression 2019016.411 1 2019016.411 466.998 0
Error 2585390.236 598 4323.395
Total 4604406.646 599

The ANOVA table is established above with a row for Total.

3) Conduct a \(t\) test to decide whether or not there is a linear association between the movie’s budget \((X)\) and box office \((Y)\). Use a level of significance of \(.1\). State the alternatives, decision rule, and conclusion. What is the \(P\)-value of the test?

## [1] "t* =  21.6101342237578"
## [1] "t cutoff is  1.64740571193447"
## [1] "p-value =  0"

\(H_0:\beta _1 = 0, H_a:\beta _1 \neq 0.\)

Decision Rule: If \(|t^*| \leq 1.647\), conclude \(H_0\), else conclude \(H_a\).

Since \(|t^*| > 1.647\), we reject \(H_0\) and conclude that \(\beta _1\) is significantly different from zero (\(p=0\)). That is, there is a linear association between the movie’s budget and box office.

4) Conduct an \(F\) test to decide whether or not there is a linear association between the movie’s budget and box office; control the \(\alpha\) risk at \(.1\). State the alternatives, decision rule, and conclusion.

## [1] "F* =  466.997901168827"
## [1] "F cutoff is  2.71394557971433"
## [1] "p-value =  5.59602981406871e-77"

\(H_0:\beta _1 = 0, H_a:\beta _1 \neq 0.\)

Decision Rule: If \(F^* \leq 2.7139\), conclude \(H_0\), else conclude \(H_a\).

Since \(F^* > 2.7139\), we reject \(H_0\) and conclude that \(\beta _1\) is significantly different from zero (\(p=5.5960 \times 10^{-77}\)). That is, there is a linear association between a movie’s budget and box office.

5) Calculate \(R^2\) and \(r\).

## [1] "r^2 =  0.438496546047457"
## [1] "r =  0.66219071727672"

\(R^2 = 0.4385 = 43.85\%\) and \(r = 0.662191\).

6) Prepare a residual plot. Discuss the features of the plot and your conclusion.

## Warning: package 'ggplot2' was built under R version 3.5.3

The scatterplot appears to get wider as the fitted numbers go from small to large. There is evidence of a pattern within the residuals.

7) Prepare a normal probability plot. Discuss the features of the plot and your conclusion.

The points seem to distance further from the line as it moves away from 0 at theoretical. We have evidence of the normality assumption being violated.

8) Prepare a time plot of the residuals to ascertain whether the error terms are correlated over time. What is your conclusion? Note that Movie$Date is a Factor; you can change the type to Date with Movie$Date <- as.Date(Movie$Date).

There does not appear to be any change in variance over time.

9) Conduct the Breusch-Pagan test to determine whether or not the error variance varies with the level of \(Budget\). Use \(\alpha =.05\). State the alternatives, decision rule, and conclusion.

## Warning: package 'lmtest' was built under R version 3.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.5.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
##  studentized Breusch-Pagan test
## 
## data:  mod1
## BP = 76.726, df = 1, p-value < 2.2e-16

Given a \(\chi ^2_{BP}\) test statistics of \(76.726\) \((p=2.2*10^{-16})\), we reject the null hypothesis at the \(\alpha =.05\) level and conclude that there is evidence of non constant variance.

10) Use the Box-Cox procedure to find an appropriate power transformation of \(Y\). Evaluate \(SSE\) for \(\lambda = .20\) to \(\lambda=.30\) using \(0.01\) increments. What transformation of \(Y\) is suggested?

## Warning: package 'MASS' was built under R version 3.5.3

## [1] 0.2626263

Note that the minimum SSE comes with \(\lambda = 0.2626 \approx 0.25\). That is, the Box-Cox procedure suggests \(Y'=\sqrt[4]{Y}\).

11) Use the tranformation found in part (10) and obtain the estimated linear regression function for the transformed data.

## 
## Call:
## lm(formula = yprime ~ Xbudget)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51943 -0.32687  0.00667  0.33650  1.61074 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.2550823  0.0332313   67.86   <2e-16 ***
## Xbudget     0.0078567  0.0003696   21.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5275 on 598 degrees of freedom
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4294 
## F-statistic: 451.8 on 1 and 598 DF,  p-value: < 2.2e-16

The fitted regression line is \(\hat{Y}'=2.255+0.079X\).

12) Express the estimated regression function in the original units.

The fitted regression line is \(\hat{Y}=(2.255+0.079X)^4\).

13) Now download Movie2.csv from D2L. This file contains the same data but in a different format. Now fit a linear regression for \(Box.Office\) using \(Budget\), \(Theatres\) and \(Sequel\) as explanatory variables. Be sure to state the linear regression function.

## 
## Call:
## glm(formula = Box.Office ~ Budget + Theatres + Sequel, data = Movie2)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -153377997   -34808432    -6841733    21547004   402223581  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.116e+08  1.341e+07  -8.324 6.14e-16 ***
## Budget       5.492e-01  5.629e-02   9.756  < 2e-16 ***
## Theatres     4.989e+04  5.063e+03   9.852  < 2e-16 ***
## Sequel       3.030e+07  6.497e+06   4.663 3.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 3.465997e+15)
## 
##     Null deviance: 4.4384e+18  on 582  degrees of freedom
## Residual deviance: 2.0068e+18  on 579  degrees of freedom
## AIC: 22521
## 
## Number of Fisher Scoring iterations: 2

The regression line is \(\hat{Y}=-1.116*10^8+5.492*10^{-1}X_1 + 4.989*10^4X_2 + 3.030*10^7X_3\). \(X_1\) represent Budget. \(X_2\) represent Theatres. \(X_3\) represent Sequel.

14) Obtain a residual plot and a normal probability plot. Be sure to make a comment about them.

The scatterplot appears to get wider as the fitted numbers go from small to large. There is evidence of a pattern within the residuals. The points seem to distance further from the line as it moves away from 0 in the positive direction at theoretical. We have evidence of the normality assumption being violated.

15) Again, use the Box-Cox procedure and standardization to find an appropriate power transformation of \(Y\). Evaluate \(SSE\) for \(\lambda = .10\) to \(\lambda=.20\) using \(0.01\) increments. What transformation of \(Y\) is suggested?

## [1] 0.1414141

Note that the minimum SSE comes with \(\lambda = 0.1414 \approx 1/7\). That is, the Box-Cox procedure suggests \(Y'=\sqrt[7]{Y}\).

16) Use the tranformation found in part (15) and obtain the estimated linear regression function for the transformed data.

## 
## Call:
## lm(formula = yprime ~ Budget + Theatres + Sequel, data = Movie2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6293 -0.7229 -0.0837  0.6691  3.3990 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.938e+00  2.459e-01  28.214  < 2e-16 ***
## Budget      6.397e-09  1.032e-09   6.196  1.1e-09 ***
## Theatres    1.771e-03  9.286e-05  19.073  < 2e-16 ***
## Sequel      2.399e-01  1.192e-01   2.013   0.0446 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.08 on 579 degrees of freedom
## Multiple R-squared:  0.6532, Adjusted R-squared:  0.6514 
## F-statistic: 363.5 on 3 and 579 DF,  p-value: < 2.2e-16

The fitted regression line is \(\hat{Y}'=6.938+6.397*10^{-9}X_1+1.771*10^{-3}X_2+2.399*10^{-1}X_3\). \(X_1\) represent Budget. \(X_2\) represent Theatres. \(X_3\) represent Sequel.

17) Express the estimated regression function in the original units.

The fitted regression line is \(\hat{Y}=(6.938+6.397*10^{-9}X_1+1.771*10^{-3}X_2+2.399*10^{-1}X_3)^7\).

18) Obtain a residual plot and a normal probability plot. Be sure to make a comment about them.

There appears to be a random scatter of points about the line. There is no evidence of a pattern within the residuals. The points seem to follow the line nicely. We have no evidence of the normality assumption being violated.

19) Conduct an F test to determine if there is a regression relation between the response and the explanatory variables. Control the \(\alpha\) risk at \(.1\). State the alternatives, decision rule, and conclusion.

## [1] "F* =  1094.25738961049"
## [1] "F cutoff is  2.71419201620957"
## [1] "p-value = 0"

\(H_0:\beta _1 = \beta _2 = \beta_3 = 0, H_a:\beta _1 \neq 0, \beta _2 \neq 0, \beta _3 \neq 0.\)

Decision Rule: If \(F^* \leq 2.7142\), conclude \(H_0\), else conclude \(H_a\).

Since \(F^* > 2.7142\), we reject \(H_0\) and conclude that \(\beta_1\) is significantly different from zero (\(p=0\)). That is, there is a linear regression between the response and explanatory variables.

20) Let’s try this again! Construct a 95 percent prediction interval for the box office sales for Avengers Endgame (use your model from part (16)). The budget was $356 million, the movie was released in 4,600 theatres and Sequel=1. The movie made $2.24 billion in the box office, does this number fall in the prediction interval?

##        fit      lwr      upr
## 1 17.60312 15.43375 19.77249

The 95 percent prediction interval for the box office sales for Avengers Endgame is between 15.43 and 19.77. Taking the transformation into account, that interval is equivalent to an interval between $208,593,239 and $1,181,488,645. The movie did not fall in the prediction interval.

Complete your project proposal. See D2L for details.