This R Markdown file includes the questions, the empty code chunk sections for your code, and the text blocks for your responses. Answer the questions below by completing this R Markdown file. You may make slight adjustments to get the file to knit/convert but otherwise keep the formatting the same. Once you’ve finished answering the questions, submit your responses in a single knitted file as HTML only.
There are 9 questions total, each worth between 3.5-9 points. Partial credit may be given if your code is correct but your conclusion is incorrect or vice versa.
Next Steps:
Save this .Rmd file in your R working directory - the same directory where you will download the “auto-mpg.csv” data file into. Having both files in the same directory will help in reading the “auto-mpg.csv” file.
Read the question and create the R code necessary within the code chunk section immediately below each question. Knitting this file will generate the output and insert it into the section below the code chunk.
Type your answer to the questions in the text block provided immediately after the response prompt.
Once you’ve finished answering all questions, knit this file and submit the knitted file as HTML on Canvas.
This will be the exam question - each question is already copied from Canvas and inserted into individual text blocks below, you do not need to copy/paste the questions from the online Canvas exam.
# Example code chunk area. Enter your code below the comment`
Mock Response to Example Question: This is the section where you type your written answers to the question. Depending on the question asked, your typed response may be a number, a list of variables, a few sentences, or a combination of these elements.
Ready? Let’s begin. We wish you the best of luck!
Recommended Packages
# Loading relevant libraries
library(car)
library(MASS)
library(aod)
For this exam, you will be building a model to predict the fuel efficiency of a car (number of miles per gallon).
The “auto-mpg.csv” data set consists of the following variables:
Read the data and answer the questions below. Assume a significance threshold of 0.05 for hypothesis tests unless stated otherwise.
# Read the data set
autompg_full = read.csv('auto-mpg.csv',header=TRUE, sep = ",")
# Split the data into training and testing data
sample_size = floor(0.8*nrow(autompg_full))
set.seed(10)
idx = sample(seq_len(nrow(autompg_full)), size = sample_size)
autompg = autompg_full[idx, ]
autompg_test = autompg_full[-idx, ]
Note: For all of the following questions, use autompg as your data unless stated otherwise.
Create a histogram of the response variable mpg. Based on this plot, would it be appropriate to model this response variable using a poisson regression? Explain.
hist(autompg$mpg)
Response to Question 1: Yes, it is appropriate to model by poisson due to the distribution is not normal distribution.
model1 = lm(mpg~.,data=autompg)
summary(model1)
##
## Call:
## lm(formula = mpg ~ ., data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2801 -2.1088 -0.1009 1.8845 12.9594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.841e+01 5.179e+00 -3.555 0.000437 ***
## cylinders -4.332e-01 3.720e-01 -1.165 0.245107
## displacement 2.130e-02 8.435e-03 2.526 0.012045 *
## horsepower -1.361e-02 1.499e-02 -0.908 0.364547
## weight -6.926e-03 7.655e-04 -9.047 < 2e-16 ***
## acceleration 1.630e-01 1.099e-01 1.484 0.138903
## modelyear 7.569e-01 5.780e-02 13.094 < 2e-16 ***
## origin 1.327e+00 3.118e-01 4.257 2.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.439 on 310 degrees of freedom
## Multiple R-squared: 0.8142, Adjusted R-squared: 0.81
## F-statistic: 194.1 on 7 and 310 DF, p-value: < 2.2e-16
B)Is the overall regression significant at the \(\alpha=0.01\) level? Explain.
Response to Question 2B: Yes, the model overall is significant since the p-value of F-statistic is significantly lower than the alpha threshold
#Code to calculate Cook's distance and plot it
Response to Question 2C:
Using model1, create and interpret the following plots to determine whether the multiple linear regression (MLR) assumptions hold. State which assumption(s) can be assessed by each plot and comment on whether there are any apparent departures from the assumptions and how you came to these conclusions.
• Plot of the standardized residuals, \(r_i\), versus the fitted values, \(\hat{y_i}\)
• Q-Q plot of the standardized residuals, \(r_i\)
res = resid(model1, type="deviance")
plot(autompg$mpg, res, xlab= "mpge", ylab= "Residuals", pch= 19)
abline(h = 0)
#Code to plot the qqplot
qqnorm(res, ylab="Std residuals")
qqline(res, col="blue", lwd=2)
Response to Question 3: The residual has postive pattern which violate the assumption is expectation of residual should be 0; Besides, based on QQplot, the residual is not follow the normal distribution.
# Code to log transform and create model
model2 = lm(log(mpg)~log(cylinders)+log(displacement)+log(horsepower)+log(weight)+log(acceleration)+log(modelyear)+log(origin),data=autompg)
summary(model2)
##
## Call:
## lm(formula = log(mpg) ~ log(cylinders) + log(displacement) +
## log(horsepower) + log(weight) + log(acceleration) + log(modelyear) +
## log(origin), data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39109 -0.06878 0.00174 0.06583 0.37946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.47984 0.73426 -0.653 0.514
## log(cylinders) -0.05833 0.06985 -0.835 0.404
## log(displacement) 0.01270 0.06467 0.196 0.845
## log(horsepower) -0.27353 0.06197 -4.414 1.40e-05 ***
## log(weight) -0.62338 0.09386 -6.642 1.39e-10 ***
## log(acceleration) -0.12563 0.06469 -1.942 0.053 .
## log(modelyear) 2.34575 0.14884 15.760 < 2e-16 ***
## log(origin) 0.03703 0.02108 1.756 0.080 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1167 on 310 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.8844
## F-statistic: 347.5 on 7 and 310 DF, p-value: < 2.2e-16
#Code to calculate VIF
# VIF threshold
Response to Question 4B:
Response to Question 4C:
model3 = glm(mpg~.,data=autompg,family = poisson)
summary(model3)
##
## Call:
## glm(formula = mpg ~ ., family = poisson, data = autompg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.72128 -0.33322 -0.02383 0.29990 1.91492
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.555e+00 3.097e-01 5.021 5.15e-07 ***
## cylinders -9.888e-03 2.403e-02 -0.412 0.681
## displacement 5.326e-04 5.640e-04 0.944 0.345
## horsepower -1.336e-03 1.010e-03 -1.322 0.186
## weight -2.908e-04 5.013e-05 -5.800 6.64e-09 ***
## acceleration 4.953e-03 6.516e-03 0.760 0.447
## modelyear 3.129e-02 3.392e-03 9.226 < 2e-16 ***
## origin 3.145e-02 1.723e-02 1.825 0.068 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 827.98 on 317 degrees of freedom
## Residual deviance: 109.49 on 310 degrees of freedom
## AIC: Inf
##
## Number of Fisher Scoring iterations: 4
summary(model2)
##
## Call:
## lm(formula = log(mpg) ~ log(cylinders) + log(displacement) +
## log(horsepower) + log(weight) + log(acceleration) + log(modelyear) +
## log(origin), data = autompg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39109 -0.06878 0.00174 0.06583 0.37946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.47984 0.73426 -0.653 0.514
## log(cylinders) -0.05833 0.06985 -0.835 0.404
## log(displacement) 0.01270 0.06467 0.196 0.845
## log(horsepower) -0.27353 0.06197 -4.414 1.40e-05 ***
## log(weight) -0.62338 0.09386 -6.642 1.39e-10 ***
## log(acceleration) -0.12563 0.06469 -1.942 0.053 .
## log(modelyear) 2.34575 0.14884 15.760 < 2e-16 ***
## log(origin) 0.03703 0.02108 1.756 0.080 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1167 on 310 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.8844
## F-statistic: 347.5 on 7 and 310 DF, p-value: < 2.2e-16
summary(model3)
##
## Call:
## glm(formula = mpg ~ ., family = poisson, data = autompg)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.72128 -0.33322 -0.02383 0.29990 1.91492
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.555e+00 3.097e-01 5.021 5.15e-07 ***
## cylinders -9.888e-03 2.403e-02 -0.412 0.681
## displacement 5.326e-04 5.640e-04 0.944 0.345
## horsepower -1.336e-03 1.010e-03 -1.322 0.186
## weight -2.908e-04 5.013e-05 -5.800 6.64e-09 ***
## acceleration 4.953e-03 6.516e-03 0.760 0.447
## modelyear 3.129e-02 3.392e-03 9.226 < 2e-16 ***
## origin 3.145e-02 1.723e-02 1.825 0.068 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 827.98 on 317 degrees of freedom
## Residual deviance: 109.49 on 310 degrees of freedom
## AIC: Inf
##
## Number of Fisher Scoring iterations: 4
Response to Question 5B Yes. For instance, the horsepower change from significant to insignificant frmo model 2 to model 3.
# Code to extract coefficient
Response to Question 6A
# Code to compute CI
# Code to exponentiate the above CI's endpoints
Response to Question 6B
# Code to estimate dispersion parameter
Response to Question 7A
# Code to perform GOF test
with(model3, cbind(res.deviance= deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
## res.deviance df p
## [1,] 109.4876 310 1
Response to Question 7B The deviance residual P-value is greater than 0.05, which means not reject the model’s goodness to fit.
Are at least one of the variables displacement, cylinders and horsepower significant at the \(\alpha = 0.05\) level, given all of the other variables in model3 are included? Perform a testing for subset of coefficients to answer this question. Explain your reasoning, including a statement of the null and alternative hypotheses.
# Code to perform the test
Response to Question 8
# Code to estimate using model2
pred.test2= predict.glm(model2,autompg_test,type="response")
head(pred.test2)
## 1 2 3 5 9 11
## 2.706045 2.615095 2.690498 2.712205 2.438422 2.647972
# Code to estimate using model3
pred.test3= predict.glm(model3,autompg_test,type="response")
head(pred.test3)
## 1 2 3 5 9 11
## 15.30727 14.11246 15.21521 15.19356 11.05172 14.70706
# Code to calculate PM for model2
sum((pred.test2-log(autompg_test$mpg))^2)/sum((log(autompg_test$mpg)-mean(log(autompg_test$mpg)))^2)
## [1] 0.104396
# Code to calculate PM for model3
sum((pred.test3-autompg_test$mpg)^2)/sum((autompg_test$mpg-mean(autompg_test$mpg))^2)
## [1] 0.1071959
Response to Question 9C Model 2 looks better due to lower PM