Instructions

This R Markdown file includes the questions, the empty code chunk sections for your code, and the text blocks for your responses. Answer the questions below by completing this R Markdown file. You may make slight adjustments to get the file to knit/convert but otherwise keep the formatting the same. Once you’ve finished answering the questions, submit your responses in a single knitted file as HTML only.

There are 9 questions total, each worth between 3.5-9 points. Partial credit may be given if your code is correct but your conclusion is incorrect or vice versa.

Next Steps:

  1. Save this .Rmd file in your R working directory - the same directory where you will download the “auto-mpg.csv” data file into. Having both files in the same directory will help in reading the “auto-mpg.csv” file.

  2. Read the question and create the R code necessary within the code chunk section immediately below each question. Knitting this file will generate the output and insert it into the section below the code chunk.

  3. Type your answer to the questions in the text block provided immediately after the response prompt.

  4. Once you’ve finished answering all questions, knit this file and submit the knitted file as HTML on Canvas.

Mock Example Question

This will be the exam question - each question is already copied from Canvas and inserted into individual text blocks below, you do not need to copy/paste the questions from the online Canvas exam.

# Example code chunk area. Enter your code below the comment`

Mock Response to Example Question: This is the section where you type your written answers to the question. Depending on the question asked, your typed response may be a number, a list of variables, a few sentences, or a combination of these elements.

Ready? Let’s begin. We wish you the best of luck!

Recommended Packages

# Loading relevant libraries
library(car)
library(MASS)
library(aod)

Auto MPG Data Analysis

For this exam, you will be building a model to predict the fuel efficiency of a car (number of miles per gallon).

The “auto-mpg.csv” data set consists of the following variables:

Read the data and answer the questions below. Assume a significance threshold of 0.05 for hypothesis tests unless stated otherwise.

# Read the data set
autompg_full = read.csv('auto-mpg.csv',header=TRUE, sep = ",")

# Split the data into training and testing data
sample_size = floor(0.8*nrow(autompg_full))
set.seed(10)
idx = sample(seq_len(nrow(autompg_full)), size = sample_size)
autompg = autompg_full[idx, ]
autompg_test = autompg_full[-idx, ]

Note: For all of the following questions, use autompg as your data unless stated otherwise.

Question 1 - Exploratory Data Analysis

Create a histogram of the response variable mpg. Based on this plot, would it be appropriate to model this response variable using a poisson regression? Explain.

hist(autompg$mpg)

Response to Question 1: Yes, it is appropriate to model by poisson due to the distribution is not normal distribution.

Question 2 - Multiple Linear Regression & Model fit

  1. Create a linear regression model, called model1, with mpg as the response variable and all other variables as the predictors. Include an intercept. Display the summary table of the model. Note: Remember to use autompg as your data set.
model1 = lm(mpg~.,data=autompg)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ ., data = autompg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2801 -2.1088 -0.1009  1.8845 12.9594 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.841e+01  5.179e+00  -3.555 0.000437 ***
## cylinders    -4.332e-01  3.720e-01  -1.165 0.245107    
## displacement  2.130e-02  8.435e-03   2.526 0.012045 *  
## horsepower   -1.361e-02  1.499e-02  -0.908 0.364547    
## weight       -6.926e-03  7.655e-04  -9.047  < 2e-16 ***
## acceleration  1.630e-01  1.099e-01   1.484 0.138903    
## modelyear     7.569e-01  5.780e-02  13.094  < 2e-16 ***
## origin        1.327e+00  3.118e-01   4.257 2.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.439 on 310 degrees of freedom
## Multiple R-squared:  0.8142, Adjusted R-squared:   0.81 
## F-statistic: 194.1 on 7 and 310 DF,  p-value: < 2.2e-16

B)Is the overall regression significant at the \(\alpha=0.01\) level? Explain.

Response to Question 2B: Yes, the model overall is significant since the p-value of F-statistic is significantly lower than the alpha threshold

  1. Using model1, calculate and plot the Cook’s distance for each point. Based on this plot, evaluate whether there are any concerning outliers. Explain your reasoning using a threshold of 1. Note: Do not remove any data points
#Code to calculate Cook's distance and plot it

Response to Question 2C:

Question 3 - Model assumptions

Using model1, create and interpret the following plots to determine whether the multiple linear regression (MLR) assumptions hold. State which assumption(s) can be assessed by each plot and comment on whether there are any apparent departures from the assumptions and how you came to these conclusions.

• Plot of the standardized residuals, \(r_i\), versus the fitted values, \(\hat{y_i}\)

• Q-Q plot of the standardized residuals, \(r_i\)

res = resid(model1, type="deviance")
plot(autompg$mpg, res, xlab= "mpge", ylab= "Residuals", pch= 19)
abline(h = 0)

#Code to plot the qqplot
qqnorm(res, ylab="Std residuals")
qqline(res, col="blue", lwd=2)

Response to Question 3: The residual has postive pattern which violate the assumption is expectation of residual should be 0; Besides, based on QQplot, the residual is not follow the normal distribution.

Question 4 - Transformation of response variable - attempt to improve the fit

  1. Perform a log transformation of the response variable and create a new linear regression model model2, with the transformed mpg as the response variable and all other variables as the predictors. Include an intercept. Display the summary table.
# Code to log transform and create model 
model2 = lm(log(mpg)~log(cylinders)+log(displacement)+log(horsepower)+log(weight)+log(acceleration)+log(modelyear)+log(origin),data=autompg)
summary(model2)
## 
## Call:
## lm(formula = log(mpg) ~ log(cylinders) + log(displacement) + 
##     log(horsepower) + log(weight) + log(acceleration) + log(modelyear) + 
##     log(origin), data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39109 -0.06878  0.00174  0.06583  0.37946 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.47984    0.73426  -0.653    0.514    
## log(cylinders)    -0.05833    0.06985  -0.835    0.404    
## log(displacement)  0.01270    0.06467   0.196    0.845    
## log(horsepower)   -0.27353    0.06197  -4.414 1.40e-05 ***
## log(weight)       -0.62338    0.09386  -6.642 1.39e-10 ***
## log(acceleration) -0.12563    0.06469  -1.942    0.053 .  
## log(modelyear)     2.34575    0.14884  15.760  < 2e-16 ***
## log(origin)        0.03703    0.02108   1.756    0.080 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 310 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8844 
## F-statistic: 347.5 on 7 and 310 DF,  p-value: < 2.2e-16
  1. Calculate the variance inflation factor (VIF) for each predicting variable. What is the value of the VIF threshold max\((10, \frac{1}{1-R^2_{model2}})\) for this model?
#Code to calculate VIF



# VIF threshold

Response to Question 4B:

  1. Do any of the VIFs exceed the threshold? Based on these results, does multicollinearity seem to be a concern in this model?

Response to Question 4C:

Question 5 - Poisson Regression

  1. Fit a Poisson regression model, called **model3*, with \(mpg\) as the response variable and all other variables as the predictors. Include an intercept. Display the summary table of the model.
model3 = glm(mpg~.,data=autompg,family = poisson)
summary(model3)
## 
## Call:
## glm(formula = mpg ~ ., family = poisson, data = autompg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72128  -0.33322  -0.02383   0.29990   1.91492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.555e+00  3.097e-01   5.021 5.15e-07 ***
## cylinders    -9.888e-03  2.403e-02  -0.412    0.681    
## displacement  5.326e-04  5.640e-04   0.944    0.345    
## horsepower   -1.336e-03  1.010e-03  -1.322    0.186    
## weight       -2.908e-04  5.013e-05  -5.800 6.64e-09 ***
## acceleration  4.953e-03  6.516e-03   0.760    0.447    
## modelyear     3.129e-02  3.392e-03   9.226  < 2e-16 ***
## origin        3.145e-02  1.723e-02   1.825    0.068 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 827.98  on 317  degrees of freedom
## Residual deviance: 109.49  on 310  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4
  1. Examine the summary tables for model2 and model3. Is there any significant change in the direction and/or statistical significance of the regression coefficients? If so, list one change. Use \(\alpha=0.01\) significance level.
summary(model2)
## 
## Call:
## lm(formula = log(mpg) ~ log(cylinders) + log(displacement) + 
##     log(horsepower) + log(weight) + log(acceleration) + log(modelyear) + 
##     log(origin), data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39109 -0.06878  0.00174  0.06583  0.37946 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.47984    0.73426  -0.653    0.514    
## log(cylinders)    -0.05833    0.06985  -0.835    0.404    
## log(displacement)  0.01270    0.06467   0.196    0.845    
## log(horsepower)   -0.27353    0.06197  -4.414 1.40e-05 ***
## log(weight)       -0.62338    0.09386  -6.642 1.39e-10 ***
## log(acceleration) -0.12563    0.06469  -1.942    0.053 .  
## log(modelyear)     2.34575    0.14884  15.760  < 2e-16 ***
## log(origin)        0.03703    0.02108   1.756    0.080 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 310 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8844 
## F-statistic: 347.5 on 7 and 310 DF,  p-value: < 2.2e-16
summary(model3)
## 
## Call:
## glm(formula = mpg ~ ., family = poisson, data = autompg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72128  -0.33322  -0.02383   0.29990   1.91492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.555e+00  3.097e-01   5.021 5.15e-07 ***
## cylinders    -9.888e-03  2.403e-02  -0.412    0.681    
## displacement  5.326e-04  5.640e-04   0.944    0.345    
## horsepower   -1.336e-03  1.010e-03  -1.322    0.186    
## weight       -2.908e-04  5.013e-05  -5.800 6.64e-09 ***
## acceleration  4.953e-03  6.516e-03   0.760    0.447    
## modelyear     3.129e-02  3.392e-03   9.226  < 2e-16 ***
## origin        3.145e-02  1.723e-02   1.825    0.068 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 827.98  on 317  degrees of freedom
## Residual deviance: 109.49  on 310  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4

Response to Question 5B Yes. For instance, the horsepower change from significant to insignificant frmo model 2 to model 3.

Question 6 - Coefficient interpretation

  1. Interpret the estimated coefficients for the predictor variable horsepower
# Code to extract coefficient

Response to Question 6A

  1. Provide a 95% confidence interval for cylinders coefficient in model3. Exponentiate the endpoints of the confidence interval you just created and interpret it in the context of the problem.
# Code to compute CI



# Code to exponentiate the above CI's endpoints

Response to Question 6B

Question 7 - Goodness of Fit

  1. Estimate the dispersion parameter for model3. Is this model overdispersed? Explain
# Code to estimate dispersion parameter

Response to Question 7A

  1. Perform goodness of fit hypothesis test for model3 using the deviance residuals and \(\alpha = 0.05\). State the hypothesis of this test. What do you conclude? Explain.
# Code to perform GOF test
with(model3, cbind(res.deviance= deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df p
## [1,]     109.4876 310 1

Response to Question 7B The deviance residual P-value is greater than 0.05, which means not reject the model’s goodness to fit.

Question 8 - Subsets of coefficients

Are at least one of the variables displacement, cylinders and horsepower significant at the \(\alpha = 0.05\) level, given all of the other variables in model3 are included? Perform a testing for subset of coefficients to answer this question. Explain your reasoning, including a statement of the null and alternative hypotheses.

# Code to perform the test

Response to Question 8

Question 9 - Prediction

  1. Estimate mpg using the autompg_test dataset using both model2 and model3. Use head() to show the first few predictions.
# Code to estimate using model2
pred.test2= predict.glm(model2,autompg_test,type="response")
head(pred.test2)
##        1        2        3        5        9       11 
## 2.706045 2.615095 2.690498 2.712205 2.438422 2.647972
# Code to estimate using model3
pred.test3= predict.glm(model3,autompg_test,type="response")
head(pred.test3)
##        1        2        3        5        9       11 
## 15.30727 14.11246 15.21521 15.19356 11.05172 14.70706
  1. Calculate the precision error (PM) for each model
# Code to calculate PM for model2
sum((pred.test2-log(autompg_test$mpg))^2)/sum((log(autompg_test$mpg)-mean(log(autompg_test$mpg)))^2)
## [1] 0.104396
# Code to calculate PM for model3
sum((pred.test3-autompg_test$mpg)^2)/sum((autompg_test$mpg-mean(autompg_test$mpg))^2)
## [1] 0.1071959
  1. Compare the results from part B. Which model is better? Explain.

Response to Question 9C Model 2 looks better due to lower PM