ISYE6414 - Midterm Exam 2 - Open Book Section (R)

Auto MPG Data Analysis

For this exam, you will be building a model to predict the fuel efficiency of a car (number of miles per gallon).

The “auto-mpg.csv” data set consists of the following variables:

mpg: number of miles per gallon
cylinders: number of cylinders in the engine
displacement: displacement of the engine
horsepower: horsepower output by the car
weight: weight of the car
acceleration: acceleration of the car
modelyear: year in which car was released
origin: origin of the car

Read the data and answer the questions below. Assume a significance threshold of 0.05 for hypothesis tests unless stated otherwise.

# Read the data set
autompg_full = read.csv('auto-mpg.csv',header=TRUE, sep = ",")

# Split the data into training and testing data
sample_size = floor(0.8*nrow(autompg_full))
set.seed(10)
idx = sample(seq_len(nrow(autompg_full)), size = sample_size)
autompg = autompg_full[idx, ]
autompg_test = autompg_full[-idx, ]

Note: For all of the following questions, use autompg as your data unless stated otherwise.

Question 1 - Exploratory Data Analysis

Create a histogram of the response variable mpg. Based on this plot, would it be appropriate to model this response variable using a poisson regression? Explain.

hist(autompg$mpg)

Response to Question 1: Yes, it is appropriate to model by poisson due to the distribution is not normal distribution.

Question 2 - Multiple Linear Regression & Model fit

Create a linear regression model, called model1, with mpg as the response variable and all other variables as the predictors. Include an intercept. Display the summary table of the model. Note: Remember to use autompg as your data set.

model1 = lm(mpg~.,data=autompg)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ ., data = autompg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2801 -2.1088 -0.1009  1.8845 12.9594 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.841e+01  5.179e+00  -3.555 0.000437 ***
## cylinders    -4.332e-01  3.720e-01  -1.165 0.245107    
## displacement  2.130e-02  8.435e-03   2.526 0.012045 *  
## horsepower   -1.361e-02  1.499e-02  -0.908 0.364547    
## weight       -6.926e-03  7.655e-04  -9.047  < 2e-16 ***
## acceleration  1.630e-01  1.099e-01   1.484 0.138903    
## modelyear     7.569e-01  5.780e-02  13.094  < 2e-16 ***
## origin        1.327e+00  3.118e-01   4.257 2.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.439 on 310 degrees of freedom
## Multiple R-squared:  0.8142, Adjusted R-squared:   0.81 
## F-statistic: 194.1 on 7 and 310 DF,  p-value: < 2.2e-16

B)Is the overall regression significant at the \(\alpha=0.01\) level? Explain.

Response to Question 2B: Yes, the model overall is significant since the p-value of F-statistic is significantly lower than the alpha threshold

Using model1, calculate and plot the Cook’s distance for each point. Based on this plot, evaluate whether there are any concerning outliers. Explain your reasoning using a threshold of 1. Note: Do not remove any data points

#Code to calculate Cook's distance and plot it

Response to Question 2C:

Question 3 - Model assumptions

Using model1, create and interpret the following plots to determine whether the multiple linear regression (MLR) assumptions hold. State which assumption(s) can be assessed by each plot and comment on whether there are any apparent departures from the assumptions and how you came to these conclusions.

• Plot of the standardized residuals, \(r_i\), versus the fitted values, \(\hat{y_i}\)

• Q-Q plot of the standardized residuals, \(r_i\)

res = resid(model1, type="deviance")
plot(autompg$mpg, res, xlab= "mpge", ylab= "Residuals", pch= 19)
abline(h = 0)

#Code to plot the qqplot
qqnorm(res, ylab="Std residuals")
qqline(res, col="blue", lwd=2)

Response to Question 3: The residual has postive pattern which violate the assumption is expectation of residual should be 0; Besides, based on QQplot, the residual is not follow the normal distribution.

Question 4 - Transformation of response variable - attempt to improve the fit

Perform a log transformation of the response variable and create a new linear regression model model2, with the transformed mpg as the response variable and all other variables as the predictors. Include an intercept. Display the summary table.

# Code to log transform and create model 
model2 = lm(log(mpg)~log(cylinders)+log(displacement)+log(horsepower)+log(weight)+log(acceleration)+log(modelyear)+log(origin),data=autompg)
summary(model2)

## 
## Call:
## lm(formula = log(mpg) ~ log(cylinders) + log(displacement) + 
##     log(horsepower) + log(weight) + log(acceleration) + log(modelyear) + 
##     log(origin), data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39109 -0.06878  0.00174  0.06583  0.37946 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.47984    0.73426  -0.653    0.514    
## log(cylinders)    -0.05833    0.06985  -0.835    0.404    
## log(displacement)  0.01270    0.06467   0.196    0.845    
## log(horsepower)   -0.27353    0.06197  -4.414 1.40e-05 ***
## log(weight)       -0.62338    0.09386  -6.642 1.39e-10 ***
## log(acceleration) -0.12563    0.06469  -1.942    0.053 .  
## log(modelyear)     2.34575    0.14884  15.760  < 2e-16 ***
## log(origin)        0.03703    0.02108   1.756    0.080 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 310 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8844 
## F-statistic: 347.5 on 7 and 310 DF,  p-value: < 2.2e-16

Calculate the variance inflation factor (VIF) for each predicting variable. What is the value of the VIF threshold max\((10, \frac{1}{1-R^2_{model2}})\) for this model?

#Code to calculate VIF



# VIF threshold

Response to Question 4B:

Do any of the VIFs exceed the threshold? Based on these results, does multicollinearity seem to be a concern in this model?

Response to Question 4C:

Question 5 - Poisson Regression

Fit a Poisson regression model, called **model3*, with \(mpg\) as the response variable and all other variables as the predictors. Include an intercept. Display the summary table of the model.

model3 = glm(mpg~.,data=autompg,family = poisson)
summary(model3)

## 
## Call:
## glm(formula = mpg ~ ., family = poisson, data = autompg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72128  -0.33322  -0.02383   0.29990   1.91492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.555e+00  3.097e-01   5.021 5.15e-07 ***
## cylinders    -9.888e-03  2.403e-02  -0.412    0.681    
## displacement  5.326e-04  5.640e-04   0.944    0.345    
## horsepower   -1.336e-03  1.010e-03  -1.322    0.186    
## weight       -2.908e-04  5.013e-05  -5.800 6.64e-09 ***
## acceleration  4.953e-03  6.516e-03   0.760    0.447    
## modelyear     3.129e-02  3.392e-03   9.226  < 2e-16 ***
## origin        3.145e-02  1.723e-02   1.825    0.068 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 827.98  on 317  degrees of freedom
## Residual deviance: 109.49  on 310  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4

Examine the summary tables for model2 and model3. Is there any significant change in the direction and/or statistical significance of the regression coefficients? If so, list one change. Use \(\alpha=0.01\) significance level.

summary(model2)

## 
## Call:
## lm(formula = log(mpg) ~ log(cylinders) + log(displacement) + 
##     log(horsepower) + log(weight) + log(acceleration) + log(modelyear) + 
##     log(origin), data = autompg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39109 -0.06878  0.00174  0.06583  0.37946 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.47984    0.73426  -0.653    0.514    
## log(cylinders)    -0.05833    0.06985  -0.835    0.404    
## log(displacement)  0.01270    0.06467   0.196    0.845    
## log(horsepower)   -0.27353    0.06197  -4.414 1.40e-05 ***
## log(weight)       -0.62338    0.09386  -6.642 1.39e-10 ***
## log(acceleration) -0.12563    0.06469  -1.942    0.053 .  
## log(modelyear)     2.34575    0.14884  15.760  < 2e-16 ***
## log(origin)        0.03703    0.02108   1.756    0.080 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1167 on 310 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8844 
## F-statistic: 347.5 on 7 and 310 DF,  p-value: < 2.2e-16

summary(model3)

## 
## Call:
## glm(formula = mpg ~ ., family = poisson, data = autompg)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.72128  -0.33322  -0.02383   0.29990   1.91492  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.555e+00  3.097e-01   5.021 5.15e-07 ***
## cylinders    -9.888e-03  2.403e-02  -0.412    0.681    
## displacement  5.326e-04  5.640e-04   0.944    0.345    
## horsepower   -1.336e-03  1.010e-03  -1.322    0.186    
## weight       -2.908e-04  5.013e-05  -5.800 6.64e-09 ***
## acceleration  4.953e-03  6.516e-03   0.760    0.447    
## modelyear     3.129e-02  3.392e-03   9.226  < 2e-16 ***
## origin        3.145e-02  1.723e-02   1.825    0.068 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 827.98  on 317  degrees of freedom
## Residual deviance: 109.49  on 310  degrees of freedom
## AIC: Inf
## 
## Number of Fisher Scoring iterations: 4

Response to Question 5B Yes. For instance, the horsepower change from significant to insignificant frmo model 2 to model 3.

Question 6 - Coefficient interpretation

Interpret the estimated coefficients for the predictor variable horsepower

# Code to extract coefficient

Response to Question 6A

Provide a 95% confidence interval for cylinders coefficient in model3. Exponentiate the endpoints of the confidence interval you just created and interpret it in the context of the problem.

# Code to compute CI



# Code to exponentiate the above CI's endpoints

Response to Question 6B

Question 7 - Goodness of Fit

Estimate the dispersion parameter for model3. Is this model overdispersed? Explain

# Code to estimate dispersion parameter

Response to Question 7A

Perform goodness of fit hypothesis test for model3 using the deviance residuals and \(\alpha = 0.05\). State the hypothesis of this test. What do you conclude? Explain.

# Code to perform GOF test
with(model3, cbind(res.deviance= deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))

##      res.deviance  df p
## [1,]     109.4876 310 1

Response to Question 7B The deviance residual P-value is greater than 0.05, which means not reject the model’s goodness to fit.

Question 8 - Subsets of coefficients

Are at least one of the variables displacement, cylinders and horsepower significant at the \(\alpha = 0.05\) level, given all of the other variables in model3 are included? Perform a testing for subset of coefficients to answer this question. Explain your reasoning, including a statement of the null and alternative hypotheses.

# Code to perform the test

Response to Question 8

Question 9 - Prediction

Estimate mpg using the autompg_test dataset using both model2 and model3. Use head() to show the first few predictions.

# Code to estimate using model2
pred.test2= predict.glm(model2,autompg_test,type="response")
head(pred.test2)

##        1        2        3        5        9       11 
## 2.706045 2.615095 2.690498 2.712205 2.438422 2.647972

# Code to estimate using model3
pred.test3= predict.glm(model3,autompg_test,type="response")
head(pred.test3)

##        1        2        3        5        9       11 
## 15.30727 14.11246 15.21521 15.19356 11.05172 14.70706

Calculate the precision error (PM) for each model

# Code to calculate PM for model2
sum((pred.test2-log(autompg_test$mpg))^2)/sum((log(autompg_test$mpg)-mean(log(autompg_test$mpg)))^2)

## [1] 0.104396

# Code to calculate PM for model3
sum((pred.test3-autompg_test$mpg)^2)/sum((autompg_test$mpg-mean(autompg_test$mpg))^2)

## [1] 0.1071959

Compare the results from part B. Which model is better? Explain.

Response to Question 9C Model 2 looks better due to lower PM

ISYE6414 - Midterm Exam 2 - Open Book Section (R) - Part 2

Instructions

Mock Example Question