Peer Grader Guidance

Please review the student expectations for peer review grading and peer review comments. Overall, we ask that you score with accuracy. When grading your peers, you will not only learn how to improve your future homework submissions but you will also gain deeper understanding of the concepts in the assignments. When assigning scores, consider the responses to the questions given your understanding of the problem and using the solutions as a guide. Moreover, please give partial credit for a concerted effort, but also be thorough. Add comments to your review, particularly when deducting points, to explain why the student missed the points. Ensure your comments are specific to questions and the student responses in the assignment.

Background

The provided dataset is a subset of the public data from the 2022 EPA Automotive Trends Report. It will be used to study the effects of various vehicle characteristics on CO2 emissions.

Data Description

The dataset consists of a dataframe with 2060 observations with the following 7 variables:

Model.Year: year the vehicle model was produced (quantitative)
Type: vehicle type (qualitative)
MPG: miles per gallon of fuel (quantitative)
Weight: vehicle weight in lbs (quantitative)
Horsepower: vehicle horsepower in HP (quantitative)
Acceleration: acceleration time (from 0 to 60 mph) in seconds (quantitative)
CO2: carbon dioxide emissions in g/mi (response variable)

Instructions on reading the data

To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv()

# reading the dataset
data <- read.csv("vehicle_CO2_emis.csv")
head(data,3)

##   Model.Year Type Weight Horsepower Acceleration      MPG      CO2
## 1       1995  SUV   3500         94      13.2003 20.99489 423.2935
## 2       1996  SUV   3500        130      11.1597 20.29168 437.9628
## 3       1997  SUV   3500        130      11.7893 18.92057 469.7003

Question 1: Exploratory Data Analysis [15 points]

3 pts Compute the correlation coefficient for each quantitative predicting variable (Model.Year, MPG, Weight, Horsepower, Acceleration) against the response (CO2). Describe the strength and direction of the top 3 predicting variables that have the strongest linear relationships with the response.

cor(data$Model.Year, data$CO2)

## [1] -0.4794969

cor(data$MPG, data$CO2)

## [1] -0.9598317

cor(data$Weight, data$CO2)

## [1] 0.5163332

cor(data$Horsepower, data$CO2)

## [1] 0.007126864

cor(data$Acceleration, data$CO2)

## [1] 0.3117955

Q1a ANSWER: The top 3 predicting variables that have the strongest linear relationship with the response are Model.Year (-0.4794969), MPG (-0.9598317), and Weight (0.5163332). Both Model.Year and MPG are negatively correlated, meaning there is an inverse relationship. That is, as these variables increase (independently), they decrease the value of CO2. Meanwhile, Weight is positively correlated with CO2, indicating that as Weight increases, CO2 also increases.

3 pts Create a boxplot of the qualitative predicting variable (Type) versus the response (CO2). Explain the relationship between the two variables.

Type = data$Type <-as.factor(data$Type)
boxplot(data$CO2 ~ Type, main = "CO2 vs Type",xlab = "Type", ylab = "CO2")

Q1b ANSWER: The mean CO2 between SUV, Truck, and Van seem very close, whereas the mean CO2 from type Sedan seems lower. Meanwhile, the CO2 range from Type SUV seems to be the widest among the different types. Thus, I would imagine there is a relationship between the Type and CO2, with Type Sedan generally producing lower values of CO2.

6 pts Create scatterplots of the response (CO2) against each quantitative predicting variable (Model.Year, MPG, Weight, Horsepower, Acceleration). Describe the general trend of each plot.

plot(data$Model.Year,data$CO2,main = "CO2 vs Model.Year", xlab="Model.Year",ylab="CO2")
abline(lm(data$CO2 ~ data$Model.Year, data = data), col = "blue")

plot(data$MPG,data$CO2,main = "CO2 vs MPG", xlab="MPG",ylab="CO2")
abline(lm(data$CO2 ~ data$MPG, data = data), col = "blue")

plot(data$Weight,data$CO2,main = "CO2 vs Weight", xlab="Weight",ylab="CO2")
abline(lm(data$CO2 ~ data$Weight, data = data), col = "blue")

plot(data$Horsepower,data$CO2,main = "CO2 vs Horsepower", xlab="Horsepower",ylab="CO2")
abline(lm(data$CO2 ~ data$Horsepower, data = data), col = "blue")

plot(data$Acceleration,data$CO2,main = "CO2 vs Acceleration", xlab="Acceleration",ylab="CO2")
abline(lm(data$CO2 ~ data$Acceleration, data = data), col = "blue")

Q1c ANSWER: The general trend of CO2 vs Model.Year and CO2 vs MPG is negative, with CO2 vs MPG being more negative than CO2 vs Model.Year. The general trend of CO2 vs Weight and CO2 vs Acceleration is positive, with CO2 vs Weight being more positive than CO2 vs Acceleration (inverse relationships). CO2 vs Horsepower seems to have a flat line, indicating no relationship.

3 pts Based on this exploratory analysis, is it reasonable to fit a multiple linear regression model for the relationship between CO2 and the predicting variables? Explain how you determined the answer.

Q1d ANSWER: Yes it is reasonable to fit a multiple linear regression model for the relationship between CO2 and the predicting variables since it appears that many predicting variables may impact the result of CO2. Some variables seem to have more explantory power than others, but further analysis is needed to determine their significance.

Question 2: Model Fitting and Interpretation [26 points]

3 pts Fit a multiple linear regression model called model1 using CO2 as the response and the top 3 predicting variables with the strongest relationship with CO2, from Question 1a.

model1 = lm(data$CO2 ~ data$Model.Year + data$MPG + data$Weight, data = data)
summary(model1)

## 
## Call:
## lm(formula = data$CO2 ~ data$Model.Year + data$MPG + data$Weight, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.891 -12.833  -5.383   5.787 262.204 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.764e+03  1.617e+02   23.27   <2e-16 ***
## data$Model.Year -1.547e+00  8.582e-02  -18.03   <2e-16 ***
## data$MPG        -1.587e+01  2.653e-01  -59.83   <2e-16 ***
## data$Weight      2.747e-02  1.658e-03   16.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.55 on 2056 degrees of freedom
## Multiple R-squared:  0.9322, Adjusted R-squared:  0.9321 
## F-statistic:  9429 on 3 and 2056 DF,  p-value: < 2.2e-16

4 pts What is the estimated coefficient for the intercept? Interpret this coefficient in the context of the dataset.

Q2b ANSWER: The estimates coefficient for the intercept is 3764. This means, if Model.Year, MPG, and Weight are all 0, the CO2 level would be 3764. That is, 3764 is the base level of CO2 emissions that would occur.

4 pts Assuming a marginal relationship between Type and CO2, perform an ANOVA F-test on the mean CO2 emission among the different vehicle types. Using an \(\alpha\)-level of 0.05, is Type useful in predicting CO2? Explain how you determined the answer.

TypeAnova = aov(data$CO2 ~ Type)
summary(TypeAnova)

##               Df   Sum Sq Mean Sq F value Pr(>F)    
## Type           3  3802499 1267500     200 <2e-16 ***
## Residuals   2056 13032369    6339                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model.tables(TypeAnova, type = "means")

## Tables of means
## Grand mean
##          
## 433.3347 
## 
##  Type 
##     Sedan   SUV Truck   Van
##     364.1 451.8 462.8 471.9
## rep 562.0 722.0 473.0 303.0

Q2c ANSWER: Since the p-value of the F Test is very small (<2e-16), we can reject the null hypothesis that the means are all equal to 0. This means that the mean of at least one Type is not equal to another.

3 pts Fit a multiple linear regression model called model2 using CO2 as the response and all predicting variables. Using \(\alpha = 0.05\), which of the estimated coefficients that were statistically significant in model1 are also statistically significant in model2?

model2 = lm(data$CO2 ~ data$Model.Year + data$MPG + data$Weight + data$Horsepower + data$Acceleration + Type, data = data)
summary(model2)

## 
## Call:
## lm(formula = data$CO2 ~ data$Model.Year + data$MPG + data$Weight + 
##     data$Horsepower + data$Acceleration + Type, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.674 -11.644  -3.872   6.250 248.771 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.738e+03  2.437e+02   7.129 1.39e-12 ***
## data$Model.Year   -5.280e-01  1.245e-01  -4.239 2.34e-05 ***
## data$MPG          -1.684e+01  2.777e-01 -60.649  < 2e-16 ***
## data$Weight        4.586e-02  2.014e-03  22.774  < 2e-16 ***
## data$Horsepower   -3.169e-01  2.511e-02 -12.623  < 2e-16 ***
## data$Acceleration  3.522e-01  4.744e-01   0.742    0.458    
## TypeSUV           -9.438e+00  1.676e+00  -5.632 2.03e-08 ***
## TypeTruck         -1.891e+01  1.853e+00 -10.203  < 2e-16 ***
## TypeVan           -2.440e+01  2.153e+00 -11.333  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9414 
## F-statistic:  4139 on 8 and 2051 DF,  p-value: < 2.2e-16

Q2d ANSWER: Using \(\alpha = 0.05\), all coefficients in model1 are also statistically significant in model2. That is, Model.Year, MPG, and Weight remain statisically significant in both models, as their p-values are less than 0.05.

4 pts Interpret the estimated coefficient for TypeVan in the context of the dataset. Make sure TypeSedan is the baseline level for Type. Mention any assumptions you make about other predictors clearly when stating the interpretation.

Q2e ANSWER: The estimated coefficient for TypeVan is -24.4 and the intercept is 1738. This means, holding all other variables fixed, TypeVan will change the CO2 emissions by 1713.6 (1738 - 24.4).

4 pts How does your interpretation of TypeVan above compare to the relationship between CO2 vs Type analyzed using the boxplot in Q1? Explain the reason for the similarities/differences.

Q2f ANSWER: The interpretation of TypeVan in a linear model is different than a boxplot. The boxplot illustrates the mean CO2 emissions of TypeVan, whereas the linear model shows the relationship between TypeVan and CO2, holding other variables fixed. The boxplot shows a mean of TypeVan around 450 CO2 emissions, whereas the linear model has a coefficient of -24.4 with an intercept of 1738.

4 pts Is the overall regression (model2) significant at an \(\alpha\)-level of 0.05? Explain how you determined the answer.

Q2g ANSWER: The F-statistic of Model2 is 4139 on 8 and 2051 degrees of freedom with p-value <2.2e-16. Since this p-value is less than 0.05, the overall regression of model2 is significant.

Question 3: Model Comparison, Outliers, and Multicollinearity [16 points]

4 pts Conduct a partial \(F\)-test comparing model1 and model2. What can you conclude from the results using an \(\alpha\)-level of 0.05?

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: data$CO2 ~ data$Model.Year + data$MPG + data$Weight
## Model 2: data$CO2 ~ data$Model.Year + data$MPG + data$Weight + data$Horsepower + 
##     data$Acceleration + Type
##   Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
## 1   2056 1140671                                 
## 2   2051  981999  5    158672 66.28 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q3a ANSWER: The partial F-test has a p-value of < 2.2e-16 which is less than the alpha value of 0.05. We can reject the null hypothesis that the regression coefficients for Horsepower, Acceleration, and Type are 0.

4 pts Using \(R^2\) and adjusted \(R^2\), compare model1 and model2.

summary(model1)$r.squared

## [1] 0.9322435

summary(model2)$r.squared

## [1] 0.9416687

summary(model1)$adj.r.squared

## [1] 0.9321447

summary(model2)$adj.r.squared

## [1] 0.9414412

Q3b ANSWER: Model2’s r-squared value is higher than model1. In addition, model2, despite being penalized for having more variables, has a higher adjusted r-squared than model1. Thus, model2 explains more variability in CO2 emissions than model1.

4 pts Create a plot for the Cook’s Distances (use model2). Using a threshold of 1, are there any outliers? If yes, which data points?

cook = cooks.distance(model2)

plot(cook,type="h", lwd=4, col="red", ylab = "Cook's Distance", main="Cook's Distance")
abline(h=1,col="blue")

Q3c ANSWER: Using a threshold of 1, there aren’t any outliers.

4 pts Calculate the VIF of each predictor (use model2). Using a threshold of max(10, 1/(1-\(R^2\))) what conclusions can you make regarding multicollinearity?

library(car)

## Loading required package: carData

vif(model2)

##                        GVIF Df GVIF^(1/(2*Df))
## data$Model.Year   10.152350  1        3.186275
## data$MPG           6.166968  1        2.483338
## data$Weight        7.733145  1        2.780853
## data$Horsepower    9.917250  1        3.149167
## data$Acceleration  7.365774  1        2.713996
## Type               2.557766  3        1.169437

vif_threshold <- max(10, 1/(1-summary(model2)$r.squared))
vif_threshold

## [1] 17.14346

Q3d ANSWER: The threshold is 17.14. Since all variables have a GVIF of less than 17.14, there is no indication of multicollinearity.

Question 4: Prediction [3 points]

3 pts Using model1 and model2, predict the CO2 emissions for a vehicle with the following characteristics: Model.Year=2020, Type=“Sedan”, MPG=32, Weight=3400, Horsepower=203, Acceleration=8

new <- data.frame(Model.Year=c(2020), Type=c("Sedan"), MPG=c(32), Weight = c(3400), Horsepower = c(203), Acceleration = c(8))

model1_predict = 3764 - 1.547*2020 - 15.87*32 + 0.02747*3400
model2_predict = 1738 - 0.528*2020 - 16.83*32 + 0.04586*3400 - 0.3169*203 + 0.3522*8
model1_predict

## [1] 224.618

model2_predict

## [1] 227.2909

Q4 ANSWER: The predicted CO2 emissions for model1 and model2 are as follows: Model1: 224.618 Model2: 227.2909