Please review the student expectations for peer review grading and peer review comments. Overall, we ask that you score with accuracy. When grading your peers, you will not only learn how to improve your future homework submissions but you will also gain deeper understanding of the concepts in the assignments. When assigning scores, consider the responses to the questions given your understanding of the problem and using the solutions as a guide. Moreover, please give partial credit for a concerted effort, but also be thorough. Add comments to your review, particularly when deducting points, to explain why the student missed the points. Ensure your comments are specific to questions and the student responses in the assignment.
The provided dataset is a subset of the public data from the 2022 EPA Automotive Trends Report. It will be used to study the effects of various vehicle characteristics on CO2 emissions.
The dataset consists of a dataframe with 2060 observations with the following 7 variables:
To read the data in R, save the file in your working
directory (make sure you have changed the directory if different from
the R working directory) and read the data using the R
function read.csv()
# reading the dataset
data <- read.csv("vehicle_CO2_emis.csv")
head(data,3)
## Model.Year Type Weight Horsepower Acceleration MPG CO2
## 1 1995 SUV 3500 94 13.2003 20.99489 423.2935
## 2 1996 SUV 3500 130 11.1597 20.29168 437.9628
## 3 1997 SUV 3500 130 11.7893 18.92057 469.7003
cor(data$Model.Year, data$CO2)
## [1] -0.4794969
cor(data$MPG, data$CO2)
## [1] -0.9598317
cor(data$Weight, data$CO2)
## [1] 0.5163332
cor(data$Horsepower, data$CO2)
## [1] 0.007126864
cor(data$Acceleration, data$CO2)
## [1] 0.3117955
Q1a ANSWER: The top 3 predicting variables that have the strongest linear relationship with the response are Model.Year (-0.4794969), MPG (-0.9598317), and Weight (0.5163332). Both Model.Year and MPG are negatively correlated, meaning there is an inverse relationship. That is, as these variables increase (independently), they decrease the value of CO2. Meanwhile, Weight is positively correlated with CO2, indicating that as Weight increases, CO2 also increases.
Type = data$Type <-as.factor(data$Type)
boxplot(data$CO2 ~ Type, main = "CO2 vs Type",xlab = "Type", ylab = "CO2")
Q1b ANSWER: The mean CO2 between SUV, Truck, and Van seem very close, whereas the mean CO2 from type Sedan seems lower. Meanwhile, the CO2 range from Type SUV seems to be the widest among the different types. Thus, I would imagine there is a relationship between the Type and CO2, with Type Sedan generally producing lower values of CO2.
plot(data$Model.Year,data$CO2,main = "CO2 vs Model.Year", xlab="Model.Year",ylab="CO2")
abline(lm(data$CO2 ~ data$Model.Year, data = data), col = "blue")
plot(data$MPG,data$CO2,main = "CO2 vs MPG", xlab="MPG",ylab="CO2")
abline(lm(data$CO2 ~ data$MPG, data = data), col = "blue")
plot(data$Weight,data$CO2,main = "CO2 vs Weight", xlab="Weight",ylab="CO2")
abline(lm(data$CO2 ~ data$Weight, data = data), col = "blue")
plot(data$Horsepower,data$CO2,main = "CO2 vs Horsepower", xlab="Horsepower",ylab="CO2")
abline(lm(data$CO2 ~ data$Horsepower, data = data), col = "blue")
plot(data$Acceleration,data$CO2,main = "CO2 vs Acceleration", xlab="Acceleration",ylab="CO2")
abline(lm(data$CO2 ~ data$Acceleration, data = data), col = "blue")
Q1c ANSWER: The general trend of CO2 vs Model.Year and CO2 vs MPG is negative, with CO2 vs MPG being more negative than CO2 vs Model.Year. The general trend of CO2 vs Weight and CO2 vs Acceleration is positive, with CO2 vs Weight being more positive than CO2 vs Acceleration (inverse relationships). CO2 vs Horsepower seems to have a flat line, indicating no relationship.
Q1d ANSWER: Yes it is reasonable to fit a multiple linear regression model for the relationship between CO2 and the predicting variables since it appears that many predicting variables may impact the result of CO2. Some variables seem to have more explantory power than others, but further analysis is needed to determine their significance.
model1 = lm(data$CO2 ~ data$Model.Year + data$MPG + data$Weight, data = data)
summary(model1)
##
## Call:
## lm(formula = data$CO2 ~ data$Model.Year + data$MPG + data$Weight,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.891 -12.833 -5.383 5.787 262.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.764e+03 1.617e+02 23.27 <2e-16 ***
## data$Model.Year -1.547e+00 8.582e-02 -18.03 <2e-16 ***
## data$MPG -1.587e+01 2.653e-01 -59.83 <2e-16 ***
## data$Weight 2.747e-02 1.658e-03 16.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.55 on 2056 degrees of freedom
## Multiple R-squared: 0.9322, Adjusted R-squared: 0.9321
## F-statistic: 9429 on 3 and 2056 DF, p-value: < 2.2e-16
Q2b ANSWER: The estimates coefficient for the intercept is 3764. This means, if Model.Year, MPG, and Weight are all 0, the CO2 level would be 3764. That is, 3764 is the base level of CO2 emissions that would occur.
TypeAnova = aov(data$CO2 ~ Type)
summary(TypeAnova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 3 3802499 1267500 200 <2e-16 ***
## Residuals 2056 13032369 6339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(TypeAnova, type = "means")
## Tables of means
## Grand mean
##
## 433.3347
##
## Type
## Sedan SUV Truck Van
## 364.1 451.8 462.8 471.9
## rep 562.0 722.0 473.0 303.0
Q2c ANSWER: Since the p-value of the F Test is very small (<2e-16), we can reject the null hypothesis that the means are all equal to 0. This means that the mean of at least one Type is not equal to another.
model2 = lm(data$CO2 ~ data$Model.Year + data$MPG + data$Weight + data$Horsepower + data$Acceleration + Type, data = data)
summary(model2)
##
## Call:
## lm(formula = data$CO2 ~ data$Model.Year + data$MPG + data$Weight +
## data$Horsepower + data$Acceleration + Type, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.674 -11.644 -3.872 6.250 248.771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.738e+03 2.437e+02 7.129 1.39e-12 ***
## data$Model.Year -5.280e-01 1.245e-01 -4.239 2.34e-05 ***
## data$MPG -1.684e+01 2.777e-01 -60.649 < 2e-16 ***
## data$Weight 4.586e-02 2.014e-03 22.774 < 2e-16 ***
## data$Horsepower -3.169e-01 2.511e-02 -12.623 < 2e-16 ***
## data$Acceleration 3.522e-01 4.744e-01 0.742 0.458
## TypeSUV -9.438e+00 1.676e+00 -5.632 2.03e-08 ***
## TypeTruck -1.891e+01 1.853e+00 -10.203 < 2e-16 ***
## TypeVan -2.440e+01 2.153e+00 -11.333 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared: 0.9417, Adjusted R-squared: 0.9414
## F-statistic: 4139 on 8 and 2051 DF, p-value: < 2.2e-16
Q2d ANSWER: Using \(\alpha = 0.05\), all coefficients in model1 are also statistically significant in model2. That is, Model.Year, MPG, and Weight remain statisically significant in both models, as their p-values are less than 0.05.
Q2e ANSWER: The estimated coefficient for TypeVan is -24.4 and the intercept is 1738. This means, holding all other variables fixed, TypeVan will change the CO2 emissions by 1713.6 (1738 - 24.4).
Q2f ANSWER: The interpretation of TypeVan in a linear model is different than a boxplot. The boxplot illustrates the mean CO2 emissions of TypeVan, whereas the linear model shows the relationship between TypeVan and CO2, holding other variables fixed. The boxplot shows a mean of TypeVan around 450 CO2 emissions, whereas the linear model has a coefficient of -24.4 with an intercept of 1738.
Q2g ANSWER: The F-statistic of Model2 is 4139 on 8 and 2051 degrees of freedom with p-value <2.2e-16. Since this p-value is less than 0.05, the overall regression of model2 is significant.
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: data$CO2 ~ data$Model.Year + data$MPG + data$Weight
## Model 2: data$CO2 ~ data$Model.Year + data$MPG + data$Weight + data$Horsepower +
## data$Acceleration + Type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2056 1140671
## 2 2051 981999 5 158672 66.28 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q3a ANSWER: The partial F-test has a p-value of < 2.2e-16 which is less than the alpha value of 0.05. We can reject the null hypothesis that the regression coefficients for Horsepower, Acceleration, and Type are 0.
summary(model1)$r.squared
## [1] 0.9322435
summary(model2)$r.squared
## [1] 0.9416687
summary(model1)$adj.r.squared
## [1] 0.9321447
summary(model2)$adj.r.squared
## [1] 0.9414412
Q3b ANSWER: Model2’s r-squared value is higher than model1. In addition, model2, despite being penalized for having more variables, has a higher adjusted r-squared than model1. Thus, model2 explains more variability in CO2 emissions than model1.
cook = cooks.distance(model2)
plot(cook,type="h", lwd=4, col="red", ylab = "Cook's Distance", main="Cook's Distance")
abline(h=1,col="blue")
Q3c ANSWER: Using a threshold of 1, there aren’t any outliers.
library(car)
## Loading required package: carData
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## data$Model.Year 10.152350 1 3.186275
## data$MPG 6.166968 1 2.483338
## data$Weight 7.733145 1 2.780853
## data$Horsepower 9.917250 1 3.149167
## data$Acceleration 7.365774 1 2.713996
## Type 2.557766 3 1.169437
vif_threshold <- max(10, 1/(1-summary(model2)$r.squared))
vif_threshold
## [1] 17.14346
Q3d ANSWER: The threshold is 17.14. Since all variables have a GVIF of less than 17.14, there is no indication of multicollinearity.
3 pts Using model1 and model2, predict the CO2 emissions for a vehicle with the following characteristics: Model.Year=2020, Type=“Sedan”, MPG=32, Weight=3400, Horsepower=203, Acceleration=8
new <- data.frame(Model.Year=c(2020), Type=c("Sedan"), MPG=c(32), Weight = c(3400), Horsepower = c(203), Acceleration = c(8))
model1_predict = 3764 - 1.547*2020 - 15.87*32 + 0.02747*3400
model2_predict = 1738 - 0.528*2020 - 16.83*32 + 0.04586*3400 - 0.3169*203 + 0.3522*8
model1_predict
## [1] 224.618
model2_predict
## [1] 227.2909
Q4 ANSWER: The predicted CO2 emissions for model1 and model2 are as follows: Model1: 224.618 Model2: 227.2909