Please review the student expectations for peer review grading and peer review comments. Overall, we ask that you score with accuracy. When grading your peers, you will not only learn how to improve your future homework submissions but you will also gain deeper understanding of the concepts in the assignments. When assigning scores, consider the responses to the questions given your understanding of the problem and using the solutions as a guide. Moreover, please give partial credit for a concerted effort, but also be thorough. Add comments to your review, particularly when deducting points, to explain why the student missed the points. Ensure your comments are specific to questions and the student responses in the assignment.
The provided dataset is a subset of the public data from the 2022 EPA Automotive Trends Report. It will be used to study the effects of various vehicle characteristics on CO2 emissions.
The dataset consists of a dataframe with 2060 observations with the following 7 variables:
To read the data in R, save the file in your working
directory (make sure you have changed the directory if different from
the R working directory) and read the data using the R
function read.csv()
# reading the dataset
setwd("~/Regression Analysis/Homework")
getwd
## function ()
## .Internal(getwd())
## <bytecode: 0x00000234fb47d640>
## <environment: namespace:base>
data <- read.csv("vehicle_CO2_emis.csv")
head(data,3)
## Model.Year Type Weight Horsepower Acceleration MPG CO2
## 1 1995 SUV 3500 94 13.2003 20.99489 423.2935
## 2 1996 SUV 3500 130 11.1597 20.29168 437.9628
## 3 1997 SUV 3500 130 11.7893 18.92057 469.7003
setwd("~/Regression Analysis/Homework")
getwd
## function ()
## .Internal(getwd())
## <bytecode: 0x00000234fb47d640>
## <environment: namespace:base>
#libraries
library(ggplot2)
library(car)
## Loading required package: carData
rm(list = ls())
#Building the model
data = read.csv("vehicle_CO2_emis.csv", head = TRUE)
head(data)
## Model.Year Type Weight Horsepower Acceleration MPG CO2
## 1 1995 SUV 3500 94 13.2003 20.99489 423.2935
## 2 1996 SUV 3500 130 11.1597 20.29168 437.9628
## 3 1997 SUV 3500 130 11.7893 18.92057 469.7003
## 4 1998 SUV 3500 130 12.1865 19.74470 450.0955
## 5 1999 SUV 3500 130 12.2307 19.66936 451.8196
## 6 2000 SUV 3500 130 11.7105 18.74666 474.0577
str(data)
## 'data.frame': 2060 obs. of 7 variables:
## $ Model.Year : int 1995 1996 1997 1998 1999 2000 2001 2002 2005 2006 ...
## $ Type : chr "SUV" "SUV" "SUV" "SUV" ...
## $ Weight : num 3500 3500 3500 3500 3500 ...
## $ Horsepower : num 94 130 130 130 130 ...
## $ Acceleration: num 13.2 11.2 11.8 12.2 12.2 ...
## $ MPG : num 21 20.3 18.9 19.7 19.7 ...
## $ CO2 : num 423 438 470 450 452 ...
model = lm(CO2~
Model.Year+MPG+Weight+Horsepower+Acceleration, data = data)
summary(model)
##
## Call:
## lm(formula = CO2 ~ Model.Year + MPG + Weight + Horsepower + Acceleration,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.312 -12.006 -5.127 6.064 261.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.770e+03 2.254e+02 12.289 < 2e-16 ***
## Model.Year -1.041e+00 1.155e-01 -9.010 < 2e-16 ***
## MPG -1.599e+01 2.639e-01 -60.606 < 2e-16 ***
## Weight 4.000e-02 2.028e-03 19.723 < 2e-16 ***
## Horsepower -2.803e-01 2.546e-02 -11.008 < 2e-16 ***
## Acceleration -1.483e+00 4.685e-01 -3.165 0.00157 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.75 on 2054 degrees of freedom
## Multiple R-squared: 0.9369, Adjusted R-squared: 0.9367
## F-statistic: 6097 on 5 and 2054 DF, p-value: < 2.2e-16
Q1a ANSWER:The top 3 coefficients are MPG, Weight, and Horsepower, with Weight being the most impactful coefficient and Horsepower being the least out of the three. MPG has a high negative coefficient which means that with every unit increase of the MPG the CO2 emission value goes down by 15.99 (g/mL). The weight has a positive coefficient meaning that with every pound that the vehicle is heavier there is an increase in the CO2 emissions by 0.04 (g/mL). Lastly the horsepower has a negative coefficient meaning that with every 1 HP increase with the CO2 emissions decrease by .28 (g/mL)
boxplot(data$CO2~data$Type, xlab = "Vehicle Type", ylab ="Carbon Dioxide Emissions (g/mi)")
Q1b ANSWER:Looking at the four types of vehicles the average CO2 emissions are the least in Sedans, following this are SUVs, Trucks, and then the most being Vans. There is a lot of variability in the ranges across all four categories with SUV’s having the most. Every category of vehicle also has a lot of outliers on the higher end.
#Scatterplot code
#Model Year
plot( data$Model.Year, data$CO2, main ="Scatterplot of CO2 Emissions vs Model Year of the Vehicle",
ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Model Year")
#MPG
plot(data$MPG, data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle MPG",
ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Miles per gallon")
#Weight
plot(data$Weight, data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle Weight",
ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Weight (lbs)")
#horsepower
plot(data$Horsepower, data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle Horse Power",
ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Horsepower")
#acceleration
plot(data$Acceleration,data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle Acceleration",
ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Acceleration (From 0 to 60 miles in seconds)")
Q1c ANSWER: • Model Year: There is a weak negative correlation between the model year of the vehicle and the CO2 emission release. • MPG: This graph could be interpreted to have a strong inverse or negative correlation between MPG and the CO2 emissions. The relationship could be linear, but it does appear to express the function e^-x • Weight: There is a positive correlation between the weight of the vehicle and the CO2 emissions. • Horsepower: There is a very weak positive correlation between the horsepower of a vehicle and the CO2 emissions. The pattern is scattered, and it almost looks random but there is a slight downwards turn of the points. • Acceleration: There is a very positive correlation between the acceleration of the vehicle and the CO2 emission release.Meaning as one increases so does the other.
Q1d ANSWER: It is reasonable to fit a multiple linear regression for those predicting variables and the CO2 emissions but some of the variables seem to have a weaker linear relationship with the variable of interest. Other than that there is a high R2 and adjusted R2 meaning that 93% of the variable of interest has been captured by the variables from the model.
model1 = lm(CO2~
Horsepower+MPG+Weight, data = data)
summary(model1)
##
## Call:
## lm(formula = CO2 ~ Horsepower + MPG + Weight, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.867 -12.500 -5.782 6.329 267.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 750.581176 7.560325 99.28 <2e-16 ***
## Horsepower -0.320859 0.016036 -20.01 <2e-16 ***
## MPG -17.820898 0.174183 -102.31 <2e-16 ***
## Weight 0.031548 0.001707 18.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.19 on 2056 degrees of freedom
## Multiple R-squared: 0.9343, Adjusted R-squared: 0.9342
## F-statistic: 9749 on 3 and 2056 DF, p-value: < 2.2e-16
Q2b ANSWER:The intercept is 750.58 and this means that a vehicle will release approximately 750 g/mL of CO2 emissions regardless of any other factors of the vehicle.
anova_model = aov(CO2~ Type, data= data)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 3 3802499 1267500 200 <2e-16 ***
## Residuals 2056 13032369 6339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q2c ANSWER:The output provides the partial F-value, which is 200 and the P-value approximately equal to 0. Looking at the alpha threshold of 0.05 these results are significant, and the null hypothesis is rejected which means that the coefficients corresponding to the type of the vehicle type are not all zero.
model2 = lm(CO2~ ., data = data)
summary(model2)
##
## Call:
## lm(formula = CO2 ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.674 -11.644 -3.872 6.250 248.771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.738e+03 2.437e+02 7.129 1.39e-12 ***
## Model.Year -5.280e-01 1.245e-01 -4.239 2.34e-05 ***
## TypeSUV -9.438e+00 1.676e+00 -5.632 2.03e-08 ***
## TypeTruck -1.891e+01 1.853e+00 -10.203 < 2e-16 ***
## TypeVan -2.440e+01 2.153e+00 -11.333 < 2e-16 ***
## Weight 4.586e-02 2.014e-03 22.774 < 2e-16 ***
## Horsepower -3.169e-01 2.511e-02 -12.623 < 2e-16 ***
## Acceleration 3.522e-01 4.744e-01 0.742 0.458
## MPG -1.684e+01 2.777e-01 -60.649 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared: 0.9417, Adjusted R-squared: 0.9414
## F-statistic: 4139 on 8 and 2051 DF, p-value: < 2.2e-16
Q2d ANSWER:The coefficients that were significant in both models at a 0.05 level are MPG, Weight, Model year, and Horsepower.
Q2e ANSWER:In this model Vans lessen the CO2 emissions by 24.4 g/mL more than Sedans do. The total amount that Vans decrease the CO2 emissions are 33.84 g/mL holding everything else constant.
Q2f ANSWER:Looking at the model we see that, holding everything else constant, if the type of the vehicle is a Van it will release 33.84 g/mL less of CO2 emissions. But the boxplot shows that Vans have the highest CO2 emission on average. It appears this way because there are other factors being accounted for in the boxplot. They are not just looking at the coefficients of the Type; the graph encompasses all of the other factors. When all of the other factors are accounted for we can see that with a heavier vehicle and a vehicle with less MPG have higher CO2 emissions which are characteristics of a Van. So it makes sense for the CO2 emissions to be higher for Vans due to the other factors, but not because the vehicle has been categorized as a Van.
Q2g ANSWER:Looking at the output of the regression model there is a p-value that encompasses the entire model and is denoted by 2e-16 which can be rounded to zero. Meaning that with an alpha level of 0.05 the current p-value is much less and therefore indicates that this model’s results are not due to randomness and that there is a significant relationship.
partial_f = anova(model1, model2)
print(partial_f)
## Analysis of Variance Table
##
## Model 1: CO2 ~ Horsepower + MPG + Weight
## Model 2: CO2 ~ Model.Year + Type + Weight + Horsepower + Acceleration +
## MPG
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2056 1105746
## 2 2051 981999 5 123747 51.692 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q3a ANSWER: It can be concluded that since the since the F stat is larger than the p-value and the p-value is much lower than our alpha value of 0.05, the null hypothesis that these two models have no difference can be rejected. Meaning that the models have a significant difference.
summary(model1)
##
## Call:
## lm(formula = CO2 ~ Horsepower + MPG + Weight, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.867 -12.500 -5.782 6.329 267.617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 750.581176 7.560325 99.28 <2e-16 ***
## Horsepower -0.320859 0.016036 -20.01 <2e-16 ***
## MPG -17.820898 0.174183 -102.31 <2e-16 ***
## Weight 0.031548 0.001707 18.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.19 on 2056 degrees of freedom
## Multiple R-squared: 0.9343, Adjusted R-squared: 0.9342
## F-statistic: 9749 on 3 and 2056 DF, p-value: < 2.2e-16
summary(model2)
##
## Call:
## lm(formula = CO2 ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.674 -11.644 -3.872 6.250 248.771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.738e+03 2.437e+02 7.129 1.39e-12 ***
## Model.Year -5.280e-01 1.245e-01 -4.239 2.34e-05 ***
## TypeSUV -9.438e+00 1.676e+00 -5.632 2.03e-08 ***
## TypeTruck -1.891e+01 1.853e+00 -10.203 < 2e-16 ***
## TypeVan -2.440e+01 2.153e+00 -11.333 < 2e-16 ***
## Weight 4.586e-02 2.014e-03 22.774 < 2e-16 ***
## Horsepower -3.169e-01 2.511e-02 -12.623 < 2e-16 ***
## Acceleration 3.522e-01 4.744e-01 0.742 0.458
## MPG -1.684e+01 2.777e-01 -60.649 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared: 0.9417, Adjusted R-squared: 0.9414
## F-statistic: 4139 on 8 and 2051 DF, p-value: < 2.2e-16
Q3b ANSWER:The R2 and Adjusted R2 for model1 are 0.9343 and 0.9342 and for model2 0.9417 and 0.9414, respectively. Because these models have a different number of variables the adjusted R2 value is the more appropriate value to use for comparison. Model2 is slightly better than Model1 with an adjusted R2 0.0072 units better than Model1’s. But with this in mind since they both have such high values both model2 and model1 explain 94% and 93% of the CO2 emissions with their regression coefficients.
cook = cooks.distance(model2)
plot(cook, type="h", lwd = 3, ylab= "Cook's Distance")
Q3c ANSWER:There are no values with a value over 1, indicating there are no outliers
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## Model.Year 10.152350 1 3.186275
## Type 2.557766 3 1.169437
## Weight 7.733145 1 2.780853
## Horsepower 9.917250 1 3.149167
## Acceleration 7.365774 1 2.713996
## MPG 6.166968 1 2.483338
Q3d ANSWER:The model year is slightly over 10 and horsepower is right below there. These values indicate that there could be a linear relationship between the two of these. On top of this some of the other factors are on the higher side, such as Weight and Acceleration. This all suggests that there is high multicollinearity amongst the predicting variables.
3 pts Using model1 and model2, predict the CO2 emissions for a vehicle with the following characteristics: Model.Year=2020, Type=“Sedan”, MPG=32, Weight=3400, Horsepower=203, Acceleration=8
parm <- data.frame(
Model.Year = 2020,
Type = "Sedan",
MPG = 32,
Weight = 3400,
Horsepower = 203,
Acceleration = 8
)
predictions2 = predict(model2, parm)
predictions1 = predict(model1, parm)
predictions2
## 1
## 226.5434
predictions1
## 1
## 222.4425
Q4 ANSWER:For the give data frame predicted CO2 emissions using model 1 is 222.4 g/mL and using model 2 is 226.5 g/mL.