Peer Grader Guidance

Please review the student expectations for peer review grading and peer review comments. Overall, we ask that you score with accuracy. When grading your peers, you will not only learn how to improve your future homework submissions but you will also gain deeper understanding of the concepts in the assignments. When assigning scores, consider the responses to the questions given your understanding of the problem and using the solutions as a guide. Moreover, please give partial credit for a concerted effort, but also be thorough. Add comments to your review, particularly when deducting points, to explain why the student missed the points. Ensure your comments are specific to questions and the student responses in the assignment.

Background

The provided dataset is a subset of the public data from the 2022 EPA Automotive Trends Report. It will be used to study the effects of various vehicle characteristics on CO2 emissions.

Data Description

The dataset consists of a dataframe with 2060 observations with the following 7 variables:

  1. Model.Year: year the vehicle model was produced (quantitative)
  2. Type: vehicle type (qualitative)
  3. MPG: miles per gallon of fuel (quantitative)
  4. Weight: vehicle weight in lbs (quantitative)
  5. Horsepower: vehicle horsepower in HP (quantitative)
  6. Acceleration: acceleration time (from 0 to 60 mph) in seconds (quantitative)
  7. CO2: carbon dioxide emissions in g/mi (response variable)

Instructions on reading the data

To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv()

# reading the dataset
setwd("~/Regression Analysis/Homework")
getwd
## function () 
## .Internal(getwd())
## <bytecode: 0x00000234fb47d640>
## <environment: namespace:base>
data <- read.csv("vehicle_CO2_emis.csv")
head(data,3)
##   Model.Year Type Weight Horsepower Acceleration      MPG      CO2
## 1       1995  SUV   3500         94      13.2003 20.99489 423.2935
## 2       1996  SUV   3500        130      11.1597 20.29168 437.9628
## 3       1997  SUV   3500        130      11.7893 18.92057 469.7003

Question 1: Exploratory Data Analysis [15 points]

  1. 3 pts Compute the correlation coefficient for each quantitative predicting variable (Model.Year, MPG, Weight, Horsepower, Acceleration) against the response (CO2). Describe the strength and direction of the top 3 predicting variables that have the strongest linear relationships with the response.
setwd("~/Regression Analysis/Homework")
getwd
## function () 
## .Internal(getwd())
## <bytecode: 0x00000234fb47d640>
## <environment: namespace:base>
#libraries
library(ggplot2)
library(car)
## Loading required package: carData
rm(list = ls())
#Building the model
data = read.csv("vehicle_CO2_emis.csv", head = TRUE)
head(data)
##   Model.Year Type Weight Horsepower Acceleration      MPG      CO2
## 1       1995  SUV   3500         94      13.2003 20.99489 423.2935
## 2       1996  SUV   3500        130      11.1597 20.29168 437.9628
## 3       1997  SUV   3500        130      11.7893 18.92057 469.7003
## 4       1998  SUV   3500        130      12.1865 19.74470 450.0955
## 5       1999  SUV   3500        130      12.2307 19.66936 451.8196
## 6       2000  SUV   3500        130      11.7105 18.74666 474.0577
str(data)
## 'data.frame':    2060 obs. of  7 variables:
##  $ Model.Year  : int  1995 1996 1997 1998 1999 2000 2001 2002 2005 2006 ...
##  $ Type        : chr  "SUV" "SUV" "SUV" "SUV" ...
##  $ Weight      : num  3500 3500 3500 3500 3500 ...
##  $ Horsepower  : num  94 130 130 130 130 ...
##  $ Acceleration: num  13.2 11.2 11.8 12.2 12.2 ...
##  $ MPG         : num  21 20.3 18.9 19.7 19.7 ...
##  $ CO2         : num  423 438 470 450 452 ...
model = lm(CO2~ 
             Model.Year+MPG+Weight+Horsepower+Acceleration, data = data)
summary(model)
## 
## Call:
## lm(formula = CO2 ~ Model.Year + MPG + Weight + Horsepower + Acceleration, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.312 -12.006  -5.127   6.064 261.216 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.770e+03  2.254e+02  12.289  < 2e-16 ***
## Model.Year   -1.041e+00  1.155e-01  -9.010  < 2e-16 ***
## MPG          -1.599e+01  2.639e-01 -60.606  < 2e-16 ***
## Weight        4.000e-02  2.028e-03  19.723  < 2e-16 ***
## Horsepower   -2.803e-01  2.546e-02 -11.008  < 2e-16 ***
## Acceleration -1.483e+00  4.685e-01  -3.165  0.00157 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.75 on 2054 degrees of freedom
## Multiple R-squared:  0.9369, Adjusted R-squared:  0.9367 
## F-statistic:  6097 on 5 and 2054 DF,  p-value: < 2.2e-16

Q1a ANSWER:The top 3 coefficients are MPG, Weight, and Horsepower, with Weight being the most impactful coefficient and Horsepower being the least out of the three. MPG has a high negative coefficient which means that with every unit increase of the MPG the CO2 emission value goes down by 15.99 (g/mL). The weight has a positive coefficient meaning that with every pound that the vehicle is heavier there is an increase in the CO2 emissions by 0.04 (g/mL). Lastly the horsepower has a negative coefficient meaning that with every 1 HP increase with the CO2 emissions decrease by .28 (g/mL)

  1. 3 pts Create a boxplot of the qualitative predicting variable (Type) versus the response (CO2). Explain the relationship between the two variables.
boxplot(data$CO2~data$Type, xlab = "Vehicle Type", ylab ="Carbon Dioxide Emissions (g/mi)")

Q1b ANSWER:Looking at the four types of vehicles the average CO2 emissions are the least in Sedans, following this are SUVs, Trucks, and then the most being Vans. There is a lot of variability in the ranges across all four categories with SUV’s having the most. Every category of vehicle also has a lot of outliers on the higher end.

  1. 6 pts Create scatterplots of the response (CO2) against each quantitative predicting variable (Model.Year, MPG, Weight, Horsepower, Acceleration). Describe the general trend of each plot.
#Scatterplot code
#Model Year
plot( data$Model.Year, data$CO2, main ="Scatterplot of CO2 Emissions vs Model Year of the Vehicle",
     ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Model Year")

#MPG
plot(data$MPG, data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle MPG",
     ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Miles per gallon")

#Weight
plot(data$Weight, data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle Weight",
     ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Weight (lbs)")

#horsepower
plot(data$Horsepower, data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle Horse Power",
     ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Horsepower")

#acceleration
plot(data$Acceleration,data$CO2, main ="Scatterplot of CO2 Emissions vs Vehicle Acceleration",
     ylab="Carbon Dioxide Emissions (g/mi)", xlab =" Acceleration (From 0 to 60 miles in seconds)")

Q1c ANSWER: • Model Year: There is a weak negative correlation between the model year of the vehicle and the CO2 emission release. • MPG: This graph could be interpreted to have a strong inverse or negative correlation between MPG and the CO2 emissions. The relationship could be linear, but it does appear to express the function e^-x • Weight: There is a positive correlation between the weight of the vehicle and the CO2 emissions. • Horsepower: There is a very weak positive correlation between the horsepower of a vehicle and the CO2 emissions. The pattern is scattered, and it almost looks random but there is a slight downwards turn of the points. • Acceleration: There is a very positive correlation between the acceleration of the vehicle and the CO2 emission release.Meaning as one increases so does the other.

  1. 3 pts Based on this exploratory analysis, is it reasonable to fit a multiple linear regression model for the relationship between CO2 and the predicting variables? Explain how you determined the answer.

Q1d ANSWER: It is reasonable to fit a multiple linear regression for those predicting variables and the CO2 emissions but some of the variables seem to have a weaker linear relationship with the variable of interest. Other than that there is a high R2 and adjusted R2 meaning that 93% of the variable of interest has been captured by the variables from the model.

Question 2: Model Fitting and Interpretation [26 points]

  1. 3 pts Fit a multiple linear regression model called model1 using CO2 as the response and the top 3 predicting variables with the strongest relationship with CO2, from Question 1a.
model1 = lm(CO2~ 
             Horsepower+MPG+Weight, data = data)
summary(model1)
## 
## Call:
## lm(formula = CO2 ~ Horsepower + MPG + Weight, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.867 -12.500  -5.782   6.329 267.617 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 750.581176   7.560325   99.28   <2e-16 ***
## Horsepower   -0.320859   0.016036  -20.01   <2e-16 ***
## MPG         -17.820898   0.174183 -102.31   <2e-16 ***
## Weight        0.031548   0.001707   18.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.19 on 2056 degrees of freedom
## Multiple R-squared:  0.9343, Adjusted R-squared:  0.9342 
## F-statistic:  9749 on 3 and 2056 DF,  p-value: < 2.2e-16
  1. 4 pts What is the estimated coefficient for the intercept? Interpret this coefficient in the context of the dataset.

Q2b ANSWER:The intercept is 750.58 and this means that a vehicle will release approximately 750 g/mL of CO2 emissions regardless of any other factors of the vehicle.

  1. 4 pts Assuming a marginal relationship between Type and CO2, perform an ANOVA F-test on the mean CO2 emission among the different vehicle types. Using an \(\alpha\)-level of 0.05, is Type useful in predicting CO2? Explain how you determined the answer.
anova_model = aov(CO2~ Type, data= data)
summary(anova_model)
##               Df   Sum Sq Mean Sq F value Pr(>F)    
## Type           3  3802499 1267500     200 <2e-16 ***
## Residuals   2056 13032369    6339                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q2c ANSWER:The output provides the partial F-value, which is 200 and the P-value approximately equal to 0. Looking at the alpha threshold of 0.05 these results are significant, and the null hypothesis is rejected which means that the coefficients corresponding to the type of the vehicle type are not all zero.

  1. 3 pts Fit a multiple linear regression model called model2 using CO2 as the response and all predicting variables. Using \(\alpha = 0.05\), which of the estimated coefficients that were statistically significant in model1 are also statistically significant in model2?
model2 = lm(CO2~ ., data = data)
summary(model2)
## 
## Call:
## lm(formula = CO2 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.674 -11.644  -3.872   6.250 248.771 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.738e+03  2.437e+02   7.129 1.39e-12 ***
## Model.Year   -5.280e-01  1.245e-01  -4.239 2.34e-05 ***
## TypeSUV      -9.438e+00  1.676e+00  -5.632 2.03e-08 ***
## TypeTruck    -1.891e+01  1.853e+00 -10.203  < 2e-16 ***
## TypeVan      -2.440e+01  2.153e+00 -11.333  < 2e-16 ***
## Weight        4.586e-02  2.014e-03  22.774  < 2e-16 ***
## Horsepower   -3.169e-01  2.511e-02 -12.623  < 2e-16 ***
## Acceleration  3.522e-01  4.744e-01   0.742    0.458    
## MPG          -1.684e+01  2.777e-01 -60.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9414 
## F-statistic:  4139 on 8 and 2051 DF,  p-value: < 2.2e-16

Q2d ANSWER:The coefficients that were significant in both models at a 0.05 level are MPG, Weight, Model year, and Horsepower.

  1. 4 pts Interpret the estimated coefficient for TypeVan in the context of the dataset. Make sure TypeSedan is the baseline level for Type. Mention any assumptions you make about other predictors clearly when stating the interpretation.

Q2e ANSWER:In this model Vans lessen the CO2 emissions by 24.4 g/mL more than Sedans do. The total amount that Vans decrease the CO2 emissions are 33.84 g/mL holding everything else constant.

  1. 4 pts How does your interpretation of TypeVan above compare to the relationship between CO2 vs Type analyzed using the boxplot in Q1? Explain the reason for the similarities/differences.

Q2f ANSWER:Looking at the model we see that, holding everything else constant, if the type of the vehicle is a Van it will release 33.84 g/mL less of CO2 emissions. But the boxplot shows that Vans have the highest CO2 emission on average. It appears this way because there are other factors being accounted for in the boxplot. They are not just looking at the coefficients of the Type; the graph encompasses all of the other factors. When all of the other factors are accounted for we can see that with a heavier vehicle and a vehicle with less MPG have higher CO2 emissions which are characteristics of a Van. So it makes sense for the CO2 emissions to be higher for Vans due to the other factors, but not because the vehicle has been categorized as a Van.

  1. 4 pts Is the overall regression (model2) significant at an \(\alpha\)-level of 0.05? Explain how you determined the answer.

Q2g ANSWER:Looking at the output of the regression model there is a p-value that encompasses the entire model and is denoted by 2e-16 which can be rounded to zero. Meaning that with an alpha level of 0.05 the current p-value is much less and therefore indicates that this model’s results are not due to randomness and that there is a significant relationship.

Question 3: Model Comparison, Outliers, and Multicollinearity [16 points]

  1. 4 pts Conduct a partial \(F\)-test comparing model1 and model2. What can you conclude from the results using an \(\alpha\)-level of 0.05?
partial_f = anova(model1, model2)
print(partial_f)
## Analysis of Variance Table
## 
## Model 1: CO2 ~ Horsepower + MPG + Weight
## Model 2: CO2 ~ Model.Year + Type + Weight + Horsepower + Acceleration + 
##     MPG
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   2056 1105746                                  
## 2   2051  981999  5    123747 51.692 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q3a ANSWER: It can be concluded that since the since the F stat is larger than the p-value and the p-value is much lower than our alpha value of 0.05, the null hypothesis that these two models have no difference can be rejected. Meaning that the models have a significant difference.

  1. 4 pts Using \(R^2\) and adjusted \(R^2\), compare model1 and model2.
summary(model1)
## 
## Call:
## lm(formula = CO2 ~ Horsepower + MPG + Weight, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.867 -12.500  -5.782   6.329 267.617 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 750.581176   7.560325   99.28   <2e-16 ***
## Horsepower   -0.320859   0.016036  -20.01   <2e-16 ***
## MPG         -17.820898   0.174183 -102.31   <2e-16 ***
## Weight        0.031548   0.001707   18.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.19 on 2056 degrees of freedom
## Multiple R-squared:  0.9343, Adjusted R-squared:  0.9342 
## F-statistic:  9749 on 3 and 2056 DF,  p-value: < 2.2e-16
summary(model2)
## 
## Call:
## lm(formula = CO2 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.674 -11.644  -3.872   6.250 248.771 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.738e+03  2.437e+02   7.129 1.39e-12 ***
## Model.Year   -5.280e-01  1.245e-01  -4.239 2.34e-05 ***
## TypeSUV      -9.438e+00  1.676e+00  -5.632 2.03e-08 ***
## TypeTruck    -1.891e+01  1.853e+00 -10.203  < 2e-16 ***
## TypeVan      -2.440e+01  2.153e+00 -11.333  < 2e-16 ***
## Weight        4.586e-02  2.014e-03  22.774  < 2e-16 ***
## Horsepower   -3.169e-01  2.511e-02 -12.623  < 2e-16 ***
## Acceleration  3.522e-01  4.744e-01   0.742    0.458    
## MPG          -1.684e+01  2.777e-01 -60.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9414 
## F-statistic:  4139 on 8 and 2051 DF,  p-value: < 2.2e-16

Q3b ANSWER:The R2 and Adjusted R2 for model1 are 0.9343 and 0.9342 and for model2 0.9417 and 0.9414, respectively. Because these models have a different number of variables the adjusted R2 value is the more appropriate value to use for comparison. Model2 is slightly better than Model1 with an adjusted R2 0.0072 units better than Model1’s. But with this in mind since they both have such high values both model2 and model1 explain 94% and 93% of the CO2 emissions with their regression coefficients.

  1. 4 pts Create a plot for the Cook’s Distances (use model2). Using a threshold of 1, are there any outliers? If yes, which data points?
cook = cooks.distance(model2)
plot(cook, type="h", lwd =  3, ylab= "Cook's Distance")

Q3c ANSWER:There are no values with a value over 1, indicating there are no outliers

  1. 4 pts Calculate the VIF of each predictor (use model2). Using a threshold of max(10, 1/(1-\(R^2\))) what conclusions can you make regarding multicollinearity?
vif(model2)
##                   GVIF Df GVIF^(1/(2*Df))
## Model.Year   10.152350  1        3.186275
## Type          2.557766  3        1.169437
## Weight        7.733145  1        2.780853
## Horsepower    9.917250  1        3.149167
## Acceleration  7.365774  1        2.713996
## MPG           6.166968  1        2.483338

Q3d ANSWER:The model year is slightly over 10 and horsepower is right below there. These values indicate that there could be a linear relationship between the two of these. On top of this some of the other factors are on the higher side, such as Weight and Acceleration. This all suggests that there is high multicollinearity amongst the predicting variables.

Question 4: Prediction [3 points]

3 pts Using model1 and model2, predict the CO2 emissions for a vehicle with the following characteristics: Model.Year=2020, Type=“Sedan”, MPG=32, Weight=3400, Horsepower=203, Acceleration=8

parm <- data.frame(
  Model.Year = 2020, 
  Type = "Sedan", 
  MPG = 32, 
  Weight = 3400, 
  Horsepower = 203, 
  Acceleration = 8
)
predictions2 = predict(model2, parm)
predictions1 = predict(model1, parm)
predictions2
##        1 
## 226.5434
predictions1
##        1 
## 222.4425

Q4 ANSWER:For the give data frame predicted CO2 emissions using model 1 is 222.4 g/mL and using model 2 is 226.5 g/mL.