Background

The provided dataset is a subset of the public data from the 2022 EPA Automotive Trends Report. It will be used to study the effects of various vehicle characteristics on CO2 emissions.

Data Description

The dataset consists of a dataframe with 2060 observations with the following 7 variables:

Model.Year: year the vehicle model was produced (quantitative)
Type: vehicle type (qualitative)
MPG: miles per gallon of fuel (quantitative)
Weight: vehicle weight in lbs (quantitative)
Horsepower: vehicle horsepower in HP (quantitative)
Acceleration: acceleration time (from 0 to 60 mph) in seconds (quantitative)
CO2: carbon dioxide emissions in g/mi (response variable)

Instructions on reading the data

To read the data in R, save the file in your working directory (make sure you have changed the directory if different from the R working directory) and read the data using the R function read.csv()

# reading the dataset
data <- read.csv("vehicle_CO2_emis.csv")
head(data,3)

##   Model.Year Type Weight Horsepower Acceleration      MPG      CO2
## 1       1995  SUV   3500         94      13.2003 20.99489 423.2935
## 2       1996  SUV   3500        130      11.1597 20.29168 437.9628
## 3       1997  SUV   3500        130      11.7893 18.92057 469.7003

Question 1: Exploratory Data Analysis [15 points]

3 pts Compute the correlation coefficient for each quantitative predicting variable (Model.Year, MPG, Weight, Horsepower, Acceleration) against the response (CO2). Describe the strength and direction of the top 3 predicting variables that have the strongest linear relationships with the response.

cor(data$Model.Year, data$CO2)

## [1] -0.4794969

cor(data$MPG, data$CO2)

## [1] -0.9598317

cor(data$Weight, data$CO2)

## [1] 0.5163332

cor(data$Horsepower, data$CO2)

## [1] 0.007126864

cor(data$Acceleration, data$CO2)

## [1] 0.3117955

#cor(data[-2])
#cor(data[-c(2)])

Q1a ANSWER: My top 3 predictors are MPG= -0.96, weight= 0.52, model.year= -0.48 MPG has a very strong negative correlation with CO2. Weight has a moderateLy strong positive correlation with CO2. Model.year has a weaker negative correlation with CO2.

3 pts Create a boxplot of the qualitative predicting variable (Type) versus the response (CO2). Explain the relationship between the two variables.

#convert categorical data to factor & identify baseline
data$Type <- as.factor(data$Type) 
levels(data$Type)

## [1] "Sedan" "SUV"   "Truck" "Van"

plot(data$Type, data$CO2)

#boxplot(CO2 ~ Type, data = data)

Q1b ANSWER: The boxplot shows CO2 emissions are different between vehcile types of sedan vs other types. We can conclude a relationship exist between different type and CO2. Sedan has the lowest mean, followed by SUV, Truck, & van.

6 pts Create scatterplots of the response (CO2) against each quantitative predicting variable (Model.Year, MPG, Weight, Horsepower, Acceleration). Describe the general trend of each plot.

par(mfrow=c(2,3))
plot(data$Model.Year, data$CO2)
abline(lm(CO2 ~ Model.Year, data = data), col = "red")

plot(data$MPG, data$CO2)
abline(lm(CO2 ~ MPG, data = data), col = "red")

plot(data$Weight, data$CO2)
abline(lm(CO2 ~ Weight, data = data), col = "red")

plot(data$Horsepower, data$CO2)
abline(lm(CO2 ~ Horsepower, data = data), col = "red")

plot(data$Acceleration, data$CO2)
abline(lm(CO2 ~ Acceleration, data = data), col = "red")

Q1c ANSWER: Model year - mod/weak negative trend. Variance in years before 1985 MPG - strong negative correlation, however there is some curving especially at the ends, which suggest the relationship may not be linear. Weight - strong positive correlation Horsepower - VERY weak (almost non-existent) positive relationship with a lot of clustering and variance Acceleration - weak positive upward direction with a lot of variance

3 pts Based on this exploratory analysis, is it reasonable to fit a multiple linear regression model for the relationship between CO2 and the predicting variables? Explain how you determined the answer.

Q1d ANSWER: Yes, a MLR model is reasonable because there are several variables that have strong relationships with C02 so we can fit a MLR model. There are some predictors with weak correlations that could possibly be included in the MLR after variable transformation.

Question 2: Model Fitting and Interpretation [26 points]

3 pts Fit a multiple linear regression model called model1 using CO2 as the response and the top 3 predicting variables with the strongest relationship with CO2, from Question 1a.

model1 <- lm( CO2 ~ Model.Year + MPG + Weight, data = data)
summary(model1)

## 
## Call:
## lm(formula = CO2 ~ Model.Year + MPG + Weight, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.891 -12.833  -5.383   5.787 262.204 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.764e+03  1.617e+02   23.27   <2e-16 ***
## Model.Year  -1.547e+00  8.582e-02  -18.03   <2e-16 ***
## MPG         -1.587e+01  2.653e-01  -59.83   <2e-16 ***
## Weight       2.747e-02  1.658e-03   16.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.55 on 2056 degrees of freedom
## Multiple R-squared:  0.9322, Adjusted R-squared:  0.9321 
## F-statistic:  9429 on 3 and 2056 DF,  p-value: < 2.2e-16

4 pts What is the estimated coefficient for the intercept? Interpret this coefficient in the context of the dataset.

Q2b ANSWER: Estimated intercept = 3.764e4. This means CO2 emissions is 3764 is when all of the predictors are equal to zero. When weight, age and model year is zero, CO2 emissions will be 3764. This scenario may not be ideal in the real world.

4 pts Assuming a marginal relationship between Type and CO2, perform an ANOVA F-test on the mean CO2 emission among the different vehicle types. Using an \(\alpha\)-level of 0.05, is Type useful in predicting CO2? Explain how you determined the answer.

anovamodel <- aov(CO2 ~ Type, data = data)
summary(anovamodel)

##               Df   Sum Sq Mean Sq F value Pr(>F)    
## Type           3  3802499 1267500     200 <2e-16 ***
## Residuals   2056 13032369    6339                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q2c ANSWER: Type is a significant predictor for CO2. It has a p-value of 2.2e-16 which is less than the alpha, so we can reject the null. Type is a significant predictor.

3 pts Fit a multiple linear regression model called model2 using CO2 as the response and all predicting variables. Using \(\alpha = 0.05\), which of the estimated coefficients that were statistically significant in model1 are also statistically significant in model2?

model2 <- lm(CO2 ~ ., data = data)
summary(model2)

## 
## Call:
## lm(formula = CO2 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.674 -11.644  -3.872   6.250 248.771 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.738e+03  2.437e+02   7.129 1.39e-12 ***
## Model.Year   -5.280e-01  1.245e-01  -4.239 2.34e-05 ***
## TypeSUV      -9.438e+00  1.676e+00  -5.632 2.03e-08 ***
## TypeTruck    -1.891e+01  1.853e+00 -10.203  < 2e-16 ***
## TypeVan      -2.440e+01  2.153e+00 -11.333  < 2e-16 ***
## Weight        4.586e-02  2.014e-03  22.774  < 2e-16 ***
## Horsepower   -3.169e-01  2.511e-02 -12.623  < 2e-16 ***
## Acceleration  3.522e-01  4.744e-01   0.742    0.458    
## MPG          -1.684e+01  2.777e-01 -60.649  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared:  0.9417, Adjusted R-squared:  0.9414 
## F-statistic:  4139 on 8 and 2051 DF,  p-value: < 2.2e-16

levels(data$Type)

## [1] "Sedan" "SUV"   "Truck" "Van"

Q2d ANSWER: All coefficients in previous model ae significant in this model.

4 pts Interpret the estimated coefficient for TypeVan in the context of the dataset. Make sure TypeSedan is the baseline level for Type. Mention any assumptions you make about other predictors clearly when stating the interpretation.

Q2e ANSWER: Estimated TypeVan coefficient = -2.440e+01. This means on average, CO2 emissions for TypeVan is -24.4 less than TypeSedan, holding all other variables constant.

4 pts How does your interpretation of TypeVan above compare to the relationship between CO2 vs Type analyzed using the boxplot in Q1? Explain the reason for the similarities/differences.

Q2f ANSWER: It contradicts the boxplot, which showed sedan had the lowest median. The boxplot shows the marginal relationship between CO2 and Type, without considering the other factors; whereas the regression model considers the other variables in the model. (conditional relationship)

4 pts Is the overall regression (model2) significant at an \(\alpha\)-level of 0.05? Explain how you determined the answer.

Q2g ANSWER: Yes. Since the p-value for the overall F-Test is very close to zero (< 2.2e-16) and less than α = 0.05, we reject the null hypothesis that all estimated coefficients except the intercept are zero. This means that at least one of the slope coefficients is nonzero, and the overall model is statistically useful in predicting CO2.

Question 3: Model Comparison, Outliers, and Multicollinearity [16 points]

4 pts Conduct a partial \(F\)-test comparing model1 and model2. What can you conclude from the results using an \(\alpha\)-level of 0.05?

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: CO2 ~ Model.Year + MPG + Weight
## Model 2: CO2 ~ Model.Year + Type + Weight + Horsepower + Acceleration + 
##     MPG
##   Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
## 1   2056 1140671                                 
## 2   2051  981999  5    158672 66.28 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q3a ANSWER: The p-value of the f-statistic is 2.2e-16, which is less than the alpha so we can reject the null. At least one of the additional predictors adds explanatory power.

4 pts Using \(R^2\) and adjusted \(R^2\), compare model1 and model2.

summary(model1)$r.squared

## [1] 0.9322435

summary(model2)$r.squared

## [1] 0.9416687

cat("model 1 adj r^2:", summary(model1)$adj.r.squared)

## model 1 adj r^2: 0.9321447

cat("model 2 adj r^2:", summary(model2)$adj.r.squared)

## model 2 adj r^2: 0.9414412

Q3b ANSWER: We compare the two models using adjusted R squared. Model 2 explains 94% of the variance in CO2, which is about 1% more than model 1. Model2 is tgeh better model and the additional predictors in model 2 adds more explanatory powe to the model.

4 pts Create a plot for the Cook’s Distances (use model2). Using a threshold of 1, are there any outliers? If yes, which data points?

cd <- cooks.distance(model2)
plot(cd, type = "h")
abline(h=1, col = "red")

Q3c ANSWER: There are no significant outliers using a threshold of 1.

4 pts Calculate the VIF of each predictor (use model2). Using a threshold of max(10, 1/(1-\(R^2\))) what conclusions can you make regarding multicollinearity?

library(car)

## Loading required package: carData

vif(model2)

##                   GVIF Df GVIF^(1/(2*Df))
## Model.Year   10.152350  1        3.186275
## Type          2.557766  3        1.169437
## Weight        7.733145  1        2.780853
## Horsepower    9.917250  1        3.149167
## Acceleration  7.365774  1        2.713996
## MPG           6.166968  1        2.483338

r2 <- summary(model2)$r.squared
threshold <- max(10, 1/(1-r2))

Q3d ANSWER: There are no predictors that surpass the threshold of 17.14, indicating there is no multicolinearity among variables in model 2.

Question 4: Prediction [3 points]

3 pts Using model1 and model2, predict the CO2 emissions for a vehicle with the following characteristics: Model.Year=2020, Type=“Sedan”, MPG=32, Weight=3400, Horsepower=203, Acceleration=8

newdata <- data.frame(Model.Year=2020,
                      Type= factor("Sedan", levels = c("Sedan", "SUV", "Truck", "Van")),
                      MPG=32,
                      Weight=3400,
                      Horsepower=203,
                      Acceleration=8)
levels(newdata$Type)

## [1] "Sedan" "SUV"   "Truck" "Van"

cat("Model 1 prediction:", predict(model1, newdata))

## Model 1 prediction: 223.5195

cat("Model 1 prediction:", predict(model2, newdata))

## Model 1 prediction: 226.5434

Q4 ANSWER: model1 prediction: 224 g/mi model2 prediction: 227 g/m

#transformation

library(MASS)
boxcox(model1)

inv_transf <- 1/data$CO2      
inv_sqrt_ <- 1/sqrt(data$CO2) 

sqrt_transf <- sqrt(data$CO2) 
log_transf <- log(data$CO2)   
#optimal lambda is close to -1, so we perform inverse