The provided dataset is a subset of the public data from the 2022 EPA Automotive Trends Report. It will be used to study the effects of various vehicle characteristics on CO2 emissions.
The dataset consists of a dataframe with 2060 observations with the following 7 variables:
To read the data in R, save the file in your working
directory (make sure you have changed the directory if different from
the R working directory) and read the data using the R
function read.csv()
# reading the dataset
data <- read.csv("vehicle_CO2_emis.csv")
head(data,3)
## Model.Year Type Weight Horsepower Acceleration MPG CO2
## 1 1995 SUV 3500 94 13.2003 20.99489 423.2935
## 2 1996 SUV 3500 130 11.1597 20.29168 437.9628
## 3 1997 SUV 3500 130 11.7893 18.92057 469.7003
cor(data$Model.Year, data$CO2)
## [1] -0.4794969
cor(data$MPG, data$CO2)
## [1] -0.9598317
cor(data$Weight, data$CO2)
## [1] 0.5163332
cor(data$Horsepower, data$CO2)
## [1] 0.007126864
cor(data$Acceleration, data$CO2)
## [1] 0.3117955
#cor(data[-2])
#cor(data[-c(2)])
Q1a ANSWER: My top 3 predictors are MPG= -0.96, weight= 0.52, model.year= -0.48 MPG has a very strong negative correlation with CO2. Weight has a moderateLy strong positive correlation with CO2. Model.year has a weaker negative correlation with CO2.
#convert categorical data to factor & identify baseline
data$Type <- as.factor(data$Type)
levels(data$Type)
## [1] "Sedan" "SUV" "Truck" "Van"
plot(data$Type, data$CO2)
#boxplot(CO2 ~ Type, data = data)
Q1b ANSWER: The boxplot shows CO2 emissions are different between vehcile types of sedan vs other types. We can conclude a relationship exist between different type and CO2. Sedan has the lowest mean, followed by SUV, Truck, & van.
par(mfrow=c(2,3))
plot(data$Model.Year, data$CO2)
abline(lm(CO2 ~ Model.Year, data = data), col = "red")
plot(data$MPG, data$CO2)
abline(lm(CO2 ~ MPG, data = data), col = "red")
plot(data$Weight, data$CO2)
abline(lm(CO2 ~ Weight, data = data), col = "red")
plot(data$Horsepower, data$CO2)
abline(lm(CO2 ~ Horsepower, data = data), col = "red")
plot(data$Acceleration, data$CO2)
abline(lm(CO2 ~ Acceleration, data = data), col = "red")
Q1c ANSWER: Model year - mod/weak negative trend. Variance in years before 1985 MPG - strong negative correlation, however there is some curving especially at the ends, which suggest the relationship may not be linear. Weight - strong positive correlation Horsepower - VERY weak (almost non-existent) positive relationship with a lot of clustering and variance Acceleration - weak positive upward direction with a lot of variance
Q1d ANSWER: Yes, a MLR model is reasonable because there are several variables that have strong relationships with C02 so we can fit a MLR model. There are some predictors with weak correlations that could possibly be included in the MLR after variable transformation.
model1 <- lm( CO2 ~ Model.Year + MPG + Weight, data = data)
summary(model1)
##
## Call:
## lm(formula = CO2 ~ Model.Year + MPG + Weight, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.891 -12.833 -5.383 5.787 262.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.764e+03 1.617e+02 23.27 <2e-16 ***
## Model.Year -1.547e+00 8.582e-02 -18.03 <2e-16 ***
## MPG -1.587e+01 2.653e-01 -59.83 <2e-16 ***
## Weight 2.747e-02 1.658e-03 16.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.55 on 2056 degrees of freedom
## Multiple R-squared: 0.9322, Adjusted R-squared: 0.9321
## F-statistic: 9429 on 3 and 2056 DF, p-value: < 2.2e-16
Q2b ANSWER: Estimated intercept = 3.764e4. This means CO2 emissions is 3764 is when all of the predictors are equal to zero. When weight, age and model year is zero, CO2 emissions will be 3764. This scenario may not be ideal in the real world.
anovamodel <- aov(CO2 ~ Type, data = data)
summary(anovamodel)
## Df Sum Sq Mean Sq F value Pr(>F)
## Type 3 3802499 1267500 200 <2e-16 ***
## Residuals 2056 13032369 6339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q2c ANSWER: Type is a significant predictor for CO2. It has a p-value of 2.2e-16 which is less than the alpha, so we can reject the null. Type is a significant predictor.
model2 <- lm(CO2 ~ ., data = data)
summary(model2)
##
## Call:
## lm(formula = CO2 ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.674 -11.644 -3.872 6.250 248.771
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.738e+03 2.437e+02 7.129 1.39e-12 ***
## Model.Year -5.280e-01 1.245e-01 -4.239 2.34e-05 ***
## TypeSUV -9.438e+00 1.676e+00 -5.632 2.03e-08 ***
## TypeTruck -1.891e+01 1.853e+00 -10.203 < 2e-16 ***
## TypeVan -2.440e+01 2.153e+00 -11.333 < 2e-16 ***
## Weight 4.586e-02 2.014e-03 22.774 < 2e-16 ***
## Horsepower -3.169e-01 2.511e-02 -12.623 < 2e-16 ***
## Acceleration 3.522e-01 4.744e-01 0.742 0.458
## MPG -1.684e+01 2.777e-01 -60.649 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.88 on 2051 degrees of freedom
## Multiple R-squared: 0.9417, Adjusted R-squared: 0.9414
## F-statistic: 4139 on 8 and 2051 DF, p-value: < 2.2e-16
levels(data$Type)
## [1] "Sedan" "SUV" "Truck" "Van"
Q2d ANSWER: All coefficients in previous model ae significant in this model.
Q2e ANSWER: Estimated TypeVan coefficient = -2.440e+01. This means on average, CO2 emissions for TypeVan is -24.4 less than TypeSedan, holding all other variables constant.
Q2f ANSWER: It contradicts the boxplot, which showed sedan had the lowest median. The boxplot shows the marginal relationship between CO2 and Type, without considering the other factors; whereas the regression model considers the other variables in the model. (conditional relationship)
Q2g ANSWER: Yes. Since the p-value for the overall F-Test is very close to zero (< 2.2e-16) and less than α = 0.05, we reject the null hypothesis that all estimated coefficients except the intercept are zero. This means that at least one of the slope coefficients is nonzero, and the overall model is statistically useful in predicting CO2.
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: CO2 ~ Model.Year + MPG + Weight
## Model 2: CO2 ~ Model.Year + Type + Weight + Horsepower + Acceleration +
## MPG
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2056 1140671
## 2 2051 981999 5 158672 66.28 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Q3a ANSWER: The p-value of the f-statistic is 2.2e-16, which is less than the alpha so we can reject the null. At least one of the additional predictors adds explanatory power.
summary(model1)$r.squared
## [1] 0.9322435
summary(model2)$r.squared
## [1] 0.9416687
cat("model 1 adj r^2:", summary(model1)$adj.r.squared)
## model 1 adj r^2: 0.9321447
cat("model 2 adj r^2:", summary(model2)$adj.r.squared)
## model 2 adj r^2: 0.9414412
Q3b ANSWER: We compare the two models using adjusted R squared. Model 2 explains 94% of the variance in CO2, which is about 1% more than model 1. Model2 is tgeh better model and the additional predictors in model 2 adds more explanatory powe to the model.
cd <- cooks.distance(model2)
plot(cd, type = "h")
abline(h=1, col = "red")
Q3c ANSWER: There are no significant outliers using a threshold of 1.
library(car)
## Loading required package: carData
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## Model.Year 10.152350 1 3.186275
## Type 2.557766 3 1.169437
## Weight 7.733145 1 2.780853
## Horsepower 9.917250 1 3.149167
## Acceleration 7.365774 1 2.713996
## MPG 6.166968 1 2.483338
r2 <- summary(model2)$r.squared
threshold <- max(10, 1/(1-r2))
Q3d ANSWER: There are no predictors that surpass the threshold of 17.14, indicating there is no multicolinearity among variables in model 2.
3 pts Using model1 and model2, predict the CO2 emissions for a vehicle with the following characteristics: Model.Year=2020, Type=“Sedan”, MPG=32, Weight=3400, Horsepower=203, Acceleration=8
newdata <- data.frame(Model.Year=2020,
Type= factor("Sedan", levels = c("Sedan", "SUV", "Truck", "Van")),
MPG=32,
Weight=3400,
Horsepower=203,
Acceleration=8)
levels(newdata$Type)
## [1] "Sedan" "SUV" "Truck" "Van"
cat("Model 1 prediction:", predict(model1, newdata))
## Model 1 prediction: 223.5195
cat("Model 1 prediction:", predict(model2, newdata))
## Model 1 prediction: 226.5434
Q4 ANSWER: model1 prediction: 224 g/mi model2 prediction: 227 g/m
#transformation
library(MASS)
boxcox(model1)
inv_transf <- 1/data$CO2
inv_sqrt_ <- 1/sqrt(data$CO2)
sqrt_transf <- sqrt(data$CO2)
log_transf <- log(data$CO2)
#optimal lambda is close to -1, so we perform inverse