How do different vehicle factors relate to their CO2 displacement?
I will be using the Environmental Protection Agency’s Automotive
Trends Dataset to answer the research question. This dataset includes
information on various automobiles delivered for sale in the U.S since
1975, including CO2 emissions and fuel economy alongside factors such as
weight and engine displacement.
Source: https://www.epa.gov/automotive-trends/explore-automotive-trends-data
Key Variables:
Real-World CO2 (g/mi)
Model
Year
Weight (lbs)
Horsepower (HP)
Displacement
Mean
Cylinders
I predict these variables will significantly affect vehicle CO2
emissions.
This dataset includes 5500 observations and 60 variables.
library(tidyverse)
library(car)
cars <- read_csv("epacars.csv")
After cleaning the data I will select the key variables, filter out summary data (“All”), and filter for only CO2 emitting internal combustion vehicles. I will then generate a series of diagnostic plots to check the assumptions of my multiple linear regression model.
head(cars)
## # A tibble: 6 × 60
## Manufacturer `Model Year` `Regulatory Class` `Vehicle Type` `Production (000)`
## <chr> <chr> <chr> <chr> <chr>
## 1 All 1991 Car Car SUV 224
## 2 All 1992 Car Car SUV 243
## 3 All 1993 Car Car SUV 473
## 4 All 1994 Car Car SUV 332
## 5 All 1995 Car Car SUV 220
## 6 All 1996 Car Car SUV 287
## # ℹ 55 more variables: `Production Share` <chr>, `2-Cycle MPG` <chr>,
## # `Real-World MPG` <chr>, `Real-World MPG_City` <chr>,
## # `Real-World MPG_Hwy` <chr>, `Real-World CO2 (g/mi)` <chr>,
## # `Real-World CO2_City (g/mi)` <chr>, `Real-World CO2_Hwy (g/mi)` <chr>,
## # `Weight (lbs)` <chr>, `Footprint (sq. ft.)` <chr>,
## # `Engine Displacement` <chr>, `Horsepower (HP)` <chr>,
## # `Acceleration (0-60 time in seconds)` <chr>, …
#str(cars)
#Clean variable names
names(cars) <- gsub("\\(|\\)", "", names(cars)) #remove parentheses
names(cars) <- gsub("g\\/|mi", "", names(cars)) #remove (g/mi)
names(cars) <- gsub("-", "_", names(cars)) #sub hyphens w/ underscores
names(cars) <- gsub(" ", "_", names(cars)) #sub spaces w/ underscores
names(cars) <- gsub("_$", "", names(cars)) #remove trailing underscore
names(cars) <- tolower(names(cars)) #variable names lowercase
#Convert variables to proper classes
cars$engine_displacement <- as.numeric(cars$engine_displacement)
cars$real_world_co2 <- as.numeric(cars$real_world_co2)
cars$weight_lbs <- as.numeric(cars$weight_lbs)
cars$horsepower_hp <- as.numeric(cars$horsepower_hp)
cars$cylinders_in_gasoline_ice_vehicles <- as.numeric(cars$cylinders_in_gasoline_ice_vehicles)
cars$model_year <- as.numeric(cars$model_year)
#Clean data
cars[cars == "-"] <- NA #mark missing values as NA
cars$model_year[cars$model_year == "Prelim. 2024"] <- 2024 #remove prelim.
cars2 <- cars |>
select(c(manufacturer,
model_year,
vehicle_type,
real_world_co2,
weight_lbs,
horsepower_hp,
engine_displacement,
cylinders_in_gasoline_ice_vehicles)) |>
filter(!cars$manufacturer == "All" &
!cars$vehicle_type %in% c("All", "All Car", "All Truck") &
cars$cylinders_in_gasoline_ice_vehicles > 1) |>
arrange(desc(real_world_co2))
I will be using a multiple linear regression model because I am using multiple variables to predict a continuous outcome variable (CO2).
#Multiple linear regression model
cars_model <- lm(real_world_co2 ~ model_year + engine_displacement + vehicle_type, data = cars2)
summary(cars_model)
##
## Call:
## lm(formula = real_world_co2 ~ model_year + engine_displacement +
## vehicle_type, data = cars2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -174.21 -27.86 -3.52 24.29 347.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7214.10973 150.92974 47.798 < 2e-16 ***
## model_year -3.48508 0.07492 -46.515 < 2e-16 ***
## engine_displacement 1.12764 0.01958 57.594 < 2e-16 ***
## vehicle_typeMinivan/Van 7.18631 3.51685 2.043 0.04114 *
## vehicle_typePickup -10.31128 3.64817 -2.826 0.00475 **
## vehicle_typeSedan/Wagon -45.91424 3.07815 -14.916 < 2e-16 ***
## vehicle_typeTruck SUV 24.17041 3.28870 7.350 2.88e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.93 on 2000 degrees of freedom
## (47 observations deleted due to missingness)
## Multiple R-squared: 0.824, Adjusted R-squared: 0.8235
## F-statistic: 1561 on 6 and 2000 DF, p-value: < 2.2e-16
Coefficients
Model Year: ~-3.49, meaning newer cars emit less CO2.
Engine
displacement: ~1.13, meaning as displacement increases, CO2 emissions
increase.
Vehicle Type (vs Car SUV):
Minivan/van: +7.19 (g/mi)
Pickup: -10.31 (g/mi)
Sedan/wagon: -45.91 (g/mi)
Truck SUV: +24.17 (g/mi)
I will check assumptions for linearity, independence, homoscedasticity, normality, and multicollinearity using diagnostic plots and correlation matrices.
Diagnostic plots
par(mfrow=c(2,2)); plot(cars_model); par(mfrow=c(1,1))
Linearity:
#Check linearity through component + residual plots
crPlots(cars_model)
The component + residual plots show clear linear trends for each
variable, with a positive relationship between CO2, engine displacement
and horsepower, and a negative relationship with model year. In
addition, the residuals vs fitted plot reveals a mostly even
distribution of residuals. Overall linearity is reasonable.
Independence of observations:
#Check independence through residuals vs. order plot
plot(resid(cars_model), type="b",
main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)
The residuals vs order plot shows residuals centered around zero with
slight deviance at index ≈ 0 and index ≈ 4000 which are not enough to
invalidate the model. Overall, the plot indicates independence.
Homoscedasticity
The spread of points across the residuals vs fitted plot line is
distributed evenly around zero which indicates homoscedasticity. In
addition, the scale-location plot’s line is mostly horizontal with an
even spread of points which demonstrates homoscedasticity.
Normality of residuals
The Q-Q plot’s residuals deviate slightly at the left and right tails
which indicates non-normality, but this is not severe enough to
invalidate the model.
Multicollinearity
cor(cars2[, c("model_year", "engine_displacement")], use = "complete.obs")
## model_year engine_displacement
## model_year 1.0000000 -0.1385737
## engine_displacement -0.1385737 1.0000000
There is little correlation between model year and engine displacement, showing there is very little multicollinearity in the model.
Key Findings:
Newer cars emit less CO2, while large engines with higher
displacement emit more CO2. Sedans and wagons emit the least, while
truck SUVs emit the most.
Implications:
My results answer the question of how various vehicle factors
(vehicle type, year, and displacement) contribute to their CO2
emissions. This information is useful for those who wish to purchase an
internal combustion car whilst being more environmentally conscious,
taking into account variables that may increase their contribution to
climate change. Manufacturers may also take these variables into account
when designing their cars to meet emission regulations.
Model Fit and Limitations:
The model’s R^2 is 0.824, meaning 82.4% of the variation in CO2 is
explained by the model. This is pretty good. It is, however, limited by
a small number of variables. As such, the model may be lacking
significant variables that would better explain variance in CO2. This
was largely a result of multicollinearity being present in the model
with other variables. For example, the was high correlation between
horsepower and model year (0.78), horsepower and weight (0.8), and
displacement and mean cylinders (0.99). These variables were either
omitted or replaced with others.
Future Research:
In the future, I would like to explore more variables to increase the accuracy of my model. For example, I would explore the emissions of turbocharged vs naturally aspirated engines or those of diesel cars. I would also look into why certain vehicle types produce more emissions than others, since I predicted pickups would be highest even though they are not. Since learning that newer cars emit less, likely due to environmental regulations, I would like to see how manufacturers are adjusting their cars to comply with these laws. In addition, I would explore emissions from automobiles in different parts of the world to see how differing regulations contribute to national carbon footprint.
Environmental Protection Agency - https://www.epa.gov/automotive-trends/explore-automotive-trends-data