How do different vehicle factors relate to their CO2 displacement?

Introduction

I will be using the Environmental Protection Agency’s Automotive Trends Dataset to answer the research question. This dataset includes information on various automobiles delivered for sale in the U.S since 1975, including CO2 emissions and fuel economy alongside factors such as weight and engine displacement.

Source: https://www.epa.gov/automotive-trends/explore-automotive-trends-data

Key Variables:
Real-World CO2 (g/mi)
Model Year
Weight (lbs)
Horsepower (HP)
Displacement
Mean Cylinders

I predict these variables will significantly affect vehicle CO2 emissions.

This dataset includes 5500 observations and 60 variables.

Load libraries and dataset

library(tidyverse)
library(car)
cars <- read_csv("epacars.csv")

Data Analysis

After cleaning the data I will select the key variables, filter out summary data (“All”), and filter for only CO2 emitting internal combustion vehicles. I will then generate a series of diagnostic plots to check the assumptions of my multiple linear regression model.

Inspect data

head(cars)
## # A tibble: 6 × 60
##   Manufacturer `Model Year` `Regulatory Class` `Vehicle Type` `Production (000)`
##   <chr>        <chr>        <chr>              <chr>          <chr>             
## 1 All          1991         Car                Car SUV        224               
## 2 All          1992         Car                Car SUV        243               
## 3 All          1993         Car                Car SUV        473               
## 4 All          1994         Car                Car SUV        332               
## 5 All          1995         Car                Car SUV        220               
## 6 All          1996         Car                Car SUV        287               
## # ℹ 55 more variables: `Production Share` <chr>, `2-Cycle MPG` <chr>,
## #   `Real-World MPG` <chr>, `Real-World MPG_City` <chr>,
## #   `Real-World MPG_Hwy` <chr>, `Real-World CO2 (g/mi)` <chr>,
## #   `Real-World CO2_City (g/mi)` <chr>, `Real-World CO2_Hwy (g/mi)` <chr>,
## #   `Weight (lbs)` <chr>, `Footprint (sq. ft.)` <chr>,
## #   `Engine Displacement` <chr>, `Horsepower (HP)` <chr>,
## #   `Acceleration (0-60 time in seconds)` <chr>, …
#str(cars)

Cleaning

#Clean variable names
names(cars) <- gsub("\\(|\\)", "", names(cars)) #remove parentheses 
names(cars) <- gsub("g\\/|mi", "", names(cars)) #remove (g/mi)
names(cars) <- gsub("-", "_", names(cars)) #sub hyphens w/ underscores
names(cars) <- gsub(" ", "_", names(cars))   #sub spaces w/ underscores
names(cars) <- gsub("_$", "", names(cars)) #remove trailing underscore
names(cars) <- tolower(names(cars))      #variable names lowercase
#Convert variables to proper classes
cars$engine_displacement <- as.numeric(cars$engine_displacement)
cars$real_world_co2 <- as.numeric(cars$real_world_co2)
cars$weight_lbs <- as.numeric(cars$weight_lbs)
cars$horsepower_hp <- as.numeric(cars$horsepower_hp)
cars$cylinders_in_gasoline_ice_vehicles <- as.numeric(cars$cylinders_in_gasoline_ice_vehicles)
cars$model_year <- as.numeric(cars$model_year)
#Clean data
cars[cars == "-"] <- NA   #mark missing values as NA
cars$model_year[cars$model_year == "Prelim. 2024"] <- 2024  #remove prelim.

Filtering for useful information

cars2 <- cars |>
  select(c(manufacturer, 
           model_year, 
           vehicle_type,
           real_world_co2, 
           weight_lbs, 
           horsepower_hp, 
           engine_displacement, 
           cylinders_in_gasoline_ice_vehicles)) |>
  filter(!cars$manufacturer == "All" &
           !cars$vehicle_type %in% c("All", "All Car", "All Truck") &
           cars$cylinders_in_gasoline_ice_vehicles > 1) |>
  arrange(desc(real_world_co2))

Regression Analysis

I will be using a multiple linear regression model because I am using multiple variables to predict a continuous outcome variable (CO2).

#Multiple linear regression model
cars_model <- lm(real_world_co2 ~ model_year + engine_displacement + vehicle_type, data = cars2)

summary(cars_model)
## 
## Call:
## lm(formula = real_world_co2 ~ model_year + engine_displacement + 
##     vehicle_type, data = cars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174.21  -27.86   -3.52   24.29  347.76 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             7214.10973  150.92974  47.798  < 2e-16 ***
## model_year                -3.48508    0.07492 -46.515  < 2e-16 ***
## engine_displacement        1.12764    0.01958  57.594  < 2e-16 ***
## vehicle_typeMinivan/Van    7.18631    3.51685   2.043  0.04114 *  
## vehicle_typePickup       -10.31128    3.64817  -2.826  0.00475 ** 
## vehicle_typeSedan/Wagon  -45.91424    3.07815 -14.916  < 2e-16 ***
## vehicle_typeTruck SUV     24.17041    3.28870   7.350 2.88e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.93 on 2000 degrees of freedom
##   (47 observations deleted due to missingness)
## Multiple R-squared:  0.824,  Adjusted R-squared:  0.8235 
## F-statistic:  1561 on 6 and 2000 DF,  p-value: < 2.2e-16

Coefficients

Model Year: ~-3.49, meaning newer cars emit less CO2.
Engine displacement: ~1.13, meaning as displacement increases, CO2 emissions increase.

Vehicle Type (vs Car SUV):

Minivan/van: +7.19 (g/mi)
Pickup: -10.31 (g/mi)
Sedan/wagon: -45.91 (g/mi)
Truck SUV: +24.17 (g/mi)

Model Assumptions and Diagnostics

I will check assumptions for linearity, independence, homoscedasticity, normality, and multicollinearity using diagnostic plots and correlation matrices.

Diagnostic plots

par(mfrow=c(2,2)); plot(cars_model); par(mfrow=c(1,1))

Linearity:

#Check linearity through component + residual plots
crPlots(cars_model)

The component + residual plots show clear linear trends for each variable, with a positive relationship between CO2, engine displacement and horsepower, and a negative relationship with model year. In addition, the residuals vs fitted plot reveals a mostly even distribution of residuals. Overall linearity is reasonable.

Independence of observations:

#Check independence through residuals vs. order plot
plot(resid(cars_model), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

The residuals vs order plot shows residuals centered around zero with slight deviance at index ≈ 0 and index ≈ 4000 which are not enough to invalidate the model. Overall, the plot indicates independence.

Homoscedasticity

The spread of points across the residuals vs fitted plot line is distributed evenly around zero which indicates homoscedasticity. In addition, the scale-location plot’s line is mostly horizontal with an even spread of points which demonstrates homoscedasticity.

Normality of residuals

The Q-Q plot’s residuals deviate slightly at the left and right tails which indicates non-normality, but this is not severe enough to invalidate the model.

Multicollinearity

cor(cars2[, c("model_year", "engine_displacement")], use = "complete.obs")
##                     model_year engine_displacement
## model_year           1.0000000          -0.1385737
## engine_displacement -0.1385737           1.0000000

There is little correlation between model year and engine displacement, showing there is very little multicollinearity in the model.

Conclusion and Future Directions

Key Findings:

Newer cars emit less CO2, while large engines with higher displacement emit more CO2. Sedans and wagons emit the least, while truck SUVs emit the most.

Implications:

My results answer the question of how various vehicle factors (vehicle type, year, and displacement) contribute to their CO2 emissions. This information is useful for those who wish to purchase an internal combustion car whilst being more environmentally conscious, taking into account variables that may increase their contribution to climate change. Manufacturers may also take these variables into account when designing their cars to meet emission regulations.

Model Fit and Limitations:

The model’s R^2 is 0.824, meaning 82.4% of the variation in CO2 is explained by the model. This is pretty good. It is, however, limited by a small number of variables. As such, the model may be lacking significant variables that would better explain variance in CO2. This was largely a result of multicollinearity being present in the model with other variables. For example, the was high correlation between horsepower and model year (0.78), horsepower and weight (0.8), and displacement and mean cylinders (0.99). These variables were either omitted or replaced with others.

Future Research:

In the future, I would like to explore more variables to increase the accuracy of my model. For example, I would explore the emissions of turbocharged vs naturally aspirated engines or those of diesel cars. I would also look into why certain vehicle types produce more emissions than others, since I predicted pickups would be highest even though they are not. Since learning that newer cars emit less, likely due to environmental regulations, I would like to see how manufacturers are adjusting their cars to comply with these laws. In addition, I would explore emissions from automobiles in different parts of the world to see how differing regulations contribute to national carbon footprint.

References

Environmental Protection Agency - https://www.epa.gov/automotive-trends/explore-automotive-trends-data