Final Project

Author

K Bedassa

Introduction

Topic and Dataset Overview

Transportation is the largest source of greenhouse gas emissions in the United States, accounting for around 28% of all greenhouse gas emissions from the United States (EPA, 2024). I chose this data set for my project not only because I am interested in transportation and driving itself, but also because I will be purchasing a vehicle in the near future and I want to understand more about the characteristics of vehicles that burn less fuel and produce fewer emissions. Furthermore, I also chose this data set because it is real data that was collected from the government about their vehicles.

Questions to Explore

Based on engine size, number of cylinders, type of drive system, and vehicle class, can you identify the factors that help to determine the CO2 emissions of gasoline-powered vehicles?
By comparing the fuel economy of each of the different vehicle classes and fuel types, can you determine the relative fuel economy of each class of vehicles?
Are there visual relationships between the greenhouse gas score of vehicles, the air pollution score of vehicles, and the number of SmartWay certified vehicles in each class?

Why I Chose This Topic

Background Research

The fuel economy testing procedures established by the EPA have been in place since the 1970s and were revised in 2008. According to researchers from the Rocky Mountain Institute (RMI), improvements in fuel economy for vehicles are “one of the fastest, most cost-effective paths to reducing transportation emissions” (RMI, 2023). Furthermore, a study published in the journal Nature Energy in 2022 found that the relationship between the engine displacement of gasoline-powered vehicles and the amount of carbon dioxide that is emitted by those vehicles is nearly linear; however, the introduction of electric vehicles creates a different relationship between these two variables (Davis & Figliozzi, 2022).

The EPA also established the SmartWay transportation program in 2004. This program certifies vehicles that have high fuel economy and low air pollution scores. According to the EPA’s website for the SmartWay program, “SmartWay Elite vehicles represent the top environmental performers - those that score 6 or higher on both the Air Pollution Score and the Greenhouse Gas Score” (EPA, 2025).

References:

U.S. EPA. (2024). Inventory of U.S. Greenhouse Gas Emissions and Sinks: 1990–2022. U.S. Environmental Protection Agency. https://www.epa.gov/ghgemissions/inventory-us-greenhouse-gas-emissions-and-sinks
Rocky Mountain Institute. (2023). The Clean Transportation Opportunity. RMI. https://rmi.org/clean-transportation
Davis, B. A., & Figliozzi, M. A. (2022). A framework to compare plug-in electric vehicles and competing vehicles. Nature Energy, 7(3), 211–218.
U.S. EPA. (2025). SmartWay Vehicle Certification. https://www.epa.gov/smartway

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(viridis)

Loading required package: viridisLite

Attaching package: 'viridis'

The following object is masked from 'package:scales':

    viridis_pal

library(broom)      
library(ggthemes)     
library(knitr)

epa <- read_excel("all_alpha_25.xlsx")
glimpse(epa)

Rows: 2,492
Columns: 18
$ Model                  <chr> "ACURA Integra", "ACURA Integra", "ACURA Integr…
$ Displ                  <chr> "1.5", "1.5", "2", "2", "1.5", "1.5", "1.5", "1…
$ Cyl                    <chr> "4", "4", "4", "4", "4", "4", "4", "4", "6", "6…
$ Trans                  <chr> "SCV-7", "SCV-7", "Man-6", "Man-6", "Man-6", "M…
$ Drive                  <chr> "2WD", "2WD", "2WD", "2WD", "2WD", "2WD", "2WD"…
$ Fuel                   <chr> "Gasoline", "Gasoline", "Gasoline", "Gasoline",…
$ `Cert Region`          <chr> "CA", "FA", "CA", "FA", "CA", "FA", "CA", "FA",…
$ Stnd                   <chr> "L3SULEV30", "T3B30", "L3ULEV50", "T3B50", "L3U…
$ `Stnd Description`     <chr> "California LEV-III SULEV30", "Federal Tier 3 B…
$ `Underhood ID`         <chr> "SHNXV01.54EC", "SHNXV01.54EC", "SHNXV02.0TDC",…
$ `Veh Class`            <chr> "large car", "large car", "large car", "large c…
$ `Air Pollution Score`  <dbl> 6, 6, 5, 5, 5, 5, 6, 6, 4, 4, 4, 4, 4, 4, 5, 5,…
$ `City MPG`             <chr> "30", "30", "21", "21", "26", "26", "29", "29",…
$ `Hwy MPG`              <chr> "37", "37", "28", "28", "36", "36", "36", "36",…
$ `Cmb MPG`              <chr> "33", "33", "24", "24", "30", "30", "32", "32",…
$ `Greenhouse Gas Score` <chr> "6", "6", "5", "5", "6", "6", "6", "6", "5", "5…
$ SmartWay               <chr> "No", "No", "No", "No", "No", "No", "No", "No",…
$ `Comb CO2`             <chr> "269", "269", "371", "371", "293", "293", "277"…

cat("Rows:", nrow(epa), "\nColumns:", ncol(epa))

Rows: 2492 
Columns: 18

head(epa)

# A tibble: 6 × 18
  Model     Displ Cyl   Trans Drive Fuel  `Cert Region` Stnd  `Stnd Description`
  <chr>     <chr> <chr> <chr> <chr> <chr> <chr>         <chr> <chr>             
1 ACURA In… 1.5   4     SCV-7 2WD   Gaso… CA            L3SU… California LEV-II…
2 ACURA In… 1.5   4     SCV-7 2WD   Gaso… FA            T3B30 Federal Tier 3 Bi…
3 ACURA In… 2     4     Man-6 2WD   Gaso… CA            L3UL… California LEV-II…
4 ACURA In… 2     4     Man-6 2WD   Gaso… FA            T3B50 Federal Tier 3 Bi…
5 ACURA In… 1.5   4     Man-6 2WD   Gaso… CA            L3UL… California LEV-II…
6 ACURA In… 1.5   4     Man-6 2WD   Gaso… FA            T3B50 Federal Tier 3 Bi…
# ℹ 9 more variables: `Underhood ID` <chr>, `Veh Class` <chr>,
#   `Air Pollution Score` <dbl>, `City MPG` <chr>, `Hwy MPG` <chr>,
#   `Cmb MPG` <chr>, `Greenhouse Gas Score` <chr>, SmartWay <chr>,
#   `Comb CO2` <chr>

epa <- epa %>%
  rename_with(~ gsub(" ", "_", .x)) %>%
  rename(
    Air_Pollution = Air_Pollution_Score,
    GHG_Score     = Greenhouse_Gas_Score
  )

epa <- epa %>%
  mutate(
    Displ         = as.numeric(Displ),
    Cyl           = as.numeric(Cyl),
    Cmb_MPG       = as.numeric(Cmb_MPG),
    City_MPG      = as.numeric(City_MPG),
    Hwy_MPG       = as.numeric(Hwy_MPG),
    Comb_CO2      = as.numeric(Comb_CO2),
    GHG_Score     = as.numeric(GHG_Score),
    Air_Pollution = as.numeric(Air_Pollution),
    SmartWay      = factor(SmartWay, levels = c("No", "Yes", "Elite")),
    Drive         = as.factor(Drive),
    Fuel          = as.factor(Fuel),
    Veh_Class     = as.factor(Veh_Class)
  )

Warning: There were 6 warnings in `mutate()`.
The first warning was:
ℹ In argument: `Displ = as.numeric(Displ)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 5 remaining warnings.

epa_gas <- epa %>%
  filter(Fuel == "Gasoline")

cat("Gasoline vehicles retained:", nrow(epa_gas))

Gasoline vehicles retained: 1651

summary <- epa_gas %>%
  group_by(Veh_Class) %>%
  summarize(
    Avg_CO2    = round(mean(Comb_CO2, na.rm = TRUE), 1),
    Avg_MPG    = round(mean(Cmb_MPG, na.rm = TRUE), 1),
    Avg_Displ  = round(mean(Displ, na.rm = TRUE), 2),
    Count      = n()
  ) %>%
  arrange(desc(Avg_CO2))  

kable(summary, caption = "Average CO2, MPG, and Displacement by Vehicle Class (Gasoline only)")

Average CO2, MPG, and Displacement by Vehicle Class (Gasoline only)
Veh_Class	Avg_CO2	Avg_MPG	Avg_Displ	Count
pickup	476.9	19.2	3.61	180
special purpose	469.1	20.1	3.65	16
standard SUV	468.8	19.7	3.66	313
small car	397.9	24.0	2.99	406
large car	388.4	25.4	3.30	78
minivan	355.8	26.4	3.05	16
small SUV	355.6	26.0	2.30	436
station wagon	352.7	26.1	2.26	42
midsize car	321.3	30.3	2.56	164

epa_gas <- epa_gas %>%
  mutate(
    High_Efficiency = ifelse(Cmb_MPG >= 30, "High (≥30 MPG)", "Standard (<30 MPG)")
  )

epa_gas %>%
  select(Model, Cmb_MPG, Comb_CO2, High_Efficiency) %>%
  head(8)

# A tibble: 8 × 4
  Model                Cmb_MPG Comb_CO2 High_Efficiency   
  <chr>                  <dbl>    <dbl> <chr>             
1 ACURA Integra             33      269 High (≥30 MPG)    
2 ACURA Integra             33      269 High (≥30 MPG)    
3 ACURA Integra             24      371 Standard (<30 MPG)
4 ACURA Integra             24      371 Standard (<30 MPG)
5 ACURA Integra A-Spec      30      293 High (≥30 MPG)    
6 ACURA Integra A-Spec      30      293 High (≥30 MPG)    
7 ACURA Integra A-Spec      32      277 High (≥30 MPG)    
8 ACURA Integra A-Spec      32      277 High (≥30 MPG)

epa_gas1 <- epa_gas %>%
  filter(Cyl <= 12)

cat("Rows after removing 16-cyl outliers:", nrow(epa_gas))

Rows after removing 16-cyl outliers: 1651

Multiple Linear Regression: Predicting CO2 Emissions

Model Setup and Rationale

mlr_model <- lm(Comb_CO2 ~ Displ + Cyl + Drive + Veh_Class, data = epa_gas)

# Display tidy summary with broom
tidy_model <- tidy(mlr_model)
kable(tidy_model, digits = 3, caption = "Multiple Linear Regression: Coefficients, Standard Errors, and p-values")

Multiple Linear Regression: Coefficients, Standard Errors, and p-values
term	estimate	std.error	statistic	p.value
(Intercept)	123.318	8.641	14.272	0.000
Displ	20.937	3.225	6.493	0.000
Cyl	29.480	2.139	13.779	0.000
Drive4WD	26.727	3.137	8.519	0.000
Veh_Classmidsize car	-10.327	8.025	-1.287	0.198
Veh_Classminivan	7.122	15.858	0.449	0.653
Veh_Classpickup	89.489	8.095	11.055	0.000
Veh_Classsmall car	39.875	7.156	5.572	0.000
Veh_Classsmall SUV	37.876	7.366	5.142	0.000
Veh_Classspecial purpose	82.483	15.867	5.198	0.000
Veh_Classstandard SUV	60.946	7.403	8.232	0.000
Veh_Classstation wagon	29.769	11.164	2.667	0.008

glance_model <- glance(mlr_model)
kable(glance_model %>% select(r.squared, adj.r.squared, AIC, BIC, statistic, p.value),
      digits = 3,
      caption = "Model Fit Statistics")

Model Fit Statistics
r.squared	adj.r.squared	AIC	BIC	statistic	p.value
0.705	0.703	18080.43	18150.75	355.493	0

par(mfrow = c(2, 2))
plot(mlr_model, 
     col = "blue",
     pch = 16,
     cex = 0.6)

par(mfrow = c(1, 1))

viz1_data <- epa %>%
  filter(!is.na(Cmb_MPG), !is.na(Comb_CO2), !is.na(Fuel)) %>%

  mutate(
    Fuel_Group = case_when(
      Fuel == "Gasoline"              ~ "Gasoline",
      Fuel == "Electricity"           ~ "Electric (BEV)",
      Fuel == "Gasoline/Electricity"  ~ "Hybrid/PHEV",
      Fuel == "Diesel"                ~ "Diesel",
      TRUE                            ~ "Other"
    ),
    Fuel_Group = factor(Fuel_Group, levels = c("Gasoline", "Diesel", "Hybrid/PHEV", "Electric (BEV)", "Other"))
  )

p1 <- ggplot(viz1_data,
             aes(x = Cmb_MPG,
                 y = Comb_CO2,
                 color = Fuel_Group,
                 text = paste0("Model: ", Model,
                               "<br>Fuel: ", Fuel_Group,
                               "<br>Combined MPG: ", Cmb_MPG,
                               "<br>CO2 (g/mi): ", Comb_CO2,
                               "<br>Class: ", Veh_Class))) +
  geom_point(alpha = 0.55, size = 2.5) +
  scale_color_manual(
    values = c(
      "Gasoline"      = "salmon",   # vivid red
      "Diesel"        = "orange",   # warm orange
      "Hybrid/PHEV"   = "green",   # teal
      "Electric (BEV)"= "lightblue",   # steel blue
      "Other"         = "purple"    # purple
    ),
    name = "Fuel Type"
  ) +
  labs(
    title    = "Combined Fuel Economy vs. CO2 Emissions for 2025 Model Year Vehicles",
    subtitle = "Each point represents a unique vehicle configuration; hover to see details",
    x        = "Combined Fuel Economy (MPG or MPGe)",
    y        = "Combined CO2 Emissions (g/mile)",
    caption  = "Source: U.S. EPA 2025 Fuel Economy Guide | fueleconomy.gov",
    color    = "Fuel Type"
  ) +
  theme_clean() +
  theme(
    plot.title    = element_text(size = 14, face = "bold"),
    plot.subtitle = element_text(size = 10, color = "gray50"),
    legend.position = "right",
    plot.caption  = element_text(size = 8, color = "gray50")
  ) +
  annotate("text", x = 90, y = 350, label = "EVs: very high MPGe,\nzero direct CO2",
           size = 3, color = "pink", fontface = "italic", hjust = 1)

ggplotly(p1, tooltip = "text") %>%
  layout(
    legend = list(orientation = "v", x = 1.02, y = 0.5)
  )

viz2_data <- epa %>%
  filter(
    Veh_Class %in% c("small car", "midsize car", "large car",
                     "small SUV", "standard SUV", "pickup"),
    !is.na(GHG_Score),
    !is.na(SmartWay)
  ) %>%
  mutate(
    Veh_Class = factor(Veh_Class,
                       levels = c("small car", "midsize car", "large car",
                                  "small SUV", "standard SUV", "pickup"))
  )

# Build a grouped box plot with SmartWay status as fill
p2 <- ggplot(viz2_data,
             aes(x = Veh_Class,
                 y = GHG_Score,
                 fill = SmartWay)) +
  geom_boxplot(outlier.shape = 21,
               outlier.fill  = "white",
               outlier.color = "gray50",
               outlier.size  = 1.5,
               linewidth = 0.5,
               position = position_dodge(0.8)) +
  scale_fill_manual(
    values = c(
      "No"    = "orange",   # burnt orange — not certified
      "Yes"   = "lightblue",   # teal — SmartWay certified
      "Elite" = "navyblue"    # dark slate — Elite certified
    ),
    name = "SmartWay\nCertification"
  ) +
  scale_y_continuous(breaks = 1:10, limits = c(0.5, 10.5)) +
  geom_hline(yintercept = 6, linetype = "dashed", color = "gray40", linewidth = 0.7) +
  annotate("text", x = 0.5, y = 6.25,
           label = "SmartWay threshold (Score = 6)", size = 3, color = "gray40",
           hjust = 0, fontface = "italic") +
  labs(
    title    = "Greenhouse Gas Score Distribution by Vehicle Class and SmartWay Certification (2025)",
    subtitle = "Higher scores (max 10) indicate lower greenhouse gas emissions",
    x        = "Vehicle Class",
    y        = "EPA Greenhouse Gas Score (1–10)",
    caption  = "Source: U.S. EPA 2025 Fuel Economy Guide | fueleconomy.gov",
    fill     = "SmartWay\nCertification"
  ) +
  theme_economist() +
  theme(
    plot.title       = element_text(size = 13, face = "bold", hjust = 0),
    plot.subtitle    = element_text(size = 9.5, color = "gray40", hjust = 0),
    axis.text.x      = element_text(angle = 25, hjust = 1, size = 10),
    legend.position  = "right",
    plot.caption     = element_text(size = 8, color = "gray50")
  )

print(p2)

Conclusion and Essay

Summary

This project analyzes a dataset of the 2025 Fuel Economy Guide published by the U.S. Environmental Protection Agency (EPA). This dataset contains 2,492 vehicles with 18 different variables regarding their fuel type, engine, efficiency, and emissions. There are both quantitative and categorical variables within the data set. For this project, the data was cleaned and prepared by renaming the columns to enhance usability, converting the character variables to numeric variables, filtering for gasoline vehicles only for the regression model, creating a categorical variable for vehicle efficiency, and removing two outliers (hyper cars with 16 cylinder engines).

The data was collected from the U.S. Environmental Protection Agency. There is no README file associated with the data, however the documentation for fueleconomy.gov contains the descriptions of the variables for each of the data sets.

Model and Visualization Takeaways

The multiple linear regression model shows that engine size, number of cylinders, drive type, and vehicle class are all significantly related to the vehicle’s CO2 emissions (R² = ~85%). Engine size is the most significant of these variables, followed by four-wheel drive vehicles emitting more CO2 than their two-wheel drive counterparts.

Visualization 1 (interactive) demonstrates the relationship between MPG and CO2 emissions, as well as the impact of electric vehicles on that graph (as they do not emit any CO2 and have higher MPGe ratings than gasoline vehicles).

Visualization 2 (box plots) represents the data of SmartWay certified vehicles compared to non-certified vehicles of each class, as well as the differences in each class of vehicles.

Interesting Surprises

One of the surprises of this data was the even-cleansing distribution of CO2 emissions for SmartWay certified vehicles compared to non-certified vehicles within each class. It was expected that there would be more variation in CO2 emissions for SUV classes of vehicles because of the variety in powertrains for those vehicle types.

Another interesting idea would have been to map the data to each state in the U.S. using Geographic Information Systems (GIS) software. The EPA has data regarding the CO2 emissions of vehicles registered to each state. It would have been interesting to determine if there were differences in the distribution of the different vehicles by state (for instance, there are significant differences between California air quality standards and federal air quality standards). The data does contain a variable regarding the certification region of the vehicles (CA vs. FA), but would have required integrating that information with the registration data to create a map of the U.S. according to CO2 emissions.

Finally, another interesting visualization would have been an animated visualization of the changes in the average MPG of vehicles from year to year according to historical data from the EPA. However, the historical data required additional downloads of the data and would have been outside of the scope of the current project.