Final Project

Introduction

For my final project, I used a vehicle fuel economy dataset collected by the United States Environmental Protection Agency (EPA). I chose this topic because fuel economy is connected to everyday life. It affects how much money people spend on gas, and it also relates to air pollution and carbon dioxide emissions. I thought this topic was meaningful because cars are used every day, so it is interesting to study how different vehicle characteristics relate to fuel efficiency. According to the EPA report, the agency has collected data on new light-duty vehicles sold in the United States since 1975, using EPA testing and data submitted by manufacturers through official EPA test procedures.

The variables I used in this project include both categorical and quantitative variables.

fuelType1: which tells the main type of fuel used by the vehicle VClass: which shows the vehicle class such as compact car or SUV trany: which shows the transmission type. comb08: which shows the vehicle’s combined miles per gallon displ: which is engine displacement in liters cylinders: which is the number of engine cylinders co2TailpipeGpm: which measures tailpipe carbon dioxide emissions in grams per mile fuelCost08: which estimates yearly fuel cost.

I wanted to explore how engine size and other vehicle characteristics relate to fuel economy, and whether some vehicle classes or fuel types perform better than others. To clean the data, I selected only the variables needed for my project, removed rows with missing values in the main variables, and filtered out some unusual values so the regression and visualizations would be easier to read. The EPA report also explains that the fuel economy data in the report are based on estimated real-world values that use expanded testing to better reflect actual driving conditions.

## Load the libraries and dataset
library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

Warning: package 'plotly' was built under R version 4.5.2


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.5.2

vehicles <- read_csv("vehicle_fuel_econ_USEPA.csv")

Rows: 40704 Columns: 83
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (22): drive, eng_dscr, fuelType, fuelType1, make, model, mpgData, trany,...
dbl (59): barrels08, barrelsA08, charge120, charge240, city08, city08U, city...
lgl  (2): phevBlended, tCharger

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(vehicles)

# A tibble: 6 × 83
  barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 cityA08U
      <dbl>      <dbl>     <dbl>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
1      15.7          0         0         0     18       0       0        0
2      15.0          0         0         0     20       0       0        0
3      22.0          0         0         0     13       0       0        0
4      22.0          0         0         0     13       0       0        0
5      19.4          0         0         0     15       0       0        0
6      18.3          0         0         0     16       0       0        0
# ℹ 75 more variables: cityCD <dbl>, cityE <dbl>, cityUF <dbl>, co2 <dbl>,
#   co2A <dbl>, co2TailpipeAGpm <dbl>, co2TailpipeGpm <dbl>, comb08 <dbl>,
#   comb08U <dbl>, combA08 <dbl>, combA08U <dbl>, combE <dbl>,
#   combinedCD <dbl>, combinedUF <dbl>, cylinders <dbl>, displ <dbl>,
#   drive <chr>, engId <dbl>, eng_dscr <chr>, feScore <dbl>, fuelCost08 <dbl>,
#   fuelCostA08 <dbl>, fuelType <chr>, fuelType1 <chr>, ghgScore <dbl>,
#   ghgScoreA <dbl>, highway08 <dbl>, highway08U <dbl>, highwayA08 <dbl>, …

I selected the main variables

vehicles_clean <- vehicles |>
  select(make, model, year, fuelType1, VClass, drive, trany, comb08, co2TailpipeGpm, displ, cylinders, fuelCost08)

head(vehicles_clean)

# A tibble: 6 × 12
  make      model  year fuelType1 VClass drive trany comb08 co2TailpipeGpm displ
  <chr>     <chr> <dbl> <chr>     <chr>  <chr> <chr>  <dbl>          <dbl> <dbl>
1 Alfa Rom… Spid…  1984 Regular … Two S… <NA>  Manu…     21           423.   2  
2 Bertone   X1/9   1984 Regular … Two S… <NA>  Manu…     22           404.   1.5
3 Chevrolet Corv…  1984 Regular … Two S… <NA>  Auto…     15           592.   5.7
4 Chevrolet Corv…  1984 Regular … Two S… <NA>  Manu…     15           592.   5.7
5 Nissan    300ZX  1984 Regular … Two S… <NA>  Auto…     17           523.   3  
6 Nissan    300ZX  1984 Regular … Two S… <NA>  Auto…     18           494.   3  
# ℹ 2 more variables: cylinders <dbl>, fuelCost08 <dbl>

I cleaned the dataset

vehicles_clean2 <- vehicles_clean |>
  filter(!is.na(displ)) |>
  filter(!is.na(cylinders)) |>
  filter(!is.na(trany)) |>
  filter(co2TailpipeGpm > 0)  # To keep vehicles with positive tailpipe CO2 emissions

head(vehicles_clean2)

# A tibble: 6 × 12
  make      model  year fuelType1 VClass drive trany comb08 co2TailpipeGpm displ
  <chr>     <chr> <dbl> <chr>     <chr>  <chr> <chr>  <dbl>          <dbl> <dbl>
1 Alfa Rom… Spid…  1984 Regular … Two S… <NA>  Manu…     21           423.   2  
2 Bertone   X1/9   1984 Regular … Two S… <NA>  Manu…     22           404.   1.5
3 Chevrolet Corv…  1984 Regular … Two S… <NA>  Auto…     15           592.   5.7
4 Chevrolet Corv…  1984 Regular … Two S… <NA>  Manu…     15           592.   5.7
5 Nissan    300ZX  1984 Regular … Two S… <NA>  Auto…     17           523.   3  
6 Nissan    300ZX  1984 Regular … Two S… <NA>  Auto…     18           494.   3  
# ℹ 2 more variables: cylinders <dbl>, fuelCost08 <dbl>

summary(vehicles_clean2)

     make              model                year       fuelType1        
 Length:40522       Length:40522       Min.   :1984   Length:40522      
 Class :character   Class :character   1st Qu.:1991   Class :character  
 Mode  :character   Mode  :character   Median :2002   Mode  :character  
                                       Mean   :2001                     
                                       3rd Qu.:2011                     
                                       Max.   :2019                     
    VClass             drive              trany               comb08     
 Length:40522       Length:40522       Length:40522       Min.   : 7.00  
 Class :character   Class :character   Class :character   1st Qu.:17.00  
 Mode  :character   Mode  :character   Mode  :character   Median :20.00  
                                                          Mean   :20.19  
                                                          3rd Qu.:23.00  
                                                          Max.   :58.00  
 co2TailpipeGpm       displ         cylinders        fuelCost08  
 Min.   :  29.0   Min.   :0.600   Min.   : 2.000   Min.   : 700  
 1st Qu.: 386.4   1st Qu.:2.200   1st Qu.: 4.000   1st Qu.:1850  
 Median : 446.0   Median :3.000   Median : 6.000   Median :2250  
 Mean   : 469.5   Mean   :3.299   Mean   : 5.721   Mean   :2284  
 3rd Qu.: 522.9   3rd Qu.:4.300   3rd Qu.: 6.000   3rd Qu.:2700  
 Max.   :1269.6   Max.   :8.400   Max.   :16.000   Max.   :7150

I explored the main categorical variables

vehicles_clean2 |>
  count(fuelType1, sort = TRUE)

# A tibble: 5 × 2
  fuelType1             n
  <chr>             <int>
1 Regular Gasoline  27658
2 Premium Gasoline  11540
3 Diesel             1158
4 Midgrade Gasoline   106
5 Natural Gas          60

vehicles_clean2 |>
  count(VClass, sort = TRUE)

# A tibble: 34 × 2
   VClass                          n
   <chr>                       <int>
 1 Compact Cars                 5796
 2 Subcompact Cars              5079
 3 Midsize Cars                 4786
 4 Standard Pickup Trucks       2354
 5 Large Cars                   2090
 6 Sport Utility Vehicle - 4WD  2090
 7 Two Seaters                  2025
 8 Sport Utility Vehicle - 2WD  1619
 9 Small Station Wagons         1582
10 Special Purpose Vehicles     1455
# ℹ 24 more rows

I kept the most common fuel types and vehicle classes

# keep the 4 most common fuel types so the plots stay readable
top_fuel <- vehicles_clean2 |>
  count(fuelType1, sort = TRUE) |>
  slice_head(n = 4)

# keep the 7 most common vehicle classes so the heatmap is not too crowded
top_class <- vehicles_clean2 |>
  count(VClass, sort = TRUE) |>
  slice_head(n = 7)

# filter the dataset to only those common categories
vehicles_final <- vehicles_clean2 |>
  filter(fuelType1 %in% top_fuel$fuelType1) |>
  filter(VClass %in% top_class$VClass)

head(vehicles_final)

# A tibble: 6 × 12
  make      model  year fuelType1 VClass drive trany comb08 co2TailpipeGpm displ
  <chr>     <chr> <dbl> <chr>     <chr>  <chr> <chr>  <dbl>          <dbl> <dbl>
1 Alfa Rom… Spid…  1984 Regular … Two S… <NA>  Manu…     21           423.   2  
2 Bertone   X1/9   1984 Regular … Two S… <NA>  Manu…     22           404.   1.5
3 Chevrolet Corv…  1984 Regular … Two S… <NA>  Auto…     15           592.   5.7
4 Chevrolet Corv…  1984 Regular … Two S… <NA>  Manu…     15           592.   5.7
5 Nissan    300ZX  1984 Regular … Two S… <NA>  Auto…     17           523.   3  
6 Nissan    300ZX  1984 Regular … Two S… <NA>  Auto…     18           494.   3  
# ℹ 2 more variables: cylinders <dbl>, fuelCost08 <dbl>

In those steps, I kept only the most common fuel types and vehicle classes. I did this so the final visualizations would be easier to read and less crowded. This filtering helped me focus on the groups that appeared most often in the dataset.

multiple linear regression model

model2 <- lm(comb08 ~ displ + cylinders, data = vehicles_final)

summary(model2)


Call:
lm(formula = comb08 ~ displ + cylinders, data = vehicles_final)

Residuals:
    Min      1Q  Median      3Q     Max 
-10.764  -2.375  -0.536   1.634  32.414 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.36608    0.08136  385.53   <2e-16 ***
displ       -2.37264    0.04090  -58.02   <2e-16 ***
cylinders   -0.49589    0.03011  -16.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.759 on 24188 degrees of freedom
Multiple R-squared:  0.5446,    Adjusted R-squared:  0.5446 
F-statistic: 1.446e+04 on 2 and 24188 DF,  p-value: < 2.2e-16

autoplot(model2, 1:4, nrow = 2, ncol = 2)

Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Before choosing my final model, I also tested a model that included co2TailpipeGpm as a predictor along with displ and cylinders. However, I removed co2TailpipeGpm from the final model because I wanted the regression to focus more directly on vehicle characteristics. My final model uses displ and cylinders to predict comb08, which makes the model simpler and easier to interpret.

The final multiple linear regression model has the equation: comb08 = 31.3661 - 2.3726(displ) - 0.4959(cylinders). The intercept of 31.3661 represents the predicted combined MPG when both engine displacement and cylinders are equal to 0, although that is not very meaningful in a real-world vehicle context. The coefficient for displ shows that for each additional 1-liter increase in engine displacement, the predicted combined MPG decreases by about 2.373, holding the number of cylinders constant. The coefficient for cylinders shows that for each additional 1-cylinder increase, the predicted combined MPG decreases by about 0.496, holding engine displacement constant.

The model was statistically significant overall, with a p-value less than 2.2e-16, and the adjusted R-squared value was 0.5446. This means the model explains about 54.46% of the variation in combined MPG, which suggests that engine displacement and number of cylinders are important predictors of fuel economy.

I created a scatterplot of engine displacement and combined MPG

ggplot(vehicles_final,
             aes(x = displ, y = comb08, color = fuelType1)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Engine Displacement and Combined MPG by Fuel Type",
    subtitle = "Filtered EPA vehicle fuel economy data",
    x = "Engine Displacement (Liters)",
    y = "Combined MPG",
    color = "Fuel Type",
    caption = "Source: U.S. Environmental Protection Agency"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 12)

`geom_smooth()` using formula = 'y ~ x'

This scatterplot shows the relationship between engine displacement and combined MPG for the main fuel types in the EPA vehicle dataset. The overall pattern shows a negative relationship, which means that vehicles with larger engines usually have lower fuel economy. The fitted regression lines also show this downward trend for each fuel type, although the strength of the pattern is a little different across groups. Most of the vehicles in this filtered dataset use regular gasoline or premium gasoline, so those categories appear more often in the graph. This visualization supports the regression results because it shows that engine size is an important factor related to fuel economy.

I created a heatmap of average MPG by vehicle class and fuel type

heat_data <- vehicles_final |>
  group_by(VClass, fuelType1) |>
  summarise(avg_mpg = mean(comb08), .groups = "drop")

ggplot(heat_data, aes(x = fuelType1, y = VClass, fill = avg_mpg)) +
  geom_tile(color = "white") +
  labs(
    title = "Average MPG by Vehicle Class and Fuel Type",
    subtitle = "Filtered EPA vehicle fuel economy data",
    x = "Fuel Type",
    y = "Vehicle Class",
    fill = "Average MPG",
    caption = "Source: U.S. Environmental Protection Agency"
  ) +
  scale_fill_distiller(palette = "YlGnBu", direction = 1) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

This heatmap shows the average combined MPG for different combinations of vehicle class and fuel type. Darker colors represent higher average fuel economy, while lighter colors represent lower average MPG. One clear pattern is that compact and subcompact cars tend to have higher average MPG than larger vehicle classes such as pickup trucks and sport utility vehicles. The heatmap also shows that fuel type matters, since some fuel types have better average fuel economy within the same vehicle class. This visualization is useful because it summarizes many group comparisons at the same time in one figure.

I created an interactive bubble chart

vehicles_bubble <- vehicles_final |>
  filter(year >= 2010) |>
  filter(displ <= 6.5) |>
  filter(fuelCost08 <= 5000)


bubble_plot <- ggplot(
  vehicles_bubble,
  aes(
    x = displ,
    y = comb08,
    color = fuelType1,
    size = fuelCost08,
    text = paste(
      "Make:", make,
      "<br>Model:", model,
      "<br>Year:", year,
      "<br>Engine Displacement:", displ,
      "<br>Combined MPG:", comb08,
      "<br>Fuel Cost:", fuelCost08,
      "<br>Fuel Type:", fuelType1
    )
  )
) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Engine Displacement, Fuel Economy, and Fuel Cost by Fuel Type",
    subtitle = "Filtered EPA vehicle fuel economy data",
    x = "Engine Displacement (Liters)",
    y = "Combined MPG",
    color = "Fuel Type",
    size = "Annual Fuel Cost",
    caption = "Source: U.S. Environmental Protection Agency"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 12)

ggplotly(bubble_plot, tooltip = "text")

This interactive bubble chart shows the relationship between engine displacement and combined MPG while also adding annual fuel cost into the same visualization. The size of each bubble represents annual fuel cost, so larger bubbles usually show vehicles that cost more to fuel each year. The plot still shows the same general pattern as the scatterplot, where vehicles with larger engine displacement tend to have lower fuel economy. Making the chart interactive helps the viewer explore the data more closely by moving over each point to see details such as the make, model, year, and fuel type. This makes the visualization more useful because it combines several variables in one graph and lets the viewer inspect individual vehicles.

Conclusion

For this project, I used EPA vehicle fuel economy data to study how vehicle characteristics relate to fuel economy. My multiple linear regression showed that engine displacement and number of cylinders both have negative relationships with combined MPG, which means that vehicles with larger engines and more cylinders usually have lower fuel economy. The scatterplot supported this pattern by showing a clear downward trend between engine displacement and combined MPG, while the heatmap made it easier to compare average MPG across vehicle classes and fuel types. The interactive bubble chart added another layer by showing annual fuel cost and allowing the viewer to explore individual vehicles more closely. One thing that stood out to me is that smaller vehicle classes usually had better fuel economy than larger classes such as SUVs and pickup trucks. If I had more time, I would want to explore changes over time in more detail or compare manufacturers more directly.

Sources

https://www.epa.gov/automotive-trends/download-automotive-trends-report#Full%20Report https://www.w3schools.com/ https://www.youtube.com/watch?v=Dh7P5ExsYCg&list=PLtL57Fdbwb_C6RS0JtBojTNOMVlgpeJkS&index=2