Final Project

## Load the libraries and dataset
library(tidyverse)
library(plotly)
library(ggfortify)

vehicles <- read_csv("vehicle_fuel_econ_USEPA.csv")

head(vehicles)

# A tibble: 6 × 83
  barrels08 barrelsA08 charge120 charge240 city08 city08U cityA08 cityA08U
      <dbl>      <dbl>     <dbl>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
1      15.7          0         0         0     18       0       0        0
2      15.0          0         0         0     20       0       0        0
3      22.0          0         0         0     13       0       0        0
4      22.0          0         0         0     13       0       0        0
5      19.4          0         0         0     15       0       0        0
6      18.3          0         0         0     16       0       0        0
# ℹ 75 more variables: cityCD <dbl>, cityE <dbl>, cityUF <dbl>, co2 <dbl>,
#   co2A <dbl>, co2TailpipeAGpm <dbl>, co2TailpipeGpm <dbl>, comb08 <dbl>,
#   comb08U <dbl>, combA08 <dbl>, combA08U <dbl>, combE <dbl>,
#   combinedCD <dbl>, combinedUF <dbl>, cylinders <dbl>, displ <dbl>,
#   drive <chr>, engId <dbl>, eng_dscr <chr>, feScore <dbl>, fuelCost08 <dbl>,
#   fuelCostA08 <dbl>, fuelType <chr>, fuelType1 <chr>, ghgScore <dbl>,
#   ghgScoreA <dbl>, highway08 <dbl>, highway08U <dbl>, highwayA08 <dbl>, …

vehicles_clean <- vehicles |>
  select(make, model, year, fuelType1, VClass, drive, trany, comb08, co2TailpipeGpm, displ, cylinders, fuelCost08)
 vehicles_clean2 <- vehicles_clean |>
  filter(!is.na(displ)) |>
  filter(!is.na(cylinders)) |>
  filter(!is.na(trany)) |>
  filter(co2TailpipeGpm > 0)
vehicles_clean2 |>
  count(fuelType1, sort = TRUE)

# A tibble: 5 × 2
  fuelType1             n
  <chr>             <int>
1 Regular Gasoline  27658
2 Premium Gasoline  11540
3 Diesel             1158
4 Midgrade Gasoline   106
5 Natural Gas          60

vehicles_clean2 |>
  count(VClass, sort = TRUE)

# A tibble: 34 × 2
   VClass                          n
   <chr>                       <int>
 1 Compact Cars                 5796
 2 Subcompact Cars              5079
 3 Midsize Cars                 4786
 4 Standard Pickup Trucks       2354
 5 Large Cars                   2090
 6 Sport Utility Vehicle - 4WD  2090
 7 Two Seaters                  2025
 8 Sport Utility Vehicle - 2WD  1619
 9 Small Station Wagons         1582
10 Special Purpose Vehicles     1455
# ℹ 24 more rows

top_fuel <- vehicles_clean2 |>
  count(fuelType1, sort = TRUE) |>
  slice_head(n = 4)
top_class <- vehicles_clean2 |>
  count(VClass, sort = TRUE) |>
  slice_head(n = 7)
vehicles_final <- vehicles_clean2 |>
  filter(fuelType1 %in% top_fuel$fuelType1) |>
  filter(VClass %in% top_class$VClass)

I created an interactive bubble chart

vehicles_bubble <- vehicles_final |>
  filter(year >= 2010) |>
  filter(displ <= 6.5) |>
  filter(fuelCost08 <= 5000)


bubble_plot <- ggplot(
  vehicles_bubble,
  aes(
    x = displ,
    y = comb08,
    color = fuelType1,
    size = fuelCost08,
    text = paste(
      "Make:", make,
      "<br>Model:", model,
      "<br>Year:", year,
      "<br>Engine Displacement:", displ,
      "<br>Combined MPG:", comb08,
      "<br>Fuel Cost:", fuelCost08,
      "<br>Fuel Type:", fuelType1
    )
  )
) +
  geom_point(alpha = 0.5) +
  labs(
    title = "Engine Displacement, Fuel Economy, and Fuel Cost by Fuel Type",
    subtitle = "Filtered EPA vehicle fuel economy data",
    x = "Engine Displacement (Liters)",
    y = "Combined MPG",
    color = "Fuel Type",
    size = "Annual Fuel Cost",
    caption = "Source: U.S. Environmental Protection Agency"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 12)

ggplotly(bubble_plot, tooltip = "text")

This interactive bubble chart shows the relationship between engine displacement and combined MPG while also adding annual fuel cost into the same visualization. The size of each bubble represents annual fuel cost, so larger bubbles usually show vehicles that cost more to fuel each year. The plot still shows the same general pattern as the scatterplot, where vehicles with larger engine displacement tend to have lower fuel economy. Making the chart interactive helps the viewer explore the data more closely by moving over each point to see details such as the make, model, year, and fuel type. This makes the visualization more useful because it combines several variables in one graph and lets the viewer inspect individual vehicles.

Conclusion

For this project, I used EPA vehicle fuel economy data to study how vehicle characteristics relate to fuel economy. My multiple linear regression showed that engine displacement and number of cylinders both have negative relationships with combined MPG, which means that vehicles with larger engines and more cylinders usually have lower fuel economy. The scatterplot supported this pattern by showing a clear downward trend between engine displacement and combined MPG, while the heatmap made it easier to compare average MPG across vehicle classes and fuel types. The interactive bubble chart added another layer by showing annual fuel cost and allowing the viewer to explore individual vehicles more closely. One thing that stood out to me is that smaller vehicle classes usually had better fuel economy than larger classes such as SUVs and pickup trucks. If I had more time, I would want to explore changes over time in more detail or compare manufacturers more directly.

Sources

https://www.epa.gov/automotive-trends/download-automotive-trends-report#Full%20Report https://www.w3schools.com/ https://www.youtube.com/watch?v=Dh7P5ExsYCg&list=PLtL57Fdbwb_C6RS0JtBojTNOMVlgpeJkS&index=2