#Introduction The dataset “Vehicle Fuel Economy” was created by the United States Environmental Protection Agency (US EPA), and is accessible through Data.Gov. This dataset contains fuel economy information for passenger vehicles sold in the United States across multiple years
NOTE: Fuel economy is a rating of how far a vehicle can travel on a specific amount of fuel, usually measured in MPG (Miles per gallon)
The dataset includes quantitative variables such as miles per gallon (MPG) in city driving, highway driving, and combined driving, as well as estimated annual fuel cost. It also contains categorical variables including fuel type, drivetrain, vehicle class, and manufacturer, along with vehicle specifications such as model name and model year. In addition, the dataset includes environmental indicators such as greenhouse gas scores.
I did not find a dedicated ReadMe file included with the dataset that describes the data collection process. However, according to information published by the U.S.EPA, vehicle fuel economy values are generated by laboratory testing done by vehicle manufacturers and verified by the EPA themselves. These tests are designed to best estimate fuel economy under controlled conditions rather than real-world driving behavior tests(U.S. Environmental Protection Agency, 2025).
I chose this dataset because, as a new driver, fuel economy has become directly connected to my everyday decisions. Fuel efficiency affects how much I spend on gas, how often I need to refuel, and what types of vehicles are efficient to own or drive regularly. I get to look at data about something that now affects me directly, with this dataset.
#Loading the libraries
library(tidyverse)
Warning: package 'dplyr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Warning: package 'plotly' was built under R version 4.5.2
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(broom)
Warning: package 'broom' was built under R version 4.5.2
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.5.2
library(viridis)
Warning: package 'viridis' was built under R version 4.5.2
Loading required package: viridisLite
#Loading the dataset
vehicles_raw <-read_csv("vehicles.csv")
Rows: 40704 Columns: 83
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (22): drive, eng_dscr, fuelType, fuelType1, make, model, mpgData, trany,...
dbl (59): barrels08, barrelsA08, charge120, charge240, city08, city08U, city...
lgl (2): phevBlended, tCharger
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Cleaning the data I reduced the dataset to only the variables needed for fuel economy and vehicle characteristics. This makes the data easier to understand as its less clustered. I restricted the analysis to vehicles from the year 2000 and later. This helps to make sure the data reflects more modern vehicle technology, therefore making the results more relevant to current drivers. I converted categorical variables into factors so they are read correctly in visualizations and regression models. This is important for the multiple linear regression, which reads categorical values differently compared numerical values. Lastly, I removed observations with missing values.
A multiple linear regression model was used to analyze how vehicle featurs relate to combined fuel economy (MPG). The response variable was combined MPG, with engine displacement, number of cylinders, fuel type, drivetrain, and model year as the predictors. This model estimates the effect of each variable while keeping the others constant.
Regression Diagnostics
autoplot(model1, which =1:4)
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Removed 2555 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).
#Visualization 1: Average combined MPG by fuel type and drivetrain.
mpg_summary <- vehicles %>%group_by(fuelType, drive) %>%summarize(mean_mpg =mean(comb08), .groups ="drop")ggplot(mpg_summary, aes(x = fuelType, y = mean_mpg, fill = drive)) +geom_col(position ="dodge") +labs(title ="Average Combined MPG by Fuel Type and Drive (2000+)",x ="Fuel Type",y ="Average combined MPG",fill ="Drive",caption ="Data source: U.S. Environmental Protection Agency (EPA), Vehicle Fuel Economy dataset" ) +scale_fill_brewer(palette ="Set2") +theme_minimal() +theme(axis.text.x =element_text(angle =25, hjust =1))
This visualization summarizes average combined fuel economy (MPG) by fuel type and drivetrain. The data were grouped by fuel type and drive type, and the mean combined MPG was calculated for each group. The bar chart shows how fuel efficiency varies across fuel types and drivetrain configurations, revealing differences in average MPG between different categories.
#Visualization 2: Engine displacement vs combined MPG (interactive)
This interactive scatter plot shows the connection between the engine displacement and the combined fuel economy (MPG), with points colored according to the fuel type the vehicle uses. The trend here is that vehicles with higher displacement seem to have a lower average fuel economy. You can view different stats for different vehicles by hovering over the dots in the plot.
Background Research
Fuel economy is a rating of how far a vehicle can travel on a specific amount of fuel. The less fuel the vehicle uses, the higher the fuel economy. In the United States, the standard measure of fuel economy is miles per gallon (mpg). This measure refers to how many miles a vehicle can travel using one gallon of fuel. The U.S. Environmental Protection Agency (EPA) reports a vehicle’s fuel economy rating in three driving cycles: city, highway, and combined. Battery-electric vehicles (BEV) and plug-in hybrid vehicles (PHEV) are EPA-rated by a miles-per-gallon equivalent (MPGe).
U.S. Environmental Protection Agency. (2025, November 14). Technical capabilities of the National Vehicle and Fuel Emissions Laboratory (NVFEL). https://www.epa.gov/vehicle-and-fuel-emissions-testing/technical-capabilities-national-vehicle-and-fuel-emissions
Kelley Blue Book. (n.d.). What is fuel economy? https://www.kbb.com/what-is/fuel-economy/
Conclusion
This project explored how vehicle characteristics are related to fuel economy using data from the USEPA. The visualizations and regression analysis showed clear patterns, specifically, vehicles with larger engine displacement and more cylinders tend to have lower combined MPG - making them less fuel efficient. Fuel type and drivetrain also contribute to differences in fuel efficiency; vehicles using traditional gasoline seem to be less fuel efficient as compared to hybrids, and vehicles with All Wheel Drive systems also have lower fuel efficiency as compared to those with RWD or FWD, most likely due to the extra weight that comes with AWD systems. The interactive scatter plot showed the differences in MPG across vehicle types, even among cars with similar engine displacement sizes. One limitation of this analysis is that EPA fuel economy values are based on standardized laboratory testing, therefore ignoring real-world driving factors like weather, tarrain and driver behaviour among others. If more time or data were available, including real world driving data could provide additional insight into how fuel efficiency affects everyday drivers.