For my final project, I used a vehicle fuel economy dataset collected by the United States Environmental Protection Agency (EPA). I chose this topic because fuel economy is connected to everyday life. It affects how much money people spend on gas, and it also relates to air pollution and carbon dioxide emissions. I thought this topic was meaningful because cars are used every day, so it is interesting to study how different vehicle characteristics relate to fuel efficiency. According to the EPA report, the agency has collected data on new light-duty vehicles sold in the United States since 1975, using EPA testing and data submitted by manufacturers through official EPA test procedures.
The variables I used in this project include both categorical and quantitative variables.
fuelType1: which tells the main type of fuel used by the vehicle VClass: which shows the vehicle class such as compact car or SUV trany: which shows the transmission type. comb08: which shows the vehicle’s combined miles per gallon displ: which is engine displacement in liters cylinders: which is the number of engine cylinders co2TailpipeGpm: which measures tailpipe carbon dioxide emissions in grams per mile fuelCost08: which estimates yearly fuel cost.
I wanted to explore how engine size and other vehicle characteristics relate to fuel economy, and whether some vehicle classes or fuel types perform better than others. To clean the data, I selected only the variables needed for my project, removed rows with missing values in the main variables, and filtered out some unusual values so the regression and visualizations would be easier to read. The EPA report also explains that the fuel economy data in the report are based on estimated real-world values that use expanded testing to better reflect actual driving conditions.
## Load the libraries and datasetlibrary(tidyverse)library(plotly)library(ggfortify)vehicles <-read_csv("vehicle_fuel_econ_USEPA.csv")head(vehicles)
# A tibble: 6 × 12
make model year fuelType1 VClass drive trany comb08 co2TailpipeGpm displ
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alfa Rom… Spid… 1984 Regular … Two S… <NA> Manu… 21 423. 2
2 Bertone X1/9 1984 Regular … Two S… <NA> Manu… 22 404. 1.5
3 Chevrolet Corv… 1984 Regular … Two S… <NA> Auto… 15 592. 5.7
4 Chevrolet Corv… 1984 Regular … Two S… <NA> Manu… 15 592. 5.7
5 Nissan 300ZX 1984 Regular … Two S… <NA> Auto… 17 523. 3
6 Nissan 300ZX 1984 Regular … Two S… <NA> Auto… 18 494. 3
# ℹ 2 more variables: cylinders <dbl>, fuelCost08 <dbl>
#I cleaned the dataset
vehicles_clean2 <- vehicles_clean |>filter(!is.na(displ)) |>filter(!is.na(cylinders)) |>filter(!is.na(trany)) |>filter(co2TailpipeGpm >0) # To keep vehicles with positive tailpipe CO2 emissionshead(vehicles_clean2)
# A tibble: 6 × 12
make model year fuelType1 VClass drive trany comb08 co2TailpipeGpm displ
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alfa Rom… Spid… 1984 Regular … Two S… <NA> Manu… 21 423. 2
2 Bertone X1/9 1984 Regular … Two S… <NA> Manu… 22 404. 1.5
3 Chevrolet Corv… 1984 Regular … Two S… <NA> Auto… 15 592. 5.7
4 Chevrolet Corv… 1984 Regular … Two S… <NA> Manu… 15 592. 5.7
5 Nissan 300ZX 1984 Regular … Two S… <NA> Auto… 17 523. 3
6 Nissan 300ZX 1984 Regular … Two S… <NA> Auto… 18 494. 3
# ℹ 2 more variables: cylinders <dbl>, fuelCost08 <dbl>
summary(vehicles_clean2)
make model year fuelType1
Length:40522 Length:40522 Min. :1984 Length:40522
Class :character Class :character 1st Qu.:1991 Class :character
Mode :character Mode :character Median :2002 Mode :character
Mean :2001
3rd Qu.:2011
Max. :2019
VClass drive trany comb08
Length:40522 Length:40522 Length:40522 Min. : 7.00
Class :character Class :character Class :character 1st Qu.:17.00
Mode :character Mode :character Mode :character Median :20.00
Mean :20.19
3rd Qu.:23.00
Max. :58.00
co2TailpipeGpm displ cylinders fuelCost08
Min. : 29.0 Min. :0.600 Min. : 2.000 Min. : 700
1st Qu.: 386.4 1st Qu.:2.200 1st Qu.: 4.000 1st Qu.:1850
Median : 446.0 Median :3.000 Median : 6.000 Median :2250
Mean : 469.5 Mean :3.299 Mean : 5.721 Mean :2284
3rd Qu.: 522.9 3rd Qu.:4.300 3rd Qu.: 6.000 3rd Qu.:2700
Max. :1269.6 Max. :8.400 Max. :16.000 Max. :7150
#I explored the main categorical variables
vehicles_clean2 |>count(fuelType1, sort =TRUE)
# A tibble: 5 × 2
fuelType1 n
<chr> <int>
1 Regular Gasoline 27658
2 Premium Gasoline 11540
3 Diesel 1158
4 Midgrade Gasoline 106
5 Natural Gas 60
vehicles_clean2 |>count(VClass, sort =TRUE)
# A tibble: 34 × 2
VClass n
<chr> <int>
1 Compact Cars 5796
2 Subcompact Cars 5079
3 Midsize Cars 4786
4 Standard Pickup Trucks 2354
5 Large Cars 2090
6 Sport Utility Vehicle - 4WD 2090
7 Two Seaters 2025
8 Sport Utility Vehicle - 2WD 1619
9 Small Station Wagons 1582
10 Special Purpose Vehicles 1455
# ℹ 24 more rows
#I kept the most common fuel types and vehicle classes
# keep the 4 most common fuel types so the plots stay readabletop_fuel <- vehicles_clean2 |>count(fuelType1, sort =TRUE) |>slice_head(n =4)# keep the 7 most common vehicle classes so the heatmap is not too crowdedtop_class <- vehicles_clean2 |>count(VClass, sort =TRUE) |>slice_head(n =7)# filter the dataset to only those common categoriesvehicles_final <- vehicles_clean2 |>filter(fuelType1 %in% top_fuel$fuelType1) |>filter(VClass %in% top_class$VClass)head(vehicles_final)
# A tibble: 6 × 12
make model year fuelType1 VClass drive trany comb08 co2TailpipeGpm displ
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alfa Rom… Spid… 1984 Regular … Two S… <NA> Manu… 21 423. 2
2 Bertone X1/9 1984 Regular … Two S… <NA> Manu… 22 404. 1.5
3 Chevrolet Corv… 1984 Regular … Two S… <NA> Auto… 15 592. 5.7
4 Chevrolet Corv… 1984 Regular … Two S… <NA> Manu… 15 592. 5.7
5 Nissan 300ZX 1984 Regular … Two S… <NA> Auto… 17 523. 3
6 Nissan 300ZX 1984 Regular … Two S… <NA> Auto… 18 494. 3
# ℹ 2 more variables: cylinders <dbl>, fuelCost08 <dbl>
In those steps, I kept only the most common fuel types and vehicle classes. I did this so the final visualizations would be easier to read and less crowded. This filtering helped me focus on the groups that appeared most often in the dataset.
#multiple linear regression model
model2 <-lm(comb08 ~ displ + cylinders, data = vehicles_final)summary(model2)
Call:
lm(formula = comb08 ~ displ + cylinders, data = vehicles_final)
Residuals:
Min 1Q Median 3Q Max
-10.764 -2.375 -0.536 1.634 32.414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.36608 0.08136 385.53 <2e-16 ***
displ -2.37264 0.04090 -58.02 <2e-16 ***
cylinders -0.49589 0.03011 -16.47 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.759 on 24188 degrees of freedom
Multiple R-squared: 0.5446, Adjusted R-squared: 0.5446
F-statistic: 1.446e+04 on 2 and 24188 DF, p-value: < 2.2e-16
autoplot(model2, 1:4, nrow =2, ncol =2)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Before choosing my final model, I also tested a model that included co2TailpipeGpm as a predictor along with displ and cylinders. However, I removed co2TailpipeGpm from the final model because I wanted the regression to focus more directly on vehicle characteristics. My final model uses displ and cylinders to predict comb08, which makes the model simpler and easier to interpret.
The final multiple linear regression model has the equation: comb08 = 31.3661 - 2.3726(displ) - 0.4959(cylinders). The intercept of 31.3661 represents the predicted combined MPG when both engine displacement and cylinders are equal to 0, although that is not very meaningful in a real-world vehicle context. The coefficient for displ shows that for each additional 1-liter increase in engine displacement, the predicted combined MPG decreases by about 2.373, holding the number of cylinders constant. The coefficient for cylinders shows that for each additional 1-cylinder increase, the predicted combined MPG decreases by about 0.496, holding engine displacement constant.
The model was statistically significant overall, with a p-value less than 2.2e-16, and the adjusted R-squared value was 0.5446. This means the model explains about 54.46% of the variation in combined MPG, which suggests that engine displacement and number of cylinders are important predictors of fuel economy.
#I created a scatterplot of engine displacement and combined MPG
ggplot(vehicles_final,aes(x = displ, y = comb08, color = fuelType1)) +geom_point(alpha =0.5) +geom_smooth(method ="lm", se =FALSE) +labs(title ="Engine Displacement and Combined MPG by Fuel Type",subtitle ="Filtered EPA vehicle fuel economy data",x ="Engine Displacement (Liters)",y ="Combined MPG",color ="Fuel Type",caption ="Source: U.S. Environmental Protection Agency" ) +scale_color_brewer(palette ="Set1") +theme_minimal(base_size =12)
`geom_smooth()` using formula = 'y ~ x'
This scatterplot shows the relationship between engine displacement and combined MPG for the main fuel types in the EPA vehicle dataset. The overall pattern shows a negative relationship, which means that vehicles with larger engines usually have lower fuel economy. The fitted regression lines also show this downward trend for each fuel type, although the strength of the pattern is a little different across groups. Most of the vehicles in this filtered dataset use regular gasoline or premium gasoline, so those categories appear more often in the graph. This visualization supports the regression results because it shows that engine size is an important factor related to fuel economy.
#I created a heatmap of average MPG by vehicle class and fuel type
heat_data <- vehicles_final |>group_by(VClass, fuelType1) |>summarise(avg_mpg =mean(comb08), .groups ="drop")ggplot(heat_data, aes(x = fuelType1, y = VClass, fill = avg_mpg)) +geom_tile(color ="white") +labs(title ="Average MPG by Vehicle Class and Fuel Type",subtitle ="Filtered EPA vehicle fuel economy data",x ="Fuel Type",y ="Vehicle Class",fill ="Average MPG",caption ="Source: U.S. Environmental Protection Agency" ) +scale_fill_distiller(palette ="YlGnBu", direction =1) +theme_minimal(base_size =12) +theme(axis.text.x =element_text(angle =20, hjust =1))
This heatmap shows the average combined MPG for different combinations of vehicle class and fuel type. Darker colors represent higher average fuel economy, while lighter colors represent lower average MPG. One clear pattern is that compact and subcompact cars tend to have higher average MPG than larger vehicle classes such as pickup trucks and sport utility vehicles. The heatmap also shows that fuel type matters, since some fuel types have better average fuel economy within the same vehicle class. This visualization is useful because it summarizes many group comparisons at the same time in one figure.
This interactive bubble chart shows the relationship between engine displacement and combined MPG while also adding annual fuel cost into the same visualization. The size of each bubble represents annual fuel cost, so larger bubbles usually show vehicles that cost more to fuel each year. The plot still shows the same general pattern as the scatterplot, where vehicles with larger engine displacement tend to have lower fuel economy. Making the chart interactive helps the viewer explore the data more closely by moving over each point to see details such as the make, model, year, and fuel type. This makes the visualization more useful because it combines several variables in one graph and lets the viewer inspect individual vehicles.
#Conclusion
For this project, I used EPA vehicle fuel economy data to study how vehicle characteristics relate to fuel economy. My multiple linear regression showed that engine displacement and number of cylinders both have negative relationships with combined MPG, which means that vehicles with larger engines and more cylinders usually have lower fuel economy. The scatterplot supported this pattern by showing a clear downward trend between engine displacement and combined MPG, while the heatmap made it easier to compare average MPG across vehicle classes and fuel types. The interactive bubble chart added another layer by showing annual fuel cost and allowing the viewer to explore individual vehicles more closely. One thing that stood out to me is that smaller vehicle classes usually had better fuel economy than larger classes such as SUVs and pickup trucks. If I had more time, I would want to explore changes over time in more detail or compare manufacturers more directly.