Uncovering the Story Behind MPG: An Exploratory Analysis

Eagels Emily Johnson, Michael Davis, Sophia Martinez

Introduction

  • Knowing the factors that affect Miles Per Gallon (MPG) is important because it helps consumers lower fuel expenses, supports automakers in improving efficiency, and contributes to reducing environmental impacts such as carbon emissions.

  • It also allows drivers to make smarter vehicle purchasing and maintenance decisions.

  • Studying MPG helps identify factors that improve vehicle efficiency.

Project Goal

The goal of this project is to explore the factors that could impact MPG.

Data

We obtained the mpg dataset from R (tidyverse), which contains 234 records and 11 variables describing fuel economy and characteristics of various vehicles. The following code the the variables (columns) in mpg. hwy and cty represent mpg.

library(tidyverse)
library(knitr)

data.frame(Variable_Names = names(mpg)) %>%
  knitr::kable(
    caption = "Variable Names in mpg Dataset"
  )
Variable Names in mpg Dataset
Variable_Names
manufacturer
model
displ
year
cyl
trans
drv
cty
hwy
fl
class

There are two MPG-related variables in this dataset: highway MPG (hwy) and city MPG (cty). To provide a more comprehensive measure of fuel efficiency, we create a new variable by combining these two values.

mpg <- mpg %>%
  mutate(mpg = (cty + hwy) / 2)

You can interact with the data using the search box, such as limiting results to Audi or a chosen year.

library(DT)
datatable(mpg)

Analysis

This project analyzes the data across three levels:

  • MPG analysis

  • Relationships between MPG and numeric variables

  • A deeper exploration of MPG with additional variables to highlight drivetrain types.

Target Variable Analysis

To better understand the MPG variables, we summarize the dataset by reporting the total number of records and variables, along with the average, minimum, and maximum values for overall, highway, and city MPG. We also examine their distributions. The results show that the average overall, highway, and city MPG are approximately 20, 23, and 16, respectively, indicating that highway MPG is higher than city MPG.

library(tidyverse)
library(knitr)

Row1 <- mpg %>%
  summarise(
    Records = n(),
    Variables = ncol(.),
    Avg = mean(mpg),
    Min = min(mpg),
    Max = max(mpg),
    Med = median(mpg)
  ) 

Row2 <- mpg %>%
  summarise(
    Records = n(),
    Variables = ncol(.),
    Avg = mean(hwy),
    Min = min(hwy),
    Max = max(hwy),
    Med = median(hwy)
  )

Row3 <- mpg %>%
  summarise(
    Records = n(),
    Variables = ncol(.),
    Avg = mean(cty),
    Min = min(cty),
    Max = max(cty),
    Med = median(cty)
  )

bind_data_rows <- data.frame(rbind(Row1,Row2,Row3))

rownames(bind_data_rows) <- c("Overall MPG", "Highway MPG","City MPG")

bind_data_rows%>%
  knitr::kable(
    caption = "Summary of Overal MPG Dataset",
    digits = 2
  )
Summary of Overal MPG Dataset
Records Variables Avg Min Max Med
Overall MPG 234 12 20.15 10.5 39.5 20.5
Highway MPG 234 12 23.44 12.0 44.0 24.0
City MPG 234 12 16.86 9.0 35.0 17.0
library(plotly)
library(ggplot2)

plot_ly(
  data = mpg,
  x = ~mpg,
  type = "histogram",
  nbinsx = 10
)%>%
  layout(
    xaxis = list(title = "Overall MPG"),
    yaxis = list(title = "Freuqency")
  )
plot_ly(
  data = mpg,
  x = ~hwy,
  type = "histogram",
  nbinsx = 10
)%>%
  layout(
    xaxis = list(title = "Highway MPG"),
    yaxis = list(title = "Freuqency")
  )
plot_ly(
  data = mpg,
  x = ~cty,
  type = "histogram",
  nbinsx = 10
)%>%
  layout(
    xaxis = list(title = "City MPG"),
    yaxis = list(title = "Freuqency")
  )

Two-Dimension Analysis

This analysis investigates the relationship between MPG, engine displacement, and the number of cylinders. In all cases, a clear negative relationship is observed, indicating that vehicles with larger engines and more cylinders tend to have lower fuel efficiency.

library(patchwork)

p1 <- ggplot(mpg, aes(x = displ, y = mpg)) +
  geom_point(color = "steelblue", size = 2) +
  geom_smooth(color = "red", se = FALSE) +
  labs(
    title = "Average MPG vs Engine Displacement",
    x = "Engine Displacement",
    y = "Average MPG"
  ) +
  theme_minimal()

p2 <- ggplot(mpg, aes(x = cyl, y = mpg)) +
  geom_point(color = "darkgreen", size = 2) +
  geom_smooth(color = "red", se = FALSE) +
  labs(
    title = "Average MPG vs Cylinders",
    x = "Cylinders (cyl)",
    y = "MPG"
  ) +
  theme_minimal()

p3 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "steelblue", size = 2) +
  geom_smooth(color = "red", se = FALSE) +
  labs(
    title = "Highway MPG vs Engine Displacement",
    x = "Engine Displacement",
    y = "Highway MPG"
  ) +
  theme_minimal()

p4 <- ggplot(mpg, aes(x = cyl, y = hwy)) +
  geom_point(color = "darkgreen", size = 2) +
  geom_smooth(color = "red", se = FALSE) +
  labs(
    title = "Highway MPG vs Cylinders",
    x = "Cylinders",
    y = "Highway MPG"
  ) +
  theme_minimal()

p5 <- ggplot(mpg, aes(x = displ, y = cty)) +
  geom_point(color = "steelblue", size = 2) +
  geom_smooth(color = "red", se = FALSE) +
  labs(
    title = "City MPG vs Engine Displacement",
    x = "Engine Displacement",
    y = "City MPG"
  ) +
  theme_minimal()

p6 <- ggplot(mpg, aes(x = cyl, y = cty)) +
  geom_point(color = "darkgreen", size = 2) +
  geom_smooth(color = "red", se = FALSE) +
  labs(
    title = "City MPG vs Cylinders",
    x = "Cylinders",
    y = "City MPG"
  ) +
  theme_minimal()

(p1 + p2)/(p3 + p4)/(p5 + p6)

Three-Dimension Analysis

This analysis can help to understand the changes of MPGs and engine displacement thorugh driving types. The results show that:

  • Front-wheel drive and 4-Wheel drive: First, there is a negative relationship between MPG and engine displacement, followed by a more stable pattern at higher displacement levels.

  • Rear-Wheel Drive: First, there is a negative relationship between MPG and engine displacement, followed by a positive relationship at higher displacement levels.

label_info <- mpg |>
group_by(drv) |>
arrange(desc(displ)) |>
slice_head(n = 1) |>
mutate(
drive_type = case_when(
drv == "f" ~ "front-wheel drive",
drv == "r" ~ "rear-wheel drive",
drv == "4" ~ "4-wheel drive"
)
) |>
select(displ, hwy, drv, drive_type)

p1 <- mpg |>
ggplot(aes(x = displ, y = hwy, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
  labs(
    title = "Highway MPG vs Engine Displacement",
    x = "Engine Displacement",
    y = "Highway MPG",
    color = "Type of Drivetrain"
  )+
  theme(legend.position = "none")

p2 <- mpg |>
ggplot(aes(x = displ, y = cty, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
  labs(
    title = "City MPG vs Engine Displacement",
    x = "Engine Displacement",
    y = "City MPG",
    color = "Type of Drivetrain"
  )+
  theme(legend.position = "none")

p3 <- mpg |>
ggplot(aes(x = displ, y = mpg, color = drv)) +
geom_point(alpha = 0.3) +
geom_smooth(se = FALSE) +
  labs(
    title = "MPG vs Engine Displacement",
    x = "Engine Displacement",
    y = "MPG",
    color = "Type of Drivetrain"
  )+scale_color_discrete(
    labels = c(
      "4" = "4-Wheel Drive",
      "f" = "Front-Wheel Drive",
      "r" = "Rear-Wheel Drive"
    )
  )+
  theme(legend.position = "bottom")
  


(p1+p2)/p3

Outlier Analysis

This analysis examines the drivetrain categories of outliers based on MPG and engine displacement. The first step in identifying outliers is determining threshold values. The boxplots suggest approximate cutoffs of 40 for highway MPG, 30 for city MPG, and 35 for overall MPG. However, no clear outliers are observed for engine displacement.

library(patchwork)

p1 <- ggplot(mpg, aes(x = mpg)) +
    geom_boxplot(fill = "steelblue", color = "black") +
    labs(
        title = "Boxplot of Overall MPG",
        x = ""
    ) 
  

p2<- ggplot(mpg, aes(x = hwy)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  labs(
    title = "Boxplot of Highway MPG",
        x = ""
  ) +
    scale_x_continuous(limits = c(0, 60))

p3<- ggplot(mpg, aes(x = cty)) +
  geom_boxplot(fill = "steelblue", color = "black") +
  labs(
    title = "Boxplot of City MPG",
        x = ""
  ) +
    scale_x_continuous(limits = c(0, 60))


p4 <- ggplot(mpg, aes(x = displ)) +
    geom_boxplot(fill = "steelblue", color = "black") +
    labs(
        title = "Boxplot of Engine Displacement",
        x = ""
    )  



p1/p2/p3/p4

The final analysis highlights outliers across different MPG measures. The plots show that these outliers are concentrated in the front-wheel drive category, suggesting that vehicles in this group tend to achieve unusually high fuel efficiency compared to others.

library(ggrepel)
potential_outliers <- mpg |>
filter(mpg >= 35)

ggplot(mpg, aes(x = displ, y = mpg)) +
  geom_point(color = "black") +
    geom_point(
    data = potential_outliers,
    aes(color = drv),
    size = 3
  ) +
  scale_y_continuous(limits = c(0, 50))+
  geom_text_repel(
    data = potential_outliers,
    aes(label = model, color = drv),
    show.legend = FALSE
  ) +
  labs(
    title = "Outliers by Drivetrain",
    x = "Engine Displacement",
    y = "Overall MPG",
    color = "Type of Drivetrain"
  ) +
  scale_color_discrete(
    labels = c(
      "4" = "4-Wheel Drive",
      "f" = "Front-Wheel Drive",
      "r" = "Rear-Wheel Drive"
    )
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

library(ggrepel)
potential_outliers <- mpg |>
filter(hwy >= 40)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "black") +
    geom_point(
    data = potential_outliers,
    aes(color = drv),
    size = 3
  ) +
  scale_y_continuous(limits = c(0, 50))+
  geom_text_repel(
    data = potential_outliers,
    aes(label = model, color = drv),
    show.legend = FALSE
  ) +
  labs(
    title = "Outliers by Drivetrain",
    x = "Engine Displacement",
    y = "Highway MPG",
    color = "Type of Drivetrain"
  ) +
  scale_color_discrete(
    labels = c(
      "4" = "4-Wheel Drive",
      "f" = "Front-Wheel Drive",
      "r" = "Rear-Wheel Drive"
    )
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

library(ggrepel)
potential_outliers <- mpg |>
filter(cty >= 30)

ggplot(mpg, aes(x = displ, y = cty)) +
  geom_point(color = "black") +
    geom_point(
    data = potential_outliers,
    aes(color = drv),
    size = 3
  ) +
  scale_y_continuous(limits = c(0, 50))+
  geom_text_repel(
    data = potential_outliers,
    aes(label = model, color = drv),
    show.legend = FALSE
  ) +
  labs(
    title = "Outliers by Drivetrain",
    x = "Engine Displacement",
    y = "City MPG",
    color = "Type of Drivetrain"
  ) +
  scale_color_discrete(
    labels = c(
      "4" = "4-Wheel Drive",
      "f" = "Front-Wheel Drive",
      "r" = "Rear-Wheel Drive"
    )
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Conclusion

  1. Highway MPG is higher than overall and city MPG (~16).
  2. MPG has a negative relationship with engine displacement.
  3. MPG decreases as the number of cylinders increases.
  4. Larger engines are associated with lower fuel efficiency.
  5. The MPG–displacement relationship becomes more stable at higher values.
  6. Drivetrain type affects MPG patterns.
  7. Front-wheel drive vehicles account for most high-MPG outliers.

Contact Information

Thanks for visiting our page!

  • emily.johnson@kennesaw.edu

  • michael.davis@kennesaw.edu

  • sophia.martinez@kennesaw.edu