US Vehicle Fuel Economy Data

I love my current car. It serves the purpose. But when the time comes, I will need a new car. I personally like Mercedes-Benz’s look, specifically AMG A35. However, I want to know if this will be the best in all areas, not just for the face. I wish to see fuel consumption and co2 emissions as well. At the end of this exploration, we will see if the Mercedes-Benz AMG A35 still be my no.1 choice or will I be changing my mind to other models.

The dataset I use is called “us-vehicle-fuel-economy-data-1984-2017”. This dataset contains United State vehicle fuel economy data starting from 1984. Data from Fuel Economy.GOV. Dataset provides model-specific fuel consumption ratings and estimated carbon dioxide emissions for retail sale vehicles. The size of this dataset is 13,853 KB.

The dataset itself is from: https://datasource.kapsarc.org

Source copyrights and descriptions are from: http://www.fueleconomy.gov

What questions am I considering?

  1. What types/brands/models spend the least/most fuel?
  2. What types/brands/models of cars save fuel on both city and highway?
  3. What types/brands/models of cars are good in the city or highway?
  4. Would Mercedes-Benz AMG A35 be one of the best cars?

Importing libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.3
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(dplyr)
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.3
## Loading required package: viridisLite
library(epiDisplay)
## Warning: package 'epiDisplay' was built under R version 4.1.3
## Loading required package: foreign
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.1.3
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: nnet
## 
## Attaching package: 'epiDisplay'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(MASS)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(forcats)
library(lemon)
## Warning: package 'lemon' was built under R version 4.1.3
## 
## Attaching package: 'lemon'
## The following object is masked from 'package:purrr':
## 
##     %||%
## The following objects are masked from 'package:ggplot2':
## 
##     CoordCartesian, element_render
library(gplots)
## Warning: package 'gplots' was built under R version 4.1.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
#install.packages("lemon")
knit_print.data.frame <- lemon_print
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.3
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
#install.packages("gplots")

Loading Data

df <- read.csv("us-vehicle-fuel-economy-data-1984-2017.csv")

Data Exploration

as_tibble(df)
## # A tibble: 44,859 x 83
##     Year Manufacturer  Model     barrels08 barrelsA08 charge120 charge240 city08
##    <int> <chr>         <chr>         <dbl>      <int>     <int>     <int>  <int>
##  1  2016 Porsche       911 Turb~     14.9           0         0         0     17
##  2  2017 BMW           M6 Gran ~     18.6           0         0         0     14
##  3  2017 Hyundai       Elantra ~     11.0           0         0         0     24
##  4  2017 Lincoln       MKC AWD       13.5           0         0         0     19
##  5  2017 Mercedes-Benz AMG GLS63     21.3           0         0         0     13
##  6  2016 GMC           Sierra 1~     14.9           0         0         0     18
##  7  2016 Chevrolet     Silverad~     16.5           0         0         0     16
##  8  2017 Hyundai       Elantra        8.50          0         0         0     32
##  9  2017 BMW           550i xDr~     16.5           0         0         0     15
## 10  2017 Genesis       G80 RWD       13.5           0         0         0     18
## # ... with 44,849 more rows, and 75 more variables: city08U <int>,
## #   cityA08 <int>, cityA08U <int>, cityCD <int>, cityE <int>, cityUF <int>,
## #   co2 <int>, co2A <int>, co2TailpipeAGpm <int>, co2TailpipeGpm <dbl>,
## #   comb08 <int>, comb08U <int>, combA08 <int>, combA08U <int>, combE <int>,
## #   combinedCD <int>, combinedUF <int>, cylinders <int>, displ <dbl>,
## #   drive <chr>, engId <int>, eng_dscr <chr>, feScore <int>, fuelCost08 <int>,
## #   fuelCostA08 <int>, fuelType <chr>, fuelType1 <chr>, ghgScore <int>, ...
dim(df)
## [1] 44859    83
sum(is.na(df))
## [1] 36690
#str(df)

Data Descriptions

This dataset contains 44859 observations, 83 variables, 36690 NAs.

After a glimpse of the data, it is excellent that the dataset contains both quantitative and categorical values. However, the biggest challenge was too many variables. It was time-consuming to study the dataset and choose what variable to be used. In addition to that, there were 36690 NA values.

We will be only using only 13 variables.

year- int- model year
Manufacturer - chr -manufacturer (division)
model - chr - model name (carline)
VClass- chr - EPA vehicle size class: Minicompact Cars, Compact Cars, Midsize Cars, Small
displ - dbl- engine displacement in liters
cylinders- int - engine cylinders
drive -chr - drive axle type
fuelCost08 - int- annual fuel cost for fuelType1 ($) (7)
city08 -int - city MPG for fuelType1
highway08- int - highway MPG for fuelType1
trany- chr - transmission: Automatic, Manual
fuelType - chr - fuel type with fuelType1 and fuelType2 (if applicable)
co2TailpipeGpm - double - tailpipe CO2 in grams/mile fuelType1

*EPA - U.S. Environmental Protection Agency

Cleaning

Selecting variables.

new_df <- df %>%
  dplyr::select(Year, Manufacturer, Model,VClass, displ, cylinders,drive, fuelCost08,city08, highway08, trany,fuelType, co2TailpipeGpm)
#new_df

Rename variables.

colnames(new_df)[which(names(new_df) == "VClass")] <- "Class"
colnames(new_df)[which(names(new_df) == "displ")] <- "EngineSize"
colnames(new_df)[which(names(new_df) == "cylinders")] <- "Cylinders"
colnames(new_df)[which(names(new_df) == "drive")] <- "Drive"
colnames(new_df)[which(names(new_df) == "fuelCost08")] <- "AnnualCost"
colnames(new_df)[which(names(new_df) == "city08")] <- "CityConcumptions"
colnames(new_df)[which(names(new_df) == "highway08")] <- "HwyConcumptions"
colnames(new_df)[which(names(new_df) == "trany")] <- "Transmission"
colnames(new_df)[which(names(new_df) == "fuelType")] <- "FuelType"
colnames(new_df)[which(names(new_df) == "co2TailpipeGpm")] <- "CO2Emissions"
#new_df
#summary(new_df)
sum(is.na(new_df))
## [1] 748

Remove NAs

new_df_nona <- na.omit(new_df)
summary(new_df_nona)
##       Year      Manufacturer          Model              Class          
##  Min.   :1984   Length:44484       Length:44484       Length:44484      
##  1st Qu.:1992   Class :character   Class :character   Class :character  
##  Median :2004   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2003                                                           
##  3rd Qu.:2014                                                           
##  Max.   :2023                                                           
##    EngineSize      Cylinders        Drive             AnnualCost   
##  Min.   :0.600   Min.   : 2.00   Length:44484       Min.   : 1050  
##  1st Qu.:2.200   1st Qu.: 4.00   Class :character   1st Qu.: 2800  
##  Median :3.000   Median : 6.00   Mode  :character   Median : 3300  
##  Mean   :3.283   Mean   : 5.71                      Mean   : 3382  
##  3rd Qu.:4.200   3rd Qu.: 6.00                      3rd Qu.: 3800  
##  Max.   :8.400   Max.   :16.00                      Max.   :10350  
##  CityConcumptions HwyConcumptions Transmission         FuelType        
##  Min.   : 6.00    Min.   : 9.00   Length:44484       Length:44484      
##  1st Qu.:15.00    1st Qu.:20.00   Class :character   Class :character  
##  Median :17.00    Median :24.00   Mode  :character   Mode  :character  
##  Mean   :18.18    Mean   :24.39                                        
##  3rd Qu.:21.00    3rd Qu.:28.00                                        
##  Max.   :58.00    Max.   :61.00                                        
##   CO2Emissions   
##  Min.   :  22.0  
##  1st Qu.: 386.4  
##  Median : 444.4  
##  Mean   : 463.5  
##  3rd Qu.: 522.8  
##  Max.   :1269.6
sum(is.na(new_df_nona))
## [1] 0

There are too many observations, we will be only observed 2022 - 2023 model years. In other word, we will be only looking for brand new cars :)

#install.packages("kableExtra")
library(kableExtra)
#tb <- tab1(new_df_nona$Manufacturer, sort.group = "decreasing", cum.percent = TRUE)

df_2022 <- new_df_nona %>% 
      filter(Year >= 2022 ) 
#df_2022

#df_2022 %>%
 # kbl() %>%
  #kable_paper("hover", full_width = F)

#group_car <- df_2022 %>% 
 #     group_by(Manufacturer) %>%
  #    summarize(count = n()) %>%
   #   mutate(Percent = (round(count / sum(count), 5)) *100) %>%
    #  arrange(desc(Percent))

This will leave us with 1,305 observations.

Before we begin, we need to find information for that one car Mercedez-Benz AMG A35

amg_a35 <- df_2022 %>% 
      filter(Model == "AMG A35 4matic") 

amg_a35 %>%
  kbl() %>%
  kable_paper("hover", full_width = F)
Year Manufacturer Model Class EngineSize Cylinders Drive AnnualCost CityConcumptions HwyConcumptions Transmission FuelType CO2Emissions
2022 Mercedes-Benz AMG A35 4matic Subcompact Cars 2 4 All-Wheel Drive 2900 22 30 Automatic (AM7) Premium 356

Visualizing and analyzing

We will compare:
CO2 Emissions vs. Transmission
CO2 Emissions vs. Vehicle Class
CO2 Emissions vs. Fuel Types

options(scipen=10000)
p <- df_2022 %>%
  ggplot( aes(x=Transmission, y=CO2Emissions, fill=Transmission)) +
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+ 
    geom_bar(stat = "identity", width=0.5) +
    coord_flip() +
    xlab("Transmissions") +
  theme(legend.position="none")+
  ggtitle("CO2 Emission VS. Transmission")
 p

#ggplotly(p, tooltip=c("x", "y")) 

It seems that Automatic (S8) has the higest CO2 Emission, follow by Automatic 8-spd, Automatic 10-spd, Automatic (S10), Automatic 9-spd.

Transmission typs that has the least CO2 Emission are Automatic (A1), Automatic (AM-S9), and Automatic 7-spd

The AMG A35 4matic is a Automatic(AM7). It does not belongs to either the lowest or the highest CO2 Emission,

options(scipen=10000)
p <- df_2022 %>%
  ggplot( aes(x=Class, y=CO2Emissions, fill=Class)) +
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+ 
    geom_bar(stat = "identity", width=0.5) +
    coord_flip() +
    xlab("Vehicle Class") +
  theme(legend.position="none")+
  ggtitle("CO2 Emission VS. Vehicle Class")
 p

#ggplotly(p, tooltip=c("x", "y")) 

Small Standard sport utility 4WD and Standard sport utility 4WD have about the highest emissions of CO2, followed by Midsize cars, Subcompact cars, and Standard Pickup Trucks 4WD.

Vans, Passenger Type, Minivan-4WD, and Minivan - 2WD are giving the least CO2.

The AMG A35 4matic is a Subcompact Car.

van <- df_2022 %>%
     filter(Class == "Vans, Passenger Type" | Class == "Minivan - 4WD" | Class == "Minivan - 2WD") 
van <- van %>% arrange(CO2Emissions)
head(van, 5)%>%
  kbl() %>%
  kable_paper("hover", full_width = F)
Year Manufacturer Model Class EngineSize Cylinders Drive AnnualCost CityConcumptions HwyConcumptions Transmission FuelType CO2Emissions
2022 Chrysler Pacifica Hybrid Minivan - 2WD 3.6 6 Front-Wheel Drive 2050 29 30 Automatic (variable gear ratios) Regular Gas and Electricity 119
2022 Toyota Sienna 2WD Minivan - 2WD 2.5 4 Front-Wheel Drive 1700 36 36 Automatic (AV-S6) Regular 246
2022 Toyota Sienna AWD Minivan - 4WD 2.5 4 All-Wheel Drive 1750 35 36 Automatic (AV-S6) Regular 253
2022 Honda Odyssey Minivan - 2WD 3.5 6 Front-Wheel Drive 2800 19 28 Automatic (S10) Regular 394
2023 Honda Odyssey Minivan - 2WD 3.5 6 Front-Wheel Drive 2800 19 28 Automatic (S10) Regular 394
options(scipen=10000)
p <- df_2022 %>%
  ggplot( aes(x=FuelType, y=CO2Emissions, fill=FuelType)) +
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+ 
    geom_bar(stat = "identity", width=0.5) +
    coord_flip() +
    xlab("Fuel Types") +
  theme(legend.position="none")+
  ggtitle("CO2 Emission VS. Fuel Types")
 p

Statistical Analysis

From the previous analysis, it leads to the question if Vehicle class and Fuel type are independent?
Conduct a Chi-Square test with = 0.05.
H0: Vehicle class and Fuel type are independent.
Hα: Vehicle class and Fuel type are dependent.

tbl = table(df_2022$FuelType, df_2022$Class) 
#tbl

result <- chisq.test(tbl); result
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 797.61, df = 120, p-value < 0.00000000000000022

P-value: 0.00000000000000022 < 0.05
Conclusion: Reject the Null Hypothesis H0: Fuel type and Vehicle class are independent.
There is enough evidence to show that Fuel type and Vehicle class are dependent.

p.box <- df_2022 %>%
  ggplot(aes(x=FuelType, y=AnnualCost, fill=FuelType)) +
    geom_boxplot()+
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic() +  
  theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))+
  theme(legend.position="none")+
  ggtitle("Annual Fuel Cost vs. Fuel Types")
ggplotly(p.box)

It appears that the average annual fuel cost differs significantly between the fuel types. Is it true that it is cheaper on average to for regular vs. Manhattan? Let’s find out. Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).

topsammary <-df_2022 %>%      # Summary by group using dplyr
  group_by(FuelType) %>% 
  summarize(min = min(AnnualCost),
            mean = mean(AnnualCost),
            median = median(AnnualCost),
            max = max(AnnualCost),
            sd=sd(AnnualCost),
             count = n())
topsammary
## # A tibble: 7 x 7
##   FuelType                      min  mean median   max    sd count
##   <chr>                       <int> <dbl>  <dbl> <int> <dbl> <int>
## 1 Diesel                       2950 3334.   3350  4050  245.    29
## 2 Gasoline or E85              2350 3382.   3400  4050  518.    17
## 3 Midgrade                     3400 3840    4000  4000  216.    15
## 4 Premium                      1850 3553.   3450  8050  796.   658
## 5 Premium and Electricity      2000 3007.   2800  4250  609.    29
## 6 Regular                      1050 2579.   2550  4350  679.   543
## 7 Regular Gas and Electricity  1150 1914.   1825  3050  578.    14

H0:μr=μp
Hα:μr<μp

premium <- subset(df_2022, FuelType == "Premium")
regular <- subset(df_2022, FuelType == "Regular")


res <- t.test(regular$AnnualCost, premium$AnnualCost, alternative = "less", conf.level = 0.95)
res
## 
##  Welch Two Sample t-test
## 
## data:  regular$AnnualCost and premium$AnnualCost
## t = -22.9, df = 1197.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -904.2596
## sample estimates:
## mean of x mean of y 
##  2578.821  3553.116

P-value: 0.00000000000000022 < 0.05 Conclusion: Reject the Null Hypothesis H0:μr=μp There is enough evidence to show that the mean annual fuel cost for Regular fuel is less than Premium fuel.

w <- df_2022 %>% select( Model, AnnualCost, FuelType)%>% 
     filter(FuelType == "Regular" | FuelType == "Premium")


# Compute t-test
res <- t.test(AnnualCost ~ FuelType, data = w)
res
## 
##  Welch Two Sample t-test
## 
## data:  AnnualCost by FuelType
## t = 22.9, df = 1197.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group Premium and group Regular is not equal to 0
## 95 percent confidence interval:
##   890.823 1057.765
## sample estimates:
## mean in group Premium mean in group Regular 
##              3553.116              2578.821