Final Project

US Vehicle Fuel Economy Data

I love my current car. It serves the purpose. But when the time comes, I will get a new car. I personally like Mercedes-Benz’s design, specifically AMG. However, I want to know what other options are out there.

The dataset I use is called “us-vehicle-fuel-economy-data-1984-2017”. This dataset contains United State vehicle fuel economy data starting from 1984. It provides model-specific fuel consumption ratings and estimated carbon dioxide emissions for retail sale vehicles. The size of this dataset is 13,853 KB.

The dataset itself is from: https://datasource.kapsarc.org

Source copyrights and descriptions are from: http://www.fueleconomy.gov

I will be investigating:

What Manufacturer have the most/least average annual fuel cost?
What types of vehicles have the least CO2 emissions?
Are there any relationships between the type of car and the fuel?

Importing libraries

library(tidyverse)
library(ggplot2)
library(hrbrthemes)
library(dplyr)
library(viridis)
library(epiDisplay)
library(MASS)
library(plotly)
library(forcats)
library(lemon)
library(gplots)
#install.packages("lemon")
knit_print.data.frame <- lemon_print
library(kableExtra)
#install.packages("gplots")

Loading Data

df <- read.csv("us-vehicle-fuel-economy-data-1984-2017.csv")

Data Exploration

as_tibble(df)

## # A tibble: 44,859 x 83
##     Year Manufacturer  Model     barrels08 barrelsA08 charge120 charge240 city08
##    <int> <chr>         <chr>         <dbl>      <int>     <int>     <int>  <int>
##  1  2016 Porsche       911 Turb~     14.9           0         0         0     17
##  2  2017 BMW           M6 Gran ~     18.6           0         0         0     14
##  3  2017 Hyundai       Elantra ~     11.0           0         0         0     24
##  4  2017 Lincoln       MKC AWD       13.5           0         0         0     19
##  5  2017 Mercedes-Benz AMG GLS63     21.3           0         0         0     13
##  6  2016 GMC           Sierra 1~     14.9           0         0         0     18
##  7  2016 Chevrolet     Silverad~     16.5           0         0         0     16
##  8  2017 Hyundai       Elantra        8.50          0         0         0     32
##  9  2017 BMW           550i xDr~     16.5           0         0         0     15
## 10  2017 Genesis       G80 RWD       13.5           0         0         0     18
## # ... with 44,849 more rows, and 75 more variables: city08U <int>,
## #   cityA08 <int>, cityA08U <int>, cityCD <int>, cityE <int>, cityUF <int>,
## #   co2 <int>, co2A <int>, co2TailpipeAGpm <int>, co2TailpipeGpm <dbl>,
## #   comb08 <int>, comb08U <int>, combA08 <int>, combA08U <int>, combE <int>,
## #   combinedCD <int>, combinedUF <int>, cylinders <int>, displ <dbl>,
## #   drive <chr>, engId <int>, eng_dscr <chr>, feScore <int>, fuelCost08 <int>,
## #   fuelCostA08 <int>, fuelType <chr>, fuelType1 <chr>, ghgScore <int>, ...

dim(df)

## [1] 44859    83

#str(df)

Data Descriptions

This dataset contains 44859 observations, 83 variables, 36690 NAs.

After a glimpse of the data, it is excellent that the dataset contains both quantitative and categorical values. However, my concern is that the dataset has too many variables. It was time-consuming to study the dataset and choose what variable to be used. In addition to that, there were 36690 NA values.

We will be only using only 13 variables.

year- int- model year
Manufacturer - chr -manufacturer (division)
model - chr - model name (carline)
VClass- chr - EPA vehicle size class: Minicompact Cars, Compact Cars, Midsize Cars, Small
displ - dbl- engine displacement in liters
cylinders- int - engine cylinders
drive -chr - drive axle type
fuelCost08 - int- annual fuel cost for fuelType1 ($) (7)
city08 -int - city MPG for fuelType1
highway08- int - highway MPG for fuelType1
trany- chr - transmission: Automatic, Manual
fuelType - chr - fuel type with fuelType1 and fuelType2 (if applicable)
co2TailpipeGpm - double - tailpipe CO2 in grams/mile fuelType1

*EPA - U.S. Environmental Protection Agency

Cleaning

Selecting variables.

new_df <- df %>%
  dplyr::select(Year, Manufacturer, Model,VClass, displ, cylinders,drive, fuelCost08,city08, highway08, trany,fuelType, co2TailpipeGpm)
#new_df

Rename variables.

colnames(new_df)[which(names(new_df) == "VClass")] <- "Class"
colnames(new_df)[which(names(new_df) == "displ")] <- "EngineSize"
colnames(new_df)[which(names(new_df) == "cylinders")] <- "Cylinders"
colnames(new_df)[which(names(new_df) == "drive")] <- "Drive"
colnames(new_df)[which(names(new_df) == "fuelCost08")] <- "Cost"
colnames(new_df)[which(names(new_df) == "city08")] <- "City"
colnames(new_df)[which(names(new_df) == "highway08")] <- "Hwy"
colnames(new_df)[which(names(new_df) == "trany")] <- "Transmission"
colnames(new_df)[which(names(new_df) == "fuelType")] <- "Fuel"
colnames(new_df)[which(names(new_df) == "co2TailpipeGpm")] <- "CO2"
#new_df

#nrow(new_df)
(sum(is.na(new_df))/nrow(new_df))*100

## [1] 1.667447

Remove NAs

new_df_nona <- na.omit(new_df)
summary(new_df_nona)

##       Year      Manufacturer          Model              Class          
##  Min.   :1984   Length:44484       Length:44484       Length:44484      
##  1st Qu.:1992   Class :character   Class :character   Class :character  
##  Median :2004   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2003                                                           
##  3rd Qu.:2014                                                           
##  Max.   :2023                                                           
##    EngineSize      Cylinders        Drive                Cost      
##  Min.   :0.600   Min.   : 2.00   Length:44484       Min.   : 1050  
##  1st Qu.:2.200   1st Qu.: 4.00   Class :character   1st Qu.: 2800  
##  Median :3.000   Median : 6.00   Mode  :character   Median : 3300  
##  Mean   :3.283   Mean   : 5.71                      Mean   : 3382  
##  3rd Qu.:4.200   3rd Qu.: 6.00                      3rd Qu.: 3800  
##  Max.   :8.400   Max.   :16.00                      Max.   :10350  
##       City            Hwy        Transmission           Fuel          
##  Min.   : 6.00   Min.   : 9.00   Length:44484       Length:44484      
##  1st Qu.:15.00   1st Qu.:20.00   Class :character   Class :character  
##  Median :17.00   Median :24.00   Mode  :character   Mode  :character  
##  Mean   :18.18   Mean   :24.39                                        
##  3rd Qu.:21.00   3rd Qu.:28.00                                        
##  Max.   :58.00   Max.   :61.00                                        
##       CO2        
##  Min.   :  22.0  
##  1st Qu.: 386.4  
##  Median : 444.4  
##  Mean   : 463.5  
##  3rd Qu.: 522.8  
##  Max.   :1269.6

sum(is.na(new_df_nona))

## [1] 0

There are too many observations. Therefore, we will be only looking at the 2022 - 2023 model years. In other words, we will be only looking for brand new cars :)

df_2022 <- new_df_nona %>% 
      filter(Year >= 2022 ) 
#sum(is.na(new_df_nona))

This will leave us with 1,305 observations.

Average Annual fuel cost by Brand

ave_cost <-df_2022 %>%      # Summary by group using dplyr
  group_by(Manufacturer) %>% 
  summarize(Cost = mean(Cost))%>%
  arrange(Cost)

ave_cost_Most <- tail(ave_cost,15)
p2 <- ave_cost_Most %>%
  ggplot( aes(x=Manufacturer, y=Cost, fill=Manufacturer))+
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+ 
    geom_bar(stat = "identity", width=0.5) +
    xlab("Manufacturer")+
    ylab("Cost $USD") +
  theme(legend.position="none")+
  ggtitle("Top 15 Annual Fuel Cost in USD")+  
  theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))

ggplotly(p2, tooltip=c("x", "y"))

It is no doubt that luxury cars spend a lot of money on fuel, but what kind of cars spend the least money annually?

ave_cost_least <- head(ave_cost,15)

p <- ave_cost_least %>%
  ggplot( aes(x=Manufacturer, y=Cost, fill=Manufacturer))+
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+ 
    geom_bar(stat = "identity", width=0.5) +
    xlab("Manufacturer") +
    ylab("Cost $USD") +
  theme(legend.position="none")+
  ggtitle("Least 15 Annual Fuel Cost")+  
  theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))

ggplotly(p, tooltip=c("x", "y"))

We will be only pick the least 10 cars from the chart above to look for more details

df_2022_10 <- df_2022 %>% 
      dplyr::filter(Manufacturer == "Hyundai"| Manufacturer == "Honda"|Manufacturer == "Toyota"|Manufacturer == "Volkswagen"|Manufacturer == "Kia"|Manufacturer == "Buick"|Manufacturer == "Fiat"|Manufacturer == "Mazda"|Manufacturer == "Mitsubishi"|Manufacturer == "Subaru" )
#df_2022_10


p.box <- df_2022_10 %>%
  ggplot(aes(x=Manufacturer, y=Cost, fill=Manufacturer)) +
    geom_boxplot()+
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic() + 
    xlab("Manufacturer") +
    ylab("Cost $USD") +
  theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))+
  theme(legend.position="none")+
  ggtitle("Annual Cost by Manufacturer")
ggplotly(p.box)

Statistical Analysis 1

I wonder if it is true that it is Honda’s cost on average cheaper than Subaru’s? Let’s find out.

Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).

H0:μh=μs
Hα:μh<μs

Perform Two Sample Mean Test

w <- df_2022_10 %>% select( Manufacturer, Cost)%>% 
     filter(Manufacturer == "Subaru" | Manufacturer == "Honda")

# Compute t-test
res <- t.test(Cost ~ Manufacturer, data = w)
res

## 
##  Welch Two Sample t-test
## 
## data:  Cost by Manufacturer
## t = -3.088, df = 53.484, p-value = 0.003192
## alternative hypothesis: true difference in means between group Honda and group Subaru is not equal to 0
## 95 percent confidence interval:
##  -657.5592 -139.7742
## sample estimates:
##  mean in group Honda mean in group Subaru 
##             2083.333             2482.000

P-value: 0.003192 < 0.05
Conclusion: Reject the Null Hypothesis H0:μs=μh
There is enough evidence to show that the average cost of Subaru is greater than the the average cost of Honda

We know that two different brand do not have the the average cost. We would want to know more if City vs. Highway Consumptions.

p.dot <- ggplot(df_2022_10, aes(City, Hwy))+
  geom_point(aes(color = Manufacturer))+ 
    ggtitle("City vs. Highway Consumptions")+
  scale_color_viridis(discrete = TRUE, option = "A")+
  scale_fill_viridis(discrete = TRUE) +
  theme_light()+
  theme(legend.position="none")

ggplotly(p.dot)

There is a correlation between City vs.Highway consumption.

Class_ave <-df_2022_10 %>%      # Summary by group using dplyr
  group_by(Class) %>% 
  summarize(CO2 = median(CO2))%>%
  arrange(CO2)

options(scipen=10000)
p <- Class_ave %>%
  ggplot( aes(x=Class, y=CO2, fill=Class)) +
  scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
  theme_classic()+ 
    geom_bar(stat = "identity", width=0.5) +
    coord_flip() +
    xlab("Vehicle Classes")+
    ylab("Co2 Emissions") +
  theme(legend.position="none")+
  ggtitle("Average CO2 Emissions VS. Type of Cars")
 #p

ggplotly(p, tooltip=c("x", "y"))

Small Station Wagons, Compact Cars, Midsize Cars, Small Sport Utility Vehicle 2WD, Minivan - 4WD have the least CO2 emission.

Statistical Analysis 2

From the previous analysis, it leads to the question if Vehicle class and Fuel type are independent?
Conduct a Chi-Square test with = 0.05.
H0: Vehicle class and Fuel type are independent.
Hα: Vehicle class and Fuel type are dependent.

#contingency table of the two variables
tbl = table(df_2022$Fuel, df_2022$Class) 

result <- chisq.test(tbl); result

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 797.61, df = 120, p-value < 0.00000000000000022

P-value: 0.00000000000000022 < 0.05
Conclusion: Reject the Null Hypothesis H0: Fuel type and Vehicle class are independent.
There is enough evidence to show that Fuel type and Vehicle class are dependent.

Conclusion

Now, I will be able to suggest if anyone needs help finding the right car. First, I have learned today that Vehicle class and Fuel type are related.

Second, if some one want to save on money annually, I will recommend them to look for these 10 Manufacturer:
Hyundai
Honda
Toyota
Kia
Buick
Fiat
Mazda
Mitsubishi
Subaru
Volkswagen

Third, I learned that if we want to find the cars with the least CO2 emissions, I will look for:

Small Station Wagons
Compact Cars
Midsize Cars
Small Sport Utility Vehicle 2WD
Minivan - 4WD