I love my current car. It serves the purpose. But when the time comes, I will get a new car. I personally like Mercedes-Benz’s design, specifically AMG. However, I want to know what other options are out there.
The dataset I use is called “us-vehicle-fuel-economy-data-1984-2017”. This dataset contains United State vehicle fuel economy data starting from 1984. It provides model-specific fuel consumption ratings and estimated carbon dioxide emissions for retail sale vehicles. The size of this dataset is 13,853 KB.
The dataset itself is from: https://datasource.kapsarc.org
Source copyrights and descriptions are from: http://www.fueleconomy.gov
library(tidyverse)
library(ggplot2)
library(hrbrthemes)
library(dplyr)
library(viridis)
library(epiDisplay)
library(MASS)
library(plotly)
library(forcats)
library(lemon)
library(gplots)
#install.packages("lemon")
knit_print.data.frame <- lemon_print
library(kableExtra)
#install.packages("gplots")
df <- read.csv("us-vehicle-fuel-economy-data-1984-2017.csv")
as_tibble(df)
## # A tibble: 44,859 x 83
## Year Manufacturer Model barrels08 barrelsA08 charge120 charge240 city08
## <int> <chr> <chr> <dbl> <int> <int> <int> <int>
## 1 2016 Porsche 911 Turb~ 14.9 0 0 0 17
## 2 2017 BMW M6 Gran ~ 18.6 0 0 0 14
## 3 2017 Hyundai Elantra ~ 11.0 0 0 0 24
## 4 2017 Lincoln MKC AWD 13.5 0 0 0 19
## 5 2017 Mercedes-Benz AMG GLS63 21.3 0 0 0 13
## 6 2016 GMC Sierra 1~ 14.9 0 0 0 18
## 7 2016 Chevrolet Silverad~ 16.5 0 0 0 16
## 8 2017 Hyundai Elantra 8.50 0 0 0 32
## 9 2017 BMW 550i xDr~ 16.5 0 0 0 15
## 10 2017 Genesis G80 RWD 13.5 0 0 0 18
## # ... with 44,849 more rows, and 75 more variables: city08U <int>,
## # cityA08 <int>, cityA08U <int>, cityCD <int>, cityE <int>, cityUF <int>,
## # co2 <int>, co2A <int>, co2TailpipeAGpm <int>, co2TailpipeGpm <dbl>,
## # comb08 <int>, comb08U <int>, combA08 <int>, combA08U <int>, combE <int>,
## # combinedCD <int>, combinedUF <int>, cylinders <int>, displ <dbl>,
## # drive <chr>, engId <int>, eng_dscr <chr>, feScore <int>, fuelCost08 <int>,
## # fuelCostA08 <int>, fuelType <chr>, fuelType1 <chr>, ghgScore <int>, ...
dim(df)
## [1] 44859 83
#str(df)
After a glimpse of the data, it is excellent that the dataset contains both quantitative and categorical values. However, my concern is that the dataset has too many variables. It was time-consuming to study the dataset and choose what variable to be used. In addition to that, there were 36690 NA values.
year- int- model year
Manufacturer - chr -manufacturer (division)
model - chr - model name (carline)
VClass- chr - EPA vehicle size class: Minicompact Cars, Compact Cars, Midsize Cars, Small
displ - dbl- engine displacement in liters
cylinders- int - engine cylinders
drive -chr - drive axle type
fuelCost08 - int- annual fuel cost for fuelType1 ($) (7)
city08 -int - city MPG for fuelType1
highway08- int - highway MPG for fuelType1
trany- chr - transmission: Automatic, Manual
fuelType - chr - fuel type with fuelType1 and fuelType2 (if applicable)
co2TailpipeGpm - double - tailpipe CO2 in grams/mile fuelType1
*EPA - U.S. Environmental Protection Agency
Selecting variables.
new_df <- df %>%
dplyr::select(Year, Manufacturer, Model,VClass, displ, cylinders,drive, fuelCost08,city08, highway08, trany,fuelType, co2TailpipeGpm)
#new_df
Rename variables.
colnames(new_df)[which(names(new_df) == "VClass")] <- "Class"
colnames(new_df)[which(names(new_df) == "displ")] <- "EngineSize"
colnames(new_df)[which(names(new_df) == "cylinders")] <- "Cylinders"
colnames(new_df)[which(names(new_df) == "drive")] <- "Drive"
colnames(new_df)[which(names(new_df) == "fuelCost08")] <- "Cost"
colnames(new_df)[which(names(new_df) == "city08")] <- "City"
colnames(new_df)[which(names(new_df) == "highway08")] <- "Hwy"
colnames(new_df)[which(names(new_df) == "trany")] <- "Transmission"
colnames(new_df)[which(names(new_df) == "fuelType")] <- "Fuel"
colnames(new_df)[which(names(new_df) == "co2TailpipeGpm")] <- "CO2"
#new_df
#nrow(new_df)
(sum(is.na(new_df))/nrow(new_df))*100
## [1] 1.667447
Remove NAs
new_df_nona <- na.omit(new_df)
summary(new_df_nona)
## Year Manufacturer Model Class
## Min. :1984 Length:44484 Length:44484 Length:44484
## 1st Qu.:1992 Class :character Class :character Class :character
## Median :2004 Mode :character Mode :character Mode :character
## Mean :2003
## 3rd Qu.:2014
## Max. :2023
## EngineSize Cylinders Drive Cost
## Min. :0.600 Min. : 2.00 Length:44484 Min. : 1050
## 1st Qu.:2.200 1st Qu.: 4.00 Class :character 1st Qu.: 2800
## Median :3.000 Median : 6.00 Mode :character Median : 3300
## Mean :3.283 Mean : 5.71 Mean : 3382
## 3rd Qu.:4.200 3rd Qu.: 6.00 3rd Qu.: 3800
## Max. :8.400 Max. :16.00 Max. :10350
## City Hwy Transmission Fuel
## Min. : 6.00 Min. : 9.00 Length:44484 Length:44484
## 1st Qu.:15.00 1st Qu.:20.00 Class :character Class :character
## Median :17.00 Median :24.00 Mode :character Mode :character
## Mean :18.18 Mean :24.39
## 3rd Qu.:21.00 3rd Qu.:28.00
## Max. :58.00 Max. :61.00
## CO2
## Min. : 22.0
## 1st Qu.: 386.4
## Median : 444.4
## Mean : 463.5
## 3rd Qu.: 522.8
## Max. :1269.6
sum(is.na(new_df_nona))
## [1] 0
There are too many observations. Therefore, we will be only looking at the 2022 - 2023 model years. In other words, we will be only looking for brand new cars :)
df_2022 <- new_df_nona %>%
filter(Year >= 2022 )
#sum(is.na(new_df_nona))
This will leave us with 1,305 observations.
ave_cost <-df_2022 %>% # Summary by group using dplyr
group_by(Manufacturer) %>%
summarize(Cost = mean(Cost))%>%
arrange(Cost)
ave_cost_Most <- tail(ave_cost,15)
p2 <- ave_cost_Most %>%
ggplot( aes(x=Manufacturer, y=Cost, fill=Manufacturer))+
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
geom_bar(stat = "identity", width=0.5) +
xlab("Manufacturer")+
ylab("Cost $USD") +
theme(legend.position="none")+
ggtitle("Top 15 Annual Fuel Cost in USD")+
theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))
ggplotly(p2, tooltip=c("x", "y"))
It is no doubt that luxury cars spend a lot of money on fuel, but what kind of cars spend the least money annually?
ave_cost_least <- head(ave_cost,15)
p <- ave_cost_least %>%
ggplot( aes(x=Manufacturer, y=Cost, fill=Manufacturer))+
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
geom_bar(stat = "identity", width=0.5) +
xlab("Manufacturer") +
ylab("Cost $USD") +
theme(legend.position="none")+
ggtitle("Least 15 Annual Fuel Cost")+
theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))
ggplotly(p, tooltip=c("x", "y"))
We will be only pick the least 10 cars from the chart above to look for more details
df_2022_10 <- df_2022 %>%
dplyr::filter(Manufacturer == "Hyundai"| Manufacturer == "Honda"|Manufacturer == "Toyota"|Manufacturer == "Volkswagen"|Manufacturer == "Kia"|Manufacturer == "Buick"|Manufacturer == "Fiat"|Manufacturer == "Mazda"|Manufacturer == "Mitsubishi"|Manufacturer == "Subaru" )
#df_2022_10
p.box <- df_2022_10 %>%
ggplot(aes(x=Manufacturer, y=Cost, fill=Manufacturer)) +
geom_boxplot()+
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic() +
xlab("Manufacturer") +
ylab("Cost $USD") +
theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))+
theme(legend.position="none")+
ggtitle("Annual Cost by Manufacturer")
ggplotly(p.box)
I wonder if it is true that it is Honda’s cost on average cheaper than Subaru’s? Let’s find out.
Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).
H0:μh=μs
Hα:μh<μs
w <- df_2022_10 %>% select( Manufacturer, Cost)%>%
filter(Manufacturer == "Subaru" | Manufacturer == "Honda")
# Compute t-test
res <- t.test(Cost ~ Manufacturer, data = w)
res
##
## Welch Two Sample t-test
##
## data: Cost by Manufacturer
## t = -3.088, df = 53.484, p-value = 0.003192
## alternative hypothesis: true difference in means between group Honda and group Subaru is not equal to 0
## 95 percent confidence interval:
## -657.5592 -139.7742
## sample estimates:
## mean in group Honda mean in group Subaru
## 2083.333 2482.000
P-value: 0.003192 < 0.05
Conclusion: Reject the Null Hypothesis H0:μs=μh
There is enough evidence to show that the average cost of Subaru is greater than the the average cost of Honda
We know that two different brand do not have the the average cost. We would want to know more if City vs. Highway Consumptions.
p.dot <- ggplot(df_2022_10, aes(City, Hwy))+
geom_point(aes(color = Manufacturer))+
ggtitle("City vs. Highway Consumptions")+
scale_color_viridis(discrete = TRUE, option = "A")+
scale_fill_viridis(discrete = TRUE) +
theme_light()+
theme(legend.position="none")
ggplotly(p.dot)
There is a correlation between City vs.Highway consumption.
Class_ave <-df_2022_10 %>% # Summary by group using dplyr
group_by(Class) %>%
summarize(CO2 = median(CO2))%>%
arrange(CO2)
options(scipen=10000)
p <- Class_ave %>%
ggplot( aes(x=Class, y=CO2, fill=Class)) +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
geom_bar(stat = "identity", width=0.5) +
coord_flip() +
xlab("Vehicle Classes")+
ylab("Co2 Emissions") +
theme(legend.position="none")+
ggtitle("Average CO2 Emissions VS. Type of Cars")
#p
ggplotly(p, tooltip=c("x", "y"))
Small Station Wagons, Compact Cars, Midsize Cars, Small Sport Utility Vehicle 2WD, Minivan - 4WD have the least CO2 emission.
From the previous analysis, it leads to the question if Vehicle class and Fuel type are independent?
Conduct a Chi-Square test with = 0.05.
H0: Vehicle class and Fuel type are independent.
Hα: Vehicle class and Fuel type are dependent.
#contingency table of the two variables
tbl = table(df_2022$Fuel, df_2022$Class)
result <- chisq.test(tbl); result
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 797.61, df = 120, p-value < 0.00000000000000022
P-value: 0.00000000000000022 < 0.05
Conclusion: Reject the Null Hypothesis H0: Fuel type and Vehicle class are independent.
There is enough evidence to show that Fuel type and Vehicle class are dependent.
Now, I will be able to suggest if anyone needs help finding the right car. First, I have learned today that Vehicle class and Fuel type are related.
Second, if some one want to save on money annually, I will recommend them to look for these 10 Manufacturer:
Hyundai
Honda
Toyota
Kia
Buick
Fiat
Mazda
Mitsubishi
Subaru
Volkswagen
Third, I learned that if we want to find the cars with the least CO2 emissions, I will look for: