I love my current car. It serves the purpose. But when the time comes, I will need a new car. I personally like Mercedes-Benz’s look, specifically AMG A35. However, I want to know if this will be the best in all areas, not just for the face. I wish to see fuel consumption and co2 emissions as well. At the end of this exploration, we will see if the Mercedes-Benz AMG A35 still be my no.1 choice or will I be changing my mind to other models.
The dataset I use is called “us-vehicle-fuel-economy-data-1984-2017”. This dataset contains United State vehicle fuel economy data starting from 1984. Data from Fuel Economy.GOV. Dataset provides model-specific fuel consumption ratings and estimated carbon dioxide emissions for retail sale vehicles. The size of this dataset is 13,853 KB.
The dataset itself is from: https://datasource.kapsarc.org
Source copyrights and descriptions are from: http://www.fueleconomy.gov
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.3
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(dplyr)
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.3
## Loading required package: viridisLite
library(epiDisplay)
## Warning: package 'epiDisplay' was built under R version 4.1.3
## Loading required package: foreign
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.1.3
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: nnet
##
## Attaching package: 'epiDisplay'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(MASS)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(forcats)
library(lemon)
## Warning: package 'lemon' was built under R version 4.1.3
##
## Attaching package: 'lemon'
## The following object is masked from 'package:purrr':
##
## %||%
## The following objects are masked from 'package:ggplot2':
##
## CoordCartesian, element_render
library(gplots)
## Warning: package 'gplots' was built under R version 4.1.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
#install.packages("lemon")
knit_print.data.frame <- lemon_print
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.1.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
#install.packages("gplots")
df <- read.csv("us-vehicle-fuel-economy-data-1984-2017.csv")
as_tibble(df)
## # A tibble: 44,859 x 83
## Year Manufacturer Model barrels08 barrelsA08 charge120 charge240 city08
## <int> <chr> <chr> <dbl> <int> <int> <int> <int>
## 1 2016 Porsche 911 Turb~ 14.9 0 0 0 17
## 2 2017 BMW M6 Gran ~ 18.6 0 0 0 14
## 3 2017 Hyundai Elantra ~ 11.0 0 0 0 24
## 4 2017 Lincoln MKC AWD 13.5 0 0 0 19
## 5 2017 Mercedes-Benz AMG GLS63 21.3 0 0 0 13
## 6 2016 GMC Sierra 1~ 14.9 0 0 0 18
## 7 2016 Chevrolet Silverad~ 16.5 0 0 0 16
## 8 2017 Hyundai Elantra 8.50 0 0 0 32
## 9 2017 BMW 550i xDr~ 16.5 0 0 0 15
## 10 2017 Genesis G80 RWD 13.5 0 0 0 18
## # ... with 44,849 more rows, and 75 more variables: city08U <int>,
## # cityA08 <int>, cityA08U <int>, cityCD <int>, cityE <int>, cityUF <int>,
## # co2 <int>, co2A <int>, co2TailpipeAGpm <int>, co2TailpipeGpm <dbl>,
## # comb08 <int>, comb08U <int>, combA08 <int>, combA08U <int>, combE <int>,
## # combinedCD <int>, combinedUF <int>, cylinders <int>, displ <dbl>,
## # drive <chr>, engId <int>, eng_dscr <chr>, feScore <int>, fuelCost08 <int>,
## # fuelCostA08 <int>, fuelType <chr>, fuelType1 <chr>, ghgScore <int>, ...
dim(df)
## [1] 44859 83
sum(is.na(df))
## [1] 36690
#str(df)
After a glimpse of the data, it is excellent that the dataset contains both quantitative and categorical values. However, the biggest challenge was too many variables. It was time-consuming to study the dataset and choose what variable to be used. In addition to that, there were 36690 NA values.
year- int- model year
Manufacturer - chr -manufacturer (division)
model - chr - model name (carline)
VClass- chr - EPA vehicle size class: Minicompact Cars, Compact Cars, Midsize Cars, Small
displ - dbl- engine displacement in liters
cylinders- int - engine cylinders
drive -chr - drive axle type
fuelCost08 - int- annual fuel cost for fuelType1 ($) (7)
city08 -int - city MPG for fuelType1
highway08- int - highway MPG for fuelType1
trany- chr - transmission: Automatic, Manual
fuelType - chr - fuel type with fuelType1 and fuelType2 (if applicable)
co2TailpipeGpm - double - tailpipe CO2 in grams/mile fuelType1
*EPA - U.S. Environmental Protection Agency
Selecting variables.
new_df <- df %>%
dplyr::select(Year, Manufacturer, Model,VClass, displ, cylinders,drive, fuelCost08,city08, highway08, trany,fuelType, co2TailpipeGpm)
#new_df
Rename variables.
colnames(new_df)[which(names(new_df) == "VClass")] <- "Class"
colnames(new_df)[which(names(new_df) == "displ")] <- "EngineSize"
colnames(new_df)[which(names(new_df) == "cylinders")] <- "Cylinders"
colnames(new_df)[which(names(new_df) == "drive")] <- "Drive"
colnames(new_df)[which(names(new_df) == "fuelCost08")] <- "AnnualCost"
colnames(new_df)[which(names(new_df) == "city08")] <- "CityConcumptions"
colnames(new_df)[which(names(new_df) == "highway08")] <- "HwyConcumptions"
colnames(new_df)[which(names(new_df) == "trany")] <- "Transmission"
colnames(new_df)[which(names(new_df) == "fuelType")] <- "FuelType"
colnames(new_df)[which(names(new_df) == "co2TailpipeGpm")] <- "CO2Emissions"
#new_df
#summary(new_df)
sum(is.na(new_df))
## [1] 748
Remove NAs
new_df_nona <- na.omit(new_df)
summary(new_df_nona)
## Year Manufacturer Model Class
## Min. :1984 Length:44484 Length:44484 Length:44484
## 1st Qu.:1992 Class :character Class :character Class :character
## Median :2004 Mode :character Mode :character Mode :character
## Mean :2003
## 3rd Qu.:2014
## Max. :2023
## EngineSize Cylinders Drive AnnualCost
## Min. :0.600 Min. : 2.00 Length:44484 Min. : 1050
## 1st Qu.:2.200 1st Qu.: 4.00 Class :character 1st Qu.: 2800
## Median :3.000 Median : 6.00 Mode :character Median : 3300
## Mean :3.283 Mean : 5.71 Mean : 3382
## 3rd Qu.:4.200 3rd Qu.: 6.00 3rd Qu.: 3800
## Max. :8.400 Max. :16.00 Max. :10350
## CityConcumptions HwyConcumptions Transmission FuelType
## Min. : 6.00 Min. : 9.00 Length:44484 Length:44484
## 1st Qu.:15.00 1st Qu.:20.00 Class :character Class :character
## Median :17.00 Median :24.00 Mode :character Mode :character
## Mean :18.18 Mean :24.39
## 3rd Qu.:21.00 3rd Qu.:28.00
## Max. :58.00 Max. :61.00
## CO2Emissions
## Min. : 22.0
## 1st Qu.: 386.4
## Median : 444.4
## Mean : 463.5
## 3rd Qu.: 522.8
## Max. :1269.6
sum(is.na(new_df_nona))
## [1] 0
There are too many observations, we will be only observed 2022 - 2023 model years. In other word, we will be only looking for brand new cars :)
#install.packages("kableExtra")
library(kableExtra)
#tb <- tab1(new_df_nona$Manufacturer, sort.group = "decreasing", cum.percent = TRUE)
df_2022 <- new_df_nona %>%
filter(Year >= 2022 )
#df_2022
#df_2022 %>%
# kbl() %>%
#kable_paper("hover", full_width = F)
#group_car <- df_2022 %>%
# group_by(Manufacturer) %>%
# summarize(count = n()) %>%
# mutate(Percent = (round(count / sum(count), 5)) *100) %>%
# arrange(desc(Percent))
This will leave us with 1,305 observations.
Before we begin, we need to find information for that one car Mercedez-Benz AMG A35
amg_a35 <- df_2022 %>%
filter(Model == "AMG A35 4matic")
amg_a35 %>%
kbl() %>%
kable_paper("hover", full_width = F)
| Year | Manufacturer | Model | Class | EngineSize | Cylinders | Drive | AnnualCost | CityConcumptions | HwyConcumptions | Transmission | FuelType | CO2Emissions |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | Mercedes-Benz | AMG A35 4matic | Subcompact Cars | 2 | 4 | All-Wheel Drive | 2900 | 22 | 30 | Automatic (AM7) | Premium | 356 |
We will compare:
CO2 Emissions vs. Transmission
CO2 Emissions vs. Vehicle Class
CO2 Emissions vs. Fuel Types
options(scipen=10000)
p <- df_2022 %>%
ggplot( aes(x=Transmission, y=CO2Emissions, fill=Transmission)) +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
geom_bar(stat = "identity", width=0.5) +
coord_flip() +
xlab("Transmissions") +
theme(legend.position="none")+
ggtitle("CO2 Emission VS. Transmission")
p
#ggplotly(p, tooltip=c("x", "y"))
It seems that Automatic (S8) has the higest CO2 Emission, follow by Automatic 8-spd, Automatic 10-spd, Automatic (S10), Automatic 9-spd.
Transmission typs that has the least CO2 Emission are Automatic (A1), Automatic (AM-S9), and Automatic 7-spd
The AMG A35 4matic is a Automatic(AM7). It does not belongs to either the lowest or the highest CO2 Emission,
options(scipen=10000)
p <- df_2022 %>%
ggplot( aes(x=Class, y=CO2Emissions, fill=Class)) +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
geom_bar(stat = "identity", width=0.5) +
coord_flip() +
xlab("Vehicle Class") +
theme(legend.position="none")+
ggtitle("CO2 Emission VS. Vehicle Class")
p
#ggplotly(p, tooltip=c("x", "y"))
Small Standard sport utility 4WD and Standard sport utility 4WD have about the highest emissions of CO2, followed by Midsize cars, Subcompact cars, and Standard Pickup Trucks 4WD.
Vans, Passenger Type, Minivan-4WD, and Minivan - 2WD are giving the least CO2.
The AMG A35 4matic is a Subcompact Car.
van <- df_2022 %>%
filter(Class == "Vans, Passenger Type" | Class == "Minivan - 4WD" | Class == "Minivan - 2WD")
van <- van %>% arrange(CO2Emissions)
head(van, 5)%>%
kbl() %>%
kable_paper("hover", full_width = F)
| Year | Manufacturer | Model | Class | EngineSize | Cylinders | Drive | AnnualCost | CityConcumptions | HwyConcumptions | Transmission | FuelType | CO2Emissions |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | Chrysler | Pacifica Hybrid | Minivan - 2WD | 3.6 | 6 | Front-Wheel Drive | 2050 | 29 | 30 | Automatic (variable gear ratios) | Regular Gas and Electricity | 119 |
| 2022 | Toyota | Sienna 2WD | Minivan - 2WD | 2.5 | 4 | Front-Wheel Drive | 1700 | 36 | 36 | Automatic (AV-S6) | Regular | 246 |
| 2022 | Toyota | Sienna AWD | Minivan - 4WD | 2.5 | 4 | All-Wheel Drive | 1750 | 35 | 36 | Automatic (AV-S6) | Regular | 253 |
| 2022 | Honda | Odyssey | Minivan - 2WD | 3.5 | 6 | Front-Wheel Drive | 2800 | 19 | 28 | Automatic (S10) | Regular | 394 |
| 2023 | Honda | Odyssey | Minivan - 2WD | 3.5 | 6 | Front-Wheel Drive | 2800 | 19 | 28 | Automatic (S10) | Regular | 394 |
options(scipen=10000)
p <- df_2022 %>%
ggplot( aes(x=FuelType, y=CO2Emissions, fill=FuelType)) +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic()+
geom_bar(stat = "identity", width=0.5) +
coord_flip() +
xlab("Fuel Types") +
theme(legend.position="none")+
ggtitle("CO2 Emission VS. Fuel Types")
p
From the previous analysis, it leads to the question if Vehicle class and Fuel type are independent?
Conduct a Chi-Square test with = 0.05.
H0: Vehicle class and Fuel type are independent.
Hα: Vehicle class and Fuel type are dependent.
tbl = table(df_2022$FuelType, df_2022$Class)
#tbl
result <- chisq.test(tbl); result
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 797.61, df = 120, p-value < 0.00000000000000022
P-value: 0.00000000000000022 < 0.05
Conclusion: Reject the Null Hypothesis H0: Fuel type and Vehicle class are independent.
There is enough evidence to show that Fuel type and Vehicle class are dependent.
p.box <- df_2022 %>%
ggplot(aes(x=FuelType, y=AnnualCost, fill=FuelType)) +
geom_boxplot()+
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_classic() +
theme(axis.text.x = element_text(size = 7, angle = 25, hjust = 1))+
theme(legend.position="none")+
ggtitle("Annual Fuel Cost vs. Fuel Types")
ggplotly(p.box)
It appears that the average annual fuel cost differs significantly between the fuel types. Is it true that it is cheaper on average to for regular vs. Manhattan? Let’s find out. Conduct a t-test of a single mean at the 95% confidence level (alpha = 0.05).
topsammary <-df_2022 %>% # Summary by group using dplyr
group_by(FuelType) %>%
summarize(min = min(AnnualCost),
mean = mean(AnnualCost),
median = median(AnnualCost),
max = max(AnnualCost),
sd=sd(AnnualCost),
count = n())
topsammary
## # A tibble: 7 x 7
## FuelType min mean median max sd count
## <chr> <int> <dbl> <dbl> <int> <dbl> <int>
## 1 Diesel 2950 3334. 3350 4050 245. 29
## 2 Gasoline or E85 2350 3382. 3400 4050 518. 17
## 3 Midgrade 3400 3840 4000 4000 216. 15
## 4 Premium 1850 3553. 3450 8050 796. 658
## 5 Premium and Electricity 2000 3007. 2800 4250 609. 29
## 6 Regular 1050 2579. 2550 4350 679. 543
## 7 Regular Gas and Electricity 1150 1914. 1825 3050 578. 14
H0:μr=μp
Hα:μr<μp
premium <- subset(df_2022, FuelType == "Premium")
regular <- subset(df_2022, FuelType == "Regular")
res <- t.test(regular$AnnualCost, premium$AnnualCost, alternative = "less", conf.level = 0.95)
res
##
## Welch Two Sample t-test
##
## data: regular$AnnualCost and premium$AnnualCost
## t = -22.9, df = 1197.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -904.2596
## sample estimates:
## mean of x mean of y
## 2578.821 3553.116
P-value: 0.00000000000000022 < 0.05 Conclusion: Reject the Null Hypothesis H0:μr=μp There is enough evidence to show that the mean annual fuel cost for Regular fuel is less than Premium fuel.
w <- df_2022 %>% select( Model, AnnualCost, FuelType)%>%
filter(FuelType == "Regular" | FuelType == "Premium")
# Compute t-test
res <- t.test(AnnualCost ~ FuelType, data = w)
res
##
## Welch Two Sample t-test
##
## data: AnnualCost by FuelType
## t = 22.9, df = 1197.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group Premium and group Regular is not equal to 0
## 95 percent confidence interval:
## 890.823 1057.765
## sample estimates:
## mean in group Premium mean in group Regular
## 3553.116 2578.821