This is the Fuel economy data for all cars sold in the US from 1984 to 2015 from the source of Enviromental protection agency. And this project are to investigate the change in fuel econony and relationships between fuel ecomomy and other car types.
https://cran.r-project.org/web/packages/fueleconomy/fueleconomy.pdf
Main questions includes:
library(fueleconomy)
## Warning: package 'fueleconomy' was built under R version 3.3.3
library(DT)
## Warning: package 'DT' was built under R version 3.3.3
fuel <- fueleconomy::vehicles
head(fuel)
## id make model year class
## 1 27550 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 2 28426 AM General DJ Po Vehicle 2WD 1984 Special Purpose Vehicle 2WD
## 3 27549 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 4 28425 AM General FJ8c Post Office 1984 Special Purpose Vehicle 2WD
## 5 1032 AM General Post Office DJ5 2WD 1985 Special Purpose Vehicle 2WD
## 6 1033 AM General Post Office DJ8 2WD 1985 Special Purpose Vehicle 2WD
## trans drive cyl displ fuel hwy cty
## 1 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 2 Automatic 3-spd 2-Wheel Drive 4 2.5 Regular 17 18
## 3 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 4 Automatic 3-spd 2-Wheel Drive 6 4.2 Regular 13 13
## 5 Automatic 3-spd Rear-Wheel Drive 4 2.5 Regular 17 16
## 6 Automatic 3-spd Rear-Wheel Drive 6 4.2 Regular 13 13
The fuel economy data were collected from the EPA by observational data.
str(fuel)
## Classes 'tbl_df', 'tbl' and 'data.frame': 33442 obs. of 12 variables:
## $ id : int 27550 28426 27549 28425 1032 1033 3347 13309 13310 13311 ...
## $ make : chr "AM General" "AM General" "AM General" "AM General" ...
## $ model: chr "DJ Po Vehicle 2WD" "DJ Po Vehicle 2WD" "FJ8c Post Office" "FJ8c Post Office" ...
## $ year : int 1984 1984 1984 1984 1985 1985 1987 1997 1997 1997 ...
## $ class: chr "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" "Special Purpose Vehicle 2WD" ...
## $ trans: chr "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" "Automatic 3-spd" ...
## $ drive: chr "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" "2-Wheel Drive" ...
## $ cyl : int 4 4 6 6 4 6 6 4 4 6 ...
## $ displ: num 2.5 2.5 4.2 4.2 2.5 4.2 3.8 2.2 2.2 3 ...
## $ fuel : chr "Regular" "Regular" "Regular" "Regular" ...
## $ hwy : int 17 17 13 13 17 13 21 26 28 26 ...
## $ cty : int 18 18 13 13 16 13 14 20 22 18 ...
id Unique EPA identifier make Manufacturer model Model name year Model year class EPA vehicle size class, http://www.fueleconomy.gov/feg/ws/wsData.shtml#VClass** trans Transmission drive Drive train cyl Number of cylinders displ Engine displacement, in litres fuel Fuel type hwy Highway fuel economy, in mpg cty City fuel economy, in mpg**
33,442 cases from 1984 to 2015 with different manufacturer, model name, year, epa vehicle size class, transmission, drive train, fuel type, highway fuel economy and city fuel economy.
The type of study is an observational study. In order to understand the relationships between explanatory variables and reponse variable, I study the dataset collected by years, numbers of cylinders and fuel type in different ways: explanatory data analysis with both summary statistics and data visulization, inference analysis and applying for Machine learning knowledge.
Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.
The findings from this analysis can be generlized to that 33,442 cases, because the sample size are large enough for identification and the fuel economy are supposed independently to be estimated, exclusive other miner factor, like human driving improperly.
The cause (e.g. fuel economy ) and effect (e.g. number of cylinders) are related,there are no plausible alternative explanations for the observed covariation. Though observation study, the causal connection can be investigated between cause and effect.
The single value of predicated variables for fuel economy should be used, so the combined fuel economy will be calculated for data analysis, the ratio informaton as following link:
https://www.fueleconomy.gov/feg/label/learn-more-gasoline-label.shtml#fuel-economy
Combined fuel economy is a weighted average of City and Highway MPG values that is calculated by weighting the City value by 55% and the Highway value by 45%
fuel$Combined_mpg <- fuel$cty*0.55 + fuel$hwy*0.45
summary(fuel)
## id make model year
## Min. : 1 Length:33442 Length:33442 Min. :1984
## 1st Qu.: 8361 Class :character Class :character 1st Qu.:1991
## Median :16724 Mode :character Mode :character Median :1999
## Mean :17038 Mean :1999
## 3rd Qu.:25265 3rd Qu.:2008
## Max. :34932 Max. :2015
##
## class trans drive cyl
## Length:33442 Length:33442 Length:33442 Min. : 2.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 5.772
## 3rd Qu.: 6.000
## Max. :16.000
## NA's :58
## displ fuel hwy cty
## Min. :0.000 Length:33442 Min. : 9.00 Min. : 6.00
## 1st Qu.:2.300 Class :character 1st Qu.: 19.00 1st Qu.: 15.00
## Median :3.000 Mode :character Median : 23.00 Median : 17.00
## Mean :3.353 Mean : 23.55 Mean : 17.49
## 3rd Qu.:4.300 3rd Qu.: 27.00 3rd Qu.: 20.00
## Max. :8.400 Max. :109.00 Max. :138.00
## NA's :57
## Combined_mpg
## Min. : 7.80
## 1st Qu.: 16.70
## Median : 19.70
## Mean : 20.22
## 3rd Qu.: 22.70
## Max. :123.15
##
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fuel_year <- fuel %>%
group_by(year) %>%
summarise(n = n(), mean = mean(Combined_mpg), median = median(Combined_mpg), sd = sd(Combined_mpg))
fuel_year
## # A tibble: 32 ⊙ 5
## year n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 1984 784 17.15944 17.25 4.182516
## 2 1985 1701 20.20212 19.60 5.320747
## 3 1986 1210 19.93054 19.60 5.253975
## 4 1987 1247 19.62097 19.35 5.135072
## 5 1988 1130 19.74969 19.25 5.041844
## 6 1989 1153 19.53877 19.15 5.175750
## 7 1990 1078 19.42032 19.05 4.955587
## 8 1991 1132 19.28101 18.70 4.916046
## 9 1992 1121 19.34095 19.05 4.894614
## 10 1993 1093 19.60018 19.05 4.869317
## # ... with 22 more rows
hist(fuel$Combined_mpg)
fuel <- fuel[!fuel$Combined_mpg %in% boxplot.stats(fuel$Combined_mpg)$out, ]
hist(fuel$Combined_mpg)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
plot(fuel_year$year,fuel_year$mean, type="b", lty=6, col="blue", bg="green", xlab="year", ylab="Mean of fuel") + abline(lm(fuel_year$mean ~ fuel_year$year))
## numeric(0)
library(ggplot2)
ggplot(fuel, aes(y = Combined_mpg, x = as.factor(year), fill = as.factor(year))) + geom_boxplot(alpha = 0.5) + scale_x_discrete("", labels = NULL, breaks = NULL) + scale_y_continuous("", limits = c(5,35)) + guides(fill=guide_legend(title="Year")) + theme(legend.justification = c(0, 0.75), legend.position = c(0, 0.75), legend.background = element_rect(fill=NA), legend.title = element_text(face="bold"), legend.title.align = 0.5) + stat_summary(fun.y = mean, geom = "point", shape = 4, size = 2) + coord_flip() + ggtitle("Distribution of Fuel Economy by Year\n")
ggplot(fuel, aes(x = Combined_mpg, fill = as.factor(year))) + geom_density(alpha = 0.5) + scale_y_continuous("", labels = NULL, breaks = NULL) + scale_x_continuous("Fuel Economy (Combined_mpg)", limits = c(5,35)) + guides(fill=guide_legend(title="Year")) + theme(legend.justification = c(0, 0.75), legend.position = c(0, 0.75), legend.background = element_rect(fill=NA), legend.title = element_text(face="bold"), legend.title.align = 0.5)
unique(fuel$cyl)
## [1] 4 6 5 8 12 10 16 NA 3 2
library(dplyr)
fuel_cyl <- fuel %>% filter(!is.na(fuel$cyl))
unique(fuel_cyl$cyl)
## [1] 4 6 5 8 12 10 16 3 2
grp_cyl <- fuel_cyl %>%
group_by(cyl) %>%
summarise(n = n(), mean = mean(Combined_mpg), median = median(Combined_mpg), sd = sd(Combined_mpg))
grp_cyl
## # A tibble: 9 ⊙ 5
## cyl n mean median sd
## <int> <int> <dbl> <dbl> <dbl>
## 1 2 45 18.37000 18.60 0.5289784
## 2 3 29 29.09310 29.80 1.5587726
## 3 4 11764 23.51209 23.25 3.3083728
## 4 5 718 20.85578 20.60 2.7252662
## 5 6 11884 18.90852 19.05 2.6248608
## 6 8 7550 15.48340 15.25 2.6219897
## 7 10 138 14.46920 14.60 1.8040640
## 8 12 478 13.36056 13.70 1.7624723
## 9 16 7 10.95714 11.15 0.2405351
plot(grp_cyl$cyl ,grp_cyl$mean, type="b", lty=6, col="blue", bg="green", xlab="number of cylinder", ylab="Mean of fuel") + abline(lm(grp_cyl$mean ~ grp_cyl$cyl))
## numeric(0)
ggplot(fuel_cyl, aes(y = Combined_mpg, x = as.factor(cyl), fill = as.factor(cyl))) + geom_boxplot(alpha = 0.5) + scale_x_discrete("Cylinders") + scale_y_continuous("Fuel Economy (mpg)\n") + theme(legend.position = 'none') + stat_summary(fun.y = mean, geom = "point", shape = 4, size = 2) + ggtitle("Fuel Economy by Number of Engine Cylinders\n")
unique(fuel$fuel)
## [1] "Regular" "Premium"
## [3] "Premium or E85" "Diesel"
## [5] "Gasoline or E85" "Gasoline or natural gas"
## [7] "CNG" "Electricity"
## [9] "Midgrade" "Premium Gas or Electricity"
## [11] "Gasoline or propane" "Premium and Electricity"
It is no meaning to compare the item of “or / and”, it should be removed.
library(dplyr)
fuel_clean <- fuel %>% filter(!grepl("or|and",fuel$fuel))
unique(fuel_clean$fuel)
## [1] "Regular" "Premium" "Diesel" "CNG" "Electricity"
## [6] "Midgrade"
fuel_type <- fuel_clean %>%
group_by(fuel) %>%
summarise(n = n(), mean = mean(Combined_mpg), median = median(Combined_mpg), sd = sd(Combined_mpg))
fuel_type
## # A tibble: 6 ⊙ 5
## fuel n mean median sd
## <chr> <int> <dbl> <dbl> <dbl>
## 1 CNG 55 17.81364 14.25 7.047579
## 2 Diesel 699 20.40572 19.15 4.156685
## 3 Electricity 1 28.00000 28.00 NaN
## 4 Midgrade 43 18.02907 18.05 1.645384
## 5 Premium 8575 19.51809 19.60 3.785541
## 6 Regular 22091 19.89339 19.70 4.516686
boxplot(fuel$Combined_mpg ~ fuel$year)
by(fuel$Combined_mpg, fuel$year, mean)
## fuel$year: 1984
## [1] 17.15944
## --------------------------------------------------------
## fuel$year: 1985
## [1] 19.77042
## --------------------------------------------------------
## fuel$year: 1986
## [1] 19.45336
## --------------------------------------------------------
## fuel$year: 1987
## [1] 19.27115
## --------------------------------------------------------
## fuel$year: 1988
## [1] 19.40289
## --------------------------------------------------------
## fuel$year: 1989
## [1] 19.09172
## --------------------------------------------------------
## fuel$year: 1990
## [1] 19.05786
## --------------------------------------------------------
## fuel$year: 1991
## [1] 18.96407
## --------------------------------------------------------
## fuel$year: 1992
## [1] 18.98776
## --------------------------------------------------------
## fuel$year: 1993
## [1] 19.28198
## --------------------------------------------------------
## fuel$year: 1994
## [1] 19.25253
## --------------------------------------------------------
## fuel$year: 1995
## [1] 19.1124
## --------------------------------------------------------
## fuel$year: 1996
## [1] 19.88937
## --------------------------------------------------------
## fuel$year: 1997
## [1] 19.78054
## --------------------------------------------------------
## fuel$year: 1998
## [1] 19.71139
## --------------------------------------------------------
## fuel$year: 1999
## [1] 19.60817
## --------------------------------------------------------
## fuel$year: 2000
## [1] 19.52597
## --------------------------------------------------------
## fuel$year: 2001
## [1] 19.45185
## --------------------------------------------------------
## fuel$year: 2002
## [1] 19.29201
## --------------------------------------------------------
## fuel$year: 2003
## [1] 19.10971
## --------------------------------------------------------
## fuel$year: 2004
## [1] 19.2832
## --------------------------------------------------------
## fuel$year: 2005
## [1] 19.4293
## --------------------------------------------------------
## fuel$year: 2006
## [1] 19.32516
## --------------------------------------------------------
## fuel$year: 2007
## [1] 19.42826
## --------------------------------------------------------
## fuel$year: 2008
## [1] 19.61594
## --------------------------------------------------------
## fuel$year: 2009
## [1] 20.14529
## --------------------------------------------------------
## fuel$year: 2010
## [1] 20.8785
## --------------------------------------------------------
## fuel$year: 2011
## [1] 20.97287
## --------------------------------------------------------
## fuel$year: 2012
## [1] 21.32059
## --------------------------------------------------------
## fuel$year: 2013
## [1] 22.0118
## --------------------------------------------------------
## fuel$year: 2014
## [1] 22.31938
## --------------------------------------------------------
## fuel$year: 2015
## [1] 23.45281
This combined fuel of difference is statistically significant.
Before using ANOVA test to check the means across multiple groups, the condition are checked as follow:
To perform the ANOVA test, a linear regression is performed, and an ANOVA table is created using the anova function
test_year <- lm(Combined_mpg ~ as.factor(year), data = fuel)
summary(test_year)
##
## Call:
## lm(formula = Combined_mpg ~ as.factor(year), data = fuel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8704 -3.1579 -0.1019 2.6471 14.5406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.1594 0.1504 114.089 <2e-16 ***
## as.factor(year)1985 2.6110 0.1826 14.300 <2e-16 ***
## as.factor(year)1986 2.2939 0.1942 11.814 <2e-16 ***
## as.factor(year)1987 2.1117 0.1928 10.955 <2e-16 ***
## as.factor(year)1988 2.2435 0.1966 11.411 <2e-16 ***
## as.factor(year)1989 1.9323 0.1960 9.859 <2e-16 ***
## as.factor(year)1990 1.8984 0.1985 9.562 <2e-16 ***
## as.factor(year)1991 1.8046 0.1964 9.189 <2e-16 ***
## as.factor(year)1992 1.8283 0.1969 9.287 <2e-16 ***
## as.factor(year)1993 2.1225 0.1978 10.729 <2e-16 ***
## as.factor(year)1994 2.0931 0.2024 10.342 <2e-16 ***
## as.factor(year)1995 1.9530 0.2029 9.625 <2e-16 ***
## as.factor(year)1996 2.7299 0.2142 12.743 <2e-16 ***
## as.factor(year)1997 2.6211 0.2149 12.198 <2e-16 ***
## as.factor(year)1998 2.5520 0.2117 12.054 <2e-16 ***
## as.factor(year)1999 2.4487 0.2096 11.682 <2e-16 ***
## as.factor(year)2000 2.3665 0.2101 11.264 <2e-16 ***
## as.factor(year)2001 2.2924 0.2062 11.116 <2e-16 ***
## as.factor(year)2002 2.1326 0.2029 10.512 <2e-16 ***
## as.factor(year)2003 1.9503 0.1998 9.761 <2e-16 ***
## as.factor(year)2004 2.1238 0.1967 10.798 <2e-16 ***
## as.factor(year)2005 2.2699 0.1952 11.627 <2e-16 ***
## as.factor(year)2006 2.1657 0.1971 10.988 <2e-16 ***
## as.factor(year)2007 2.2688 0.1961 11.570 <2e-16 ***
## as.factor(year)2008 2.4565 0.1943 12.645 <2e-16 ***
## as.factor(year)2009 2.9859 0.1944 15.356 <2e-16 ***
## as.factor(year)2010 3.7191 0.1974 18.844 <2e-16 ***
## as.factor(year)2011 3.8134 0.1972 19.333 <2e-16 ***
## as.factor(year)2012 4.1612 0.1975 21.072 <2e-16 ***
## as.factor(year)2013 4.8524 0.1974 24.581 <2e-16 ***
## as.factor(year)2014 5.1599 0.1967 26.234 <2e-16 ***
## as.factor(year)2015 6.2934 0.3363 18.713 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.211 on 32585 degrees of freedom
## Multiple R-squared: 0.05216, Adjusted R-squared: 0.05126
## F-statistic: 57.84 on 31 and 32585 DF, p-value: < 2.2e-16
library(StMoSim)
## Warning: package 'StMoSim' was built under R version 3.3.3
## Loading required package: RcppParallel
## Loading required package: Rcpp
## Warning: package 'Rcpp' was built under R version 3.3.3
##
## Attaching package: 'Rcpp'
## The following object is masked from 'package:RcppParallel':
##
## LdFlags
qqnormSim(test_year$residuals)
qqline(test_year$residuals)
The P-value is almost equal to 0, the null hypothesis rejected, that mean there is statiscally significant evidence that the fuel economy of vehicles is different from 1984 to 2015.
To perform the ANOVA test, a linear regression is performed, and an ANOVA table is created using the anova function
test_fuel <- lm(Combined_mpg ~ fuel, data = fuel_clean)
summary(test_fuel)
##
## Call:
## lm(formula = Combined_mpg ~ fuel, data = fuel_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9934 -3.0934 -0.0934 2.7066 12.0819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8136 0.5831 30.547 < 2e-16 ***
## fuelDiesel 2.5921 0.6057 4.280 1.88e-05 ***
## fuelElectricity 10.1864 4.3639 2.334 0.019589 *
## fuelMidgrade 0.2154 0.8804 0.245 0.806681
## fuelPremium 1.7045 0.5850 2.914 0.003576 **
## fuelRegular 2.0798 0.5839 3.562 0.000369 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.325 on 31458 degrees of freedom
## Multiple R-squared: 0.002626, Adjusted R-squared: 0.002468
## F-statistic: 16.57 on 5 and 31458 DF, p-value: 2.249e-16
qqnormSim(test_fuel$residuals)
qqline(test_fuel$residuals)
The P-value is almost equal to 0, the null hypothesis rejected, that mean there is statiscally significant evidence that fuel economy of vehicles is different for cars using different fuel types.
To perform the ANOVA test, a linear regression is performed, and an ANOVA table is created using the anova function
test_cyl <- lm(Combined_mpg ~ as.factor(cyl), data = fuel_cyl)
summary(test_cyl)
##
## Call:
## lm(formula = Combined_mpg ~ as.factor(cyl), data = fuel_cyl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5121 -2.1085 -0.2085 2.0879 12.4915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3700 0.4289 42.834 < 2e-16 ***
## as.factor(cyl)3 10.7231 0.6851 15.652 < 2e-16 ***
## as.factor(cyl)4 5.1421 0.4297 11.967 < 2e-16 ***
## as.factor(cyl)5 2.4858 0.4421 5.623 1.90e-08 ***
## as.factor(cyl)6 0.5385 0.4297 1.253 0.21
## as.factor(cyl)8 -2.8866 0.4301 -6.711 1.97e-11 ***
## as.factor(cyl)10 -3.9008 0.4939 -7.898 2.91e-15 ***
## as.factor(cyl)12 -5.0094 0.4486 -11.167 < 2e-16 ***
## as.factor(cyl)16 -7.4129 1.1689 -6.342 2.30e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.877 on 32604 degrees of freedom
## Multiple R-squared: 0.5573, Adjusted R-squared: 0.5572
## F-statistic: 5131 on 8 and 32604 DF, p-value: < 2.2e-16
qqnormSim(test_cyl$residuals)
qqline(test_cyl$residuals)
The P-value is almost equal to 0, the null hypothesis rejected, that mean there is statiscally significant evidence that fuel economy of vehicles is different for engines with different numbers of cylinders.