Part 1 - Introduction:

How does the number of cylinders a car has, impacts highway fuel economy?

The reason why this is a very important for everyone is because a vast majority of the population in the United States own and drive cars. Highways are where most of the fuel gets used up because the speed limits on highways are higher than those on the city roads. Based on the engine type, people can decide whether the car is sustainable in terms of fuel economy.

Part 2 - Data:

The data collected is a result of vehicle-testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. The cases are the individual vehicles being tested. The two variables of interest are the number of cylinders and the highway fuel economy. Both variables are numerical and discrete. The type of study is an observation because there are no control groups involved. There is also no group where certain cases are given a particular treatment. The population of interest is all of the vehicles in the United States. The findings from this analysis can be generalized to the population because this is a study regarding a sample. The sample is representative of all of the vehicles sold in the United States. Potential sources of bias that might prevent generalizability are the fact that the majority of vehicles in this sample belong to a particular model or the fact that the majority of vehicles in this sample have a certain number of engine cylinders. A causal relationship can be established between the number of cylinders the car has and the highway fuel economy because the number of engine cylinders a car has does affect the amount of gasoline the car uses per mile. It is possible that a car with the least number of engine cylinders uses the least amount of gasoline per mile.

Part 3 - Exploratory data analysis:

nrow(fueleconomy::vehicles)
## [1] 33442

The total number of vehicles tested is 33,442.

fueleconomy2 = subset(fueleconomy::vehicles, fueleconomy::vehicles$cyl != 'NA', select = c(cyl, hwy))

However, there are some test cases that would have to be eliminated due to the fact that some vehicles have no data regarding the number of cylinders.

summary(fueleconomy2)
##       cyl              hwy       
##  Min.   : 2.000   Min.   : 9.00  
##  1st Qu.: 4.000   1st Qu.:19.00  
##  Median : 6.000   Median :23.00  
##  Mean   : 5.772   Mean   :23.46  
##  3rd Qu.: 6.000   3rd Qu.:27.00  
##  Max.   :16.000   Max.   :61.00
nrow(fueleconomy2)
## [1] 33384

After removing those cases, we get 33,384 test cases that can be used in this study.

hist(fueleconomy2$cyl, main = "Histogram of the Number of Engine Cylinders", 
     xlab = "Number of Cylinders")

The histogram of the number of engine cylinders implies that the distribution of the number of engine cylinders is skewed right.

hist(fueleconomy2$hwy, main = "Histogram of the Highway Fuel Economy", 
     xlab = "Highway Fuel Economy (mpg)")

The histogram of the highway fuel economy implies that the distribution of the highway fuel economy is normal.

boxplot(fueleconomy2$hwy ~ fueleconomy2$cyl, main = "Boxplots of Highway Fuel Economy", 
        ylab = "Highway Fuel Economy (mpg)", xlab = "Number of Cylinders")

The boxplots suggest that as the number of cylinders increases, the highway fuel economy decreases, thus resulting in a vehicle that consumes a higher quantity of gasoline per mile traveled.

by(fueleconomy2$hwy, fueleconomy2$cyl, summary)
## fueleconomy2$cyl: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      20      22      22      22      23      23 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   35.25   39.00   40.48   45.00   61.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   24.00   27.00   27.73   31.00   51.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   22.00   24.00   24.64   27.00   34.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   20.00   22.00   22.32   25.00   38.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   16.00   18.00   18.27   21.00   30.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   17.00   19.00   18.12   20.00   22.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   15.00   17.00   16.63   18.00   22.00 
## -------------------------------------------------------- 
## fueleconomy2$cyl: 16
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   14.00   15.00   14.57   15.00   15.00

The summaries suggest that the distribution of the subset for each number of cylinders is almost normal because the mean and median are not too far apart from each other.

Part 4 - Inference:

by(fueleconomy2$hwy, fueleconomy2$cyl, length)
## fueleconomy2$cyl: 2
## [1] 45
## -------------------------------------------------------- 
## fueleconomy2$cyl: 3
## [1] 182
## -------------------------------------------------------- 
## fueleconomy2$cyl: 4
## [1] 12381
## -------------------------------------------------------- 
## fueleconomy2$cyl: 5
## [1] 718
## -------------------------------------------------------- 
## fueleconomy2$cyl: 6
## [1] 11885
## -------------------------------------------------------- 
## fueleconomy2$cyl: 8
## [1] 7550
## -------------------------------------------------------- 
## fueleconomy2$cyl: 10
## [1] 138
## -------------------------------------------------------- 
## fueleconomy2$cyl: 12
## [1] 478
## -------------------------------------------------------- 
## fueleconomy2$cyl: 16
## [1] 7

The conditions for inference are not satisfied. Although the sample size is much greater than 30 and is less than 10% of the population size, the number of vehicles in the last group is only 7 which is less than 30. Therefore theoretical inference cannot be used.

by(fueleconomy2$hwy, fueleconomy2$cyl, var)
## fueleconomy2$cyl: 2
## [1] 0.6818182
## -------------------------------------------------------- 
## fueleconomy2$cyl: 3
## [1] 46.56029
## -------------------------------------------------------- 
## fueleconomy2$cyl: 4
## [1] 24.37226
## -------------------------------------------------------- 
## fueleconomy2$cyl: 5
## [1] 11.89581
## -------------------------------------------------------- 
## fueleconomy2$cyl: 6
## [1] 13.15848
## -------------------------------------------------------- 
## fueleconomy2$cyl: 8
## [1] 12.01783
## -------------------------------------------------------- 
## fueleconomy2$cyl: 10
## [1] 5.694489
## -------------------------------------------------------- 
## fueleconomy2$cyl: 12
## [1] 4.983641
## -------------------------------------------------------- 
## fueleconomy2$cyl: 16
## [1] 0.2857143

The variances for all of the groups are not in the same neighborhood. Therefore, we cannot use the analysis of variance method. Instead, we have to do multiple comparisons with the Bonferroni correction for the significance level. The way we do multiple comparisons is by doing pairwise t-tests. The number of comparisons involved here is 36 because there are 9 groups that need to be tested. Therefore the significance level we are looking at is 0.05/36 or 0.00109.

We have to split the data into separate groups first.

number_of_cylinders = factor(fueleconomy2$cyl, labels = c(2,3,4,5,6,8,10,12,16))
highway_fuel_economy = fueleconomy2$hwy
pairwise.t.test(highway_fuel_economy, number_of_cylinders, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  highway_fuel_economy and number_of_cylinders 
## 
##    2       3       4       5       6       8       10      12     
## 3  < 2e-16 -       -       -       -       -       -       -      
## 4  < 2e-16 < 2e-16 -       -       -       -       -       -      
## 5  0.00114 < 2e-16 < 2e-16 -       -       -       -       -      
## 6  1.00000 < 2e-16 < 2e-16 < 2e-16 -       -       -       -      
## 8  5.7e-08 < 2e-16 < 2e-16 < 2e-16 < 2e-16 -       -       -      
## 10 1.5e-06 < 2e-16 < 2e-16 < 2e-16 < 2e-16 1.00000 -       -      
## 12 2.8e-15 < 2e-16 < 2e-16 < 2e-16 < 2e-16 1.4e-15 0.00719 -      
## 16 0.00034 < 2e-16 1.3e-15 4.9e-09 2.5e-05 0.63976 0.96168 1.00000
## 
## P value adjustment method: bonferroni

Part 5 - Conclusion:

The mean highway fuel economy is different for each of the discrete quantities of cylinders in the vehicle. Vehicles that have 3 engine cylinders seem to be the most sustainable in terms of highway fuel economy. They have the highest mean highway fuel economy which is 40.48 miles per gallon. Generally it seems that as the number of cylinders increases, the mean highway fuel economy decreases making the model least sustainable.

For most of the comparisons, the p-value is much less than the significance value of 0.00109. However, for the comparison between 2 cylinders and 5 cylinders, 2 cylinders and 6 cylinders, 8 cylinders and 10 cylinders, 10 cylinders and 12 cylinders, 10 cylinders and 16 cylinders, and 12 cylinders and 16 cylinders is greater than the significance value of 0.00109. Therefore, there is still no sufficient evidence to reject the null hypothesis that the mean for all of the groups is the same.

References:

http://blog.rstudio.org/2014/07/23/new-data-packages/