Introduction

In recent years regulators around the world have introduced strict emissions controls on motor vehicles to tackle pollution and climate change (Henry, 2012). The amount of fuel a motor vehicle uses is linked to emissions because burning the fuel releases Carbon Dioxide \((CO_2)\) into the atmosphere.
A turbocharger is a small turbine that is fitted to an internal combustion engine. The turbocharger harvests exhaust gases from the engine to pressurise air being fed into the combustion chamber. Increasing the amount of air in the combustion chamber increases the efficiency of combustion and results in more power from a given engine displacement.
Historically turbochargers were found on high performance models due to the increase in power. As emissions controls have become more strict, manufactures are fitting turbochargers on the majority of new vehicles in order to comply with emissions regulations and maintain performance as engine sizes reduce.
Fuel efficiency is typically measured in unit of fuel consumed (i.e. litres or gallons) per unit of distance travelled.
The internal combustion motor vehicle fleet is diverse and consists of vehicles that are designed and marketed to fulfil many different consumer preferences and tastes. Some vehicles are focused on high performance and driver engagement whilst others are focused on efficiency and practicality. These competing preferences and tastes result in various configurations of engine and other key vehicle components which may also have an impact on emissions and fuel efficiency.

Problem Statement

Does the installation of turbochargers to the most common engine configurations result in improved fuel economy and hence lower emissions?
This presentation will use a combination of descriptive statistics, visualisations and and two sample t-test to answer this question.
There are two sub elements to this question that need to be investigated in order to accurately answer the question:
- What are the most common engine configurations (by cylinder count)?
- What is the difference in average fuel economy by induction type?
A two sample t-test has been chosen to determine if there is a statistically significant relationship between the installation of turbochargers and fuel efficiency. It is an appropriate test given the data source contains the population of naturally aspirated and turbocharged and non-turbocharged vehicles on sale in the United States of America.

Turbo Engine

Non-Turbo Engine

Data

The dataset to investigate this problem has been sourced from the US Department of Energy: Office of energy efficiency & renewable energy and the United States Environmental Protection Agency (EPA). It can be downloaded here.
The file is located under the Download the Vehicle Data section, the unzipped csv has been used.
The dataset contains estimates of a vehicles city, highway, combined fuel economy and emissions. A number of other variables are also present to assist with classification of vehicle types given the diverse range of vehicles.
Data is provided by the manufacturers to the EPA, who in turn test 15%-20% of the vehicles whose data is supplied in their own testing laboratories. Detailed information on how the tests are conducted can be found here.
The dataset contains 83 variables and 44,187 observations. A full description of each variable can be found here.

The table below shows a peak of the dataset we will use for this analysis which contains 10,503 observations of 7 variables of interest.

Fuel Economy Data Subset
combined_MPG	cylinders	displacement	make	year	transmission	turbocharged
15.1896	10	5.2	Audi	2012	Automatic (AM6)	NO
14.1229	10	5.2	Audi	2012	Manual 6-spd	NO
15.1896	10	5.2	Audi	2012	Automatic (AM6)	NO
14.1229	10	5.2	Audi	2012	Manual 6-spd	NO
12.7126	12	6.5	Lamborghini	2012	Automatic (S7)	NO
12.5821	10	4.8	Lexus	2012	Automatic (S6)	NO

Data Cont

The subset of the dataset includes the following key variables:
- atvType (Categorical): This variable describes if the vehicle has an alternative or advanced fuel technology type. A blank value indicates a traditional gasoline powered vehicle which is the category of vehicle we are interested in. The dataset has been filtered to include only gasoline powered vehicles. This filter results in pure gasoline powered vehicle observations in the Miles Per Gallon (MPG) variable observations.
- comb08U (Numerical): The un-rounded combined MPG for the city (55%) and highway (45%) test cycles. The variable has been renamed as Combined_MPG for easier interpretation.
- cylinders (Ordinal): The number of engine cylinders.
- displ (Numerical): The engine displacement in litres. This variable has been renamed as displacement for easier interpretation.
- make (Nominal): The manufacturer of the vehicle.
- year (Interval): The model year of the vehicle. Data has been filtered for values after 2012 as this was the year harmonised emissions standards were adopted in the United States by the EPA and the National Highway Traffic Safety Administration. (DieselNet, 2021)
- tCharger (Nominal): T if the vehicle is turbocharged and blank if not. When the data set has been imported, r has read this variable as logical. Type conversion and transformation has been applied with TRUE values converted to “Yes” and blank to “No” to allow for comparison between turbocharged and non turbocharged vehicles. The variable has been renamed to turbocharged for easier interpretation.
- sCharger (Nominal): S if the vehicle is supercharged and blank if not. When the data set has been imported, r has read this variable as a character. It has been transformed with S converted to “Yes” and blank to “No” and then filtered to only return “No” to exclude all supercharged vehicles and then removed from the dataset as we are not interested in this induction method.

Descriptive Statistics and Visualisation

We want to understand what is the most common cylinder configuration in the dataset. The bar chart below shows the 4 and 6 cylinder engines are the most common, representing 76.9% of the total observations, followed by 8 cylinders with 17.43%. We are focused on the most common cylinder configurations so we will filter our data for 4 and 6 cylinders only.
Full r code outputs can be found in Appendix 1 for tables and charts in this section.

Descriptive Statistics and Visualisation cont.

Summary Statistics by Induction Type
turbocharged	MIN	Q1	Median	Q3	Max	Mean	SD	N	Missing
NO	15.00	20.84	23.00	27.31	36.62	24.08	4.41	2081	0
YES	13.61	21.40	24.01	26.49	38.58	24.07	3.66	1962	0

The summary statistics table has been visualised in the boxplot. Whilst not visable in the summary statistics, it appears there are a number of outliers in the turbocharged observations. There are no missing values.
We will identify and filter the outliers. Inspection of the boxplot data finds 16 outliers. Whilst filtering the outliers is not necessary to complete our t-test as these tests are robust even with the presence of outliers, there could be an impact on our summary statistics and other statistical investigations which could be completed later, such as linear regression.

Descriptive Statistics and Visualisation cont

Summary Statistics by Induction Type outliers removed
turbocharged	MIN	Q1	Median	Q3	Max	Mean	SD	N	Missing
NO	15.00	20.84	23	27.31	36.62	24.08	4.41	2081	0
YES	14.13	21.38	24	26.44	33.92	24.01	3.54	1946	0

With outliers removed inspection of the summary statistics and boxplots reveals turbocharged vehicles appear to have a higher median level of fuel economy than non-turbocharged vehicles with 24MPG and 23MPG respectively.
Turbocharged vehicles appear to have lower variability in MPG with a smaller range (19.79 vs 21.62), interquartile range (5.06 vs 6.47) and Standard Deviation (3.54 vs 4.41) than non-turbocharged vehicles.
Both induction types have a distributions that skew to the right, albeit non-turbocharged vehicles have a stronger skew. We can observe this in the summary statistics table also with the Mean value for non-turbocharged vehicles being higher than the Median.

Hypothesis Testing

A two sample, or independent samples t-test will be completed to confirm a statistically significant relationship between turbocharging and fuel efficiency. We will confirm via the p-statistic and confidence interval approaches.
Our hypothesis is: \[H_{0}: \mu_{1} - \mu_{2} = 0\] \[H_{A}: \mu_{1} - \mu_{2} \neq 0\] where: \(\mu_{1}\) = mean of turbocharged MPG and \(\mu_{2}\) = mean of non-turbocharged MPG.
In order to complete a valid two sample t-test the following assumptions are required to hold true (Baglin, 2020):
- The populations of turbocharged and non-turbocharged vehicles MPG is are independent. This is true as the fuel efficiency of a turbocharged engine has no impact on the fuel efficiency of a non-turbocharged engine.
- The samples are drawn from a population with a normal distribution. QQPlots will be constructed to confirm if normality is present.
- The variance of each population is equal. A Levene’s test will be completed to confirm this assumption.
Full r code outputs for data pre-processing to run these tests can be found in Appendix 1.

Normality

We have large sample sizes of turbocharged and non-turbocharged populations, 1946 and 2081 respectively. Therefore the Central Limit Theorem applies and the sampling distribution of the population means will be approximately normal, despite the populations not showing a normal distribution. (Baglin, 2021).
For completeness, QQ plots of the turbocharged and non-turbocharged observations have been completed below. We can see the distributions are skewed to the right as a number of plots deviate from the y=x normal distribution line.

turbompg$combined_MPG %>% qqPlot(dist="norm", main = "Turbocharged qqPlot", ylab = "MPG")

## [1] 1692 1855

nampg$combined_MPG %>% qqPlot(dist="norm", main = "Non-turbocharged qqPlot", ylab = "MPG")

## [1]  111 1567

Variance Homogeneity

The Levene’s test has the following hypothesis: \[\sigma^2_{1} = \sigma^2_{2}\]

\[\sigma^2_{1} \neq \sigma^2_{2}\]

lev <- leveneTest(combined_MPG ~ turbocharged, data = vehfinaloutlierremoved)

lev %>% kable(caption = "Levene Test Summary") %>%
  kable_classic(full_width = T) %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"))

Levene Test Summary
	Df	F value	Pr(>F)
Induction Type Group	1	79.5	7.23e-19
	4025	NA	NA

The Levene’s test returns a p-value <.001 against a 95% significance level indicating a statistically significant relationship between the variances and therefore violating the assumption of equal variance.
The two sample t test will be adjusted to a Welch test as we do not have equal variance.

Hypthesis Testing Cont.

t.test(vehfinaloutlierremoved$combined_MPG ~ vehfinaloutlierremoved$turbocharged,
       data = vehfinaloutlierremoved,
       var.equal = FALSE,
       alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  vehfinaloutlierremoved$combined_MPG by vehfinaloutlierremoved$turbocharged
## t = 0.55292, df = 3936.5, p-value = 0.5803
## alternative hypothesis: true difference in means between group NO and group YES is not equal to 0
## 95 percent confidence interval:
##  -0.1767867  0.3156700
## sample estimates:
##  mean in group NO mean in group YES 
##          24.07947          24.01003

A Welch two sample t test was used to test for a significant difference between the mean combined MPG of turbocharged (24.01) and non-turbocharged (24.08) vehicles due to the Levene test indicating homogeneity of variance was not present.
Both populations exhibited non-normal distributions via inspection of QQ plots. Due to large sample sizes of each group, the t-test can be applied due to the central limit theorem.
The results of the test assuming unequal variance did not find a statistically significant difference between the mean MPG of turbocharged and non-turbocharged vehicles with 4 or 6 cylinders, t(df = 3936.5) = 0.55, p = 0.58 95% CI for the difference in means [-0.18, 0.32].

Discussion

The statistical analysis completed in this presentation fails to find sufficient statistical evidence that fitting turbochargers to the most common engine cylinder configurations of 4 and 6 improves fuel economy.
We found 4 and 6 cylinder engine configurations are the most common vehicles available for sale based on the dataset. Turbocharged vehicles had a higher Median MPG than non-turbocharged vehicles and also displayed less variability in efficiency. Whilst outliers were present and have been removed, they did not materially impact the summary statistics calculated.
The distribution of MPG for both turbocharged and non-turbocharged vehicles did not display a normal shape, with evidence of a clear right skew for non-turbocharged vehicles even after the removal of outliers from the dataset.
Whilst this investigation fails to find statistical evidence that fitting turbochargers to the most common cylinder configuration results in improved fuel efficiency, further investigation is warranted.
Many factors effect the fuel efficiency of a vehicle including vehicle type, driving style, the technology of the turbocharger and engine management systems.
Further investigation of the following may provide additional insight:
- Including all cylinder counts in the data or completing the analysis on a specific cylinder configuration. Completing the analysis on all or each individual cylinder configuration may yield different results. For example fitting turbochargers on smaller, lower cylinder counts may be more effective than fitting to larger, higher cylinder counts.
- Analysing the results in two specific time periods to understand if technology improvement over time impacts the results.
- Completing the analysis based on the separate city and highway variables to understand the impact of driving style in a more complete manner rather than the assumption that all vehicles are driven 55% city and 45% highway.

Appendix 1 - Full R Code Outputs

## Descriptive Statistics R Codes

## Bar chart of cylinder count data
barlabels <- vehfinal %>% group_by(cylinders) %>% summarise(count = n()) %>% mutate(per = round(count / sum(count) * 100,2))

## Bar Chart with % total labels.
barlabels %>% ggplot(aes(cylinders, fill = cylinders))+
  geom_bar(aes(x=cylinders, y=count), stat="identity")+
  ggtitle("Cylinder Count (with total proportion)")+
  theme_bw()+
  theme(legend.position = c(.75,.8),legend.justification = c(.7,1),plot.title = element_text(hjust = 0.5))+
  scale_fill_brewer(palette = "Blues")+
  geom_text(aes(label=paste0(barlabels$per, "%"), y=barlabels$count), vjust=-.25)

## filter for 4 and 6 cylinders only
vehfinalfilt <- vehfinal %>% filter(cylinders == c("4","6"))

## Summary Stats for 4 and 6 cylinder vehicles grouped by induction type.
vehfinalfilt %>% group_by(turbocharged) %>% summarise(MIN = round(min(combined_MPG, na.rm = TRUE),2),
                                        Q1 = round(quantile(combined_MPG, probs = 0.25, na.rm = TRUE),2),
                                        Median = round(median(combined_MPG, na.rm = TRUE),2),
                                        Q3 = round(quantile(combined_MPG, probs = 0.75, na.rm = TRUE),2),
                                        Max = round(max(combined_MPG, na.rm = TRUE),2),
                                        Mean = round(mean(combined_MPG, na.rm = TRUE),2),
                                        SD = round(sd(combined_MPG, na.rm = TRUE),2),
                                        N = n(),
                                        Missing = sum(is.na(combined_MPG))) %>%
  kbl(caption = "Summary Statistics by Induction Type") %>% 
  kable_classic(full_width = T) %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"))

## boxplot of data with outliers.
vehfinalfilt %>%ggplot(aes(x=combined_MPG, y = turbocharged, fill = turbocharged))+
           geom_boxplot()+
  ggtitle("Boxplot of Combined MPG by Induction Type")+
    theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_fill_brewer(palette = "Blues")

Appendix 1 - Full R Code Outputs cont.

## outlier filtering
boxplotoutliers <- boxplot(vehfinalfilt$combined_MPG ~ vehfinalfilt$turbocharged, plot=FALSE)$out

vehfinaloutlierremoved <- vehfinalfilt %>% filter(!(combined_MPG %in% boxplotoutliers))

## Summary statistics with outliers removed
vehfinaloutlierremoved %>% group_by(turbocharged) %>% summarise(MIN = round(min(combined_MPG, na.rm = TRUE),2),
                                        Q1 = round(quantile(combined_MPG, probs = 0.25, na.rm = TRUE),2),
                                        Median = round(median(combined_MPG, na.rm = TRUE),2),
                                        Q3 = round(quantile(combined_MPG, probs = 0.75, na.rm = TRUE),2),
                                        Max = round(max(combined_MPG, na.rm = TRUE),2),
                                        Mean = round(mean(combined_MPG, na.rm = TRUE),2),
                                        SD = round(sd(combined_MPG, na.rm = TRUE),2),
                                        N = n(),
                                        Missing = sum(is.na(combined_MPG))) %>%
  kable(caption = "Summary Statistics by Induction Type outliers removed") %>% 
  kable_classic(full_width = T) %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"))

## boxplot with outliers removed.
vehfinaloutlierremoved %>%ggplot(aes(x=combined_MPG, y = turbocharged, fill = turbocharged))+
           geom_boxplot()+
  ggtitle("Boxplot of Combined MPG by Induction Type outliers removed")+
    theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))+
  scale_fill_brewer(palette = "Blues")

## Hypothesis Testing data cleaning
## Normality data filter
turbompg <- vehfinaloutlierremoved %>% filter(turbocharged == "YES") 
nampg <- vehfinaloutlierremoved %>% filter(turbocharged == "NO")

## Levene test data formatting
lev$`Pr(>F)` <- format(lev$`Pr(>F)`, digits = 3)
lev$`F value` <- format(lev$`F value`, digits = 3)
attributes(lev)$row.names <- c("Induction Type Group", "")

References

Baglin, J 2021, ‘Module 5 Sampling: Randomly Representative’, Course Notes MATH1324, RMIT University, viewed 12 October 2021, https://astral-theory-157510.appspot.com/secured/MATH1324_Module_05.html#Central_Limit_Theorem
Baglin, J 2020, ‘Module 7 Testing the Null: Data on Trial’, Course Notes MATH1324, RMIT University, viewed 12 October 2021, https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html
DieselNet 2021, Emissions Standards, DieselNet.com, viewed 10 October 2021, https://dieselnet.com/standards/us/index.php http://www.who.int/immunization/policy/immunization_tables/en/.
Henry, J 2012, Turbocharging to Save Gas, Instead of to Go Fast, Forbes, viewed 12 October 2021, https://www.forbes.com/sites/jimhenry/2012/02/28/turbocharging-to-save-gas-instead-of-to-go-fast/?sh=6ce805083a6b.

Fuel Efficiency and Induction Type

Is Turbocharging effective?

RPubs link information

Introduction

Problem Statement

Data

Data Cont

Descriptive Statistics and Visualisation

Descriptive Statistics and Visualisation cont.

Descriptive Statistics and Visualisation cont

Hypothesis Testing

Normality

Variance Homogeneity

Hypthesis Testing Cont.

Discussion

Appendix 1 - Full R Code Outputs

Appendix 1 - Full R Code Outputs cont.

References