Synopsis

The impact of transmission on fuel efficiency is analyzed quantititively in the report below. It uses the mtcars data provided in the R’s datasets package and develop both an exploratory analysis and inferential analysis to address the research question. A pair wise model selection through minimization of AIC provides a model based on three explanatory variables. Fuel transmission indeed has a statistically signficant impact on fuel efficiency with manuel transmission having a higher fuel efficiency. Dignostic plots and analysis shows that the results are robust aginst outliers. However, the lack of theoretical reasoning for explanatory variable selection and high degree of correlation between potential indepedent variables presents a strong limitation. The results should be taken with precaustion and a more detailed study is necessary. For reproducibility purposes all the accompanying code for the analysis is provided in the appendix.

Introduction

The following, is a report that aims to investigate the difference in fuel efficiency between manual and automatic transmission automobiles. This is primary accomplished with quantitative analysis of the mtcars data set which provides a list 32 automobiles characterized by different aspects. Essentially, the report will aim to test the null hypothesis that there is no difference in fuel efficiency, as measured by miles/gallon (mpg), between manual and automatic transmission. Transmission is coded as a dummy variable with the base level being automatic (am = 0) and manual transmission as a comparison group (am = 1). Because variation in fuel efficiency is confounded with multiple variables, the analysis will aim to control for the impact of other variables to distill out the true effect of transmission on the target variable. In doing so the multivariate regression analysis will help in quantifying the mpg difference between different transmissions if, indeed, there is a statistically significant difference.

Descriptive analysis

Before proceeding with the regression analysis, some descriptive and exploratory analysis is required. The analyzed data set contains 13 cars with manual transmission and 19 with automatic transmission. A six number summary on the target variable miles/gallon is provided below and code provided in appendix_1.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

A visualization of the relationship between the target variable and the explanatory variables to be considered is essential in shading light on the models to follow. Two sets of graphs are provided in the appendix_2. The first set of graphs depict the mean target variable, mpg, by the variable of interest am, and then facets this relation with each of the control factor variables to be considered (cyl, vs and gear). The second set of graphs show the relationship between mpg on one hand and the each of the control continuous variables (disp, hp, drat, wt, qsec), faceted by the transmission type.

From these sets of graph it can plausibly be argued that manual transmission is usually associated with higher mpg, though the effect size depends on confounding factors under consideration. There is of course a lot of uncertainty about any conclusion that can be drawn from this exploratory graphs and a formal hypothesis testing is required. This will be a task for the next section.

Inferential Analysis

Under normal conditions, it is well established that quantitative analysis requires a theoretical knowledge that guides the consideration of explanatory variables. However, given that the author is not well versed in the research and literature of fuel efficiency and different characters of vehicles, a more pragmatic model selection is going to be performed through a step wise model selection based on Akaike Information Criterion (AIC). The procedure, which is provided in the appendix_3, starts from the full model, which includes all explanatory variables in the mtcars data set and step wise include and exclude variables with the objective function being the minimization of the AIC. Of course this ensures that mulitple models are fitted and compared with respect to the AIC, albeit internally. The final model is reported below.

##                 Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     9.617781  6.9595930  1.381946 1.779152e-01
## as.factor(am)1  2.935837  1.4109045  2.080819 4.671551e-02
## wt             -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec            1.225886  0.2886696  4.246676 2.161737e-04

Results and Discussion

The final model shows that transmission is an important predictor of fuel efficiency. With a p value of 0.0467155 which is less than the \(\alpha\) =0.05 it is statistically significant at 5% significance level. On average automobiles with a manual transmission drive 2.9358372 miles/gallon more than automatic transmission automobiles. In 95% of the cases, the increase in miles/gallon due to manual transmission is between the interval (0.1704643, 5.70121). This model explains 83.4% of the variation in miles/gallon.

As a next step it is important to perform model diagnostics to ensure that the assumptions for regression modeling are not violated. Diagnostics plots and the accompanying code is provided in the appendix_5.

The diagnostics plots can be considered somewhat okay. There seems to be some non-linear association between the residuals and the fitted values which is not ideal, but is at least not so strong visually. The Q-Q plot shows that the data is generally normal, however, there are few outliers like the Chrysler Imperial and Toyota Corolla. The scale-location and residual-leverage plots do not suggest strong outliers. From the diagnostic plots it is clear that the Chrysler Imperia observation needs a closer look.

As a next step the dfbetas of the model are computed and reported in the appendix_4. The maximum dfbetas for the explanatory variables in the final model are 0.5626418, 1.0938422, 0.4968861. The maximum df betas are a result of Chrysler Imperial in the ‘am’ and ‘wt’ explanatory variables and Fiat 128 in the ‘qsec’ variable. As a final step a ‘leave-one-out’ regression is performed by leaving out each observation and estimating the final model. The coefficient for each explanatory variable is then normalized by subtracting the actual coefficient from the full model. Similarly the standard error of each coefficient estimate is standardized by dividing the leave-one-out standard error by the standard error from the full model. The result of this is then visualized and reported in the appendix. It shows, with the exception of the intercept, that the difference of the ‘leave-one-out’ regressions coefficients is not significantly different to 0. Hence, it can be argued that the coefficient estimates are robust to outliers.

In conclusion, it has been shown that a lot of variation in fuel efficiency can be explained by transmission, weight and 1/4 mile time. On average manual transmission has an effect size of 1.225886 in miles/gallon. However, since the sample size of the data is small and there is a high correlation between the explanatory variables, caution should be taken moving forward. The analysis is also limited in its theoretical discussion of the subject matter. A more complete analysis would incorporate the theoretical discussion and aim to increase the sample size.

Appendix

Below the necessary code to reproduce the analysis in the report is provided.

Appendix_1

summary (mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Appendix_2

Different bar graphs of mpg by am faceted with different factor control variables in the data set

Different scatter plots of mpg with continuous control variables faceted with am

Appendix_3

## 
## Call:
## lm(formula = mpg ~ as.factor(am) + wt + qsec, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.6178     6.9596   1.382 0.177915    
## as.factor(am)1   2.9358     1.4109   2.081 0.046716 *  
## wt              -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec             1.2259     0.2887   4.247 0.000216 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

Appendix_4

head(data.frame(dfbetas(step.model)))

Appendix_5