Executive Summary

A regression model has been developed in order to investigate whether an automatic or manual transmission affects the fuel consumption of 32 automobiles from 1973-74 or not, and if so, how much.

The regression model predicts an improvement of the fuel consumption of 2.9 Miles per (US) Gallon in favor of the manual transmission for a car with the same engine power, weight and aerodynamics.

Exploratory analysis

This report develops a regression model with the data extracted from the 1974 Motor Trend US magazine to answer the following question:

Does automatic-manual transmission affect the fuel consumption? if so, how much?

The data contains 32 observations on 11 (numeric) variables, which are:

Therefore, the key variables for the model are:

As the dataset is small, all its variables and its correlations among them can be shown in a plot. The plot identifies the type of transmission in all the correlations.

Conclusions of the exploratory analysis:

Model selection

Fundamental analysis of the problem

The main fuel consumption contributors for an automobile have been identified from a wikipedia article about fuel economy in automobile, and the variables of the dataset have been allocated to them. This is:

Fuel consumption contributor dataset variable
Engine cyl, disp, hp, vs, carb
Drivetrain am, gear, drat
Rolling wt
Aerodinamic qsec
Accesories none
Braking none
Standby none

Some remarks to these associations of the dataset variables:

  • Some variables can be easily associated to a contributor, meanwhile some other can not. These variables would be the confounding variables and have been associated to a contributor for which are assumed to be most related.
  • None variable can be directly associated to some contributors (these contributors would conform the known unknows of the model).
  • This fundamental analysis does not intent to imply that all variables contribute directly to the mpg.
  • The model should account, at least, for one variable from each contributor. As it is assumed that these variables may explain the fuel consumption from different perspectives.
  • Therefore main variables considered for the model from each contributor should be: weight (wt), engine power (hp) and acceleration rate/ aerodynamics accounted in qsec.

Model selection methods: Variance Inflation Factors and Nested models.

To identify all the variables of the model, the first approach is to investigate the Variance Inflation Factors (VIF) af a model with all the variables.

##           GVIF       Df GVIF^(1/(2*Df))
## cyl  11.319053 1.414214        1.834225
## disp  7.769536 1.000000        2.787389
## hp    5.312210 1.000000        2.304823
## drat  2.609533 1.000000        1.615405
## wt    4.881683 1.000000        2.209453
## qsec  3.284842 1.000000        1.812413
## vs    2.843970 1.000000        1.686407
## am    3.151269 1.000000        1.775181
## gear  7.131081 1.414214        1.634138
## carb 22.432384 2.236068        1.364858

The above table shows the variance inflation factors for a Generalized Liner Model (GLM), even though a Linear Model (LM) has been fitted to the data. This is due to the categorical variables, however, the conclusion from the VIF table is still valid, and this is:

  • The following variables are dropped out from the model as they raise considerably its Standard Deviation: disp, gear, carb and cyl
  • This is because the variance inflation factors are bigger than those factors associated to the key variables identified in the fundamental analysis, these are wt, hp and qsec
  • To note: wt and hp have high inflation factors, this is believed to be due to its confounding nature, as these variables do not belong exclusively to the fuel consumption contributor to which they have been associated with in the fundamental analysis.
## Analysis of Variance Table
## 
## Model 1: mpg ~ am + wt
## Model 2: mpg ~ am + wt + hp
## Model 3: mpg ~ am + wt + hp + qsec
## Model 4: mpg ~ am + wt + hp + qsec + drat
## Model 5: mpg ~ am + wt + hp + qsec + drat + vs
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     29 278.32                                   
## 2     28 180.29  1    98.029 15.4562 0.0005907 ***
## 3     27 160.07  1    20.225  3.1888 0.0862810 .  
## 4     26 158.64  1     1.428  0.2251 0.6392777    
## 5     25 158.56  1     0.080  0.0126 0.9115341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The main conclusions are:

  • Best model to answer the question of the report is the Model3
  • The addition of qsec is not statistically signifcant, however, it is close to the 5% limit and it complies with the fundamental analysis conclusion.
  • Additionally, this conclusion is also supported by the adjusted R2 statistic shown below:
##          models       adj
## 1        Model1 0.7357889
## 2        Model2 0.8227357
## 3        Model3 0.8367919
## 4        Model4 0.8320265
## 5        Model5 0.8253956
## 6 All variables 0.7790215

Analysis

Finally, the coefficients of the regression model are:

## 
## Call:
## lm(formula = mpg ~ am + wt + hp + qsec, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4975 -1.5902 -0.1122  1.1795  4.5404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 17.44019    9.31887   1.871  0.07215 . 
## am1          2.92550    1.39715   2.094  0.04579 * 
## wt          -3.23810    0.88990  -3.639  0.00114 **
## hp          -0.01765    0.01415  -1.247  0.22309   
## qsec         0.81060    0.43887   1.847  0.07573 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8368 
## F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11

Being the residuals plots:

The residuals do not indicate any specific problem that may require further investigation with any of the observations.

Conclusion

The following conclusions may be drawn from the previous regression model: