Regression Models Course Project

Executive Summary

A regression model has been developed in order to investigate whether an automatic or manual transmission affects the fuel consumption of 32 automobiles from 1973-74 or not, and if so, how much.

The regression model predicts an improvement of the fuel consumption of 2.9 Miles per (US) Gallon in favor of the manual transmission for a car with the same engine power, weight and aerodynamics.

Exploratory analysis

This report develops a regression model with the data extracted from the 1974 Motor Trend US magazine to answer the following question:

Does automatic-manual transmission affect the fuel consumption? if so, how much?

The data contains 32 observations on 11 (numeric) variables, which are:

mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburators

Therefore, the key variables for the model are:

Miles per Gallon, mpg, the outcome of the model.
The automatic-manual transmission, am, the variable to be studied

As the dataset is small, all its variables and its correlations among them can be shown in a plot. The plot identifies the type of transmission in all the correlations.

Conclusions of the exploratory analysis:

The following variables are categoricals: gear, carb, am, vs, cyl.
Last row of the plot shows, graphically, all the relationships between the variables and the mpg.
The last column of the model shows all the correlations of the variables with the outcome variable, and all seem to have a good correlation, which in some cases increase when accounting for the am variable.
Therefore, from the exploratory analysis, all the variables of the dataset should be candidates for the regression model.

Model selection

Fundamental analysis of the problem

The main fuel consumption contributors for an automobile have been identified from a wikipedia article about fuel economy in automobile, and the variables of the dataset have been allocated to them. This is:

Fuel consumption contributor	dataset variable
Engine	cyl, disp, hp, vs, carb
Drivetrain	am, gear, drat
Rolling	wt
Aerodinamic	qsec
Accesories	none
Braking	none
Standby	none

Some remarks to these associations of the dataset variables:

Some variables can be easily associated to a contributor, meanwhile some other can not. These variables would be the confounding variables and have been associated to a contributor for which are assumed to be most related.
None variable can be directly associated to some contributors (these contributors would conform the known unknows of the model).
This fundamental analysis does not intent to imply that all variables contribute directly to the mpg.
The model should account, at least, for one variable from each contributor. As it is assumed that these variables may explain the fuel consumption from different perspectives.
Therefore main variables considered for the model from each contributor should be: weight (wt), engine power (hp) and acceleration rate/ aerodynamics accounted in qsec.

Model selection methods: Variance Inflation Factors and Nested models.

To identify all the variables of the model, the first approach is to investigate the Variance Inflation Factors (VIF) af a model with all the variables.

##           GVIF       Df GVIF^(1/(2*Df))
## cyl  11.319053 1.414214        1.834225
## disp  7.769536 1.000000        2.787389
## hp    5.312210 1.000000        2.304823
## drat  2.609533 1.000000        1.615405
## wt    4.881683 1.000000        2.209453
## qsec  3.284842 1.000000        1.812413
## vs    2.843970 1.000000        1.686407
## am    3.151269 1.000000        1.775181
## gear  7.131081 1.414214        1.634138
## carb 22.432384 2.236068        1.364858

The above table shows the variance inflation factors for a Generalized Liner Model (GLM), even though a Linear Model (LM) has been fitted to the data. This is due to the categorical variables, however, the conclusion from the VIF table is still valid, and this is:

The following variables are dropped out from the model as they raise considerably its Standard Deviation: disp, gear, carb and cyl
This is because the variance inflation factors are bigger than those factors associated to the key variables identified in the fundamental analysis, these are wt, hp and qsec
To note: wt and hp have high inflation factors, this is believed to be due to its confounding nature, as these variables do not belong exclusively to the fuel consumption contributor to which they have been associated with in the fundamental analysis.

## Analysis of Variance Table
## 
## Model 1: mpg ~ am + wt
## Model 2: mpg ~ am + wt + hp
## Model 3: mpg ~ am + wt + hp + qsec
## Model 4: mpg ~ am + wt + hp + qsec + drat
## Model 5: mpg ~ am + wt + hp + qsec + drat + vs
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     29 278.32                                   
## 2     28 180.29  1    98.029 15.4562 0.0005907 ***
## 3     27 160.07  1    20.225  3.1888 0.0862810 .  
## 4     26 158.64  1     1.428  0.2251 0.6392777    
## 5     25 158.56  1     0.080  0.0126 0.9115341    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The main conclusions are:

Best model to answer the question of the report is the Model3
The addition of qsec is not statistically signifcant, however, it is close to the 5% limit and it complies with the fundamental analysis conclusion.
Additionally, this conclusion is also supported by the adjusted R² statistic shown below:

##          models       adj
## 1        Model1 0.7357889
## 2        Model2 0.8227357
## 3        Model3 0.8367919
## 4        Model4 0.8320265
## 5        Model5 0.8253956
## 6 All variables 0.7790215

Analysis

Finally, the coefficients of the regression model are:

## 
## Call:
## lm(formula = mpg ~ am + wt + hp + qsec, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4975 -1.5902 -0.1122  1.1795  4.5404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 17.44019    9.31887   1.871  0.07215 . 
## am1          2.92550    1.39715   2.094  0.04579 * 
## wt          -3.23810    0.88990  -3.639  0.00114 **
## hp          -0.01765    0.01415  -1.247  0.22309   
## qsec         0.81060    0.43887   1.847  0.07573 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.435 on 27 degrees of freedom
## Multiple R-squared:  0.8579, Adjusted R-squared:  0.8368 
## F-statistic: 40.74 on 4 and 27 DF,  p-value: 4.589e-11

Being the residuals plots:

The residuals do not indicate any specific problem that may require further investigation with any of the observations.

Conclusion

The following conclusions may be drawn from the previous regression model:

The difference between the automatic-manual transmission is statistically significant in the fuel consumption for the car models evaluated.
An increase of 2.9 mpg is expected in manual cars over automatic ones providing the same weight, engine power and car aerodynamics.
This result is aligned with 2011 SAE article, “Manual transmission can be up to 94% efficient whereas older automatic transmissions may be as low as 70% efficient.”
However, the conclusion of this regression model may overestimate the effect on fuel consumption of the automatic-manual transmission. This is based upon the expected energy loss of the drivetrain from the fuel economy in automobile.
The conclusion could be considered valid for the dataset and inferred to cars of the early 70s.
Definitely, this conclusion will not hold true if considered new automatic transmissions.