Executive summary

The goal of this report is a study of the effect of transmission (automatic or manual) on car fuel consumptionn. The mtcars dataset is used to answer questions like:

Conclusion: From the best-fit model, one can conclude that:

The analyzed sample contained only 32 observations, which may have additional impact on statistical inference.

Data Preparation and Exploratory Analysis and Regression Models

The mtcars is a data frame with 32 observations gathered for 11 variables:
mpg - miles/(US) gallon,
cyl - number of cylinders,
disp - displacement (cu.in.),
hp - gross horsepower,
drat - rear axle ratio, wt - weight (1000 lbs),
qsec - 1/4 mile time,
vs - V/S,
am - transmission (0 = automatic, 1 = manual),
gear - number of forward gears,
carb - number of carburetors.

Exploratory panel plot for these variables is presented in Appendix.
Some variables were converted to factors for further analysis.

mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
mtcars$am <- factor(mtcars$am,labels=c('_automatic','_manual'))
mtcars$carb <- factor(mtcars$carb)
mtcars$vs <- factor(mtcars$vs)

T-test for difference in MPG in automatic and manual transmission cars.

From this test we clearly see that there is a significant (98% CL) difference in fuel consumption between automatic and manual cars, with better mileage for manual cars.

t.test(mpg ~ am, data=mtcars)
## 
##  Welch Two Sample t-test
## 
## data:  mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.280194  -3.209684
## sample estimates:
## mean in group _automatic    mean in group _manual 
##                 17.14737                 24.39231

We can compare models with different variable input. From the exploratory panel plot it looks like there are correlated variables.

Coefficients in all-variable model have p-values far from significant. The adjusted (for the numbers of predictors) R\(^{2}\) = 0.78 implying the model has fairly good explanatory power of the regression model.

On the other hand, the coefficients for transmission-only model are significantly different, but its R\(^{2}\) = 0.34 is fairly low.

fit_allvars <- lm(mpg ~ ., data = mtcars)
fit_am <- lm(mpg ~ am, data = mtcars)
summary(fit_allvars)$coef
##                Estimate  Std. Error     t value   Pr(>|t|)
## (Intercept) 23.87913244 20.06582026  1.19004018 0.25252548
## cyl6        -2.64869528  3.04089041 -0.87102622 0.39746642
## cyl8        -0.33616298  7.15953951 -0.04695316 0.96317000
## disp         0.03554632  0.03189920  1.11433290 0.28267339
## hp          -0.07050683  0.03942556 -1.78835344 0.09393155
## drat         1.18283018  2.48348458  0.47627845 0.64073922
## wt          -4.52977584  2.53874584 -1.78425732 0.09461859
## qsec         0.36784482  0.93539569  0.39325050 0.69966720
## vs1          1.93085054  2.87125777  0.67247551 0.51150791
## am_manual    1.21211570  3.21354514  0.37718957 0.71131573
## gear4        1.11435494  3.79951726  0.29328856 0.77332027
## gear5        2.52839599  3.73635801  0.67670068 0.50889747
## carb2       -0.97935432  2.31797446 -0.42250436 0.67865093
## carb3        2.99963875  4.29354611  0.69863900 0.49546781
## carb4        1.09142288  4.44961992  0.24528452 0.80956031
## carb6        4.47756921  6.38406242  0.70136677 0.49381268
## carb8        7.25041126  8.36056638  0.86721532 0.39948495
summary(fit_am)$coeff
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am_manual    7.244939   1.764422  4.106127 2.850207e-04

To find the best model to fit the data (with best variable selection), a step function is used. For details on this function, type ?step in R.

The step function calculates Akaike’s ‘An Information Criterion’ (AIC) for one or several fitted model objects for which a log-likelihood value can be obtained, according to the formula \(-2*log-likelihood + k*npar\), where npar represents the number of parameters in the fitted model, and k=2 for the usual AIC, or *k=log(n)$ (n being the number of observations) for the so-called BIC or SBC (Schwarz’s Bayesian criterion).

When comparing models fitted by maximum likelihood to the same data, the smaller the AIC or BIC, the better the fit.

The theory of AIC requires that the log-likelihood has been maximized: whereas AIC can be computed for models not fitted by maximum likelihood, their AIC values should not be compared.

(Above information taken from https://stat.ethz.ch/R-manual/R-devel/library/stats/html/AIC.html)

#trace=FALSE supresses elaborate output
aic <- step(fit_allvars) ## model with the best fit
## Start:  AIC=76.4
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  5   13.5989 134.00 69.828
## - gear  2    3.9729 124.38 73.442
## - am    1    1.1420 121.55 74.705
## - qsec  1    1.2413 121.64 74.732
## - drat  1    1.8208 122.22 74.884
## - cyl   2   10.9314 131.33 75.184
## - vs    1    3.6299 124.03 75.354
## <none>              120.40 76.403
## - disp  1    9.9672 130.37 76.948
## - wt    1   25.5541 145.96 80.562
## - hp    1   25.6715 146.07 80.588
## 
## Step:  AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  2    5.0215 139.02 67.005
## - disp  1    0.9934 135.00 68.064
## - drat  1    1.1854 135.19 68.110
## - vs    1    3.6763 137.68 68.694
## - cyl   2   12.5642 146.57 68.696
## - qsec  1    5.2634 139.26 69.061
## <none>              134.00 69.828
## - am    1   11.9255 145.93 70.556
## - wt    1   19.7963 153.80 72.237
## - hp    1   22.7935 156.79 72.855
## 
## Step:  AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1    0.9672 139.99 65.227
## - cyl   2   10.4247 149.45 65.319
## - disp  1    1.5483 140.57 65.359
## - vs    1    2.1829 141.21 65.503
## - qsec  1    3.6324 142.66 65.830
## <none>              139.02 67.005
## - am    1   16.5665 155.59 68.608
## - hp    1   18.1768 157.20 68.937
## - wt    1   31.1896 170.21 71.482
## 
## Step:  AIC=65.23
## mpg ~ cyl + disp + hp + wt + qsec + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1    1.2474 141.24 63.511
## - vs    1    2.3403 142.33 63.757
## - cyl   2   12.3267 152.32 63.927
## - qsec  1    3.1000 143.09 63.928
## <none>              139.99 65.227
## - hp    1   17.7382 157.73 67.044
## - am    1   19.4660 159.46 67.393
## - wt    1   30.7151 170.71 69.574
## 
## Step:  AIC=63.51
## mpg ~ cyl + hp + wt + qsec + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - qsec  1     2.442 143.68 62.059
## - vs    1     2.744 143.98 62.126
## - cyl   2    18.580 159.82 63.466
## <none>              141.24 63.511
## - hp    1    18.184 159.42 65.386
## - am    1    18.885 160.12 65.527
## - wt    1    39.645 180.88 69.428
## 
## Step:  AIC=62.06
## mpg ~ cyl + hp + wt + vs + am
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1     7.346 151.03 61.655
## <none>              143.68 62.059
## - cyl   2    25.284 168.96 63.246
## - am    1    16.443 160.12 63.527
## - hp    1    36.344 180.02 67.275
## - wt    1    41.088 184.77 68.108
## 
## Step:  AIC=61.65
## mpg ~ cyl + hp + wt + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              151.03 61.655
## - am    1     9.752 160.78 61.657
## - cyl   2    29.265 180.29 63.323
## - hp    1    31.943 182.97 65.794
## - wt    1    46.173 197.20 68.191
summary(aic)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9387 -1.2560 -0.4013  1.1253  5.0513 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.70832    2.60489  12.940 7.73e-13 ***
## cyl6        -3.03134    1.40728  -2.154  0.04068 *  
## cyl8        -2.16368    2.28425  -0.947  0.35225    
## hp          -0.03211    0.01369  -2.345  0.02693 *  
## wt          -2.49683    0.88559  -2.819  0.00908 ** 
## am_manual    1.80921    1.39630   1.296  0.20646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8401 
## F-statistic: 33.57 on 5 and 26 DF,  p-value: 1.506e-10

The step function returns formula of mpg ~ cyl + hp + wt + am. Let’s call it best-fit model.
The best-fit model coefficients p-values suggests significance of the model with cylinders (cyl), horse power (hp) and weight (wt) significant at 95% level. However, the transmission (am) is not significant. The R\(^{2}\) = 0.84 (adjusted value) shows that the model explains well the variance.

By comparing the best-fit and the transmission-only model one can see that the added variables contribute significantly to the model.

anova(fit_am,aic)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     26 151.03  4    569.87 24.527 1.688e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Appendix

par(mfrow = c(2, 2))
plot(aic)

boxplot(mpg ~ am, data=mtcars, col=(c("red","steelblue")), 
        xlab="Transmission", ylab="MPG", main="MPG vs. Transmission")

boxplot(mpg ~ cyl, data=mtcars, outpch = 19, col=(c("steelblue", "green", "yellow")), 
        ylab="MPG", xlab="number of cylinders", main="Mileage By Cylinder")