Is it worth buying an automatic car?

1. Introduction

In this report, we studied data from a set of 32 cars, from 1973-1974, and analyzed some characteristics that could favour the choice of a car with a manual gear over the same one but with automatic gear. Data was obtained in R CRAN and its documentation can be found here. We realized that there is a significant statistical difference (at a level of 95%) between these 2 types of cars related to gas comsumption. Besides, some additional factors can also influence in the automobile efficiency (more miles per galon), like weight and acceleration force. Finally, we build a regression model on the dataset with the efficiency as the outcome so we could quantify the weight of each factor in the car efficiency.

2. Exploratory Analysis

A first exploratory analysis evidenced a superior efficiency in automatic cars over manual ones (2. appendix figure), and this difference is also evident in the boxplot comparation. Manual cars had a mean efficiency of 17.15 \( \pm \) 3.83 miles per galon while automatic models had a mean of 24.4 \( \pm \) 6.17. Assuming that this data comes from a normal distribution, a subset of the population would have a t-student distribution. To confirm the difference in efficiency of both car models, we performed a simples t-test for rejecting the null hypothesis that both sets have the same efficiency. The test rejected the null hypothesis with a 99% confidence (p-value < 0.01)

3. Regression Model

3.1 Model Preview

Only analyzing the gear type factor, we could conclude that an automatic car is better than a manual one. However, there could be more factors that influenced in the car efficiency and act as a confounder variable in the results obtained. To get a preview which variables could have more influence in the outcome of our regression model, we first build a model with all the variables included and selected the ones who had a significant participation, i.e. a low or big residual standard deviation and consequently, low or big t-values and p-values.

3.2. Model Selection

Looking to the all-variables model, we obtained a high R² statistic (R² = 0.9625), meaning that this model explained 96.25% of the data variance 1 and concluding that we could have selected this one. On the other hand, by selecting a model with many linear dependent variables, we also increase the standard error, hence variance, of the coefficients, named variance inflation 2. To better achieve a balance between a best fit and a small variance inflation, we used a stepwise model selection with the variables that best explained the outcome in the all-variables model, i.e. Weight, acceleration (time to reach ¼ mile) and number of carburetors, without including to many linearly dependant variables. As our main focus was on the impact of the gear type on the car efficiency, we accounted for the gear type variable and all its interactions with the previous selected variables to be included in this new model.
The selected model coefficients and its confidence interval can be seen in the 3.2. appendix figures. The variables selected were the interactions between gear type with weight and acceleration and all of them were statistically significant as we can see that their confidence interval do not include 0.

factor(am)0:wt
factor(am)1:wt
factor(am)0:qsec
factor(am)1:qsec

3.3. Model Validation

To validate the obtained regression model, we did a residual analysis. If the proposed model is a good one, the residual must be:
1. Normal with 0 mean and constant variance
2. Independent, no autocorrelation.
Looking at the standardized residuals and the its normal qq plot (3.3. appendix figure), we have a visual clue that the residuals are normally distributed. We have no evidence of variable variance and points in the qqplot are all next to the conceptual normal distribution line. Furthermore, in order to have a quantitative evaluation, we performed two statistical tests, the Shapiro-Wilk and the Kolmogorov-Smirnov test, which test if the residual distribution is really normally distributed against the alternative hypothesis that they are not. Two tests were performed because of a long discussion on what is a good indicator that data is normally distributed 3. In the best alternative, those tests fail to reject the null hypothesis that data is normal and the above is not the same as accepting the null hypothesis. As there is no commom sense in how to determine data normality, we assumed that more positive statiscal tests can reforce the ideia that data is normal. As it can be seen in the 3.3. appendix figure, both tests failed to reject the normal distribution assumption. Finally, to test if the residuals are not autocorrelated, we use the Durbin-Watson test 4, from the lmtest package, which is specific for testing residual autocorrelation. As it can be seen in 3.3. appendix figure, residuals strongly failed in rejecting the null hypothesis that they were independent (p-value = 0.935).

4. Model Interpretation

This was the final model obtained from the mtcars dataset:

summary(reduced.model)

## 
## Call:
## lm(formula = mpg ~ factor(am):wt + factor(am):qsec, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.936 -1.402 -0.155  1.269  3.886 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        13.969      5.776    2.42   0.0226 *  
## factor(am)0:wt     -3.176      0.636   -4.99  3.1e-05 ***
## factor(am)1:wt     -6.099      0.969   -6.30  9.7e-07 ***
## factor(am)0:qsec    0.834      0.260    3.20   0.0035 ** 
## factor(am)1:qsec    1.446      0.269    5.37  1.1e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 27 degrees of freedom
## Multiple R-squared:  0.895,  Adjusted R-squared:  0.879 
## F-statistic: 57.3 on 4 and 27 DF,  p-value: 8.42e-13

From the model above, we can see that the main variables are the interactions between Gear Type and Weight and Gear Type and Time to Reach ¼ Mile (Acceleration). While the weight has a negative effect on the gas efficiency, the increase in time to reach ¼ mile will improve the efficiency.
Analysing the weight, if we keep the acceleration constant, for each unit increase, which is equal to 1000 lbs and ~ 453.5 Kg, the number of miles per galon will decrease -3.176 for manual cars and -6.1 for automatic cars! Regarding the time for reaching ¼ mile, for each additional second the car takes to run ¼ mile, it will make 0.834 more miles for manual cars and 1.446 for automatic cars. Both measures and interactions make sense in a physical way. The heavier a car, more force must be used to move it, and, consequently, more gas will be used. The time to reach the ¼ mile can be viewed as an acceleration property. Short times means more acceleration and vice-versa. So, a larger time, would mean a weaker engine and, consequently, more economical. Of course this depends on other variables, such as the velocity sustained in the period, but this report analyzed the individual parameter assuming all others variables constant.
All in all, we can not assume that automatic cars are better than manual ones. Automatic cars have a greater loss of efficiency when the size of the car increase, so if we were to consider a heavier or bigger car, we would have to analyze if the automatic car is still more economical. On the car acceleration (time to reach ¼ mile), if we would chose a weaker car, that make ¼ mile in more time, we would opt for an automatic car instead of a manual one, because, with weaker engines, automatic cars seem to have a better efficiency when comparing it with a manual car that does the same ¼ mile time.

6. Appendix Figures

Chapter 1. Introduction Codes and Figures

library(MASS) # importing dataset
data(mtcars)

Chapter 2. Exploratory Analysis Codes and Figures

par(mfrow = c(1,2))                  # 2 plots in one line
cAut = mtcars[mtcars$am == 1, ]$mpg; # automatic cars dataset
cMan = mtcars[mtcars$am == 0, ]$mpg  # manual cars dataset
data.frame(mean.cMan = mean(cMan), 
           mean.cAut = mean(cAut),   # mean and sd of each model
           sd.cMan = sd(cMan), 
           sd.cAut = sd(cAut))

##   mean.cMan mean.cAut sd.cMan sd.cAut
## 1     17.15     24.39   3.834   6.167

plot(cAut, col=3, pch=19,            # first plot (scatterplot)  
     ylim=c(min(cMan), max(cAut)),              # enlarge y axis all points appear
     main="Miles per Galon Comparation",        # title
     xlab="Car Index", ylab="Miles per Galon"); # x and y axis label
points(cMan, col=2, pch=19);                    # points in the same plot

boxplot(mpg~am,data=mtcars,main="Averages",varwidth=TRUE, col=c(2,3),
        names=c("Manual","Automatic"))      # 2nd plot (boxplot)

plot of chunk unnamed-chunk-5

t.test(cMan, cAut, alternative="two.sided") # t-test for efficiency difference

## 
##  Welch Two Sample t-test
## 
## data:  cMan and cAut
## t = -3.767, df = 18.33, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.28  -3.21
## sample estimates:
## mean of x mean of y 
##     17.15     24.39

Chapter 3.1 Model Preview Codes and Figures

summary(lm(mpg ~ factor(am)*.,data=mtcars)) # summary of all-variables regression

## 
## Call:
## lm(formula = mpg ~ factor(am) * ., data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.035 -0.760  0.109  0.548  2.696 
## 
## Coefficients: (2 not defined because of singularities)
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         8.6435    22.3728    0.39    0.706  
## factor(am)1      -146.5509    66.3235   -2.21    0.047 *
## cyl                -0.5339     1.1726   -0.46    0.657  
## disp               -0.0203     0.0181   -1.12    0.286  
## hp                  0.0622     0.0479    1.30    0.218  
## drat                0.5916     3.1326    0.19    0.853  
## wt                  1.9541     2.3207    0.84    0.416  
## qsec               -0.8843     0.7888   -1.12    0.284  
## vs                  0.7389     2.6125    0.28    0.782  
## am                      NA         NA      NA       NA  
## gear                8.6542     4.0517    2.14    0.054 .
## carb               -4.8105     1.9765   -2.43    0.032 *
## factor(am)1:cyl    -0.7474     4.2614   -0.18    0.864  
## factor(am)1:disp    0.2002     0.1596    1.25    0.234  
## factor(am)1:hp     -0.2227     0.1381   -1.61    0.133  
## factor(am)1:drat   -5.5414     5.8474   -0.95    0.362  
## factor(am)1:wt    -12.4960     5.0728   -2.46    0.030 *
## factor(am)1:qsec    8.9793     3.2147    2.79    0.016 *
## factor(am)1:vs      0.2042     5.2854    0.04    0.970  
## factor(am)1:am          NA         NA      NA       NA  
## factor(am)1:gear    3.6743     7.2513    0.51    0.622  
## factor(am)1:carb    9.4991     4.1683    2.28    0.042 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.88 on 12 degrees of freedom
## Multiple R-squared:  0.962,  Adjusted R-squared:  0.903 
## F-statistic: 16.2 on 19 and 12 DF,  p-value: 8.25e-06

Chapter 3.2 Model Selection Codes and Figures

fit.min <- lm(mpg ~ 1, data=mtcars)  # minimum possible regression model
reduced.model <- step(fit.min, direction="forward", 
                      scope=(~ factor(am):(wt + qsec + carb)))

## Start:  AIC=115.9
## mpg ~ 1
## 
##                   Df Sum of Sq  RSS   AIC
## + factor(am):wt    2       856  270  74.2
## + factor(am):qsec  2       798  328  80.5
## + factor(am):carb  2       694  432  89.3
## <none>                         1126 115.9
## 
## Step:  AIC=74.22
## mpg ~ factor(am):wt
## 
##                   Df Sum of Sq RSS  AIC
## + factor(am):qsec  2     151.1 119 51.9
## + factor(am):carb  2      41.2 229 72.9
## <none>                         270 74.2
## 
## Step:  AIC=51.95
## mpg ~ factor(am):wt + factor(am):qsec
## 
##                   Df Sum of Sq RSS  AIC
## <none>                         119 51.9
## + factor(am):carb  2      8.22 110 53.7

par(mfrow=c(1,1))           # one plot per line
par(mar=c(5.1,7.1,4.1,2.1)) # increase left margin
coefplot(reduced.model)     # plot coefficients values and confidence interval

plot of chunk unnamed-chunk-7

Chapter 3.3 Model Validation Codes and Figures

fitres <- rstandard(reduced.model) # get standard residuals
par(mfrow=c(1,2))                  # two plots per line
par(mar=c(5.1,4.1,4.1,2.1))        # get back standard plot margin
plot(rstandard(reduced.model),     # 1st plot standard residuals
     main="Standardized Residuals",
     ylab="Obs. - Est.")
abline(h=0, col="blue")            # add line in y = 0
qqnorm(fitres)                     # 2nd plot normal qqplot
qqline(fitres,col=2)               # normal expected quantiles distributions

plot of chunk unnamed-chunk-8

# residual normal test 
shapiro.test(fitres)

## 
##  Shapiro-Wilk normality test
## 
## data:  fitres
## W = 0.9639, p-value = 0.3497

ks.test(fitres, "pnorm",mean(fitres),sd(fitres))

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  fitres
## D = 0.1245, p-value = 0.6584
## alternative hypothesis: two-sided

#Durbin-Watson test for residual autocorrelation
dwtest(reduced.model, alternative="two.sided")

## 
##  Durbin-Watson test
## 
## data:  reduced.model
## DW = 2.164, p-value = 0.935
## alternative hypothesis: true autocorrelation is not 0