Summary

The purpose of this excercise is to give answer to the question: Is an automatic or manual transmission better for MPG?. In order to answer this question, exploratory data analysis and regression models will be applied to the mtcars dataset from the dataset package in R. This data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models)

Exploratory Data Analysis

In this section, we will explore the data and determine which variables are considered relevant to build the regression model, following the parsimony principle, keep simplicity as much as you can.
Let’s start loading the required packages and displaying the first 3 rows of the mtcars data set

##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

This data set is formed by 11 variables and 32 observations. We are interested to explore which variables explain the mpg variable for manual and automatic cars (am = 1 for manual cars, am = 0 otherwise).

As seen in the correlation matrix (Please see Figure 2. in Appendix), this is a very high correlated data set, and many variables can be explained or predicted by others.
The variable am should be included in the model as default, but also it seems that hp, wt, cyl and disp to be good candidates for the multivariate model as they are highly correlated with mpg.
It would make sense not to include disp and cyl in the model as they can be predicted by hp and wt (cor(hp, cyl) = 0.832, cor(wt, disp) = 0.888).
We do not want to fail into adding variables that are highly dependent on others, it would perform a worse prediction.

Simple Linear Regression

Let’s start fitting a simple linear regression and later we will compare this model with a multivariate regression model.
Find in Figure 2. mpg vs. am chart.

##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## factor(am)1  7.244939   1.764422  4.106127 2.850207e-04

The interpretation of these coefficients are:
- The intercept = 17.15 is the average mpg for automatic cars.
- The slope = 7.24 is the difference mpg average between automatic and manual cars.
- The p-value = 0.029% is the probability to reject the true null hypothesis H0 (the probability to fall into Error Type I), being the null hypothesis H0 Beta1 = 0 (there is no correlation between mpg and am). The value obtained is smaller than 0.05 (the significance level), so we reject the null hypothesis H0.

This model predicts, in average, 7.24 increase on mpg for manual cars regarding automatic cars. Looking at the R-squared term, this model only explains 35.98% of the variance.
Please find in Figure 3. a t.test performed for manual cars mpg vs. automatic cars mpg.

Multivariate Linear Regression

In this section I will compare simple and multivariate regression model, using the function anova and setting the significance level at 5%.

fit3 <- lm(mpg ~ factor(am) + hp + wt, mtcars)
anova(fit1, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(am)
## Model 2: mpg ~ factor(am) + hp + wt
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     30 720.90                                  
## 2     28 180.29  2    540.61 41.979 3.745e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing both model, we get a p-value of 3.745e-09. We reject the null hypothesis that both models are the same.

summary(fit3)$coef
##                Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 34.00287512 2.642659337 12.866916 2.824030e-13
## factor(am)1  2.08371013 1.376420152  1.513862 1.412682e-01
## hp          -0.03747873 0.009605422 -3.901830 5.464023e-04
## wt          -2.87857541 0.904970538 -3.180850 3.574031e-03

The multivariate model predicts an increase of 2.08 for manual car with regard the automatic cars. In addition, the R-squared term for this model is 83.99%, considerably much better than the simple linear model.
The uncertainty introduced in this model is seen in the p-value for the variable am, which is 14.13%. Although this value is greater than the stablished significance level of 5%, we reject the null hypthesis H0 as we have seen evidences that manual cars are better than automatics for mpg.

Diagnostic

Residuals do not present any anomaly as shown in Figure 4.
In order to make a diagnostic, we are going to compute leverage (hat values) and how would change the am’s slope if a particular observation was not include in the model (df betas). The next table contains hat values and am’s df betas for all the observations contained in the data set. It has been ordered by hat value (desc), and it will only display the most 6 relevant values (with more leverage).

diagnostic <- data.frame(am1_dfbetas = round(dfbetas(fit3)[,2],3), hatvalues = round(hatvalues(fit3),3))
diagnostic <- diagnostic[order(-diagnostic$hatvalues),]
head(diagnostic)
##                     am1_dfbetas hatvalues
## Maserati Bora             0.156     0.412
## Lincoln Continental       0.009     0.273
## Cadillac Fleetwood       -0.091     0.235
## Chrysler Imperial         0.540     0.230
## Ford Pantera L           -0.086     0.223
## Duster 360                0.032     0.192

We could say that leverage is “under control” (highest value = 0.412) and the biggest change in the am’s slope comes if we were removing Chrysler Imperial from the data set (am1_dfbetas = 0.54).

Conclusions

From these results we are able to conclude that manual cars are better for mpg regarding to automatic cars under the uncertainty conditions shown previously. It can be quantified by the slope provided in the multivariate regression model as explained above.

Appendix

Figure 1. Correlation table for mtcars

##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

Figure 2. Simple Linear Regression Model

g <- ggplot(aes(y = mpg, x = am), data = mtcars) + geom_point()
g <- g + geom_abline(intercept = summary(fit1)$coef[1,1], slope = summary(fit1)$coef[2,1])
g

Figure 3. Manual car mpg vs. automatic cars mpg t.test

aut <- subset(mtcars, am == 0)
man <- subset(mtcars, am == 1)
t.test(man$mpg,aut$mpg)
## 
##  Welch Two Sample t-test
## 
## data:  man$mpg and aut$mpg
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737

Figure 4. Residuals analysis

par(mfrow = c(2,2))
plot(fit3)