Executive summary

This project aims to investigate the relationship between miles per gallon (MPG) and the other variables stored in the mtcars dataset. More specifically, we will try to answer two questions: * Is there a transmission type (manual or automatic) that performs better in terms of MPG? * If so, what is the statistical difference bewteen the two in terms of MPG?

The Analysis section of this document focuses on inference with a simple linear regression model and a multiple regression model. Both models support the conclusion that the cars in this study with manual transmissions have on average significantly higher MPG’s than cars with automatic transmissions.

This conclusion holds whether we consider the relationship between MPG and transmission type alone or transmission type together with 2 other predictors: wt / weight; and qsec / 1/4 mile time.

In the simple model, the mean MPG difference is 7.245 MPG; the average MPG for cars with automatic transmissions is 17.147 MPG, and the average MPG for cars with manual transmissions is 24.392 MPG. In the multiple regression model, the MPG difference is 2.9358 MPG at the mean weight and qsec.

Exploratory analysis and visualizations are located in the Appendix to this document.

Exploratory analysis

Let’s have a quick look at the data first:

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Useful variables assignment and data processing:

n <- length(mtcars$mpg)
alpha <- 0.05
mtcars$cyl  <- factor(mtcars$cyl)
mtcars$vs   <- factor(mtcars$vs)
mtcars$gear <- factor(mtcars$gear)
mtcars$carb <- factor(mtcars$carb)
mtcars$am   <- factor(mtcars$am,labels=c("Automatic","Manual"))

Let’s do a pairs plot to have an overview of the correlations between the different variables in the dataset:

pairs(mtcars,panel = panel.smooth)

From the plot, we can see that the variables cyl, disp, hp, drat, wt, vs and am have a strong correlation with mpg

In order to see wheather there is some sort of effect of the transmission type on the mpg we can use a box-plot:

boxplot(mpg ~ am, data = mtcars, col = (c("blue","red")), ylab = "MPG", xlab = "Transmission Type")

And it clearly shows that automatic transmissions tend to have a lower MPG than the manual one. This is quantified in the next section.

Regression model analysis

In this section, we try to find the best linear regression model by using different predictors and model selection methods. Plus we perform diagnostic to the analysis of residuals. Our starting model is the multiple regression model obtained by using all the variables in the dataset:

starting.model <- lm(mpg ~ . -1 , data = mtcars)
summary(starting.model)$coef
##             Estimate  Std. Error    t value   Pr(>|t|)
## cyl4     23.87913244 20.06582026  1.1900402 0.25252548
## cyl6     21.23043717 18.33416483  1.1579713 0.26498157
## cyl8     23.54296946 18.22249667  1.2919728 0.21591810
## disp      0.03554632  0.03189920  1.1143329 0.28267339
## hp       -0.07050683  0.03942556 -1.7883534 0.09393155
## drat      1.18283018  2.48348458  0.4762784 0.64073922
## wt       -4.52977584  2.53874584 -1.7842573 0.09461859
## qsec      0.36784482  0.93539569  0.3932505 0.69966720
## vs1       1.93085054  2.87125777  0.6724755 0.51150791
## amManual  1.21211570  3.21354514  0.3771896 0.71131573
## gear4     1.11435494  3.79951726  0.2932886 0.77332027
## gear5     2.52839599  3.73635801  0.6767007 0.50889747
## carb2    -0.97935432  2.31797446 -0.4225044 0.67865093
## carb3     2.99963875  4.29354611  0.6986390 0.49546781
## carb4     1.09142288  4.44961992  0.2452845 0.80956031
## carb6     4.47756921  6.38406242  0.7013668 0.49381268
## carb8     7.25041126  8.36056638  0.8672153 0.39948495

Putting all the variables altogether is confusing the model, as a matter of fact all the predictors have a p-value greater than 0.05. Let’s find the best model with the step() function:

## Start:  AIC=76.4
## mpg ~ (cyl + disp + hp + drat + wt + qsec + vs + am + gear + 
##     carb) - 1
## 
##        Df Sum of Sq    RSS    AIC
## - carb  5   13.5989 134.00 69.828
## - gear  2    3.9729 124.38 73.442
## - cyl   3   16.3057 136.71 74.468
## - am    1    1.1420 121.55 74.705
## - qsec  1    1.2413 121.64 74.732
## - drat  1    1.8208 122.22 74.884
## - vs    1    3.6299 124.03 75.354
## <none>              120.40 76.403
## - disp  1    9.9672 130.37 76.948
## - wt    1   25.5541 145.96 80.562
## - hp    1   25.6715 146.07 80.588
## 
## Step:  AIC=69.83
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear - 1
## 
##        Df Sum of Sq    RSS    AIC
## - gear  2    5.0215 139.02 67.005
## - cyl   3   16.9198 150.92 67.633
## - disp  1    0.9934 135.00 68.064
## - drat  1    1.1854 135.19 68.110
## - vs    1    3.6763 137.68 68.694
## - qsec  1    5.2634 139.26 69.061
## <none>              134.00 69.828
## - am    1   11.9255 145.93 70.556
## - wt    1   19.7963 153.80 72.237
## - hp    1   22.7935 156.79 72.855
## + carb  5   13.5989 120.40 76.403
## 
## Step:  AIC=67
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am - 1
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   3   16.6694 155.69 64.629
## - drat  1    0.9672 139.99 65.227
## - disp  1    1.5483 140.57 65.359
## - vs    1    2.1829 141.21 65.503
## - qsec  1    3.6324 142.66 65.830
## <none>              139.02 67.005
## - am    1   16.5665 155.59 68.608
## - hp    1   18.1768 157.20 68.937
## + gear  2    5.0215 134.00 69.828
## - wt    1   31.1896 170.21 71.482
## + carb  5   14.6475 124.38 73.442
## 
## Step:  AIC=65.32
## mpg ~ disp + hp + drat + wt + qsec + vs + am - 1
## 
##        Df Sum of Sq    RSS    AIC
## - vs    2     6.363 155.81 62.653
## - drat  1     2.869 152.32 63.927
## - disp  1     9.111 158.56 65.212
## <none>              149.45 65.319
## - qsec  1    12.573 162.02 65.904
## - hp    1    13.929 163.38 66.170
## + cyl   2    10.425 139.02 67.005
## - am    1    20.457 169.91 67.424
## + gear  2     2.882 146.57 68.696
## + carb  5    15.299 134.15 71.863
## - wt    1    60.936 210.38 74.262
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am - 1
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## + vs    1     0.645 149.45 65.319
## + cyl   2     8.887 141.21 65.503
## - am    2    34.915 185.01 66.149
## - qsec  1    25.574 175.67 66.491
## + gear  2     2.779 147.31 66.859
## + carb  5    11.674 138.42 70.866
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am - 1
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## + drat  1     3.345 150.09 63.457
## + cyl   2    11.107 142.33 63.757
## + vs    1     1.121 152.32 63.927
## - qsec  1    26.470 179.91 65.255
## + gear  2     3.038 150.40 65.522
## + carb  5     3.965 149.47 71.324
## - wt    1    69.043 222.48 72.051
## - am    2   101.005 254.44 74.347
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am - 1
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## + cyl   2    16.085 143.98 62.126
## + disp  1     6.629 153.44 62.162
## + drat  1     1.428 158.64 63.229
## - qsec  1    20.225 180.29 63.323
## + vs    1     0.249 159.82 63.466
## + gear  2     1.764 158.30 65.161
## + carb  5     6.393 153.67 70.211
## - wt    1    78.494 238.56 72.284
## - am    2    97.447 257.51 72.731
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am - 1
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## + hp    1     9.219 160.07 61.515
## + disp  1     3.276 166.01 62.682
## + drat  1     1.400 167.89 63.042
## + vs    1     0.000 169.29 63.307
## + cyl   2     9.862 159.42 63.387
## + gear  2     0.185 169.10 65.272
## + carb  5    10.999 158.29 69.158
## - am    2   121.452 290.74 74.614
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790
summary(best.model)
## 
## Call:
## lm(formula = mpg ~ wt + qsec + am - 1, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## qsec          1.2259     0.2887   4.247 0.000216 ***
## amAutomatic   9.6178     6.9596   1.382 0.177915    
## amManual     12.5536     6.0573   2.072 0.047543 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.9879, Adjusted R-squared:  0.9862 
## F-statistic: 573.7 on 4 and 28 DF,  p-value: < 2.2e-16

The function selected wt, qsec and am as best variables. So we have

best.model <- lm(mpg ~ wt + qsec + am -1 , data=mtcars)
summary(best.model)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## amAutomatic  9.617781  6.9595930  1.381946 1.779152e-01
## amManual    12.553618  6.0573391  2.072464 4.754335e-02

By excluding the intercept we get all the coefficients, which can be interpreted as follows: leaving constant the weight (wt) and the qsec variables, the average mpg in case of automatic transmission is 9.61, while it turns out to be higher (12.55) in case of manual transmission. We can say if there is a statistical difference by doing some inference analysis: let’s introduce back the intercept (that will be the Automatic term) to simplify the calculations:

best.model.intercept <- lm(mpg ~ wt + qsec + am , data=mtcars)
summary(best.model.intercept)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04
## amManual     2.935837  1.4109045  2.080819 4.671551e-02
pe <- coef(summary(best.model.intercept))["amManual", "Estimate"]
se <- coef(summary(best.model.intercept))["amManual","Std. Error"]
tstat <- qt(1 - alpha/2, n - 2)  # n - 2 for model with intercept and slope
pe + c(-1, 1) * (se * tstat)
## [1] 0.05438576 5.81728862

The 95% CI does not include zero so we can say that there is a statistically difference between using either automatic or manual transmission. In particular, automatic transmission performs better in terms of mpg.

Diagnostric: residual analysis

par(mfrow=c(2, 2))
plot(best.model.intercept)

From the top-left plot we can see that the are randomly scattered, which means that independence of the residuals is verified. The QQ plot (top right) is saying that the residuals are normally distributed. The scale location (bottom left) indicates that the variance is constant.