Executive Summary

This analysis explores the question whether the type of transmission (automatic vs manual) influences the mileage of cars. Using the dataset mtcars multivariable regression analysis was carried out to examine the correlation between \(mpg\) (miles per gallon) and various factors in the dataset. To find an optimally fitted model, likelihood ratios were used. This analysis indicates that manual cars are significantly (\(p-value\) = 0.027) better in terms of MPG. At 95% confidence level, manual transmissions give 3.47 \(\pm\) 3.05 higher mileage than the automatic transmissions.

Exploratory Data Analysis

The details of the data used in this analysis can be seen in the appendix.

Since we are interested in the question whether one type of transmission is better than the other in terms of mileage (mpg), let’s first look at the average mpg values by transmission.

mt <- mtcars %>% mutate(am = factor(am))
c(automatic=mean((mt %>% filter(am==0))$mpg), manual=mean((mt %>% filter(am==1))$mpg))
## automatic    manual 
##  17.14737  24.39231

In the current dataset, the manual cars have higher mpg values than the automatic cars. Several factors can affect car mileage, for example: car weight, engine size, the type of engine. To get some idea about what kind of relations exist between various columns in this dataset, I created some exploratory plots. Some of these are included in the appendix. It became apparent that in this dataset, the manual cars had much lower weights and engine sizes, both of which are likely to influence \(mpg\). Thus, it is likely that the increased mileage we see in the manual cars is a reflection of the car weights and engine size, instead of the influence of transmission.

Regression analysis of the data

Multivariable regression analysis was used to dissect influence of various predictors on \(mpg\). To begin with, regression analysis was performed using all potential predictors. Summary of this regression can be seen in the appendix.

library(car)
fit <- lm(mpg ~ ., mt)

Based on the summary, it is apparent that this is not a very useful model. We do not see a significant correlation between any of the predictors with \(mpg\). This could be due increased variance from over-fitting the model. To find a better-fitted model, several nested models were created and their likelihood ratios tested. Here, I am showing 5 nested models that I finally selected for further analysis.

library(lmtest)

fit1 <- lm(mpg ~ am, mt)
fit2 <- lm(mpg ~ am + wt + qsec, mt)
fit3 <- lm(mpg ~ am + wt + qsec + disp + hp, mt)
fit4 <- lm(mpg ~ am + wt + qsec + disp + hp + drat + vs, mt)
fit5 <- lm(mpg ~ ., mt)

l <- logLik(fit1); l2 <- logLik(fit2); l3 <- logLik(fit3); l4 <- logLik(fit4); l5 <- logLik(fit5)

# test statistic
t12 <- -2 * (as.numeric(l) - as.numeric(l2)); t23 <- -2 * (as.numeric(l2) - as.numeric(l3))
t34 <- -2 * (as.numeric(l3) - as.numeric(l4)); t45 <- -2 * (as.numeric(l4) - as.numeric(l5))

# p-values for hypothesis that the first model is better that the second in the pair analysed
p12 <- pchisq(t12, df=2, lower.tail=F); p23 <- pchisq(t23, df=2, lower.tail=F)
p34 <- pchisq(t34, df=2, lower.tail=F); p45 <- pchisq(t45, df=5, lower.tail=F)

c(fit1vs2 = p12, fit2vs3 = p23, fit3vs4 = p34 , fit4vs5 = p45)
##      fit1vs2      fit2vs3      fit3vs4      fit4vs5 
## 8.550001e-11 2.074839e-01 6.560137e-01 9.947311e-01

The above code creates five nested models, calculates their pair-wise likelihood ratios and outputs \(p-values\) for the hypothesis that the first model is better that the second in the pair being analysed. Based on these values, model3 (\(fit3\)) seems to be a good fit. Let’s look at the regression coefficients for this model.

coef(summary(fit3))
##                Estimate Std. Error   t value    Pr(>|t|)
## (Intercept) 14.36190396 9.74079485  1.474408 0.152378367
## am1          3.47045340 1.48578009  2.335779 0.027487809
## wt          -4.08433206 1.19409972 -3.420428 0.002075008
## qsec         1.00689683 0.47543287  2.117853 0.043907652
## disp         0.01123765 0.01060333  1.059823 0.298972150
## hp          -0.02117055 0.01450469 -1.459565 0.156387279
confint(fit3)['am1',]
##     2.5 %    97.5 % 
## 0.4163887 6.5245181

 
Based on the regression coefficients of model3 (\(fit3\)) predictors above, manual transmission seems to have a significant (\(p.value = 0.027\)) positive impact on \(mpg\). At 95% confidence level, manual transmissions give 0.41 to 6.52 (3.47 \(\pm\) 3.05) higher mileage than automatic transmissions.

Residuals

To further corroborate usefulness of model \(fit3\), let’s analyse the residuals of \(fit3\).

mean(resid(fit3))
## [1] 1.734723e-17

As can be seen, the mean of residuals of model \(fit3\) is very close to zero indicating they are randomly distributed. To further ensure that there is no residual pattern in the distribution of residuals, which would mean that the model is under-fitted, residuals of the model were plotted (Please see Appendix). These plots show normal distribution of the residuals around mean 0, indicating a good model fit.

Summary

Multivariable regression analysis was performed to examine the influence of transmission type on mileage using the dataset mtcars. Nested models were created and their likelihood ratios tested to find an optimally-fitted model. The selected model indicates that manual cars give significantly higher mileage, thus are better in terms of MPG. At 95% confidence level, manual transmissions give 3.47 \(\pm\) 3.05 higher mileage than the automatic transmissions.

The model indicates that apart from transmission, weight, and qsec(1/4 mile time) have statistically significant influences on \(mpg\).

Appendix

The data

data(mtcars); library(dplyr)
head(mtcars, 3); dim(mtcars)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## [1] 32 11

 
The mileage is in column \(mpg\) while column \(am\) indicates the type of transmission. 0 for aiutomatic and 1 for manual transmission.
 

Exploratory Data Analysis

par(mfrow=c(1,3), mar=c(5,5,5,0), oma=c(0, 0, 2, 0))
plot(mpg ~ am, data=mt, xlab='transmission', ylab='mpg', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
plot(mpg ~ wt, data=mt, xlab='weight, tons', ylab='mpg', col=am, pch=19) 
legend('topright', legend=c('automatic', 'manual'), col=c(1,2), pch=19)
plot(wt ~ am, data=mt, xlab='transmission', ylab='weight, tons', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
mtext('Relationship of mpg, weight and transmission of cars in mtcars dataset',
      side = 3, line = -2, outer = TRUE)

par(mfrow=c(1,3), mar=c(5,5,5,0), oma=c(0, 0, 2, 0))
plot(wt ~ am, data=mt, xlab='transmission', ylab='weight, tons', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
plot(disp ~ am, data=mt, xlab='transmission', ylab='engine, cc', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
plot(hp ~ am, data=mt, xlab='transmission', ylab='engine hp', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
mtext('distribution of weight, qsec and horsepower of automatic and manual cars', 
      side = 3, line = -2, outer = TRUE)

 

The Models

Model \(fit\)

fit <- lm(mpg ~ ., mt)
summary(fit)$coef
##                Estimate  Std. Error    t value   Pr(>|t|)
## (Intercept) 12.30337416 18.71788443  0.6573058 0.51812440
## cyl         -0.11144048  1.04502336 -0.1066392 0.91608738
## disp         0.01333524  0.01785750  0.7467585 0.46348865
## hp          -0.02148212  0.02176858 -0.9868407 0.33495531
## drat         0.78711097  1.63537307  0.4813036 0.63527790
## wt          -3.71530393  1.89441430 -1.9611887 0.06325215
## qsec         0.82104075  0.73084480  1.1234133 0.27394127
## vs           0.31776281  2.10450861  0.1509915 0.88142347
## am1          2.52022689  2.05665055  1.2254035 0.23398971
## gear         0.65541302  1.49325996  0.4389142 0.66520643
## carb        -0.19941925  0.82875250 -0.2406258 0.81217871

Model \(fit1\)

fit1 <- lm(mpg ~ am, mt)
summary(fit1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
## am1          7.244939   1.764422  4.106127 2.850207e-04

Model \(fit2\)

fit2 <- lm(mpg ~ am + wt + qsec, mt)
summary(fit2)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  9.617781  6.9595930  1.381946 1.779152e-01
## am1          2.935837  1.4109045  2.080819 4.671551e-02
## wt          -3.916504  0.7112016 -5.506882 6.952711e-06
## qsec         1.225886  0.2886696  4.246676 2.161737e-04

Model \(fit4\)

fit4 <- lm(mpg ~ am + wt + qsec + disp + hp + drat + vs, mt)
summary(fit4)$coef
##                Estimate  Std. Error    t value    Pr(>|t|)
## (Intercept) 12.49804962 12.48038774  1.0014152 0.326616519
## am1          3.02401646  1.66840310  1.8125215 0.082435144
## wt          -3.94973602  1.26261038 -3.1282303 0.004567014
## qsec         0.87148931  0.61331436  1.4209504 0.168194583
## disp         0.01373930  0.01135852  1.2096028 0.238211942
## hp          -0.02282191  0.01525893 -1.4956426 0.147781592
## drat         0.95532814  1.40737217  0.6788028 0.503756614
## vs           0.59016578  1.83303033  0.3219618 0.750269228

 

Residuals

Plot of residuals of model \(fit3\) showing their normal distribution.

par(mfrow=c(1,2), mar=c(5,5,5,0), oma=c(0, 0, 3, 0))
plot(resid(fit3), col=mt$am, pch=19);
legend('topleft', legend=c('automatic', 'manual'), col=c(1,2), pch=19, cex=.8)
plot(density(resid(fit3)),main='density(x = resid(fit3))', main.font=1)