This analysis explores the question whether the type of transmission (automatic vs manual) influences the mileage of cars. Using the dataset mtcars multivariable regression analysis was carried out to examine the correlation between \(mpg\) (miles per gallon) and various factors in the dataset. To find an optimally fitted model, likelihood ratios were used. This analysis indicates that manual cars are significantly (\(p-value\) = 0.027) better in terms of MPG. At 95% confidence level, manual transmissions give 3.47 \(\pm\) 3.05 higher mileage than the automatic transmissions.
The details of the data used in this analysis can be seen in the appendix.
Since we are interested in the question whether one type of transmission is better than the other in terms of mileage (mpg), let’s first look at the average mpg values by transmission.
mt <- mtcars %>% mutate(am = factor(am))
c(automatic=mean((mt %>% filter(am==0))$mpg), manual=mean((mt %>% filter(am==1))$mpg))
## automatic manual
## 17.14737 24.39231
In the current dataset, the manual cars have higher mpg values than the automatic cars. Several factors can affect car mileage, for example: car weight, engine size, the type of engine. To get some idea about what kind of relations exist between various columns in this dataset, I created some exploratory plots. Some of these are included in the appendix. It became apparent that in this dataset, the manual cars had much lower weights and engine sizes, both of which are likely to influence \(mpg\). Thus, it is likely that the increased mileage we see in the manual cars is a reflection of the car weights and engine size, instead of the influence of transmission.
Multivariable regression analysis was used to dissect influence of various predictors on \(mpg\). To begin with, regression analysis was performed using all potential predictors. Summary of this regression can be seen in the appendix.
library(car)
fit <- lm(mpg ~ ., mt)
Based on the summary, it is apparent that this is not a very useful model. We do not see a significant correlation between any of the predictors with \(mpg\). This could be due increased variance from over-fitting the model. To find a better-fitted model, several nested models were created and their likelihood ratios tested. Here, I am showing 5 nested models that I finally selected for further analysis.
library(lmtest)
fit1 <- lm(mpg ~ am, mt)
fit2 <- lm(mpg ~ am + wt + qsec, mt)
fit3 <- lm(mpg ~ am + wt + qsec + disp + hp, mt)
fit4 <- lm(mpg ~ am + wt + qsec + disp + hp + drat + vs, mt)
fit5 <- lm(mpg ~ ., mt)
l <- logLik(fit1); l2 <- logLik(fit2); l3 <- logLik(fit3); l4 <- logLik(fit4); l5 <- logLik(fit5)
# test statistic
t12 <- -2 * (as.numeric(l) - as.numeric(l2)); t23 <- -2 * (as.numeric(l2) - as.numeric(l3))
t34 <- -2 * (as.numeric(l3) - as.numeric(l4)); t45 <- -2 * (as.numeric(l4) - as.numeric(l5))
# p-values for hypothesis that the first model is better that the second in the pair analysed
p12 <- pchisq(t12, df=2, lower.tail=F); p23 <- pchisq(t23, df=2, lower.tail=F)
p34 <- pchisq(t34, df=2, lower.tail=F); p45 <- pchisq(t45, df=5, lower.tail=F)
c(fit1vs2 = p12, fit2vs3 = p23, fit3vs4 = p34 , fit4vs5 = p45)
## fit1vs2 fit2vs3 fit3vs4 fit4vs5
## 8.550001e-11 2.074839e-01 6.560137e-01 9.947311e-01
The above code creates five nested models, calculates their pair-wise likelihood ratios and outputs \(p-values\) for the hypothesis that the first model is better that the second in the pair being analysed. Based on these values, model3 (\(fit3\)) seems to be a good fit. Let’s look at the regression coefficients for this model.
coef(summary(fit3))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.36190396 9.74079485 1.474408 0.152378367
## am1 3.47045340 1.48578009 2.335779 0.027487809
## wt -4.08433206 1.19409972 -3.420428 0.002075008
## qsec 1.00689683 0.47543287 2.117853 0.043907652
## disp 0.01123765 0.01060333 1.059823 0.298972150
## hp -0.02117055 0.01450469 -1.459565 0.156387279
confint(fit3)['am1',]
## 2.5 % 97.5 %
## 0.4163887 6.5245181
Based on the regression coefficients of model3 (\(fit3\)) predictors above, manual transmission seems to have a significant (\(p.value = 0.027\)) positive impact on \(mpg\). At 95% confidence level, manual transmissions give 0.41 to 6.52 (3.47 \(\pm\) 3.05) higher mileage than automatic transmissions.
To further corroborate usefulness of model \(fit3\), let’s analyse the residuals of \(fit3\).
mean(resid(fit3))
## [1] 1.734723e-17
As can be seen, the mean of residuals of model \(fit3\) is very close to zero indicating they are randomly distributed. To further ensure that there is no residual pattern in the distribution of residuals, which would mean that the model is under-fitted, residuals of the model were plotted (Please see Appendix). These plots show normal distribution of the residuals around mean 0, indicating a good model fit.
Multivariable regression analysis was performed to examine the influence of transmission type on mileage using the dataset mtcars. Nested models were created and their likelihood ratios tested to find an optimally-fitted model. The selected model indicates that manual cars give significantly higher mileage, thus are better in terms of MPG. At 95% confidence level, manual transmissions give 3.47 \(\pm\) 3.05 higher mileage than the automatic transmissions.
The model indicates that apart from transmission, weight, and qsec(1/4 mile time) have statistically significant influences on \(mpg\).
data(mtcars); library(dplyr)
head(mtcars, 3); dim(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## [1] 32 11
The mileage is in column \(mpg\) while column \(am\) indicates the type of transmission. 0 for aiutomatic and 1 for manual transmission.
par(mfrow=c(1,3), mar=c(5,5,5,0), oma=c(0, 0, 2, 0))
plot(mpg ~ am, data=mt, xlab='transmission', ylab='mpg', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
plot(mpg ~ wt, data=mt, xlab='weight, tons', ylab='mpg', col=am, pch=19)
legend('topright', legend=c('automatic', 'manual'), col=c(1,2), pch=19)
plot(wt ~ am, data=mt, xlab='transmission', ylab='weight, tons', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
mtext('Relationship of mpg, weight and transmission of cars in mtcars dataset',
side = 3, line = -2, outer = TRUE)
par(mfrow=c(1,3), mar=c(5,5,5,0), oma=c(0, 0, 2, 0))
plot(wt ~ am, data=mt, xlab='transmission', ylab='weight, tons', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
plot(disp ~ am, data=mt, xlab='transmission', ylab='engine, cc', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
plot(hp ~ am, data=mt, xlab='transmission', ylab='engine hp', xaxt='n')
axis(1, at=1:2, labels=c('automatic', 'manual'))
mtext('distribution of weight, qsec and horsepower of automatic and manual cars',
side = 3, line = -2, outer = TRUE)
fit <- lm(mpg ~ ., mt)
summary(fit)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337416 18.71788443 0.6573058 0.51812440
## cyl -0.11144048 1.04502336 -0.1066392 0.91608738
## disp 0.01333524 0.01785750 0.7467585 0.46348865
## hp -0.02148212 0.02176858 -0.9868407 0.33495531
## drat 0.78711097 1.63537307 0.4813036 0.63527790
## wt -3.71530393 1.89441430 -1.9611887 0.06325215
## qsec 0.82104075 0.73084480 1.1234133 0.27394127
## vs 0.31776281 2.10450861 0.1509915 0.88142347
## am1 2.52022689 2.05665055 1.2254035 0.23398971
## gear 0.65541302 1.49325996 0.4389142 0.66520643
## carb -0.19941925 0.82875250 -0.2406258 0.81217871
fit1 <- lm(mpg ~ am, mt)
summary(fit1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
## am1 7.244939 1.764422 4.106127 2.850207e-04
fit2 <- lm(mpg ~ am + wt + qsec, mt)
summary(fit2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.617781 6.9595930 1.381946 1.779152e-01
## am1 2.935837 1.4109045 2.080819 4.671551e-02
## wt -3.916504 0.7112016 -5.506882 6.952711e-06
## qsec 1.225886 0.2886696 4.246676 2.161737e-04
fit4 <- lm(mpg ~ am + wt + qsec + disp + hp + drat + vs, mt)
summary(fit4)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.49804962 12.48038774 1.0014152 0.326616519
## am1 3.02401646 1.66840310 1.8125215 0.082435144
## wt -3.94973602 1.26261038 -3.1282303 0.004567014
## qsec 0.87148931 0.61331436 1.4209504 0.168194583
## disp 0.01373930 0.01135852 1.2096028 0.238211942
## hp -0.02282191 0.01525893 -1.4956426 0.147781592
## drat 0.95532814 1.40737217 0.6788028 0.503756614
## vs 0.59016578 1.83303033 0.3219618 0.750269228
Plot of residuals of model \(fit3\) showing their normal distribution.
par(mfrow=c(1,2), mar=c(5,5,5,0), oma=c(0, 0, 3, 0))
plot(resid(fit3), col=mt$am, pch=19);
legend('topleft', legend=c('automatic', 'manual'), col=c(1,2), pch=19, cex=.8)
plot(density(resid(fit3)),main='density(x = resid(fit3))', main.font=1)