Executive summary

In this report we aim to find out what type of transmission (automatic or manual) is better for mpg (miles per gallon) and quantify the differences in mpg between them using the mtcars data set. As a result of different models comparison, assesment of p-values and residuals for them it leads to the conclusion that type of transmission doesn’t influence mpg, but appears to be correlated with other characteristics.

Exploratory Analysis

library(datasets)
data(mtcars)
head(mtcars,4)

The exploratory plots clearly show that heavier cars with bigger engines tend to be automatics whereas lighter cars with smaller engines tend to be manual. The better (higher) values of mpg are mostly observed for the vehicles with manual transmission. The question is: “Does the type of transmission influence mpg or this dependency is a result of influence of other characteristics?” Let’s check out other dependencies using ggpairs (GGally package). There is negative correlation between hp, wt and mpg, e.g. the more car weights and the more poverful the engine, the more fuel it consumes per mile.

P-values analysis

We need to know what predictors have the highest impact on mpg, so let’s fit multiple nested models (Appendix A3) and compare them with anova (table available in Appendix A4). Inclusion of wt and hp impacts mpg the most. The following table shows p-values \(P(>|t|)\) for the coefficients of different models. The am is significant when it’s the only predictor, but as soon as we add wt and then hp it immediately becomes insignificant. Also if we add more variables, for example cyl, the significance of wt and hp drops. The best combination so far is mpg~wt+hp.

Residuals Analysis

Let’s have a look at the residuals - variation left unexplained - for different models. Again, the model with the only one predictor - mpg~am - poorly fits because the residuals show clear pattern - they are concentrated exactly at the mean values of mpg for auto and manual transmission: 17.15 and 24.39. For other models residuals have less random pattern, but it looks like non linear variation left unexplained.
The graph could represent several ways in which the model is not explaining all that is possible: 1) a missing variable; 2) a missing higher-order term of a variable in the model to explain the curvature; 3) a missing interaction between terms already in the model. Since we already tried to fit more variables, this option is not the case. Let’s check models: 1) with added interaction between wt and hp: \(mpg\)~\(wt*hp\); 2) with higher orders - squares of both wt and hp: \(mpg\)~\(hp+I(hp^2)+wt+I(wt^2)\)

According to the results of anova comparison for both models, the model with predictors interaction is better (Appendix A8). Also R-squared and adjusted R-squared values are little bit better for the model with predictors interaction (Appendix A10). Residual plots also look better for these models than for mpg~wt+hp.

Coefficients explaining

Refer to summary(fitInteract) in Appendix A10 that shows the \(mpg\)~\(hp*wt\) model. The regression slope estimate for wt is -8.21 and for hp is -0.12. We interpret these coefficients as follows: our model estimates an expected 8.21 decrease in mpg for every 1% increase in car weight holding the remaining variables constant; model estimates an expected 0.12 decrease in mpg for every 1% increase in hp holding the remaining variables constant. The two-sided t-test for H0:wt=0 versus Ha: wt<>0 is significant since 5.20e-07 is less than typical benchmarks (0.05). The same for hp and for their interaction. All coefficients in the model are significant.

Conclusion


Appendix

A1. Create first chunk of exploratory plots

par(mfrow=c(1,6),mar=c(4,4,2,0.5))
boxplot(mpg~am,data=mtcars, col="grey", xlab="0-auto, 1-manual", ylab="mpg")
plot(x=mtcars$am, y=mtcars$mpg, col=mtcars$cyl, xlab="0-auto, 1-manual", ylab="mpg")
legend("top",lwd=1,col=sort(unique(mtcars$cyl)),title="Cylinders",legend=sort(unique(mtcars$cyl)))
plot(x=mtcars$am, y=mtcars$mpg, col=mtcars$gear, xlab="0-auto, 1-manual", ylab="mpg")
legend("top",lwd=1,col=sort(unique(mtcars$gear)),title="Gears",legend=sort(unique(mtcars$gear)))
plot(x=mtcars$wt, y=mtcars$mpg, col=mtcars$am+5, xlab="wt", ylab="mpg")
legend("topright",lwd=1,col=sort(unique(mtcars$am))+5,title="Transmission",legend=c("auto","manual"))
plot(x=mtcars$hp, y=mtcars$mpg, col=mtcars$am+5, xlab="hp", ylab="mpg")
legend("topright",lwd=1,col=sort(unique(mtcars$am))+5,title="Transmission",legend=c("auto","manual"))
plot(x=mtcars$carb, y=mtcars$mpg, col=mtcars$am+5, xlab="carb", ylab="mpg")
legend("topright",lwd=1,col=sort(unique(mtcars$am))+5,title="Transmission",legend=c("auto","manual"))

A2. Checking the dependencies between the variables

library(ggplot2)
library(GGally)
g1<-ggpairs(mtcars, lower=list(continuous="smooth"))
g1

A3. Fitted linear models

fit1<-lm(mpg~am, data=mtcars);                      fit2<-lm(mpg~am+wt, data=mtcars)
fit3<-lm(mpg~am+wt+hp, data=mtcars);                fit4<-lm(mpg~am+wt+hp+cyl,data=mtcars)
fit6<-lm(mpg~am+wt+hp+cyl+carb+gear, data=mtcars);  fit_all<-lm(mpg~., data=mtcars)

A4. Anova comparison for fitted models

anova(fit1, fit2, fit3, fit4, fit6, fit_all)

A5. P-values for selected models

fit7<-lm(mpg~hp+wt, data=mtcars)
prt<-data.frame(Intercept=c(summary(fit1)$coef[1,4],summary(fit2)$coef[1,4],
          summary(fit3)$coef[1,4],summary(fit4)$coef[1,4],summary(fit7)$coef[1,4]),
   am=c(summary(fit1)$coef[2,4],summary(fit2)$coef[2,4],
          summary(fit3)$coef[2,4],summary(fit4)$coef[2,4],NA),
   wt=c(NA,summary(fit2)$coef[3,4],summary(fit3)$coef[3,4],
        summary(fit4)$coef[3,4],summary(fit7)$coef[3,4]),
   hp=c(NA,NA,summary(fit3)$coef[4,4],summary(fit4)$coef[4,4],summary(fit7)$coef[2,4]),
   cyl=c(NA,NA,NA,round(summary(fit4)$coef[5,4],6),NA),
   row.names = c("~am", "~am+wt","~am+wt+hp", "~am+wt+hp+cyl", "~wt+hp")); prt

A6. Mean values of mpg for am=1 and am=0

mn<-c(mean(mtcars[mtcars[,"am"]==0,"mpg"]),mean(mtcars[mtcars[,"am"]==1,"mpg"]))

A7. Creating residual plots

par(mfrow=c(1,5)); plot(fit1, which=1, main="mpg~am") 
plot(fit2, which=1, main="mpg~am+wt")
plot(fit3, which=1, main="mpg~am+wt+hp") 
plot(fit4, which=1, main="mpg~am+wt+hp+cyl")
plot(fit7, which=1, main="mpg~wt+hp")

A8. Models with predictors’ interaction and higher-order predictors

fitInteract<-lm(mpg~hp*wt, data=mtcars)
fitQuadratic<-lm(mpg~hp+I(hp^2)+wt+I(wt^2), data=mtcars)
##fitInt2<-lm(mpg~am+hp*wt, data=mtcars);
anova(fit7,fitQuadratic)
anova(fit7,fitInteract)##; anova(fitInteract, fitInt2)

A9. Residual plots for the models with predictors’ interaction and higher-order predictors

par(mfrow=c(1,6))
plot(fit7, which=1, main="mpg~wt+hp")
plot(fitQuadratic, which=1, main="Quadratic")
plot(fitInteract, main="Interact")

A10. Summaries for the models with predictors’ interaction and higher-order predictors

summary(fit7)$coef
#                Estimate Std. Error   t value     Pr(>|t|)
# (Intercept) 37.22727012 1.59878754 23.284689 2.565459e-20
# hp          -0.03177295 0.00902971 -3.518712 1.451229e-03
# wt          -3.87783074 0.63273349 -6.128695 1.119647e-06
summary(fitInteract)
# 
# Call:
# lm(formula = mpg ~ hp * wt, data = mtcars)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -3.0632 -1.6491 -0.7362  1.4211  4.5513 
# 
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept) 49.80842    3.60516  13.816 5.01e-14 ***
# hp          -0.12010    0.02470  -4.863 4.04e-05 ***
# wt          -8.21662    1.26971  -6.471 5.20e-07 ***
# hp:wt        0.02785    0.00742   3.753 0.000811 ***
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.153 on 28 degrees of freedom
# Multiple R-squared:  0.8848,  Adjusted R-squared:  0.8724 
# F-statistic: 71.66 on 3 and 28 DF,  p-value: 2.981e-13
summary(fitQuadratic)
# 
# Call:
# lm(formula = mpg ~ hp + I(hp^2) + wt + I(wt^2), data = mtcars)
# 
# Residuals:
#     Min      1Q  Median      3Q     Max 
# -2.8849 -1.8165 -0.3922  1.3499  4.5807 
# 
# Coefficients:
#               Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  4.945e+01  3.521e+00  14.044 6.27e-14 ***
# hp          -9.428e-02  3.193e-02  -2.952 0.006456 ** 
# I(hp^2)      1.743e-04  8.073e-05   2.159 0.039879 *  
# wt          -9.220e+00  2.270e+00  -4.062 0.000375 ***
# I(wt^2)      8.500e-01  3.005e-01   2.829 0.008700 ** 
# ---
# Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Residual standard error: 2.135 on 27 degrees of freedom
# Multiple R-squared:  0.8907,  Adjusted R-squared:  0.8745 
# F-statistic: 55.02 on 4 and 27 DF,  p-value: 1.363e-12
summary(fit1)$coef
#              Estimate Std. Error   t value     Pr(>|t|)
# (Intercept) 17.147368   1.124603 15.247492 1.133983e-15
# am           7.244939   1.764422  4.106127 2.850207e-04