In this report we aim to find out what type of transmission (automatic or manual) is better for mpg (miles per gallon) and quantify the differences in mpg between them using the mtcars data set. As a result of different models comparison, assesment of p-values and residuals for them it leads to the conclusion that type of transmission doesn’t influence mpg, but appears to be correlated with other characteristics.
library(datasets)
data(mtcars)
head(mtcars,4)
The exploratory plots clearly show that heavier cars with bigger engines tend to be automatics whereas lighter cars with smaller engines tend to be manual. The better (higher) values of mpg are mostly observed for the vehicles with manual transmission. The question is: “Does the type of transmission influence mpg or this dependency is a result of influence of other characteristics?” Let’s check out other dependencies using ggpairs (GGally package). There is negative correlation between hp, wt and mpg, e.g. the more car weights and the more poverful the engine, the more fuel it consumes per mile.
Let’s have a look at the residuals - variation left unexplained - for different models. Again, the model with the only one predictor - mpg~am - poorly fits because the residuals show clear pattern - they are concentrated exactly at the mean values of mpg for auto and manual transmission: 17.15 and 24.39. For other models residuals have less random pattern, but it looks like non linear variation left unexplained.
The graph could represent several ways in which the model is not explaining all that is possible: 1) a missing variable; 2) a missing higher-order term of a variable in the model to explain the curvature; 3) a missing interaction between terms already in the model. Since we already tried to fit more variables, this option is not the case. Let’s check models: 1) with added interaction between wt and hp: \(mpg\)~\(wt*hp\); 2) with higher orders - squares of both wt and hp: \(mpg\)~\(hp+I(hp^2)+wt+I(wt^2)\)
According to the results of anova comparison for both models, the model with predictors interaction is better (Appendix A8). Also R-squared and adjusted R-squared values are little bit better for the model with predictors interaction (Appendix A10). Residual plots also look better for these models than for mpg~wt+hp.
Refer to summary(fitInteract) in Appendix A10 that shows the \(mpg\)~\(hp*wt\) model. The regression slope estimate for wt is -8.21 and for hp is -0.12. We interpret these coefficients as follows: our model estimates an expected 8.21 decrease in mpg for every 1% increase in car weight holding the remaining variables constant; model estimates an expected 0.12 decrease in mpg for every 1% increase in hp holding the remaining variables constant. The two-sided t-test for H0:wt=0 versus Ha: wt<>0 is significant since 5.20e-07 is less than typical benchmarks (0.05). The same for hp and for their interaction. All coefficients in the model are significant.
A1. Create first chunk of exploratory plots
par(mfrow=c(1,6),mar=c(4,4,2,0.5))
boxplot(mpg~am,data=mtcars, col="grey", xlab="0-auto, 1-manual", ylab="mpg")
plot(x=mtcars$am, y=mtcars$mpg, col=mtcars$cyl, xlab="0-auto, 1-manual", ylab="mpg")
legend("top",lwd=1,col=sort(unique(mtcars$cyl)),title="Cylinders",legend=sort(unique(mtcars$cyl)))
plot(x=mtcars$am, y=mtcars$mpg, col=mtcars$gear, xlab="0-auto, 1-manual", ylab="mpg")
legend("top",lwd=1,col=sort(unique(mtcars$gear)),title="Gears",legend=sort(unique(mtcars$gear)))
plot(x=mtcars$wt, y=mtcars$mpg, col=mtcars$am+5, xlab="wt", ylab="mpg")
legend("topright",lwd=1,col=sort(unique(mtcars$am))+5,title="Transmission",legend=c("auto","manual"))
plot(x=mtcars$hp, y=mtcars$mpg, col=mtcars$am+5, xlab="hp", ylab="mpg")
legend("topright",lwd=1,col=sort(unique(mtcars$am))+5,title="Transmission",legend=c("auto","manual"))
plot(x=mtcars$carb, y=mtcars$mpg, col=mtcars$am+5, xlab="carb", ylab="mpg")
legend("topright",lwd=1,col=sort(unique(mtcars$am))+5,title="Transmission",legend=c("auto","manual"))
A2. Checking the dependencies between the variables
library(ggplot2)
library(GGally)
g1<-ggpairs(mtcars, lower=list(continuous="smooth"))
g1
A3. Fitted linear models
fit1<-lm(mpg~am, data=mtcars); fit2<-lm(mpg~am+wt, data=mtcars)
fit3<-lm(mpg~am+wt+hp, data=mtcars); fit4<-lm(mpg~am+wt+hp+cyl,data=mtcars)
fit6<-lm(mpg~am+wt+hp+cyl+carb+gear, data=mtcars); fit_all<-lm(mpg~., data=mtcars)
A4. Anova comparison for fitted models
anova(fit1, fit2, fit3, fit4, fit6, fit_all)
A5. P-values for selected models
fit7<-lm(mpg~hp+wt, data=mtcars)
prt<-data.frame(Intercept=c(summary(fit1)$coef[1,4],summary(fit2)$coef[1,4],
summary(fit3)$coef[1,4],summary(fit4)$coef[1,4],summary(fit7)$coef[1,4]),
am=c(summary(fit1)$coef[2,4],summary(fit2)$coef[2,4],
summary(fit3)$coef[2,4],summary(fit4)$coef[2,4],NA),
wt=c(NA,summary(fit2)$coef[3,4],summary(fit3)$coef[3,4],
summary(fit4)$coef[3,4],summary(fit7)$coef[3,4]),
hp=c(NA,NA,summary(fit3)$coef[4,4],summary(fit4)$coef[4,4],summary(fit7)$coef[2,4]),
cyl=c(NA,NA,NA,round(summary(fit4)$coef[5,4],6),NA),
row.names = c("~am", "~am+wt","~am+wt+hp", "~am+wt+hp+cyl", "~wt+hp")); prt
A6. Mean values of mpg for am=1 and am=0
mn<-c(mean(mtcars[mtcars[,"am"]==0,"mpg"]),mean(mtcars[mtcars[,"am"]==1,"mpg"]))
A7. Creating residual plots
par(mfrow=c(1,5)); plot(fit1, which=1, main="mpg~am")
plot(fit2, which=1, main="mpg~am+wt")
plot(fit3, which=1, main="mpg~am+wt+hp")
plot(fit4, which=1, main="mpg~am+wt+hp+cyl")
plot(fit7, which=1, main="mpg~wt+hp")
A8. Models with predictors’ interaction and higher-order predictors
fitInteract<-lm(mpg~hp*wt, data=mtcars)
fitQuadratic<-lm(mpg~hp+I(hp^2)+wt+I(wt^2), data=mtcars)
##fitInt2<-lm(mpg~am+hp*wt, data=mtcars);
anova(fit7,fitQuadratic)
anova(fit7,fitInteract)##; anova(fitInteract, fitInt2)
A9. Residual plots for the models with predictors’ interaction and higher-order predictors
par(mfrow=c(1,6))
plot(fit7, which=1, main="mpg~wt+hp")
plot(fitQuadratic, which=1, main="Quadratic")
plot(fitInteract, main="Interact")
A10. Summaries for the models with predictors’ interaction and higher-order predictors
summary(fit7)$coef
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 37.22727012 1.59878754 23.284689 2.565459e-20
# hp -0.03177295 0.00902971 -3.518712 1.451229e-03
# wt -3.87783074 0.63273349 -6.128695 1.119647e-06
summary(fitInteract)
#
# Call:
# lm(formula = mpg ~ hp * wt, data = mtcars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.0632 -1.6491 -0.7362 1.4211 4.5513
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 49.80842 3.60516 13.816 5.01e-14 ***
# hp -0.12010 0.02470 -4.863 4.04e-05 ***
# wt -8.21662 1.26971 -6.471 5.20e-07 ***
# hp:wt 0.02785 0.00742 3.753 0.000811 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 2.153 on 28 degrees of freedom
# Multiple R-squared: 0.8848, Adjusted R-squared: 0.8724
# F-statistic: 71.66 on 3 and 28 DF, p-value: 2.981e-13
summary(fitQuadratic)
#
# Call:
# lm(formula = mpg ~ hp + I(hp^2) + wt + I(wt^2), data = mtcars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -2.8849 -1.8165 -0.3922 1.3499 4.5807
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 4.945e+01 3.521e+00 14.044 6.27e-14 ***
# hp -9.428e-02 3.193e-02 -2.952 0.006456 **
# I(hp^2) 1.743e-04 8.073e-05 2.159 0.039879 *
# wt -9.220e+00 2.270e+00 -4.062 0.000375 ***
# I(wt^2) 8.500e-01 3.005e-01 2.829 0.008700 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 2.135 on 27 degrees of freedom
# Multiple R-squared: 0.8907, Adjusted R-squared: 0.8745
# F-statistic: 55.02 on 4 and 27 DF, p-value: 1.363e-12
summary(fit1)$coef
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 17.147368 1.124603 15.247492 1.133983e-15
# am 7.244939 1.764422 4.106127 2.850207e-04