Motor Trend, a magazine about the automobile industry is looking at a data set of a collection of cars. The magazine is interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). It is particularly interested in the following two questions:
Is an automatic or manual transmission better for MPG
Quantify the MPG difference between automatic and manual transmissions
An initial analysis, taking only transmission and mpg into account, seems to indicate that a manual transmission yields a higher mpg yield than an automatic transmission does. However, taking other variables into account shows that any positive effect of manual over automatic on mpg is reduced, to an extent that we are not confident that the positive effect is not just down to chance for the data given.
# Read in the data and explore its structure
data(mtcars)
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
dim(mtcars)
## [1] 32 11
The variable we will test against “am” is currently numeric, though it takes only two values (0 and 1). It is convenient to convert it to be a factor.
mtcars$am<-as.factor(mtcars$am)
levels(mtcars$am)<-c("automatic", "manual")
We first assess the effect of transmission on mpg using a boxplot. This indicates that mpg is higher for manual transmissions. However this does not take any of the other variables into account, so it is worthwhile exploring the situation further with some linear regression models.
boxplot(mpg~am, mtcars, col=c("blue", "green"),
xlab="Transmission",
ylab="Miles per Gallon",
main="Miles per Gallon by Transmission type")
We first apply a simple regerssion model to the data to add numeric detail to the boxplot.
mdl_basic<-lm(mpg~am, data=mtcars)
summary(mdl_basic)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## ammanual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
According to this simple model, on average, manual transmission cars have 7.245 miles per gallon more as compared to automatic transmission cars. However, on this model, only 34% of the variation in the data is explainable on there being a linear dependence of mpg on transmission type (using the adjusted R^2 value). This indicates that a more in-depth study of the effects of the other variables on mpg is needed.
We start by including all the variabels in the model
mdl_all<-lm(mpg~., data=mtcars)
summary(mdl_all)$adj.r.squared
## [1] 0.8066423
Thus model has an adjusted R^2 value of 81%, but it may contain unnecessary variables. We shall use a step-wise method to arrive at a smaller set of variables. We hope to achieve an adjusted R^2 value close to 80%.
mdl_min<-lm(mpg~1,data=mtcars)
mdl_fwd=step(mdl_min, direction="forward", scope=(~am+cyl+disp+hp+drat+wt+qsec+vs+gear+carb),data=mtcars)
## Start: AIC=115.94
## mpg ~ 1
##
## Df Sum of Sq RSS AIC
## + wt 1 847.73 278.32 73.217
## + cyl 1 817.71 308.33 76.494
## + disp 1 808.89 317.16 77.397
## + hp 1 678.37 447.67 88.427
## + drat 1 522.48 603.57 97.988
## + vs 1 496.53 629.52 99.335
## + am 1 405.15 720.90 103.672
## + carb 1 341.78 784.27 106.369
## + gear 1 259.75 866.30 109.552
## + qsec 1 197.39 928.66 111.776
## <none> 1126.05 115.943
##
## Step: AIC=73.22
## mpg ~ wt
##
## Df Sum of Sq RSS AIC
## + cyl 1 87.150 191.17 63.198
## + hp 1 83.274 195.05 63.840
## + qsec 1 82.858 195.46 63.908
## + vs 1 54.228 224.09 68.283
## + carb 1 44.602 233.72 69.628
## + disp 1 31.639 246.68 71.356
## <none> 278.32 73.217
## + drat 1 9.081 269.24 74.156
## + gear 1 1.137 277.19 75.086
## + am 1 0.002 278.32 75.217
##
## Step: AIC=63.2
## mpg ~ wt + cyl
##
## Df Sum of Sq RSS AIC
## + hp 1 14.5514 176.62 62.665
## + carb 1 13.7724 177.40 62.805
## <none> 191.17 63.198
## + qsec 1 10.5674 180.60 63.378
## + gear 1 3.0281 188.14 64.687
## + disp 1 2.6796 188.49 64.746
## + vs 1 0.7059 190.47 65.080
## + am 1 0.1249 191.05 65.177
## + drat 1 0.0010 191.17 65.198
##
## Step: AIC=62.66
## mpg ~ wt + cyl + hp
##
## Df Sum of Sq RSS AIC
## <none> 176.62 62.665
## + am 1 6.6228 170.00 63.442
## + disp 1 6.1762 170.44 63.526
## + carb 1 2.5187 174.10 64.205
## + drat 1 2.2453 174.38 64.255
## + qsec 1 1.4010 175.22 64.410
## + gear 1 0.8558 175.76 64.509
## + vs 1 0.0599 176.56 64.654
On the basis of this analysis, we choose to regress mpg on am, wt, cyl and hd. We bear in mind that cyl has only three levels for this analysis (4, 6 and 8).
mdl_reduced<-lm(mpg~am+wt+factor(cyl)+hp, data=mtcars)
summary(mdl_reduced)
##
## Call:
## lm(formula = mpg ~ am + wt + factor(cyl) + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## ammanual 1.80921 1.39630 1.296 0.20646
## wt -2.49683 0.88559 -2.819 0.00908 **
## factor(cyl)6 -3.03134 1.40728 -2.154 0.04068 *
## factor(cyl)8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
This model has an adjusted R^2 value of 84%, so that it is a good candidate for consideration, as it explains a high proportion of the variability on the basis of an underying trend, and it is reasonably parsimonious. We now note that while manual again contributes to a higher mpg than automatic, its effect is considerably reduced, when we take the other variables into account.
On the basis of the summary data, we look at the model with the variable hp removed (as it only contributes a decrease of 0.03).
mdl_reduced_2<-lm(mpg~am+wt+factor(cyl),data=mtcars)
summary(mdl_reduced_2)
##
## Call:
## lm(formula = mpg ~ am + wt + factor(cyl), data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4898 -1.3116 -0.5039 1.4162 5.7758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.7536 2.8135 11.997 2.5e-12 ***
## ammanual 0.1501 1.3002 0.115 0.90895
## wt -3.1496 0.9080 -3.469 0.00177 **
## factor(cyl)6 -4.2573 1.4112 -3.017 0.00551 **
## factor(cyl)8 -6.0791 1.6837 -3.611 0.00123 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.603 on 27 degrees of freedom
## Multiple R-squared: 0.8375, Adjusted R-squared: 0.8134
## F-statistic: 34.79 on 4 and 27 DF, p-value: 2.73e-10
In this more parsimonious model, we still have an adjusted R^2 measure of 80%, but in this case the positive effect of manual transmission on miles per gallon is very small (only 0.15).
anova(mdl_all, mdl_reduced, mdl_reduced_2)
## Analysis of Variance Table
##
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ am + wt + factor(cyl) + hp
## Model 3: mpg ~ am + wt + factor(cyl)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 147.49
## 2 26 151.03 -5 -3.531 0.1006 0.99093
## 3 27 182.97 -1 -31.943 4.5480 0.04492 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Inspect PRESS values for the three models
sum(abs(resid(mdl_reduced)/(1-hatvalues(mdl_reduced))))
## [1] 67.53938
sum(abs(resid(mdl_reduced_2)/(1-hatvalues(mdl_reduced_2))))
## [1] 70.546
sum(abs(resid(mdl_all)/(1-hatvalues(mdl_all))))
## [1] 87.80029
We see that of the three models examined, the model using am, wt, cyl and hp seems best. Although this indicates a positive contribution of 1.81 mpg for manual transmission, because of the other results, we are less sure of its overall positive contribution.
We first look at the distribution of the variable “mpg”.
library(ggplot2)
library(reshape2)
longcar<-melt(mtcars, measure.var="mpg")
g<-ggplot(longcar, aes(x=value))+geom_histogram(aes(y=..density..,fill=variable), binwidth=1, color="black")+geom_density(size=2)
g<-g+labs(title="Density plot of mpg")+labs(x="mpg")+labs(y="Frequency/Density")
g
We now summarize the model “mdl_reduced”" graphically
# We shall plot the fit for mdl_reduced
par(mfrow=c(2,2))
plot(mdl_reduced)