Motor Trend /2015/07 by Alexa Kiss
Miles per gallon (or the metric equivalent kilometer per litre) represents the distance travelled per unit volume of fuel used. The higher the value, the more economic a vehicle is (the more distance it can travel with a certain volume of fuel). In this short report, we will analyze the “mtcars” R dataset, with particular interest in the relationship between MPG and a set of other variables.
The two main questions of this study are: 1. Is automatic or manual transmission better for MPG?
Results: Manual transmission gives more miles per gallon fuel used, so it is better. The difference between automatic and manual transmission is 7.24 MPG in the model that takes only the transmission mode as the predictor. Nevertheless, this model only explains about 36% of the total variance. In the best model, this difference is decreased to 1.8 MPG (not significant). The new model takes into account the weight, horsepower and the number of cylinders of the cars as well. In conclusion, it is not conclusive whether manual or automatic transmission is better, as it depends on which model we look at.
The scatterplot matrix of the dataset can be found in the Appendix, whic reveals correlations between variables, and also that some of these variables should be converted to factors.
library (datasets)
data(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$gear <- as.factor(mtcars$gear)
mtcars$carb <- as.factor(mtcars$carb)
mtcars$am <- factor(mtcars$am,labels=c('Automatic','Manual'))
A short history: The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). (source: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html)
model1<-lm(mpg~am,mtcars)
# To save space, only the intercept and the slope, and the p-values will be extracted
summary(model1)$coefficients[,2]
## (Intercept) amManual
## 1.124603 1.764422
summary(model1)$coefficients[,4]
## (Intercept) amManual
## 1.133983e-15 2.850207e-04
confint(model1, 'amManual')
## 2.5 % 97.5 %
## amManual 3.64151 10.84837
The model shows that the type of transmission indeed has an effect on fuel consumption. The intercept here represents the mean of the reference group (“0”=automatic transmission), while the second term shows that the “1” group (manual transmission) has a 7.24 miles per gallon (95% CIs: 3.64 and 10.84) higher mean than group 0, and the difference is significant on p<0.001 level. Thus, based on the above, manual transmission is better in the respect of energy consumption, as the change from automatic to manual transmission adds an average of 7.24 miles per gallon.
The plot confirms, that the mean miles per gallon is higher in the case of manual transmission. The first model (model1) has already shown, that the difference between the fuel usage of diverse transmission modes is statistically significant.
However, full summary of the model1 has an R-squared value of approximately 0.36, meaning that only 36% of the variance in MPG is explain by the mode of transmission. MPG is likely correlated with other variables as well (see scatterplot). In order to better understand changes in MPG, we need to incorporate other variables via stepwise model selection.
full_model<-lm(mpg~.,data=mtcars)
model2<-step(full_model,direction = "both")
confint(model2, 'amManual')
## 2.5 % 97.5 %
## amManual -1.060934 4.679356
The adjusted R-squared value of this model is 0.84 (see Appendix), showing that additional inclusion of the variables “cyl”, “hp”, “wt” increases the explained variance of MPG to 84%. The ANOVA comparison of the original and the best model indicates that this improvement is significant (Appendix). Importantly, adjusted with these other variables the transmission mode only has an average of 1.8 (95% CIs: 0-1.06, 4.68), effect on MPG, meaning that cars with manual transmission get 1.8 miles per gallon more, but this difference is not statistically significant.
Based on the scatterplots, MPG seems to be correlated with most of the other variables (which are again correlated with each other), indicating, that a linear regression with one predictor may not be sufficient, and variabe selection is needed.
Summary and model diagnostics
summary (model1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
summary(model2)
##
## Call:
## lm(formula = mpg ~ cyl + hp + wt + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9387 -1.2560 -0.4013 1.1253 5.0513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.70832 2.60489 12.940 7.73e-13 ***
## cyl6 -3.03134 1.40728 -2.154 0.04068 *
## cyl8 -2.16368 2.28425 -0.947 0.35225
## hp -0.03211 0.01369 -2.345 0.02693 *
## wt -2.49683 0.88559 -2.819 0.00908 **
## amManual 1.80921 1.39630 1.296 0.20646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 26 degrees of freedom
## Multiple R-squared: 0.8659, Adjusted R-squared: 0.8401
## F-statistic: 33.57 on 5 and 26 DF, p-value: 1.506e-10
anova(model1,model2)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ cyl + hp + wt + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 26 151.03 4 569.87 24.527 1.688e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The diagnostic plots look good: residuals are randomly distributed, and most of them lies on the normality line in the QQ-plot. There are some outliers, but based on the Cook`s distance, none of these represent leverage points.