The analysis is part of the Coursera regression models class. The course project adresses the following questions:
You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:
The mean for cars with an automatic transmission is 17.15 whereas for cars with manual transmission it is 24.39. This suggests, that cars with manual transmission have a better value for mpg. The linear regression shows, that this hypothesis can’t be rejected. However, multivariate regression shows that other factors such as horsepower, cylinders, displacement and weight have an influence on mpg, too. Further investigation might make sense, since the further analysis implies, that factors such as horsepower and weight alone might explain variations in mpg better than the transmission type.
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$am <- factor(mtcars$am, labels=c("Automatic","Manual"))
For the following analysis we have to transform the variable am into a factor variable.
mpgmean <- aggregate(mtcars$mpg, by=list(mtcars$am), FUN=mean)
colnames(mpgmean) <- c("am", "mpg")
mpgmean
## am mpg
## 1 Automatic 17.14737
## 2 Manual 24.39231
mpgmean$mpg[2] - mpgmean$mpg[1]
## [1] 7.244939
As the boxplot shows, respective the calculation, the mean for mpg for automatic cars is 17.15 whereas for manual cars it is 24.39. This resumes to a difference of 7.25. It seems that manual cars have a higher mpg count than automatic cars. The following analysis will check the correlation with a linear regression.
fit <- lm(mpg ~ am, mtcars)
summary(fit)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
With a p-value of 0.000285 the hypothesis can’t be rejected, so we still think manual cars imply a higher mpg count. However, with a R-squared of 0.3598 approximately 36 % of the variance of mpg are explained by the transmission type.
This suggests further analysis over the whole dataset. The selection of variables follows further investigation of influencing variables of mpg (see appendix: Further exploratory analysis).
fit2 <- lm(mpg ~ am + hp, mtcars)
fit3 <- lm(mpg ~ am + hp + cyl, mtcars)
fit4 <- lm(mpg ~ am + hp + cyl + disp, mtcars)
fit5 <- lm(mpg ~ am + hp + cyl + disp + wt, mtcars)
anova(fit, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp
## Model 3: mpg ~ am + hp + cyl
## Model 4: mpg ~ am + hp + cyl + disp
## Model 5: mpg ~ am + hp + cyl + disp + wt
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 29 245.44 1 475.46 75.7841 3.499e-09 ***
## 3 28 220.55 1 24.89 3.9667 0.057011 .
## 4 27 216.37 1 4.19 0.6672 0.421464
## 5 26 163.12 1 53.25 8.4872 0.007257 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysing the p-values of our anova-analysis shows, that it seems to be necessary to include horsepower (model 2) and weight (model 5) for better explanation of the the variations in mpg. Therfore we build our final regression model containing the transmission type, horsepower and weight.
fit_final <- lm(mpg ~ am + hp + wt, mtcars)
summary(fit_final)
##
## Call:
## lm(formula = mpg ~ am + hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4221 -1.7924 -0.3788 1.2249 5.5317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.002875 2.642659 12.867 2.82e-13 ***
## amManual 2.083710 1.376420 1.514 0.141268
## hp -0.037479 0.009605 -3.902 0.000546 ***
## wt -2.878575 0.904971 -3.181 0.003574 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared: 0.8399, Adjusted R-squared: 0.8227
## F-statistic: 48.96 on 3 and 28 DF, p-value: 2.908e-11
With an R-squared of 0.8399 this analysis shows, that approximately 84 % of the variations in mpg can be explained by this model.
Further exploratory analysis shows, that variables such as hp, cyl, disp and wt seem to have an influence on the outcome of mpg, too. Therefore further regression analysis is down in the main part, using these variables.
AIC(fit) # mpg ~ am
## [1] 196.4844
AIC(fit2) # mpg ~ am + hp
## [1] 164.0061
AIC(fit3) # mpg ~ am + hp + cyl
## [1] 162.5849
AIC(fit4) # mpg ~ am + hp + cyl + disp
## [1] 163.9718
AIC(fit5) # mpg ~ am + hp + cyl + disp + wt
## [1] 156.932
AIC(fit_final)# mpg ~ am + hp + wt
## [1] 156.1348
Model selection should follow the principle of parsimonious data selection, however the model still should explain as much variation as possible in our outcome variable. The Akaike Information Criterion (AIC) helps us in finding a good balance between this two principles. As shown above, our fit_final model creates the lowest AIC with 156.1348 and therefore seems to be the best model to choose.
fit_without_am <- lm(mpg ~ hp + wt, mtcars)
summary(fit_without_am)
##
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
AIC(fit_without_am)
## [1] 156.6523
Explaining variations of mpg without using am as a variable seem to be a reasonable idea, too. Our linear regression model using hp a wt create a R-squared value of 0.8268 and a comparable AIC of 156.6523 to our final linear model in the main analysis.