The analysis is part of the Coursera regression models class. The course project adresses the following questions:

You work for Motor Trend, a magazine about the automobile industry. Looking at a data set of a collection of cars, they are interested in exploring the relationship between a set of variables and miles per gallon (MPG) (outcome). They are particularly interested in the following two questions:

Management Summary

The mean for cars with an automatic transmission is 17.15 whereas for cars with manual transmission it is 24.39. This suggests, that cars with manual transmission have a better value for mpg. The linear regression shows, that this hypothesis can’t be rejected. However, multivariate regression shows that other factors such as horsepower, cylinders, displacement and weight have an influence on mpg, too. Further investigation might make sense, since the further analysis implies, that factors such as horsepower and weight alone might explain variations in mpg better than the transmission type.

Exploratory data analysis

Looking at the dataset

data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Preparing the dataset

mtcars$am <- factor(mtcars$am, labels=c("Automatic","Manual"))

For the following analysis we have to transform the variable am into a factor variable.

Comparing means of mpg for automatic and manual transmission

mpgmean <- aggregate(mtcars$mpg, by=list(mtcars$am), FUN=mean)
colnames(mpgmean) <- c("am", "mpg")
mpgmean
##          am      mpg
## 1 Automatic 17.14737
## 2    Manual 24.39231
mpgmean$mpg[2] - mpgmean$mpg[1]
## [1] 7.244939

As the boxplot shows, respective the calculation, the mean for mpg for automatic cars is 17.15 whereas for manual cars it is 24.39. This resumes to a difference of 7.25. It seems that manual cars have a higher mpg count than automatic cars. The following analysis will check the correlation with a linear regression.

Linear Regression

fit <- lm(mpg ~ am, mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3923 -3.0923 -0.2974  3.2439  9.5077 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   17.147      1.125  15.247 1.13e-15 ***
## amManual       7.245      1.764   4.106 0.000285 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared:  0.3598, Adjusted R-squared:  0.3385 
## F-statistic: 16.86 on 1 and 30 DF,  p-value: 0.000285

With a p-value of 0.000285 the hypothesis can’t be rejected, so we still think manual cars imply a higher mpg count. However, with a R-squared of 0.3598 approximately 36 % of the variance of mpg are explained by the transmission type.

This suggests further analysis over the whole dataset. The selection of variables follows further investigation of influencing variables of mpg (see appendix: Further exploratory analysis).

fit2 <- lm(mpg ~ am + hp, mtcars)
fit3 <- lm(mpg ~ am + hp + cyl, mtcars)
fit4 <- lm(mpg ~ am + hp + cyl + disp, mtcars)
fit5 <- lm(mpg ~ am + hp + cyl + disp + wt, mtcars)

anova(fit, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: mpg ~ am
## Model 2: mpg ~ am + hp
## Model 3: mpg ~ am + hp + cyl
## Model 4: mpg ~ am + hp + cyl + disp
## Model 5: mpg ~ am + hp + cyl + disp + wt
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     30 720.90                                   
## 2     29 245.44  1    475.46 75.7841 3.499e-09 ***
## 3     28 220.55  1     24.89  3.9667  0.057011 .  
## 4     27 216.37  1      4.19  0.6672  0.421464    
## 5     26 163.12  1     53.25  8.4872  0.007257 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analysing the p-values of our anova-analysis shows, that it seems to be necessary to include horsepower (model 2) and weight (model 5) for better explanation of the the variations in mpg. Therfore we build our final regression model containing the transmission type, horsepower and weight.

fit_final <- lm(mpg ~ am + hp + wt, mtcars)
summary(fit_final)
## 
## Call:
## lm(formula = mpg ~ am + hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4221 -1.7924 -0.3788  1.2249  5.5317 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.002875   2.642659  12.867 2.82e-13 ***
## amManual     2.083710   1.376420   1.514 0.141268    
## hp          -0.037479   0.009605  -3.902 0.000546 ***
## wt          -2.878575   0.904971  -3.181 0.003574 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 28 degrees of freedom
## Multiple R-squared:  0.8399, Adjusted R-squared:  0.8227 
## F-statistic: 48.96 on 3 and 28 DF,  p-value: 2.908e-11

With an R-squared of 0.8399 this analysis shows, that approximately 84 % of the variations in mpg can be explained by this model.

Appendix

Quality of the final linear regression model mpg ~ am + hp + wt

  • Residuals vs. Fitted: The data seem to have a non-linear distribution, which are not explained by our model. This might cause problems and should be investigated in further analyses.
  • Normal Q-Q: No problems concerning the normality assumption can be observed over the wide range of data points, however the data points in the upper right corner might be concerning.
  • Scale-Location: The residuals seem to spread randomly, no pattern can be observed.
  • Residuals vs. Leverage: There are no extreme data points outside Cook’s distance of 0.5 or 1.

Further exploratory analysis

Further exploratory analysis shows, that variables such as hp, cyl, disp and wt seem to have an influence on the outcome of mpg, too. Therefore further regression analysis is down in the main part, using these variables.

AIC analysis

AIC(fit) # mpg ~ am
## [1] 196.4844
AIC(fit2) # mpg ~ am + hp
## [1] 164.0061
AIC(fit3) # mpg ~ am + hp + cyl
## [1] 162.5849
AIC(fit4) # mpg ~ am + hp + cyl + disp
## [1] 163.9718
AIC(fit5) # mpg ~ am + hp + cyl + disp + wt
## [1] 156.932
AIC(fit_final)# mpg ~ am + hp + wt
## [1] 156.1348

Model selection should follow the principle of parsimonious data selection, however the model still should explain as much variation as possible in our outcome variable. The Akaike Information Criterion (AIC) helps us in finding a good balance between this two principles. As shown above, our fit_final model creates the lowest AIC with 156.1348 and therefore seems to be the best model to choose.

Explanation without am

fit_without_am <- lm(mpg ~ hp + wt, mtcars)
summary(fit_without_am)
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
AIC(fit_without_am)
## [1] 156.6523

Explaining variations of mpg without using am as a variable seem to be a reasonable idea, too. Our linear regression model using hp a wt create a R-squared value of 0.8268 and a comparable AIC of 156.6523 to our final linear model in the main analysis.