In the following, the 1974 dataset mtcars is analyzed for the following key questions/tasks:
The findings suggest that there is a (statistically) significant difference between manual and automatic transmission with the manual transmission being more efficient by an average of 7.2 mpg when only the transmission is taken into account and no other factors.
Finding a more accurate model for the efficiency (mpg) includes other factors such as weight or accelaration (1/4 speed) of the car. It was found that many parameters given in the database are correlated to each other and thus only the most significant were selected for a predictive model. After accounting for weight and accelaration in the regression model, the transmission still has singificant impact on the gas-milage which can be quantified as an mpg-increase of 2.9 mpg in a manual vs. automatic car.
Hence it can be concluded that even after looking at secondary parameters, cars with an automatic transmission were less efficient in 1974 (see below for remark on this topic).
The models and results were tested for accuracy (i.e. homoscedasticity, etc.) and the complete analysis is presented below. It should be noted that wide confidence intervals are also due to the low number of datapoints (amongst other reasons).
First we will need to find out the basics about the dataset.
library(ggplot2)
setwd("~/Google Drive/DataScienceClasses/Regression")
# Load data
data(mtcars)
head(mtcars,3)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Since automatic or manual transmission is a factor variable (0/1), this should be defined at that. A pairs plot is produced for a first look at the data. This helps to identify what parameters mpg is correlated to. (The plot was suppressed here and can be found in the appendix.)
# Make am factor
mtcars$am <- factor(mtcars$am, levels = c(0,1), labels = c("Automatic", "Manual"))
# Plot all against all
pairs(mtcars, panel = panel.smooth)
It appears that a lot of the variables are correlated to mpg, but also have a lot of correlation between each other. This is causally/logically explainable, as (for example) engines with a large number of cylinders have to have larger displacement, but also have a higher performance (horsepower), which in return results in greater torque and acceleration (qsec).
Also it appears that heavier cars (usually SUVs and high-end sedans, etc.) are more expensive and thus are more prone to being equipped with an automtic transmission rather than manual. And lastly, automatic transmission are historically heavier than their manual counterparts (note that this data was released 1974; more modern (>2000) automatic transmissions are actually lighter and more efficient than manual transmissions).
First, a basic model only considering mpg and transmission type was tested:
m1 <- lm(mpg ~ am, data = mtcars)
summary(m1)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## amManual 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
This results can be interpreted such as that the automatic transmission reduces the mpg by an average of 7.2 mpg. In addition, the transmission type alone explains 34% of the variance of the mpg.
Next, a full model including all variables in the dataset was tested:
m2 <- lm(mpg ~ ., data = mtcars)
summary(m2)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs 0.31776 2.10451 0.151 0.8814
## amManual 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
This model explains 81% of the variance and is overall significant (F-stat) but none of the variables are highly signifiant due to their correlation among each other (t-value).
In order to find the best model, we can use stepwise regression to find the model with the lowest AIC (Akaike Information Criterion):
m3 <- step(m2, trace = F)
summary(m3)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## amManual 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The best fit model for mpg appears to be weight, 1/4 mile time and transmission type.
A further model using interaction terms was tested in the appendix.
The coefficients can be interpreted as follows:
Based on the t-values found for the regressors, we can conclude that all (wt, qsec and am) are significant (t-values, see above), however the intercept is not.
Running an Anova on the model found compared to the first simple model yields that our last model found is statistically significant:
anova(m1,m3)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 45.618 1.55e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(m3)
## 2.5 % 97.5 %
## (Intercept) -4.63829946 23.873860
## wt -5.37333423 -2.459673
## qsec 0.63457320 1.817199
## amManual 0.04573031 5.825944
The 95% confidence intervals are rather wide (especially for the intercept) due to the limited number of observations and shortcomings of the model. The number of datapoints should be increased to obtain more accurate results for the general population.
The following shows that the residuals are * evenly scattered along the mpg variable * follow approximately a normal distribution (QQ-Plot and Shapiro test has a p-value less than 10%, see below) leading to the assumption that the model is a good fit and no trends were omitted
par(mfrow = c(2,2))
plot(m3)
#Shapiro test for Normality
shapiro.test(m3$residuals)
##
## Shapiro-Wilk normality test
##
## data: m3$residuals
## W = 0.9411, p-value = 0.08043
pairs(mtcars, panel = panel.smooth)
We should not forget that weight and transmission and probably also 1/4 mile time are dependent on each other. Hence, let’s optimize our model further by including the interaction terms and run another stepwise regression over this:
m4 <- step(lm(mpg ~ am*qsec*wt, data = mtcars), trace = F)
summary(m4)
##
## Call:
## lm(formula = mpg ~ am + qsec + wt + am:wt + qsec:wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6264 -1.4660 -0.3559 1.1520 3.9559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20.1094 23.5809 -0.853 0.401568
## amManual 14.0026 3.3918 4.128 0.000334 ***
## qsec 2.6831 1.3002 2.064 0.049171 *
## wt 6.6931 7.4051 0.904 0.374379
## amManual:wt -4.1411 1.1815 -3.505 0.001675 **
## qsec:wt -0.5401 0.4137 -1.306 0.203141
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.057 on 26 degrees of freedom
## Multiple R-squared: 0.9023, Adjusted R-squared: 0.8835
## F-statistic: 48 on 5 and 26 DF, p-value: 2.606e-12
The model found now explains 88% of the variance. However, with the interaction terms the interpretation of the coefficients becomes complicated.
Below please find the t-test to prove that there is a significant difference between the mpg of automatic and manual cars as well as a comparative plot (violin and boxplots).
t.test(mpg ~ am, data = mtcars)
##
## Welch Two Sample t-test
##
## data: mpg by am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
# Am is the car variable
ggplot(mtcars, aes(am, mpg)) +
geom_violin(fill = "lightskyblue1") +
geom_boxplot(width = .25, fill = "salmon2") +
xlab("Automatic/Manual") +
ggtitle("Violin/Boxplot of Automatic vs. Manual") +
geom_point()
Based on the t-test (and visually by the plot) we have to reject out null hypothesis for the alternative that the tranmission type has significant influence on the mpg. The mean difference in mpg between automatic and manual transmissions is (logically same as the simple regression model):
abs(mean(mtcars$mpg[mtcars$am == "Automatic"]) - mean(mtcars$mpg[mtcars$am == "Manual"]))
## [1] 7.244939
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 digest_0.6.9 plyr_1.8.3 grid_3.2.3
## [5] gtable_0.1.2 formatR_1.2.1 magrittr_1.5 evaluate_0.8
## [9] scales_0.3.0 stringi_1.0-1 rmarkdown_0.9.2 labeling_0.3
## [13] tools_3.2.3 stringr_1.0.0 munsell_0.4.3 yaml_2.1.13
## [17] colorspace_1.2-6 htmltools_0.3 knitr_1.12.3