by Sandy Sng
28 May 2018
library(datasets)
data(mtcars)
?mtcars
str(mtcars)
Look at the correlation between “mpg” and all variables, specifically at “am”.
cor(mtcars$mpg,mtcars[,-1])
## cyl disp hp drat wt qsec
## [1,] -0.852162 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.418684
## vs am gear carb
## [1,] 0.6640389 0.5998324 0.4802848 -0.5509251
We see that the correlation is positive (at +0.5998). Since under ?mtcars, am Transmission (0 = automatic, 1 = manual), i.e. high “mpg” to high “am”.
This shows that manual transmission is higher for mpg.
Do data conversion for “am Transmission (0 = automatic, 1 = manual)”, and perform a statistical analysis to support this hypothesis with a t-test at 95% confidence interval.
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <-c("Automatic", "Manual")
t.test(mtcars$mpg~mtcars$am,conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: mtcars$mpg by mtcars$am
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean in group Automatic mean in group Manual
## 17.14737 24.39231
Since p-value = 0.001374 < 5%, we reject the null hypothesis. There is a true difference in the means between mpg of manual and automatic transmissions.
Since we rejected the null hypothesis, we can perform multivariates regression to see if more variance can be explained.
Check which variables are highly correlated to mpg, causing variation inflation.
library(car) # to use vif function
## Loading required package: carData
fit <- lm(mpg ~ . , data = mtcars)
vif(fit)
## cyl disp hp drat wt qsec vs
## 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
## am gear carb
## 4.648487 5.357452 7.908747
Note that the variance is inflated by a huge factor for variables “cyl”, “disp”, and “wt”. We will subsequently add these two variables to the multivariate regression model in m3.
Toggle around several multivariate regression models to find the best fit, an alternative to m1. For m1, Residual Standard Error (RSE) = 4.902 and Multiple R-squared (R2) = 0.3598, m1 predicts mpg w/an average error of 4.9mpg, and explains only 36% of the variance. This is not the best model.
A better alternative model will have a lower RSE and higher multiple R-squared value.
m1 <- lm(mpg~am, data = mtcars) # RSE 4.902, multipleR2 0.3598
m2 <- lm(mpg~am + wt + cyl + hp, data = mtcars) # RSE 2.509, multipleR2 0.849
m3 <- lm(mpg~am + wt + cyl + disp, data = mtcars) # RSE 2.642, multipleR2 0.8327
m4 <- lm(mpg~am + wt + cyl + disp + hp + carb, data = mtcars) # RSE 2.541, multipleR2 0.8566
anova(m1, m2, m3, m4) # test if adding more variables are necessary
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ am + wt + cyl + hp
## Model 3: mpg ~ am + wt + cyl + disp
## Model 4: mpg ~ am + wt + cyl + disp + hp + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 27 170.00 3 550.90 28.4370 3.191e-08 ***
## 3 27 188.43 0 -18.43
## 4 25 161.44 2 26.99 2.0896 0.1448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When we try several other models, m2 and m4 are better options because they have relatively lower RSE and higher multipleR2 values. From the Analysis of Variance Table (anova Table), we see that m2 is statistically significant (different from our simple linear model), and with higher degrees of freedom than m4. As such, it is not be necessary to include 2 more variables (disp and carb) to move from m2 to m4.
m2 is a better alternative model will have a lower RSE (average error) of 2.509mpg and higher multiple R-squared value of 84.9%. It has a lower average error compared to the simple linear model m1 (m1 has RSE = 4.902), and explains more variance than m1 (m1 explains only 36%).
Given the above analysis, to answer the question of “Is an automatic or manual transmission better for MPG”, we have to consider three more variables: Weight, Number of cylinders, and Gross horsepower, instead of just the Transmission (auto/manual).
To check the validity of the model (m2), we will have to check the 4 assumptions required to use a linear model as an explainer/predictor:
par(mfrow = c(2,2))
plot(m2)
Residuals vs Leverage graph: Does not show any influential outliers (since there are no points with extreme Cook’s Distances, there don’t appear to be any observations that exert too much influence or leverage.)