data(mtcars)
head(mtcars, 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Our response variable is mpg and predictors can be cyl, disp, hp, drat, wt (chosen from high correlations with mpg).
Looking at the correlation matrix, we can see that: 1. cyl and disp and hp are highly correlated (0.9 and 0.83 respectively) 2. cyl and drat and wt have high correlation too (-0.7, 0.78)
We do not want our model to have multicollinearity, even though, let us test it with those predictors to see how it will affect the model.
model1 = lm(mpg ~ cyl + disp + hp + drat + wt, data=mtcars)
summary(model1)
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7014 -1.6850 -0.4226 1.1681 5.7263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.00836 7.57144 4.756 6.4e-05 ***
## cyl -1.10749 0.71588 -1.547 0.13394
## disp 0.01236 0.01190 1.039 0.30845
## hp -0.02402 0.01328 -1.809 0.08208 .
## drat 0.95221 1.39085 0.685 0.49964
## wt -3.67329 1.05900 -3.469 0.00184 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.538 on 26 degrees of freedom
## Multiple R-squared: 0.8513, Adjusted R-squared: 0.8227
## F-statistic: 29.77 on 5 and 26 DF, p-value: 5.618e-10
We have a number of statistics in the summary.
Firstly, the regression can be written as:
mpg = 36.00 -1.107 * cyl + 0.012 * disp - 0.024 * hp + 0.952 * drat - 3.673 * wt
The significance of cyl, disp and drat > 0.05, thus they may be removed from the model.
Coefficient of determination- The R-squared value is 0.8513, which says that 83% of the variability in the output is captured by our model. Adjusted R squared- The adj R-squared value is 0.8227. This is generally lower than R-squared as it is adjusted for the number of predictors in the model.
We do not want multicollinearity- a way to determine if there is multicollinearity is to calculate the Variance Inflation Factor (VIF).
VIF = 1/(1-Ri^2)
VIF > 5 indicates that there is a chance of multicollinearity. VIF > 10 signals high multi-collinearity.
library(regclass)
#A case where (some of) the VIFs are large
M <- lm(mpg~ cyl + disp + hp + drat + wt,data=mtcars)
VIF(M)
## cyl disp hp drat wt
## 7.869010 10.463957 3.990380 2.662298 5.168795
Here, we can see that disp has high VIF, thus it should be removed from the model. Testing VIF after removing disp, we get:
library(regclass)
#A case where (some of) the VIFs are large
M <- lm(mpg~ cyl + hp + drat + wt,data=mtcars)
VIF(M)
## cyl hp drat wt
## 6.173560 3.784670 2.639229 3.076225
Which thus says that multicollinearity has been removed.
model2 = lm(mpg ~ cyl + hp + drat + wt, data=mtcars)
summary(model2)
##
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6171 -1.5663 -0.6058 1.2612 5.8161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.49588 7.44101 4.636 8.1e-05 ***
## cyl -0.76229 0.63502 -1.200 0.24040
## hp -0.02089 0.01295 -1.613 0.11845
## drat 0.81771 1.38684 0.590 0.56034
## wt -2.97331 0.81818 -3.634 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.541 on 27 degrees of freedom
## Multiple R-squared: 0.8451, Adjusted R-squared: 0.8222
## F-statistic: 36.84 on 4 and 27 DF, p-value: 1.438e-10
Now, the regression equation is:
mpg = 34.49588 -0.76229 * cyl - 0.02089 * hp + 0.81771 * drat - 2.97331 * wt
Corresponding plots:
plot(model2)
res1 = resid(model2)
plot(mtcars$mpg, res1)
Three things to look for in residual plots:
1. Normality- Check the distribution- whether the residuals are normally distributed or not. You can check the QQ plot- the points need to be close to the normality line. Or perform a Shapiro- Wilk test (best for N < 30), or Kolmogorov Smirnov test if N > 30.
2. Independence- Durbin Watson test, or see by inspection that there is no trend in the residual plot.
3. Constant Variance- Check by inspection that there is constant variance observed.
shapiro.test(res1)
##
## Shapiro-Wilk normality test
##
## data: res1
## W = 0.92655, p-value = 0.03143
hist(res1,
main="mpg",
xlab="mpg",
border="light blue",
col="blue",
las=1,
breaks=5)
durbinWatsonTest(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04266962 1.850141 0.448
## Alternative hypothesis: rho != 0
Here, since the p-value < 0.05 (keeping 0.05 as the level of siginificance), we can say that residuals are not normally distributed.
This test is used to find out the influential points in the data. Such influential points can be removed from the data as we do not want our model to be influenced by 1 or two extreme values.
ols_plot_cooksd_bar(model2)
There are 2 influential points in the data which must be removed, which are Chrysler Imperial and Toyota Corolla.
Removing these two now:
mtcars <- mtcars[-c(17, 20), ]
model3 <- lm(mpg ~ cyl + hp + drat + wt, data=mtcars)
summary(model3)
##
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2970 -1.4431 -0.3317 1.3976 5.9577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.04450 6.41544 5.774 5.11e-06 ***
## cyl -0.64251 0.54366 -1.182 0.248401
## hp -0.02018 0.01101 -1.833 0.078705 .
## drat 0.24413 1.18790 0.206 0.838837
## wt -3.49843 0.77374 -4.521 0.000129 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.149 on 25 degrees of freedom
## Multiple R-squared: 0.8723, Adjusted R-squared: 0.8518
## F-statistic: 42.69 on 4 and 25 DF, p-value: 8.013e-11
mpg = 37.04450 -0.64251 * cyl - 0.02018 * hp + 0.24413 * drat - 3.49843 * wt
After removing the most influential points, the new R squared value has become 0.872 and Adjusted R squared value as 0.85, which is a better value.
Also, checking the residuals we get: 1. Normality- Shapiro wilk test gave alpha significance value 0.15 which is > 0.05 2. Independence- Durbin Watson test gave 1.72, thus indicating that the residuals are independent 3. Constant variance- Inspecting the residual plot, we can say that the residuals have constant variance.
## Checking residuals
res3 <- resid(model3)
shapiro.test(res3)
##
## Shapiro-Wilk normality test
##
## data: res3
## W = 0.94908, p-value = 0.1597
durbinWatsonTest(model3)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1061411 1.720848 0.328
## Alternative hypothesis: rho != 0
res3 <- resid(model3)
plot(mtcars$mpg, res3)
Now removing cyl and drat, we get our final resulting model as:
model4 <- lm(mpg ~ hp + wt, data=mtcars)
summary(model4)
##
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5021 -1.2231 -0.1892 1.2445 6.0970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.390419 1.478587 25.288 < 2e-16 ***
## hp -0.029325 0.007518 -3.901 0.000575 ***
## wt -4.160000 0.567052 -7.336 6.84e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.152 on 27 degrees of freedom
## Multiple R-squared: 0.8617, Adjusted R-squared: 0.8515
## F-statistic: 84.13 on 2 and 27 DF, p-value: 2.514e-12
And our final equation is:
mpg = 37.390419 -0.029325 * hp - 4.160000 * wt