Regression Analysis in R with Residual Tests

Choosing the variables and creating a regression model

Our response variable is mpg and predictors can be cyl, disp, hp, drat, wt (chosen from high correlations with mpg).

Looking at the correlation matrix, we can see that: 1. cyl and disp and hp are highly correlated (0.9 and 0.83 respectively) 2. cyl and drat and wt have high correlation too (-0.7, 0.78)

We do not want our model to have multicollinearity, even though, let us test it with those predictors to see how it will affect the model.

Model 1

model1 = lm(mpg ~ cyl + disp + hp + drat + wt, data=mtcars)
summary(model1)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7014 -1.6850 -0.4226  1.1681  5.7263 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.00836    7.57144   4.756  6.4e-05 ***
## cyl         -1.10749    0.71588  -1.547  0.13394    
## disp         0.01236    0.01190   1.039  0.30845    
## hp          -0.02402    0.01328  -1.809  0.08208 .  
## drat         0.95221    1.39085   0.685  0.49964    
## wt          -3.67329    1.05900  -3.469  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 26 degrees of freedom
## Multiple R-squared:  0.8513, Adjusted R-squared:  0.8227 
## F-statistic: 29.77 on 5 and 26 DF,  p-value: 5.618e-10

We have a number of statistics in the summary.

Firstly, the regression can be written as:
mpg = 36.00 -1.107 * cyl + 0.012 * disp - 0.024 * hp + 0.952 * drat - 3.673 * wt
The significance of cyl, disp and drat > 0.05, thus they may be removed from the model.

Coefficient of determination- The R-squared value is 0.8513, which says that 83% of the variability in the output is captured by our model. Adjusted R squared- The adj R-squared value is 0.8227. This is generally lower than R-squared as it is adjusted for the number of predictors in the model.

Multicollinearity check using VIF

We do not want multicollinearity- a way to determine if there is multicollinearity is to calculate the Variance Inflation Factor (VIF).
VIF = 1/(1-Ri^2)
VIF > 5 indicates that there is a chance of multicollinearity. VIF > 10 signals high multi-collinearity.

library(regclass)
#A case where (some of) the VIFs are large
  M <- lm(mpg~ cyl + disp + hp + drat + wt,data=mtcars)
  VIF(M)

##       cyl      disp        hp      drat        wt 
##  7.869010 10.463957  3.990380  2.662298  5.168795

Here, we can see that disp has high VIF, thus it should be removed from the model. Testing VIF after removing disp, we get:

library(regclass)
#A case where (some of) the VIFs are large
  M <- lm(mpg~ cyl + hp + drat + wt,data=mtcars)
  VIF(M)

##      cyl       hp     drat       wt 
## 6.173560 3.784670 2.639229 3.076225

Which thus says that multicollinearity has been removed.

Model 2

model2 = lm(mpg ~ cyl + hp + drat + wt, data=mtcars)
summary(model2)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6171 -1.5663 -0.6058  1.2612  5.8161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.49588    7.44101   4.636  8.1e-05 ***
## cyl         -0.76229    0.63502  -1.200  0.24040    
## hp          -0.02089    0.01295  -1.613  0.11845    
## drat         0.81771    1.38684   0.590  0.56034    
## wt          -2.97331    0.81818  -3.634  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 27 degrees of freedom
## Multiple R-squared:  0.8451, Adjusted R-squared:  0.8222 
## F-statistic: 36.84 on 4 and 27 DF,  p-value: 1.438e-10

Now, the regression equation is:
mpg = 34.49588 -0.76229 * cyl - 0.02089 * hp + 0.81771 * drat - 2.97331 * wt

Corresponding plots:

  plot(model2)

Getting the residual graph

res1 = resid(model2)

plot(mtcars$mpg, res1)

Tests for residuals

Three things to look for in residual plots:

1. Normality- Check the distribution- whether the residuals are normally distributed or not. You can check the QQ plot- the points need to be close to the normality line. Or perform a Shapiro- Wilk test (best for N < 30), or Kolmogorov Smirnov test if N > 30.
2. Independence- Durbin Watson test, or see by inspection that there is no trend in the residual plot.
3. Constant Variance- Check by inspection that there is constant variance observed.

shapiro.test(res1)

## 
##  Shapiro-Wilk normality test
## 
## data:  res1
## W = 0.92655, p-value = 0.03143

hist(res1, 
     main="mpg", 
     xlab="mpg", 
     border="light blue", 
     col="blue", 
     las=1, 
     breaks=5)

durbinWatsonTest(model1)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.04266962      1.850141   0.448
##  Alternative hypothesis: rho != 0

Here, since the p-value < 0.05 (keeping 0.05 as the level of siginificance), we can say that residuals are not normally distributed.

Cook’s Distance

This test is used to find out the influential points in the data. Such influential points can be removed from the data as we do not want our model to be influenced by 1 or two extreme values.

ols_plot_cooksd_bar(model2)

There are 2 influential points in the data which must be removed, which are Chrysler Imperial and Toyota Corolla.

Removing these two now:

  mtcars <- mtcars[-c(17, 20), ]

Model 3

model3 <-  lm(mpg ~ cyl + hp + drat + wt, data=mtcars)
summary(model3)

## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2970 -1.4431 -0.3317  1.3976  5.9577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.04450    6.41544   5.774 5.11e-06 ***
## cyl         -0.64251    0.54366  -1.182 0.248401    
## hp          -0.02018    0.01101  -1.833 0.078705 .  
## drat         0.24413    1.18790   0.206 0.838837    
## wt          -3.49843    0.77374  -4.521 0.000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.149 on 25 degrees of freedom
## Multiple R-squared:  0.8723, Adjusted R-squared:  0.8518 
## F-statistic: 42.69 on 4 and 25 DF,  p-value: 8.013e-11

mpg = 37.04450 -0.64251 * cyl - 0.02018 * hp + 0.24413 * drat - 3.49843 * wt

After removing the most influential points, the new R squared value has become 0.872 and Adjusted R squared value as 0.85, which is a better value.

Also, checking the residuals we get: 1. Normality- Shapiro wilk test gave alpha significance value 0.15 which is > 0.05 2. Independence- Durbin Watson test gave 1.72, thus indicating that the residuals are independent 3. Constant variance- Inspecting the residual plot, we can say that the residuals have constant variance.

## Checking residuals
res3 <- resid(model3)
shapiro.test(res3)

## 
##  Shapiro-Wilk normality test
## 
## data:  res3
## W = 0.94908, p-value = 0.1597

durbinWatsonTest(model3)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1061411      1.720848   0.328
##  Alternative hypothesis: rho != 0

res3 <- resid(model3)

plot(mtcars$mpg, res3)

Model 4

Now removing cyl and drat, we get our final resulting model as:

model4 <-  lm(mpg ~ hp + wt, data=mtcars)
summary(model4)

## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5021 -1.2231 -0.1892  1.2445  6.0970 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.390419   1.478587  25.288  < 2e-16 ***
## hp          -0.029325   0.007518  -3.901 0.000575 ***
## wt          -4.160000   0.567052  -7.336 6.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.152 on 27 degrees of freedom
## Multiple R-squared:  0.8617, Adjusted R-squared:  0.8515 
## F-statistic: 84.13 on 2 and 27 DF,  p-value: 2.514e-12

And our final equation is:
mpg = 37.390419 -0.029325 * hp - 4.160000 * wt