Importing the dataset

data(mtcars)
head(mtcars, 10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Looking at the data

##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Choosing the variables and creating a regression model

Our response variable is mpg and predictors can be cyl, disp, hp, drat, wt (chosen from high correlations with mpg).

Looking at the correlation matrix, we can see that: 1. cyl and disp and hp are highly correlated (0.9 and 0.83 respectively) 2. cyl and drat and wt have high correlation too (-0.7, 0.78)

We do not want our model to have multicollinearity, even though, let us test it with those predictors to see how it will affect the model.

Model 1

model1 = lm(mpg ~ cyl + disp + hp + drat + wt, data=mtcars)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7014 -1.6850 -0.4226  1.1681  5.7263 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.00836    7.57144   4.756  6.4e-05 ***
## cyl         -1.10749    0.71588  -1.547  0.13394    
## disp         0.01236    0.01190   1.039  0.30845    
## hp          -0.02402    0.01328  -1.809  0.08208 .  
## drat         0.95221    1.39085   0.685  0.49964    
## wt          -3.67329    1.05900  -3.469  0.00184 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.538 on 26 degrees of freedom
## Multiple R-squared:  0.8513, Adjusted R-squared:  0.8227 
## F-statistic: 29.77 on 5 and 26 DF,  p-value: 5.618e-10

We have a number of statistics in the summary.

Firstly, the regression can be written as:
mpg = 36.00 -1.107 * cyl + 0.012 * disp - 0.024 * hp + 0.952 * drat - 3.673 * wt
The significance of cyl, disp and drat > 0.05, thus they may be removed from the model.

Coefficient of determination- The R-squared value is 0.8513, which says that 83% of the variability in the output is captured by our model. Adjusted R squared- The adj R-squared value is 0.8227. This is generally lower than R-squared as it is adjusted for the number of predictors in the model.

Multicollinearity check using VIF

We do not want multicollinearity- a way to determine if there is multicollinearity is to calculate the Variance Inflation Factor (VIF).
VIF = 1/(1-Ri^2)
VIF > 5 indicates that there is a chance of multicollinearity. VIF > 10 signals high multi-collinearity.

library(regclass)
#A case where (some of) the VIFs are large
  M <- lm(mpg~ cyl + disp + hp + drat + wt,data=mtcars)
  VIF(M)
##       cyl      disp        hp      drat        wt 
##  7.869010 10.463957  3.990380  2.662298  5.168795

Here, we can see that disp has high VIF, thus it should be removed from the model. Testing VIF after removing disp, we get:

library(regclass)
#A case where (some of) the VIFs are large
  M <- lm(mpg~ cyl + hp + drat + wt,data=mtcars)
  VIF(M)
##      cyl       hp     drat       wt 
## 6.173560 3.784670 2.639229 3.076225

Which thus says that multicollinearity has been removed.

Model 2

model2 = lm(mpg ~ cyl + hp + drat + wt, data=mtcars)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6171 -1.5663 -0.6058  1.2612  5.8161 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.49588    7.44101   4.636  8.1e-05 ***
## cyl         -0.76229    0.63502  -1.200  0.24040    
## hp          -0.02089    0.01295  -1.613  0.11845    
## drat         0.81771    1.38684   0.590  0.56034    
## wt          -2.97331    0.81818  -3.634  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.541 on 27 degrees of freedom
## Multiple R-squared:  0.8451, Adjusted R-squared:  0.8222 
## F-statistic: 36.84 on 4 and 27 DF,  p-value: 1.438e-10

Now, the regression equation is:
mpg = 34.49588 -0.76229 * cyl - 0.02089 * hp + 0.81771 * drat - 2.97331 * wt

Corresponding plots:

  plot(model2)

Getting the residual graph

res1 = resid(model2)

plot(mtcars$mpg, res1)

Tests for residuals


Three things to look for in residual plots:

1. Normality- Check the distribution- whether the residuals are normally distributed or not. You can check the QQ plot- the points need to be close to the normality line. Or perform a Shapiro- Wilk test (best for N < 30), or Kolmogorov Smirnov test if N > 30.
2. Independence- Durbin Watson test, or see by inspection that there is no trend in the residual plot.
3. Constant Variance- Check by inspection that there is constant variance observed.

shapiro.test(res1)
## 
##  Shapiro-Wilk normality test
## 
## data:  res1
## W = 0.92655, p-value = 0.03143
hist(res1, 
     main="mpg", 
     xlab="mpg", 
     border="light blue", 
     col="blue", 
     las=1, 
     breaks=5)

durbinWatsonTest(model1)
##  lag Autocorrelation D-W Statistic p-value
##    1      0.04266962      1.850141   0.448
##  Alternative hypothesis: rho != 0

Here, since the p-value < 0.05 (keeping 0.05 as the level of siginificance), we can say that residuals are not normally distributed.

Cook’s Distance

This test is used to find out the influential points in the data. Such influential points can be removed from the data as we do not want our model to be influenced by 1 or two extreme values.

ols_plot_cooksd_bar(model2)

There are 2 influential points in the data which must be removed, which are Chrysler Imperial and Toyota Corolla.

Removing these two now:

  mtcars <- mtcars[-c(17, 20), ] 

Model 3

model3 <-  lm(mpg ~ cyl + hp + drat + wt, data=mtcars)
summary(model3)
## 
## Call:
## lm(formula = mpg ~ cyl + hp + drat + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2970 -1.4431 -0.3317  1.3976  5.9577 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.04450    6.41544   5.774 5.11e-06 ***
## cyl         -0.64251    0.54366  -1.182 0.248401    
## hp          -0.02018    0.01101  -1.833 0.078705 .  
## drat         0.24413    1.18790   0.206 0.838837    
## wt          -3.49843    0.77374  -4.521 0.000129 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.149 on 25 degrees of freedom
## Multiple R-squared:  0.8723, Adjusted R-squared:  0.8518 
## F-statistic: 42.69 on 4 and 25 DF,  p-value: 8.013e-11


mpg = 37.04450 -0.64251 * cyl - 0.02018 * hp + 0.24413 * drat - 3.49843 * wt

After removing the most influential points, the new R squared value has become 0.872 and Adjusted R squared value as 0.85, which is a better value.

Also, checking the residuals we get: 1. Normality- Shapiro wilk test gave alpha significance value 0.15 which is > 0.05 2. Independence- Durbin Watson test gave 1.72, thus indicating that the residuals are independent 3. Constant variance- Inspecting the residual plot, we can say that the residuals have constant variance.

## Checking residuals
res3 <- resid(model3)
shapiro.test(res3)
## 
##  Shapiro-Wilk normality test
## 
## data:  res3
## W = 0.94908, p-value = 0.1597
durbinWatsonTest(model3)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1061411      1.720848   0.328
##  Alternative hypothesis: rho != 0
res3 <- resid(model3)

plot(mtcars$mpg, res3)

Model 4

Now removing cyl and drat, we get our final resulting model as:

model4 <-  lm(mpg ~ hp + wt, data=mtcars)
summary(model4)
## 
## Call:
## lm(formula = mpg ~ hp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5021 -1.2231 -0.1892  1.2445  6.0970 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.390419   1.478587  25.288  < 2e-16 ***
## hp          -0.029325   0.007518  -3.901 0.000575 ***
## wt          -4.160000   0.567052  -7.336 6.84e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.152 on 27 degrees of freedom
## Multiple R-squared:  0.8617, Adjusted R-squared:  0.8515 
## F-statistic: 84.13 on 2 and 27 DF,  p-value: 2.514e-12

And our final equation is:
mpg = 37.390419 -0.029325 * hp - 4.160000 * wt