Linear Regression Assumptions and Diagnostics in R

We will use the Airlines data set (“BOMDELBOM”)

Building a Regression Model

# building a regression model
model <- lm(Price ~ AdvanceBookingDays
             + Capacity
             + Airline
             + Departure
             + IsWeekend
             + IsDiwali
             + FlyingMinutes
             + SeatWidth
             + SeatPitch,
            data = airline.df)
summary(model)

## 
## Call:
## lm(formula = Price ~ AdvanceBookingDays + Capacity + Airline + 
##     Departure + IsWeekend + IsDiwali + FlyingMinutes + SeatWidth + 
##     SeatPitch, data = airline.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3096.1 -1278.7  -546.0   544.3 11676.3 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -6361.724  10775.621  -0.590  0.55539    
## AdvanceBookingDays   -91.200     12.727  -7.166 6.29e-12 ***
## Capacity             -11.003      7.314  -1.504  0.13354    
## AirlineIndiGo      -2198.615    721.687  -3.046  0.00253 ** 
## AirlineJet          -983.995    458.378  -2.147  0.03264 *  
## AirlineSpice Jet   -1701.197    724.904  -2.347  0.01960 *  
## DeparturePM         -769.824    277.705  -2.772  0.00593 ** 
## IsWeekendYes        -856.834    398.316  -2.151  0.03228 *  
## IsDiwaliYes         4366.719    581.342   7.511 7.11e-13 ***
## FlyingMinutes         -9.482     28.178  -0.337  0.73672    
## SeatWidth           1185.321    550.696   2.152  0.03218 *  
## SeatPitch            -99.785    279.146  -0.357  0.72100    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2128 on 293 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2064 
## F-statistic: 8.188 on 11 and 293 DF,  p-value: 1.679e-12

Linearity of the Data

The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st plot)

# Linearity of the data
plot(model, 1)

Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.

In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.

Normality of Residuals

QQ Plot of Residuals

The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

In our example, all the points fall approximately along this reference line, so we can assume normality.

# Normality of residuals
plot(model, 2)

In our example, all the points do not fall approximately along this reference line, so we can assume non-normality.

Perform a Shapiro-Wilk Normality Test

library(MASS)
# distribution of studentized residuals
sresid <- studres(model) 
shapiro.test(sresid)

## 
##  Shapiro-Wilk normality test
## 
## data:  sresid
## W = 0.81738, p-value < 2.2e-16

From the p-value = 2.2e-16 < 0.5, it is clear that the residuals are not normally distributed.

Outliers and high levarage points

Outliers An outlier is a point that has an extreme outcome variable value. The presence of outliers may affect the interpretation of the model, because it increases the RSE.

Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard errors away from the regression line. Observations whose standardized residuals are greater than 3 in absolute value are possible outliers

High leverage points:

A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the leverage statistic or the hat-value. A value of this statistic above 2(p + 1)/n indicates an observation with high leverage (P. Bruce and Bruce 2017); where, p is the number of predictors and n is the number of observations.

# High leverage points
plot(model, 5)

The plot above highlights the top 3 most extreme points (#183, #182 and #49), with a standardized residuals higher 3. However, there is outliers that exceed 3 standard deviations, what is not good.

Additionally, there is high leverage point in the data. That is, some data points, have a leverage statistic higher than 2(p + 1)/n = 16/305 = 0.05.

Influential values

An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis. Such a value is associated with a large residual.

Not all outliers (or extreme data points) are influential in linear regression analysis. Statisticians have developed a metric called Cook's distance to determine the influence of a value. This metric defines influence as a combination of leverage and residual size.

A rule of thumb is that an observation has high influence if Cook's distance exceeds 4/(n - p - 1)(P. Bruce and Bruce 2017), where n is the number of observations and p the number of predictor variables.

where n is the number of observations and p the number of predictor variables.

The following plots illustrate the Cook's distance and the leverage of our model:

#Cook's distance
plot(model, 4)

#Residuals vs Leverage
plot(model, 5)

By default, the top 3 most extreme values are labelled on the Cook's distance plot. If you want to label the top 5 extreme values, specify the option id.n as follow:

plot(model, 4, id.n = 5)

In our example, the data does not present any influential points. Cook's distance lines (a red dashed line) are not shown on the Residuals vs Leverage plot because all points are well inside of the Cook's distance lines.

Testing the Homoscedasticity Assumption

Scale-location plot

This assumption can be checked by examining the Scale-location plot, also known as the spread-location plot.

plot(model, 3)

This plot shows if residuals are spread equally along the ranges of predictors. it is good if you see a horizontal line with equally spread points. In our example, this is the case. we can assume Homogeneity of variance

ncvTest() For Homoscedasticity

library(car)

## Loading required package: carData

# non-constant error variance test
ncvTest(model)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 64.06275, Df = 1, p = 1.2052e-15

p < .05, suggesting that our data is not homoscedastic.

Breusch-Pagan Test For Homoscedasticity

library(lmtest)
bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 26.711, df = 11, p-value = 0.00508

Testing the Independence (Autocorrelation) Assumption

Durbin Watson Test for Autocorrelation

The Durbin Watson examines whether the errors are autocorrelated with themselves. The null states that they are not autocorrelated (what we want). This test could be especially useful when you conduct a multiple (times series) regression. For example, this test could tell you whether the residuals at time point 1 are correlated with the residuals at time point 2 (they shouldn’t be). In other words, this test is useful to verify that we haven’t violated the independence assumption.

# durbin watson test
durbinWatsonTest(model)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.3407773      1.312805       0
##  Alternative hypothesis: rho != 0

p < 0.05, so the errors are autocorrelated. We have violated the independence assumption.

Testing the Multicollinearity Assumption

Correlation Matrix

# creating subset of continuous independent variables from the dataframe
expVar <- airline.df[c("AdvanceBookingDays", "FlyingMinutes", "SeatPitch","Capacity" ,"SeatWidth")]
# correlation matrix
corMatrix <- round(cor(expVar), 2)
corMatrix

##                    AdvanceBookingDays FlyingMinutes SeatPitch Capacity
## AdvanceBookingDays               1.00          0.01     -0.01    -0.01
## FlyingMinutes                    0.01          1.00     -0.03    -0.32
## SeatPitch                       -0.01         -0.03      1.00     0.51
## Capacity                        -0.01         -0.32      0.51     1.00
## SeatWidth                        0.05         -0.18      0.32     0.45
##                    SeatWidth
## AdvanceBookingDays      0.05
## FlyingMinutes          -0.18
## SeatPitch               0.32
## Capacity                0.45
## SeatWidth               1.00

Highly Correlated Variables

# highly correlated variables
library(caret)
findCorrelation(corMatrix, cutoff = 0.7, names = TRUE)

## character(0)

Variance Inflation Factor for Multicollinearity

# detecting multicollinearity
library(car)
vif(model)

##                         GVIF Df GVIF^(1/(2*Df))
## AdvanceBookingDays  5.407194  1        2.325337
## Capacity            3.768723  1        1.941320
## Airline            17.430096  3        1.610213
## Departure           1.283870  1        1.133080
## IsWeekend           1.243858  1        1.115284
## IsDiwali            5.449963  1        2.334516
## FlyingMinutes       1.180434  1        1.086478
## SeatWidth           4.876452  1        2.208269
## SeatPitch           4.545479  1        2.132013

Farrar - Glauber Test for Multicollinearity

The mctest package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity. There are two functions viz. omcdiag() and imcdiag() under mctest package in R which will provide the overall and individual diagnostic checking for multicollinearity respectively.

# collinearity using mctest
library(mctest)
omcdiag(as.matrix(expVar), Price)

## 
## Call:
## omcdiag(x = as.matrix(expVar), y = Price)
## 
## 
## Overall Multicollinearity Diagnostics
## 
##                        MC Results detection
## Determinant |X'X|:         0.5055         0
## Farrar Chi-Square:       205.3252         1
## Red Indicator:             0.2654         0
## Sum of Lambda Inverse:     6.5490         0
## Theil's Method:            0.9863         1
## Condition Number:        158.1401         1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test

imcdiag(expVar, Price)

## 
## Call:
## imcdiag(x = expVar, y = Price)
## 
## 
## All Individual Multicollinearity Diagnostics Result
## 
##                       VIF    TOL      Wi      Fi Leamer   CVIF Klein
## AdvanceBookingDays 1.0049 0.9951  0.3656  0.4890 0.9976 0.9972     0
## FlyingMinutes      1.1472 0.8717 11.0428 14.7728 0.9336 1.1385     1
## SeatPitch          1.4018 0.7134 30.1318 40.3096 0.8446 1.3910     1
## Capacity           1.7081 0.5854 53.1072 71.0457 0.7651 1.6950     1
## SeatWidth          1.2870 0.7770 21.5256 28.7965 0.8815 1.2772     1
## 
## 1 --> COLLINEARITY is detected by the test 
## 0 --> COLLINEARITY is not detected by the test
## 
## AdvanceBookingDays , FlyingMinutes , Capacity , SeatWidth , coefficient(s) are non-significant may be due to multicollinearity
## 
## R-square of y on all x: 0.0178 
## 
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Aryansh Gupta

IIM Lucknow