We will use the Airlines data set (“BOMDELBOM”)
# building a regression model
model <- lm(Price ~ AdvanceBookingDays
+ Capacity
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ FlyingMinutes
+ SeatWidth
+ SeatPitch,
data = airline.df)
summary(model)
##
## Call:
## lm(formula = Price ~ AdvanceBookingDays + Capacity + Airline +
## Departure + IsWeekend + IsDiwali + FlyingMinutes + SeatWidth +
## SeatPitch, data = airline.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3096.1 -1278.7 -546.0 544.3 11676.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6361.724 10775.621 -0.590 0.55539
## AdvanceBookingDays -91.200 12.727 -7.166 6.29e-12 ***
## Capacity -11.003 7.314 -1.504 0.13354
## AirlineIndiGo -2198.615 721.687 -3.046 0.00253 **
## AirlineJet -983.995 458.378 -2.147 0.03264 *
## AirlineSpice Jet -1701.197 724.904 -2.347 0.01960 *
## DeparturePM -769.824 277.705 -2.772 0.00593 **
## IsWeekendYes -856.834 398.316 -2.151 0.03228 *
## IsDiwaliYes 4366.719 581.342 7.511 7.11e-13 ***
## FlyingMinutes -9.482 28.178 -0.337 0.73672
## SeatWidth 1185.321 550.696 2.152 0.03218 *
## SeatPitch -99.785 279.146 -0.357 0.72100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2128 on 293 degrees of freedom
## Multiple R-squared: 0.2351, Adjusted R-squared: 0.2064
## F-statistic: 8.188 on 11 and 293 DF, p-value: 1.679e-12
The linearity assumption can be checked by inspecting the Residuals vs Fitted plot (1st plot)
Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.
In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables.
The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.
In our example, all the points fall approximately along this reference line, so we can assume normality.
In our example, all the points do not fall approximately along this reference line, so we can assume non-normality.
##
## Shapiro-Wilk normality test
##
## data: sresid
## W = 0.81738, p-value < 2.2e-16
From the p-value = 2.2e-16 < 0.5, it is clear that the residuals are not normally distributed.
Outliers and high levarage points
Outliers An outlier is a point that has an extreme outcome variable value. The presence of outliers may affect the interpretation of the model, because it increases the RSE
.
Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual divided by its estimated standard error. Standardized residuals can be interpreted as the number of standard errors away from the regression line. Observations whose standardized residuals are greater than 3 in absolute value are possible outliers
High leverage points:
A data point has high leverage, if it has extreme predictor x values. This can be detected by examining the leverage statistic or the hat-value. A value of this statistic above 2(p + 1)/n
indicates an observation with high leverage (P. Bruce and Bruce 2017); where, p
is the number of predictors and n
is the number of observations.
The plot above highlights the top 3 most extreme points (#183, #182 and #49), with a standardized residuals higher 3. However, there is outliers that exceed 3 standard deviations, what is not good.
Additionally, there is high leverage point in the data. That is, some data points, have a leverage statistic higher than 2(p + 1)/n = 16/305 = 0.05.
Influential values
An influential value is a value, which inclusion or exclusion can alter the results of the regression analysis. Such a value is associated with a large residual.
Not all outliers (or extreme data points) are influential in linear regression analysis. Statisticians have developed a metric called Cook's distance
to determine the influence of a value. This metric defines influence as a combination of leverage and residual size.
A rule of thumb is that an observation has high influence if Cook's distance
exceeds 4/(n - p - 1)
(P. Bruce and Bruce 2017), where n is the number of observations and p the number of predictor variables.
where n
is the number of observations and p
the number of predictor variables.
The following plots illustrate the Cook's distance
and the leverage of our model:
By default, the top 3 most extreme values are labelled on the Cook's distance
plot. If you want to label the top 5 extreme values, specify the option id.n as follow:
plot(model, 4, id.n = 5)
In our example, the data does not present any influential points. Cook's distance
lines (a red dashed line) are not shown on the Residuals vs Leverage plot because all points are well inside of the Cook's distance
lines.
This assumption can be checked by examining the Scale-location plot, also known as the spread-location plot.
This plot shows if residuals are spread equally along the ranges of predictors. it is good if you see a horizontal line with equally spread points. In our example, this is the case. we can assume Homogeneity of variance
## Loading required package: carData
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 64.06275, Df = 1, p = 1.2052e-15
p < .05, suggesting that our data is not homoscedastic.
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 26.711, df = 11, p-value = 0.00508
The Durbin Watson examines whether the errors are autocorrelated with themselves. The null states that they are not autocorrelated (what we want). This test could be especially useful when you conduct a multiple (times series) regression. For example, this test could tell you whether the residuals at time point 1 are correlated with the residuals at time point 2 (they shouldn’t be). In other words, this test is useful to verify that we haven’t violated the independence assumption.
## lag Autocorrelation D-W Statistic p-value
## 1 0.3407773 1.312805 0
## Alternative hypothesis: rho != 0
p < 0.05, so the errors are autocorrelated. We have violated the independence assumption.
# creating subset of continuous independent variables from the dataframe
expVar <- airline.df[c("AdvanceBookingDays", "FlyingMinutes", "SeatPitch","Capacity" ,"SeatWidth")]
# correlation matrix
corMatrix <- round(cor(expVar), 2)
corMatrix
## AdvanceBookingDays FlyingMinutes SeatPitch Capacity
## AdvanceBookingDays 1.00 0.01 -0.01 -0.01
## FlyingMinutes 0.01 1.00 -0.03 -0.32
## SeatPitch -0.01 -0.03 1.00 0.51
## Capacity -0.01 -0.32 0.51 1.00
## SeatWidth 0.05 -0.18 0.32 0.45
## SeatWidth
## AdvanceBookingDays 0.05
## FlyingMinutes -0.18
## SeatPitch 0.32
## Capacity 0.45
## SeatWidth 1.00
Highly Correlated Variables
## character(0)
## GVIF Df GVIF^(1/(2*Df))
## AdvanceBookingDays 5.407194 1 2.325337
## Capacity 3.768723 1 1.941320
## Airline 17.430096 3 1.610213
## Departure 1.283870 1 1.133080
## IsWeekend 1.243858 1 1.115284
## IsDiwali 5.449963 1 2.334516
## FlyingMinutes 1.180434 1 1.086478
## SeatWidth 4.876452 1 2.208269
## SeatPitch 4.545479 1 2.132013
The mctest
package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity. There are two functions viz. omcdiag()
and imcdiag()
under mctest
package in R which will provide the overall and individual diagnostic checking for multicollinearity respectively.
##
## Call:
## omcdiag(x = as.matrix(expVar), y = Price)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.5055 0
## Farrar Chi-Square: 205.3252 1
## Red Indicator: 0.2654 0
## Sum of Lambda Inverse: 6.5490 0
## Theil's Method: 0.9863 1
## Condition Number: 158.1401 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## Call:
## imcdiag(x = expVar, y = Price)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein
## AdvanceBookingDays 1.0049 0.9951 0.3656 0.4890 0.9976 0.9972 0
## FlyingMinutes 1.1472 0.8717 11.0428 14.7728 0.9336 1.1385 1
## SeatPitch 1.4018 0.7134 30.1318 40.3096 0.8446 1.3910 1
## Capacity 1.7081 0.5854 53.1072 71.0457 0.7651 1.6950 1
## SeatWidth 1.2870 0.7770 21.5256 28.7965 0.8815 1.2772 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## AdvanceBookingDays , FlyingMinutes , Capacity , SeatWidth , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.0178
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================