Sameer Mathur
Multicollinearity.
Regression Diagnostics
---
Reading data into a dataframe
# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)
Number of rows and columns
# dimension of the dataframe
dim(airline.df)
[1] 305 23
# descriptive statistics
library(psych)
describe(airline.df)[, c(1:5, 8:9)]
vars n mean sd median min max
FlightNumber* 1 305 31.86 18.35 32.00 1.00 63.00
Airline* 2 305 2.60 0.88 3.00 1.00 4.00
DepartureCityCode* 3 305 1.57 0.50 2.00 1.00 2.00
ArrivalCityCode* 4 305 1.43 0.50 1.00 1.00 2.00
DepartureTime 5 305 1249.54 579.86 1035.00 225.00 2320.00
ArrivalTime 6 305 1329.31 613.52 1215.00 20.00 2345.00
Departure* 7 305 1.45 0.50 1.00 1.00 2.00
FlyingMinutes 8 305 136.03 4.71 135.00 125.00 145.00
Aircraft* 9 305 1.54 0.50 2.00 1.00 2.00
PlaneModel* 10 305 3.82 2.71 3.00 1.00 9.00
Capacity 11 305 176.36 32.39 180.00 138.00 303.00
SeatPitch 12 305 30.26 0.93 30.00 29.00 33.00
SeatWidth 13 305 17.41 0.49 17.00 17.00 18.00
DataCollectionDate* 14 305 4.36 1.98 5.00 1.00 7.00
DateDeparture* 15 305 8.14 6.69 7.00 1.00 20.00
IsWeekend* 16 305 1.13 0.34 1.00 1.00 2.00
Price 17 305 5394.54 2388.29 4681.00 2607.00 18015.00
AdvancedBookingDays 18 305 28.90 22.30 30.00 2.00 61.00
IsDiwali* 19 305 1.40 0.49 1.00 1.00 2.00
DayBeforeDiwali* 20 305 1.19 0.40 1.00 1.00 2.00
DayAfterDiwali* 21 305 1.20 0.40 1.00 1.00 2.00
MarketShare 22 305 21.18 11.04 15.40 13.20 39.60
LoadFactor 23 305 85.13 4.32 83.32 78.73 94.06
# linear OLS model
fitOLSModel <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth,
data = airline.df)
# summary of linear OLS model
summary(fitOLSModel)
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2671.2 -1266.2 -456.4 517.4 11953.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4292.94 8897.87 -0.482 0.6298
AdvancedBookingDays -87.70 12.47 -7.033 1.43e-11 ***
AirlineIndiGo -577.17 778.64 -0.741 0.4591
AirlineJet -120.75 436.69 -0.277 0.7823
AirlineSpice Jet -1118.38 697.85 -1.603 0.1101
DeparturePM -589.79 275.23 -2.143 0.0329 *
IsWeekendYes -345.92 408.06 -0.848 0.3973
IsDiwaliYes 4346.80 568.14 7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46 351.54 -4.021 7.38e-05 ***
FlyingMinutes 38.97 29.27 1.331 0.1841
SeatPitch -279.19 226.64 -1.232 0.2190
SeatWidth 868.58 507.54 1.711 0.0881 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared: 0.2695, Adjusted R-squared: 0.2421
F-statistic: 9.828 on 11 and 293 DF, p-value: 3.604e-15
Some major linear regression assumptions, given as
Linearity of the data
Normality of residuals
Homogeneity of residuals variance
Multicollinearity
In multiple regression, two or more predictor variables might be correlated with each other. This situation is referred as collinearity.
Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated.
Unfortunately, when it exists, it can wreak havoc on our analysis and thereby limit the research conclusions we can draw.
The estimated regression coefficient of any one variable depends on which other predictors are included in the model
The precision of the estimated regression coefficients decreases as more predictors are added to the model
Hypothesis tests for \( \beta_k = 0 \) may yield different conclusions depending on which predictors are in the model
The following are some approach to detect multicollinearity between the predictor variables.
1. APPROACH 1: Correlation among the continuous predictor variables
2. APPROACH 2: Variance Inflation Factors (VIF)
3. APPROACH 3: Farrar - Glauber Test
Correlation
expVar <- airline.df[c("AdvancedBookingDays", "FlyingMinutes", "SeatPitch", "SeatWidth")]
corMat <- round(cor(expVar), 2)
corMat
AdvancedBookingDays FlyingMinutes SeatPitch SeatWidth
AdvancedBookingDays 1.00 0.01 -0.01 0.05
FlyingMinutes 0.01 1.00 -0.03 -0.18
SeatPitch -0.01 -0.03 1.00 0.32
SeatWidth 0.05 -0.18 0.32 1.00
# highly correlated variables
library(caret)
findCorrelation(corMat, cutoff = 0.75, names = TRUE)
character(0)
Based on the correlation matrix for the BOM-DEL-BOM data
None of the continuous x-variables are highly correlated with each other.
Here, we have assumed the cut-off 75%.
Variance Inflation Factor (VIF)
For a given predictor x, multicollinearity can be assessed by computing a score called the Variance Inflation Factor (or VIF).
VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity.
The extent to which a predictor is correlated with the other predictor variables in a linear regression can be quantified as the R-squared statistic of the regression where the predictor of interest is predicted by all the other predictor variables. The variance inflation for a variable is then computed as \[ VIF = \frac{1}{1-R^2} \].
For any predictor variable, the square root of the VIF indicates the degree to which the confidence interval for that variable's regression parameter is expanded relative to a model with uncorrelated predictors.
If the data includes both continuous and categorical variables, generalized variance-inflation factors (Fox and Monette, 1992) are calculated and weighted by the Degrees of Freedom.
Conceptually, they are the same as VIF.
As a general rule, \( \sqrt{vif} > 2 \) indicates a multicollinearity problem.
As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al. 2014).
Sometimes a high VIF is of less concern. For example, you can get a high VIF by including interactions between \( x_1 \) and \( x_2 \); or powers of \( x_1 \) such as \( x_1^{2} \).
Note: If you have high VIFs for categorical variables, we can generally ignore them. High VIFs of continuous variables are problematic.
VIF values can be measured using the vif() function in the car package.
# variance inflation factor (vif)
library(car)
vif(fitOLSModel)
GVIF Df GVIF^(1/(2*Df))
AdvancedBookingDays 5.434809 1 2.331268
Airline 10.478006 3 1.479267
Departure 1.320436 1 1.149102
IsWeekend 1.366933 1 1.169159
IsDiwali 5.450405 1 2.334610
DepartureCityCode 2.132262 1 1.460227
FlyingMinutes 1.333785 1 1.154896
SeatPitch 3.137380 1 1.771265
SeatWidth 4.337152 1 2.082583
# square root of the vif
sqrt(vif(fitOLSModel)) > 2
GVIF Df GVIF^(1/(2*Df))
AdvancedBookingDays TRUE FALSE FALSE
Airline TRUE FALSE FALSE
Departure FALSE FALSE FALSE
IsWeekend FALSE FALSE FALSE
IsDiwali TRUE FALSE FALSE
DepartureCityCode FALSE FALSE FALSE
FlyingMinutes FALSE FALSE FALSE
SeatPitch FALSE FALSE FALSE
SeatWidth TRUE FALSE FALSE
The output suggests that we do not have any predictor variables whose \( \sqrt{vif} \) is more than 2.
Hence, we do not have multicollinearity problem.
Statistical Test (Farrar - Glauber)
The mctest package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity.
There are two functions viz. omcdiag() and imcdiag() under mctest package in R which will provide the overall and individual diagnostic checking for multicollinearity respectively.
This function detects the existence of multicollinearity by using different diagnostic measures
Function also displays diagnostic measures value with the decision of either multicollinearity is detected by the diagnostics or not.
# overall multicollinearity diagnostic measures
library(mctest)
omcdiag(as.matrix(expVar), Price)
Call:
omcdiag(x = as.matrix(expVar), y = Price)
Overall Multicollinearity Diagnostics
MC Results detection
Determinant |X'X|: 0.8635 0
Farrar Chi-Square: 44.1393 1
Red Indicator: 0.1527 0
Sum of Lambda Inverse: 4.3122 0
Theil's Method: 0.2347 0
Condition Number: 138.8156 1
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
The imcdiag function detects the existence of multicollinearity due to individual x-variables.
This includes VIF, TOL, Klein's rule, Farrar and Glauber F-test, F and \( R^2 \) relation, Leamer's method and CVIF.
# individual multicollinearity diagnostic measures
imcdiag(expVar, Price, vif = 5, method = "VIF")
Call:
imcdiag(x = expVar, y = Price, method = "VIF", vif = 5)
VIF Multicollinearity Diagnostics
VIF detection
AdvancedBookingDays 1.0044 0
FlyingMinutes 1.0356 0
SeatPitch 1.1157 0
SeatWidth 1.1565 0
NOTE: VIF Method Failed to detect multicollinearity
0 --> COLLINEARITY is not detected by the test