Effect of Seasonality and Advance Booking on New Delhi - Mumbai Air Ticket Prices

Sameer Mathur

Multicollinearity.

Regression Diagnostics

---

Reading and Describing Data

Reading data

Reading data into a dataframe

# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)

Number of rows and columns

# dimension of the dataframe
dim(airline.df)
[1] 305  23

Descriptive Statistics

# descriptive statistics
library(psych)
describe(airline.df)[, c(1:5, 8:9)]
                    vars   n    mean      sd  median     min      max
FlightNumber*          1 305   31.86   18.35   32.00    1.00    63.00
Airline*               2 305    2.60    0.88    3.00    1.00     4.00
DepartureCityCode*     3 305    1.57    0.50    2.00    1.00     2.00
ArrivalCityCode*       4 305    1.43    0.50    1.00    1.00     2.00
DepartureTime          5 305 1249.54  579.86 1035.00  225.00  2320.00
ArrivalTime            6 305 1329.31  613.52 1215.00   20.00  2345.00
Departure*             7 305    1.45    0.50    1.00    1.00     2.00
FlyingMinutes          8 305  136.03    4.71  135.00  125.00   145.00
Aircraft*              9 305    1.54    0.50    2.00    1.00     2.00
PlaneModel*           10 305    3.82    2.71    3.00    1.00     9.00
Capacity              11 305  176.36   32.39  180.00  138.00   303.00
SeatPitch             12 305   30.26    0.93   30.00   29.00    33.00
SeatWidth             13 305   17.41    0.49   17.00   17.00    18.00
DataCollectionDate*   14 305    4.36    1.98    5.00    1.00     7.00
DateDeparture*        15 305    8.14    6.69    7.00    1.00    20.00
IsWeekend*            16 305    1.13    0.34    1.00    1.00     2.00
Price                 17 305 5394.54 2388.29 4681.00 2607.00 18015.00
AdvancedBookingDays   18 305   28.90   22.30   30.00    2.00    61.00
IsDiwali*             19 305    1.40    0.49    1.00    1.00     2.00
DayBeforeDiwali*      20 305    1.19    0.40    1.00    1.00     2.00
DayAfterDiwali*       21 305    1.20    0.40    1.00    1.00     2.00
MarketShare           22 305   21.18   11.04   15.40   13.20    39.60
LoadFactor            23 305   85.13    4.32   83.32   78.73    94.06

Regression Analysis

Linear OLS Regression Model

# linear OLS model
fitOLSModel <- lm(Price ~
                    AdvancedBookingDays
                  + Airline
                  + Departure
                  + IsWeekend
                  + IsDiwali
                  + DepartureCityCode
                  + FlyingMinutes
                  + SeatPitch
                  + SeatWidth,
                  data = airline.df)
# summary of linear OLS model
summary(fitOLSModel)

Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure + 
    IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
    SeatPitch + SeatWidth, data = airline.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2671.2 -1266.2  -456.4   517.4 11953.9 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -4292.94    8897.87  -0.482   0.6298    
AdvancedBookingDays    -87.70      12.47  -7.033 1.43e-11 ***
AirlineIndiGo         -577.17     778.64  -0.741   0.4591    
AirlineJet            -120.75     436.69  -0.277   0.7823    
AirlineSpice Jet     -1118.38     697.85  -1.603   0.1101    
DeparturePM           -589.79     275.23  -2.143   0.0329 *  
IsWeekendYes          -345.92     408.06  -0.848   0.3973    
IsDiwaliYes           4346.80     568.14   7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46     351.54  -4.021 7.38e-05 ***
FlyingMinutes           38.97      29.27   1.331   0.1841    
SeatPitch             -279.19     226.64  -1.232   0.2190    
SeatWidth              868.58     507.54   1.711   0.0881 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared:  0.2695,    Adjusted R-squared:  0.2421 
F-statistic: 9.828 on 11 and 293 DF,  p-value: 3.604e-15

Regression Diagnostics

Some major linear regression assumptions, given as

  1. Linearity of the data

  2. Normality of residuals

  3. Homogeneity of residuals variance

  4. Multicollinearity

Assumption 4: Multicollinearity

In multiple regression, two or more predictor variables might be correlated with each other. This situation is referred as collinearity.

Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated.

Unfortunately, when it exists, it can wreak havoc on our analysis and thereby limit the research conclusions we can draw.

Problem 1

The estimated regression coefficient of any one variable depends on which other predictors are included in the model

Problem 2

The precision of the estimated regression coefficients decreases as more predictors are added to the model

Problem 3

Hypothesis tests for \( \beta_k = 0 \) may yield different conclusions depending on which predictors are in the model

Detecting Multicollinearity

The following are some approach to detect multicollinearity between the predictor variables.

1. APPROACH 1: Correlation among the continuous predictor variables

2. APPROACH 2: Variance Inflation Factors (VIF)

3. APPROACH 3: Farrar - Glauber Test

APPROACH 1

Correlation

High Corrleation

Correlation Matrix

expVar <- airline.df[c("AdvancedBookingDays", "FlyingMinutes", "SeatPitch", "SeatWidth")]
corMat <- round(cor(expVar), 2)
corMat
                    AdvancedBookingDays FlyingMinutes SeatPitch SeatWidth
AdvancedBookingDays                1.00          0.01     -0.01      0.05
FlyingMinutes                      0.01          1.00     -0.03     -0.18
SeatPitch                         -0.01         -0.03      1.00      0.32
SeatWidth                          0.05         -0.18      0.32      1.00

Highly Correlated Variables

# highly correlated variables
library(caret)
findCorrelation(corMat, cutoff = 0.75, names = TRUE)
character(0)

Interpretation based on Correlation Matrix

Based on the correlation matrix for the BOM-DEL-BOM data

  • None of the continuous x-variables are highly correlated with each other.

  • Here, we have assumed the cut-off 75%.

APPROACH 2

Variance Inflation Factor (VIF)

Variance Inflation Factor

For a given predictor x, multicollinearity can be assessed by computing a score called the Variance Inflation Factor (or VIF).

VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity.

The extent to which a predictor is correlated with the other predictor variables in a linear regression can be quantified as the R-squared statistic of the regression where the predictor of interest is predicted by all the other predictor variables. The variance inflation for a variable is then computed as \[ VIF = \frac{1}{1-R^2} \].

For any predictor variable, the square root of the VIF indicates the degree to which the confidence interval for that variable's regression parameter is expanded relative to a model with uncorrelated predictors.

A Technical Point

If the data includes both continuous and categorical variables, generalized variance-inflation factors (Fox and Monette, 1992) are calculated and weighted by the Degrees of Freedom.

Conceptually, they are the same as VIF.

Interpreting the Variance Inflation Factor (VIF)

  1. As a general rule, \( \sqrt{vif} > 2 \) indicates a multicollinearity problem.

  2. As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity (James et al. 2014).

Sometimes a high VIF is of less concern. For example, you can get a high VIF by including interactions between \( x_1 \) and \( x_2 \); or powers of \( x_1 \) such as \( x_1^{2} \).

Note: If you have high VIFs for categorical variables, we can generally ignore them. High VIFs of continuous variables are problematic.

Using R

VIF values can be measured using the vif() function in the car package.

# variance inflation factor (vif)
library(car)
vif(fitOLSModel)
                         GVIF Df GVIF^(1/(2*Df))
AdvancedBookingDays  5.434809  1        2.331268
Airline             10.478006  3        1.479267
Departure            1.320436  1        1.149102
IsWeekend            1.366933  1        1.169159
IsDiwali             5.450405  1        2.334610
DepartureCityCode    2.132262  1        1.460227
FlyingMinutes        1.333785  1        1.154896
SeatPitch            3.137380  1        1.771265
SeatWidth            4.337152  1        2.082583
# square root of the vif
sqrt(vif(fitOLSModel)) > 2
                     GVIF    Df GVIF^(1/(2*Df))
AdvancedBookingDays  TRUE FALSE           FALSE
Airline              TRUE FALSE           FALSE
Departure           FALSE FALSE           FALSE
IsWeekend           FALSE FALSE           FALSE
IsDiwali             TRUE FALSE           FALSE
DepartureCityCode   FALSE FALSE           FALSE
FlyingMinutes       FALSE FALSE           FALSE
SeatPitch           FALSE FALSE           FALSE
SeatWidth            TRUE FALSE           FALSE

The output suggests that we do not have any predictor variables whose \( \sqrt{vif} \) is more than 2.

Hence, we do not have multicollinearity problem.

APPROACH 3

Statistical Test (Farrar - Glauber)

Farrar - Glauber Test

The mctest package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity.

There are two functions viz. omcdiag() and imcdiag() under mctest package in R which will provide the overall and individual diagnostic checking for multicollinearity respectively.

Overall Multicollinearity Diagnostics Measures (omcdiag)

This function detects the existence of multicollinearity by using different diagnostic measures

  • Determinant of correlation matrix
  • Farrar Chi-square test
  • Red Indicator
  • Sum of lambda (\( \lambda \)) inverse values
  • Theil's Indicator
  • Condition Number

Function also displays diagnostic measures value with the decision of either multicollinearity is detected by the diagnostics or not.

# overall multicollinearity diagnostic measures
library(mctest)
omcdiag(as.matrix(expVar), Price)

Call:
omcdiag(x = as.matrix(expVar), y = Price)


Overall Multicollinearity Diagnostics

                       MC Results detection
Determinant |X'X|:         0.8635         0
Farrar Chi-Square:        44.1393         1
Red Indicator:             0.1527         0
Sum of Lambda Inverse:     4.3122         0
Theil's Method:            0.2347         0
Condition Number:        138.8156         1

1 --> COLLINEARITY is detected by the test 
0 --> COLLINEARITY is not detected by the test

Individual Multicollinearity Diagnostic Measures (imcdiag)

The imcdiag function detects the existence of multicollinearity due to individual x-variables.

This includes VIF, TOL, Klein's rule, Farrar and Glauber F-test, F and \( R^2 \) relation, Leamer's method and CVIF.

# individual multicollinearity diagnostic measures
imcdiag(expVar, Price, vif = 5, method = "VIF")

Call:
imcdiag(x = expVar, y = Price, method = "VIF", vif = 5)


 VIF Multicollinearity Diagnostics

                       VIF detection
AdvancedBookingDays 1.0044         0
FlyingMinutes       1.0356         0
SeatPitch           1.1157         0
SeatWidth           1.1565         0

NOTE:  VIF Method Failed to detect multicollinearity


0 --> COLLINEARITY is not detected by the test