Sameer Mathur
Multicollinearity Demonstration in BOM-DEL-BOM data
Regression Diagnostics
---
Reading data into a dataframe
# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)
Number of rows and columns
# dimension of the dataframe
dim(airline.df)
[1] 305 23
# descriptive statistics
library(psych)
describe(airline.df)[, c(1:5, 8:9)]
vars n mean sd median min max
FlightNumber* 1 305 31.86 18.35 32.00 1.00 63.00
Airline* 2 305 2.60 0.88 3.00 1.00 4.00
DepartureCityCode* 3 305 1.57 0.50 2.00 1.00 2.00
ArrivalCityCode* 4 305 1.43 0.50 1.00 1.00 2.00
DepartureTime 5 305 1249.54 579.86 1035.00 225.00 2320.00
ArrivalTime 6 305 1329.31 613.52 1215.00 20.00 2345.00
Departure* 7 305 1.45 0.50 1.00 1.00 2.00
FlyingMinutes 8 305 136.03 4.71 135.00 125.00 145.00
Aircraft* 9 305 1.54 0.50 2.00 1.00 2.00
PlaneModel* 10 305 3.82 2.71 3.00 1.00 9.00
Capacity 11 305 176.36 32.39 180.00 138.00 303.00
SeatPitch 12 305 30.26 0.93 30.00 29.00 33.00
SeatWidth 13 305 17.41 0.49 17.00 17.00 18.00
DataCollectionDate* 14 305 4.36 1.98 5.00 1.00 7.00
DateDeparture* 15 305 8.14 6.69 7.00 1.00 20.00
IsWeekend* 16 305 1.13 0.34 1.00 1.00 2.00
Price 17 305 5394.54 2388.29 4681.00 2607.00 18015.00
AdvancedBookingDays 18 305 28.90 22.30 30.00 2.00 61.00
IsDiwali* 19 305 1.40 0.49 1.00 1.00 2.00
DayBeforeDiwali* 20 305 1.19 0.40 1.00 1.00 2.00
DayAfterDiwali* 21 305 1.20 0.40 1.00 1.00 2.00
MarketShare 22 305 21.18 11.04 15.40 13.20 39.60
LoadFactor 23 305 85.13 4.32 83.32 78.73 94.06
# first few rows of the dataframe
head(airline.df)[, c(1,2,22)]
FlightNumber Airline MarketShare
1 9W 313 Jet 15.4
2 9W 339 Jet 15.4
3 SG 161 Spice Jet 13.2
4 6E 171 IndiGo 39.6
5 SG 160 Spice Jet 13.2
6 9W 762 Jet 15.4
# 'MarketShare' divided by 'Airline'
table(Airline, MarketShare)
MarketShare
Airline 13.2 13.3 15.4 39.6
Air India 0 41 0 0
IndiGo 0 0 0 80
Jet 0 0 144 0
Spice Jet 40 0 0 0
Regression of Price with Airline
Note: Airline is a factor variable
# Model 1
fitOLSModel1 <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth,
data = airline.df)
summary(fitOLSModel1)
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2671.2 -1266.2 -456.4 517.4 11953.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4292.94 8897.87 -0.482 0.6298
AdvancedBookingDays -87.70 12.47 -7.033 1.43e-11 ***
AirlineIndiGo -577.17 778.64 -0.741 0.4591
AirlineJet -120.75 436.69 -0.277 0.7823
AirlineSpice Jet -1118.38 697.85 -1.603 0.1101
DeparturePM -589.79 275.23 -2.143 0.0329 *
IsWeekendYes -345.92 408.06 -0.848 0.3973
IsDiwaliYes 4346.80 568.14 7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46 351.54 -4.021 7.38e-05 ***
FlyingMinutes 38.97 29.27 1.331 0.1841
SeatPitch -279.19 226.64 -1.232 0.2190
SeatWidth 868.58 507.54 1.711 0.0881 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared: 0.2695, Adjusted R-squared: 0.2421
F-statistic: 9.828 on 11 and 293 DF, p-value: 3.604e-15
Regression of Price with MarketShare
Note: MarketShare is a continuous variable
Note: Airline is excluded from this regression model.
# Model 2
fitOLSModel2 <- lm(Price ~
AdvancedBookingDays
# + Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth
+ MarketShare,
data = airline.df)
summary(fitOLSModel2)
Call:
lm(formula = Price ~ AdvancedBookingDays + Departure + IsWeekend +
IsDiwali + DepartureCityCode + FlyingMinutes + SeatPitch +
SeatWidth + MarketShare, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2992.3 -1285.8 -413.1 533.2 12094.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9608.675 8195.592 -1.172 0.2420
AdvancedBookingDays -86.321 12.481 -6.916 2.88e-11 ***
DeparturePM -427.476 262.803 -1.627 0.1049
IsWeekendYes -153.076 392.867 -0.390 0.6971
IsDiwaliYes 4338.015 569.799 7.613 3.64e-13 ***
DepartureCityCodeDEL -1584.553 327.227 -4.842 2.07e-06 ***
FlyingMinutes 54.078 28.288 1.912 0.0569 .
SeatPitch -16.405 174.302 -0.094 0.9251
SeatWidth 569.942 484.705 1.176 0.2406
MarketShare 5.523 20.799 0.266 0.7908
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2085 on 295 degrees of freedom
Multiple R-squared: 0.2602, Adjusted R-squared: 0.2376
F-statistic: 11.53 on 9 and 295 DF, p-value: 1.685e-15
Regression of Price with Airline and its MarketShare
Note: Airline is a factor variable
Note: MarketShare is a continuous variable
# Model 12
fitOLSModel12 <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth
+ MarketShare,
data = airline.df)
summary(fitOLSModel12)
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth + MarketShare, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2671.2 -1266.2 -456.4 517.4 11953.9
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4292.94 8897.87 -0.482 0.6298
AdvancedBookingDays -87.70 12.47 -7.033 1.43e-11 ***
AirlineIndiGo -577.17 778.64 -0.741 0.4591
AirlineJet -120.75 436.69 -0.277 0.7823
AirlineSpice Jet -1118.38 697.85 -1.603 0.1101
DeparturePM -589.79 275.23 -2.143 0.0329 *
IsWeekendYes -345.92 408.06 -0.848 0.3973
IsDiwaliYes 4346.80 568.14 7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46 351.54 -4.021 7.38e-05 ***
FlyingMinutes 38.97 29.27 1.331 0.1841
SeatPitch -279.19 226.64 -1.232 0.2190
SeatWidth 868.58 507.54 1.711 0.0881 .
MarketShare NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared: 0.2695, Adjusted R-squared: 0.2421
F-statistic: 9.828 on 11 and 293 DF, p-value: 3.604e-15
Regression Model 12 shows evidence of Multicollinearity
Since MarketShare and Airline are 100% correlated, we should not include both variables in our Regression.
Regression of Price with Airline and its MarketShare
Note: Airline is a factor variable
Note: MarketShare is also a factor variable
# Model 12B
# converting 'MarketShare' into a factor variable
airline.df$MarketShare <- as.factor(airline.df$MarketShare)
fitOLSModel12B <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth
+ MarketShare,
data = airline.df)
summary(fitOLSModel12B)
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth + MarketShare, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2671.2 -1266.2 -456.4 517.4 11953.9
Coefficients: (3 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4292.94 8897.87 -0.482 0.6298
AdvancedBookingDays -87.70 12.47 -7.033 1.43e-11 ***
AirlineIndiGo -577.17 778.64 -0.741 0.4591
AirlineJet -120.75 436.69 -0.277 0.7823
AirlineSpice Jet -1118.38 697.85 -1.603 0.1101
DeparturePM -589.79 275.23 -2.143 0.0329 *
IsWeekendYes -345.92 408.06 -0.848 0.3973
IsDiwaliYes 4346.80 568.14 7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46 351.54 -4.021 7.38e-05 ***
FlyingMinutes 38.97 29.27 1.331 0.1841
SeatPitch -279.19 226.64 -1.232 0.2190
SeatWidth 868.58 507.54 1.711 0.0881 .
MarketShare13.3 NA NA NA NA
MarketShare15.4 NA NA NA NA
MarketShare39.6 NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared: 0.2695, Adjusted R-squared: 0.2421
F-statistic: 9.828 on 11 and 293 DF, p-value: 3.604e-15
It does not matter whether MarketShare is modeled as Continuous or Factor.
Regression Model 12B also shows evidence of Multicollinearity
Since MarketShare and Airline are 100% correlated, we should not include both variables in our regression.