Sameer Mathur
READING AND PREPARING DATA
Number of Rows and Columns
# reading data into R
airline.df <- read.csv(paste("AirlinePricingData.csv"))
# rows and columns of the data frame
dim(airline.df)
[1] 305 24
Column Names
# data column names
colnames(airline.df)
[1] "FlightNumber" "Airline" "DepartureCityCode"
[4] "ArrivalCityCode" "DepartureTime" "ArrivalTime"
[7] "Departure" "FlyingMinutes" "Aircraft"
[10] "PlaneModel" "Capacity" "SeatPitch"
[13] "SeatWidth" "DataCollectionDate" "DateDeparture"
[16] "Price" "AdvancedBookingDays" "IsDiwali"
[19] "DayBeforeDiwali" "DayAfterDiwali" "MetroDeparture"
[22] "MetroArrival" "MarketShare" "LoadFactor"
# data types of the data columns
str(airline.df)
'data.frame': 305 obs. of 24 variables:
$ FlightNumber : Factor w/ 63 levels "6E 129","6E 155",..: 25 25 25 25 25 32 32 32 32 32 ...
$ Airline : Factor w/ 4 levels "Air India","IndiGo",..: 3 3 3 3 3 3 3 3 3 3 ...
$ DepartureCityCode : Factor w/ 2 levels "BOM","DEL": 2 2 2 2 2 1 1 1 1 1 ...
$ ArrivalCityCode : Factor w/ 2 levels "BOM","DEL": 1 1 1 1 1 2 2 2 2 2 ...
$ DepartureTime : int 225 225 225 225 225 300 300 300 300 300 ...
$ ArrivalTime : int 435 435 435 435 435 505 505 505 505 505 ...
$ Departure : Factor w/ 2 levels "AM","PM": 1 1 1 1 1 1 1 1 1 1 ...
$ FlyingMinutes : int 130 130 130 130 130 125 125 125 125 125 ...
$ Aircraft : Factor w/ 2 levels "Airbus","Boeing": 2 2 2 2 2 2 2 2 2 2 ...
$ PlaneModel : Factor w/ 9 levels "738","739","77W",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Capacity : int 156 156 156 156 156 156 156 156 156 156 ...
$ SeatPitch : int 30 30 30 30 30 30 30 30 30 30 ...
$ SeatWidth : num 17 17 17 17 17 17 17 17 17 17 ...
$ DataCollectionDate : Factor w/ 7 levels "Sep 10 2018",..: 2 2 2 2 2 4 4 4 4 4 ...
$ DateDeparture : Factor w/ 20 levels "Nov 6 2018","Nov 8 2018",..: 12 16 2 1 4 14 18 6 1 2 ...
$ Price : int 3453 4251 4681 4051 3553 4628 4628 3604 11587 5678 ...
$ AdvancedBookingDays: int 2 7 56 54 30 2 7 30 52 54 ...
$ IsDiwali : int 0 0 1 1 0 0 0 0 1 1 ...
$ DayBeforeDiwali : int 0 0 0 1 0 0 0 0 1 0 ...
$ DayAfterDiwali : int 0 0 1 0 0 0 0 0 0 1 ...
$ MetroDeparture : int 1 1 1 1 1 1 1 1 1 1 ...
$ MetroArrival : int 1 1 1 1 1 1 1 1 1 1 ...
$ MarketShare : num 15.4 15.4 15.4 15.4 15.4 15.4 15.4 15.4 15.4 15.4 ...
$ LoadFactor : num 83.3 83.3 83.3 83.3 83.3 ...
# convert 'IsDiwali' into factor
airline.df$IsDiwali <- as.factor(airline.df$IsDiwali)
# convert 'DayBeforeDiwali' into factor
airline.df$DayBeforeDiwali <- as.factor(airline.df$DayBeforeDiwali)
# convert 'DayAfterDiwali' into factor
airline.df$DayAfterDiwali <- as.factor(airline.df$DayAfterDiwali)
# convert 'MetroDeparture' into factor
airline.df$MetroDeparture <- as.factor(airline.df$MetroDeparture)
# convert 'MetroArrival' into factor
airline.df$MetroArrival <- as.factor(airline.df$MetroArrival)
# verify
str(airline.df)
'data.frame': 305 obs. of 24 variables:
$ FlightNumber : Factor w/ 63 levels "6E 129","6E 155",..: 25 25 25 25 25 32 32 32 32 32 ...
$ Airline : Factor w/ 4 levels "Air India","IndiGo",..: 3 3 3 3 3 3 3 3 3 3 ...
$ DepartureCityCode : Factor w/ 2 levels "BOM","DEL": 2 2 2 2 2 1 1 1 1 1 ...
$ ArrivalCityCode : Factor w/ 2 levels "BOM","DEL": 1 1 1 1 1 2 2 2 2 2 ...
$ DepartureTime : int 225 225 225 225 225 300 300 300 300 300 ...
$ ArrivalTime : int 435 435 435 435 435 505 505 505 505 505 ...
$ Departure : Factor w/ 2 levels "AM","PM": 1 1 1 1 1 1 1 1 1 1 ...
$ FlyingMinutes : int 130 130 130 130 130 125 125 125 125 125 ...
$ Aircraft : Factor w/ 2 levels "Airbus","Boeing": 2 2 2 2 2 2 2 2 2 2 ...
$ PlaneModel : Factor w/ 9 levels "738","739","77W",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Capacity : int 156 156 156 156 156 156 156 156 156 156 ...
$ SeatPitch : int 30 30 30 30 30 30 30 30 30 30 ...
$ SeatWidth : num 17 17 17 17 17 17 17 17 17 17 ...
$ DataCollectionDate : Factor w/ 7 levels "Sep 10 2018",..: 2 2 2 2 2 4 4 4 4 4 ...
$ DateDeparture : Factor w/ 20 levels "Nov 6 2018","Nov 8 2018",..: 12 16 2 1 4 14 18 6 1 2 ...
$ Price : int 3453 4251 4681 4051 3553 4628 4628 3604 11587 5678 ...
$ AdvancedBookingDays: int 2 7 56 54 30 2 7 30 52 54 ...
$ IsDiwali : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 2 ...
$ DayBeforeDiwali : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 2 1 ...
$ DayAfterDiwali : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 2 ...
$ MetroDeparture : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ MetroArrival : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
$ MarketShare : num 15.4 15.4 15.4 15.4 15.4 15.4 15.4 15.4 15.4 15.4 ...
$ LoadFactor : num 83.3 83.3 83.3 83.3 83.3 ...
DATA SUMMARY
# attaching data columns
attach(airline.df)
# descriptive statistics
library(psych)
describe(airline.df)[c(7, 10:12, 15:17, 22, 23), c(2:5, 8:9)]
n mean sd median min max
Departure* 305 1.45 0.50 1.0 1.0 2.0
PlaneModel* 305 3.82 2.71 3.0 1.0 9.0
Capacity 305 176.36 32.39 180.0 138.0 303.0
SeatPitch 305 30.26 0.93 30.0 29.0 33.0
DateDeparture* 305 8.14 6.69 7.0 1.0 20.0
Price 305 5394.54 2388.29 4681.0 2607.0 18015.0
AdvancedBookingDays 305 28.90 22.30 30.0 2.0 61.0
MetroArrival* 305 1.00 0.00 1.0 1.0 1.0
MarketShare 305 21.18 11.04 15.4 13.2 39.6
# flights by airline in Diwali
table(Airline, IsDiwali)
IsDiwali
Airline 0 1
Air India 25 16
IndiGo 48 32
Jet 87 57
Spice Jet 24 16
# average ticket price by airline
agg0 <- aggregate(list(Price, AdvancedBookingDays), by = list(Airline), mean)
colnames(agg0) <- c("Airline", "AverageTicketPrice", "AdvancedBookingDays")
agg0
Airline AverageTicketPrice AdvancedBookingDays
1 Air India 6335.000 26.78049
2 IndiGo 4879.525 31.40000
3 Jet 5496.146 28.54167
4 Spice Jet 5094.850 27.40000
# average ticket price by airline
agg1 <- aggregate(Price, by = list(Airline), mean)
colnames(agg1) <- c("Airline", "AverageTicketPrice")
#agg1
# sd ticket price by airline
agg2 <- aggregate(Price, by = list(Airline), sd)
colnames(agg2) <- c("Airline", "SD")
#agg2
# average ticket price and standard deviation
agg <- merge(agg1, agg2, by = c("Airline"))
agg
Airline AverageTicketPrice SD
1 Air India 6335.000 2693.637
2 IndiGo 4879.525 2956.455
3 Jet 5496.146 1989.659
4 Spice Jet 5094.850 1787.422
REGRESSION ANALYSIS
# linear OLS model
fitLinOLSModel <- lm(Price ~
IsDiwali
+ Airline
+ AdvancedBookingDays
+ Departure,
data = airline.df)
# summary of linear OLS model
summary(fitLinOLSModel)
Call:
lm(formula = Price ~ IsDiwali + Airline + AdvancedBookingDays +
Departure, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-3094.3 -1241.2 -582.9 632.9 11926.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7335.27 403.98 18.157 < 2e-16 ***
IsDiwali1 4510.43 581.50 7.757 1.40e-13 ***
AirlineIndiGo -1067.19 415.66 -2.567 0.01073 *
AirlineJet -743.56 380.52 -1.954 0.05163 .
AirlineSpice Jet -1581.70 497.42 -3.180 0.00163 **
AdvancedBookingDays -89.89 12.82 -7.012 1.57e-11 ***
DeparturePM -658.29 264.66 -2.487 0.01342 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2145 on 298 degrees of freedom
Multiple R-squared: 0.2091, Adjusted R-squared: 0.1932
F-statistic: 13.13 on 6 and 298 DF, p-value: 3.426e-13
# log-linear OLS model
fitLogLinOLSModel <- lm(log(Price) ~
IsDiwali
+ Airline
+ AdvancedBookingDays
+ Departure,
data = airline.df)
# summary of log-linear OLS model
summary(fitLogLinOLSModel)
Call:
lm(formula = log(Price) ~ IsDiwali + Airline + AdvancedBookingDays +
Departure, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-0.61740 -0.17271 -0.06197 0.15253 1.23127
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.835518 0.057652 153.257 < 2e-16 ***
IsDiwali1 0.770497 0.082984 9.285 < 2e-16 ***
AirlineIndiGo -0.237650 0.059319 -4.006 7.80e-05 ***
AirlineJet -0.117291 0.054303 -2.160 0.031576 *
AirlineSpice Jet -0.249983 0.070987 -3.522 0.000496 ***
AdvancedBookingDays -0.015091 0.001829 -8.250 5.17e-15 ***
DeparturePM -0.074840 0.037770 -1.981 0.048457 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3061 on 298 degrees of freedom
Multiple R-squared: 0.2858, Adjusted R-squared: 0.2714
F-statistic: 19.87 on 6 and 298 DF, p-value: < 2.2e-16
INTERACTION MODEL
# linear OLS intearction model
fitLinOLSIntModel <- lm(Price ~
IsDiwali
+ Airline * AdvancedBookingDays
+ Departure,
data = airline.df)
# summary of linear OLS interaction model
summary(fitLinOLSIntModel)
Call:
lm(formula = Price ~ IsDiwali + Airline * AdvancedBookingDays +
Departure, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-3589.5 -1162.2 -422.9 828.5 11145.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5943.73 573.72 10.360 < 2e-16
IsDiwali1 4263.05 570.13 7.477 8.71e-13
AirlineIndiGo 1153.02 662.55 1.740 0.0829
AirlineJet 392.06 613.35 0.639 0.5232
AirlineSpice Jet -305.80 794.36 -0.385 0.7005
AdvancedBookingDays -34.86 20.27 -1.720 0.0865
DeparturePM -631.40 257.85 -2.449 0.0149
AirlineIndiGo:AdvancedBookingDays -78.75 18.87 -4.172 3.97e-05
AirlineJet:AdvancedBookingDays -43.08 18.03 -2.389 0.0175
AirlineSpice Jet:AdvancedBookingDays -47.20 23.12 -2.041 0.0421
(Intercept) ***
IsDiwali1 ***
AirlineIndiGo .
AirlineJet
AirlineSpice Jet
AdvancedBookingDays .
DeparturePM *
AirlineIndiGo:AdvancedBookingDays ***
AirlineJet:AdvancedBookingDays *
AirlineSpice Jet:AdvancedBookingDays *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2089 on 295 degrees of freedom
Multiple R-squared: 0.2574, Adjusted R-squared: 0.2348
F-statistic: 11.36 on 9 and 295 DF, p-value: 2.812e-15
# log-linear OLS interaction model
fitLogLinOLSIntModel <- lm(log(Price) ~
IsDiwali
+ Airline * AdvancedBookingDays
+ Departure,
data = airline.df)
# summary of log-linear OLS interaction model
summary(fitLogLinOLSIntModel)
Call:
lm(formula = log(Price) ~ IsDiwali + Airline * AdvancedBookingDays +
Departure, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-0.6551 -0.1874 -0.0555 0.1891 1.1508
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.677048 0.083001 104.542 < 2e-16
IsDiwali1 0.744040 0.082481 9.021 < 2e-16
AirlineIndiGo 0.006169 0.095852 0.064 0.94872
AirlineJet 0.018354 0.088734 0.207 0.83628
AirlineSpice Jet -0.104455 0.114920 -0.909 0.36413
AdvancedBookingDays -0.008847 0.002932 -3.017 0.00277
DeparturePM -0.071880 0.037303 -1.927 0.05495
AirlineIndiGo:AdvancedBookingDays -0.008678 0.002731 -3.178 0.00164
AirlineJet:AdvancedBookingDays -0.005127 0.002609 -1.965 0.05035
AirlineSpice Jet:AdvancedBookingDays -0.005385 0.003345 -1.610 0.10853
(Intercept) ***
IsDiwali1 ***
AirlineIndiGo
AirlineJet
AirlineSpice Jet
AdvancedBookingDays **
DeparturePM .
AirlineIndiGo:AdvancedBookingDays **
AirlineJet:AdvancedBookingDays .
AirlineSpice Jet:AdvancedBookingDays
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3022 on 295 degrees of freedom
Multiple R-squared: 0.3108, Adjusted R-squared: 0.2898
F-statistic: 14.78 on 9 and 295 DF, p-value: < 2.2e-16
REGRESSION DIAGNOSTICS
# Goldfeld-Quandt Test
library(lmtest)
gqtest(fitLinOLSIntModel, order.by = ~ Price, data = airline.df)
Goldfeld-Quandt test
data: fitLinOLSIntModel
GQ = 19.14, df1 = 143, df2 = 142, p-value < 2.2e-16
alternative hypothesis: variance increases from segment 1 to 2
library(lmtest)
# BP-test for log OLS model
bptest(fitLinOLSIntModel)
studentized Breusch-Pagan test
data: fitLinOLSIntModel
BP = 55.248, df = 9, p-value = 1.092e-08
# distribution of studentized residuals
library(MASS)
sresid <- studres(fitLinOLSIntModel)
hist(sresid, freq=FALSE,
main="Distribution of Studentized Residuals")
xfit<-seq(min(sresid),max(sresid),length=40)
yfit<-dnorm(xfit)
lines(xfit, yfit)
# residual plots of log OLS model
par(mfrow=c(2,2))
plot(fitLinOLSIntModel)
FGLS REGRESSION MODEL
# Step 1:Residuals of linear OLS Model
LogOLSModelRes <- resid(fitLinOLSIntModel)
# Step 2: Taking square of the residuals of linear OLS Model
LogOLSModelResSq <- LogOLSModelRes^2
# Step 3: Taking natural log of the squared residuals of linear OLS Model
lnOLSResSq <- log(LogOLSModelResSq)
# Step 4: Running auxiliary OLS Model
auxOLSModel <- lm(lnOLSResSq ~
IsDiwali
+ Airline * AdvancedBookingDays
+ Departure,
data = airline.df)
# Step 5: Get fitted value of auxiliary OLS Model i.e. 'auxOLSModel'
fittedValue <- fitted(auxOLSModel)
# Step 6: Compute exponential values of fiited value for auxialiary OLS Model
expValue <- exp(fittedValue)
# Step 7: Fit Log-linear FGLS Model
fitLinFGLSIntModel <- lm(Price ~
IsDiwali
+ Airline * AdvancedBookingDays
+ Departure,
weights = 1/expValue, data = airline.df)
# summary of linear FGLS model
summary(fitLinFGLSIntModel)
Call:
lm(formula = Price ~ IsDiwali + Airline * AdvancedBookingDays +
Departure, data = airline.df, weights = 1/expValue)
Weighted Residuals:
Min 1Q Median 3Q Max
-5.8414 -0.9924 -0.3923 0.9229 8.2168
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5511.96 243.15 22.669 <2e-16
IsDiwali1 3180.36 322.25 9.869 <2e-16
AirlineIndiGo -625.09 411.69 -1.518 0.1300
AirlineJet 383.80 334.18 1.148 0.2517
AirlineSpice Jet -136.29 595.45 -0.229 0.8191
AdvancedBookingDays -21.97 11.37 -1.933 0.0542
DeparturePM -86.80 117.91 -0.736 0.4622
AirlineIndiGo:AdvancedBookingDays -38.13 15.06 -2.532 0.0119
AirlineJet:AdvancedBookingDays -35.54 14.17 -2.509 0.0127
AirlineSpice Jet:AdvancedBookingDays -21.34 21.58 -0.989 0.3235
(Intercept) ***
IsDiwali1 ***
AirlineIndiGo
AirlineJet
AirlineSpice Jet
AdvancedBookingDays .
DeparturePM
AirlineIndiGo:AdvancedBookingDays *
AirlineJet:AdvancedBookingDays *
AirlineSpice Jet:AdvancedBookingDays
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.683 on 295 degrees of freedom
Multiple R-squared: 0.4453, Adjusted R-squared: 0.4284
F-statistic: 26.31 on 9 and 295 DF, p-value: < 2.2e-16
# Step 1:Residuals of linear OLS Model
LogOLSModelRes <- resid(fitLogLinOLSIntModel)
# Step 2: Taking square of the residuals of linear OLS Model
LogOLSModelResSq <- LogOLSModelRes^2
# Step 3: Taking natural log of the squared residuals of linear OLS Model
lnOLSResSq <- log(LogOLSModelResSq)
# Step 4: Running auxiliary OLS Model
auxOLSModel <- lm(lnOLSResSq ~
IsDiwali
+ Airline * AdvancedBookingDays
+ Departure,
data = airline.df)
# Step 5: Get fitted value of auxiliary OLS Model i.e. 'auxOLSModel'
fittedValue <- fitted(auxOLSModel)
# Step 6: Compute exponential values of fiited value for auxialiary OLS Model
expValue <- exp(fittedValue)
# Step 7: Fit Log-linear FGLS Model
fitLogLinFGLSIntModel <- lm(log(Price) ~
IsDiwali
+ Airline * AdvancedBookingDays
+ Departure,
weights = 1/expValue, data = airline.df)
# summary of linear FGLS model
summary(fitLogLinFGLSIntModel)
Call:
lm(formula = Price ~ IsDiwali + Airline * AdvancedBookingDays +
Departure, data = airline.df, weights = 1/expValue)
Weighted Residuals:
Min 1Q Median 3Q Max
-5.8414 -0.9924 -0.3923 0.9229 8.2168
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5511.96 243.15 22.669 <2e-16
IsDiwali1 3180.36 322.25 9.869 <2e-16
AirlineIndiGo -625.09 411.69 -1.518 0.1300
AirlineJet 383.80 334.18 1.148 0.2517
AirlineSpice Jet -136.29 595.45 -0.229 0.8191
AdvancedBookingDays -21.97 11.37 -1.933 0.0542
DeparturePM -86.80 117.91 -0.736 0.4622
AirlineIndiGo:AdvancedBookingDays -38.13 15.06 -2.532 0.0119
AirlineJet:AdvancedBookingDays -35.54 14.17 -2.509 0.0127
AirlineSpice Jet:AdvancedBookingDays -21.34 21.58 -0.989 0.3235
(Intercept) ***
IsDiwali1 ***
AirlineIndiGo
AirlineJet
AirlineSpice Jet
AdvancedBookingDays .
DeparturePM
AirlineIndiGo:AdvancedBookingDays *
AirlineJet:AdvancedBookingDays *
AirlineSpice Jet:AdvancedBookingDays
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.683 on 295 degrees of freedom
Multiple R-squared: 0.4453, Adjusted R-squared: 0.4284
F-statistic: 26.31 on 9 and 295 DF, p-value: < 2.2e-16
MULTICOLLINEARITY TEST
library(car)
vif(fitLinOLSIntModel)
GVIF Df GVIF^(1/(2*Df))
IsDiwali 5.436007 1 2.331525
Airline 22.001184 3 1.673944
AdvancedBookingDays 14.223236 1 3.771371
Departure 1.147835 1 1.071371
Airline:AdvancedBookingDays 85.026222 3 2.096970
expVar <- airline.df[c("Price", "AdvancedBookingDays")]
#cor(expVar)
library(Hmisc)
rcorr(as.matrix(expVar))
Price AdvancedBookingDays
Price 1.00 -0.01
AdvancedBookingDays -0.01 1.00
n= 305
P
Price AdvancedBookingDays
Price 0.8732
AdvancedBookingDays 0.8732
The mctest package in R provides the Farrar-Glauber test and other relevant tests for multicollinearity.
There are two functions viz. omcdiag() and imcdiag() under mctest package in R which will provide the overall and individual diagnostic checking for multicollinearity respectively.
library(mctest)
omcdiag(as.matrix(expVar), Price)
Call:
omcdiag(x = as.matrix(expVar), y = Price)
Overall Multicollinearity Diagnostics
MC Results detection
Determinant |X'X|: 0.9999 0
Farrar Chi-Square: 0.0252 0
Red Indicator: 0.0092 0
Sum of Lambda Inverse: 2.0002 0
Theil's Method: -0.9998 0
Condition Number: 5.8290 0
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
imcdiag(expVar, Price)
Call:
imcdiag(x = expVar, y = Price)
All Individual Multicollinearity Diagnostics Result
VIF TOL Wi Fi Leamer CVIF Klein
Price 1.0001 0.9999 0.0255 Inf 1 0 0
AdvancedBookingDays 1.0001 0.9999 0.0255 Inf 1 0 0
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
AdvancedBookingDays , coefficient(s) are non-significant may be due to multicollinearity
R-square of y on all x: 1