Effect of Seasonality and Advance Booking on BOM-DEL-BOM Air Ticket Prices

Sameer Mathur

Outliers and High Leverage Points

Regression Diagnostics

---

Reading and Describing Data

Reading data

Reading data into a dataframe

# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)

Number of rows and columns

# dimension of the dataframe
dim(airline.df)
[1] 305  23

Descriptive Statistics

# descriptive statistics
library(psych)
describe(airline.df)[, c(1:5, 8:9)]
                    vars   n    mean      sd  median     min      max
FlightNumber*          1 305   31.86   18.35   32.00    1.00    63.00
Airline*               2 305    2.60    0.88    3.00    1.00     4.00
DepartureCityCode*     3 305    1.57    0.50    2.00    1.00     2.00
ArrivalCityCode*       4 305    1.43    0.50    1.00    1.00     2.00
DepartureTime          5 305 1249.54  579.86 1035.00  225.00  2320.00
ArrivalTime            6 305 1329.31  613.52 1215.00   20.00  2345.00
Departure*             7 305    1.45    0.50    1.00    1.00     2.00
FlyingMinutes          8 305  136.03    4.71  135.00  125.00   145.00
Aircraft*              9 305    1.54    0.50    2.00    1.00     2.00
PlaneModel*           10 305    3.82    2.71    3.00    1.00     9.00
Capacity              11 305  176.36   32.39  180.00  138.00   303.00
SeatPitch             12 305   30.26    0.93   30.00   29.00    33.00
SeatWidth             13 305   17.41    0.49   17.00   17.00    18.00
DataCollectionDate*   14 305    4.36    1.98    5.00    1.00     7.00
DateDeparture*        15 305    8.14    6.69    7.00    1.00    20.00
IsWeekend*            16 305    1.13    0.34    1.00    1.00     2.00
Price                 17 305 5394.54 2388.29 4681.00 2607.00 18015.00
AdvancedBookingDays   18 305   28.90   22.30   30.00    2.00    61.00
IsDiwali*             19 305    1.40    0.49    1.00    1.00     2.00
DayBeforeDiwali*      20 305    1.19    0.40    1.00    1.00     2.00
DayAfterDiwali*       21 305    1.20    0.40    1.00    1.00     2.00
MarketShare           22 305   21.18   11.04   15.40   13.20    39.60
LoadFactor            23 305   85.13    4.32   83.32   78.73    94.06

Regression Analysis

Linear OLS Regression Model

# linear OLS model
fitOLSModel <- lm(Price ~ 
                    AdvancedBookingDays 
                  + Airline 
                  + Departure 
                  + IsWeekend 
                  + IsDiwali 
                  + DepartureCityCode 
                  + FlyingMinutes 
                  + SeatPitch 
                  + SeatWidth, 
                  data = airline.df)
# summary of linear OLS model
summary(fitOLSModel)

Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure + 
    IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
    SeatPitch + SeatWidth, data = airline.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2671.2 -1266.2  -456.4   517.4 11953.9 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -4292.94    8897.87  -0.482   0.6298    
AdvancedBookingDays    -87.70      12.47  -7.033 1.43e-11 ***
AirlineIndiGo         -577.17     778.64  -0.741   0.4591    
AirlineJet            -120.75     436.69  -0.277   0.7823    
AirlineSpice Jet     -1118.38     697.85  -1.603   0.1101    
DeparturePM           -589.79     275.23  -2.143   0.0329 *  
IsWeekendYes          -345.92     408.06  -0.848   0.3973    
IsDiwaliYes           4346.80     568.14   7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46     351.54  -4.021 7.38e-05 ***
FlyingMinutes           38.97      29.27   1.331   0.1841    
SeatPitch             -279.19     226.64  -1.232   0.2190    
SeatWidth              868.58     507.54   1.711   0.0881 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared:  0.2695,    Adjusted R-squared:  0.2421 
F-statistic: 9.828 on 11 and 293 DF,  p-value: 3.604e-15

Boxplot of Variable Price

plot of chunk unnamed-chunk-7

Histogram of Variable Price

plot of chunk unnamed-chunk-8

We can see that our data is right skewed. We conclude presence of outliers in our dataset.

Normality Plot

plot of chunk unnamed-chunk-9

As we can see that residuals are not normally distributed. This plot suggests that we have some low and high outliers in our dataframe.

Quartiles of Variable Price

# quantiles
quantile(Price)
   0%   25%   50%   75%  100% 
 2607  4051  4681  5725 18015 

where

  • 0% is minimum
  • 50% is median
  • 100% is maximum

Outliers and High Levarage Points

Outliers

An outlier is a data point whose response y does not follow the general trend of the rest of the data.

The presence of outliers may affect the interpretation of the model, because it increases the RSE.

Identifying Outliers

Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual divided by its estimated standard error.

Standardized residuals can be interpreted as the number of standard errors away from the regression line.

Identifying Outliers

Observations whose standardized residuals are greater than 3 in absolute value are outliers (James et al. 2014).

High Leverage Points

A data point has high leverage if it has extreme predictor x values.

With a single predictor, an extreme x value is simply one that is particularly high or low.

With multiple predictors, extreme x values may be particularly high or low for one or more predictors, or may be unusual combinations of predictor values.

For example, with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor.

A value of this statistic \[ \frac{2(p + 1)}{n} \] indicates an observation with high leverage (P. Bruce and Bruce 2017); where,

  • p is the number of predictors and
  • n is the number of observations.

Detecting Outliers and High Leverage Points

Outliers and high leverage points can be identified by inspecting the Residuals vs Leverage plot

Residuals versus Leverage Plot

# residuals vs. leverage
plot(fitOLSModel, 5)

plot of chunk unnamed-chunk-12

The plot highlights the some extreme points, with a standardized residuals above.

The plot highlights the top 3 most extreme points (#49, #182 and #183), with a standardized residuals above 2.

Additionally, there are high leverage points in the data. That is, some data points, have a leverage statistic above.

# number of predictors (p) = 9
# number of observations (n) = 305

# calculating boundary of high leverage points
2*(9+1) / 305
[1] 0.06557377

In our case, boundary for high leverage point is 0.065.

In Residual vs Leverage plot, the points above 0.065 have high leverage points.

Outliers Test

The function outlierTest from car package gives the most extreme observations based on a given model.

# outliers test
library(car)
outlierTest(fitOLSModel, 
            n.max = 15, order = FALSE, digits = 3)
    rstudent unadjusted p-value Bonferonni p
182 5.049417         7.8156e-07   2.3838e-04
183 6.199350         1.9294e-09   5.8845e-07
184 5.049417         7.8156e-07   2.3838e-04
185 5.049417         7.8156e-07   2.3838e-04

Rows 182, 183, 184 and 185 are most extreme.

Handling Outliers

There are multiple approaches for handling outliers.

One way is to simply remove the outliers suggested by the outlierTest.

Removing Outliers

We remove the extreme outliers from the dataframe airline.df, suggested by the outlierTest that have high leverage points.

Subsetting dataframe removing rows 182, 183, 184 and 185.

# subset of dataframe after removing extreme outliers
airlineRemOut.df <- airline.df[c(-182, -183, -184, -185), ]
# number of rows and columns
dim(airlineRemOut.df)
[1] 301  23

Now we will run linear regression model using dataframe airlineRemOut.df.

Dataframe Before After Removing Outliers


Before

# number of rows and columns
dim(airline.df)
[1] 305  23


After

# number of rows and columns
dim(airlineRemOut.df)
[1] 301  23

Boxplot Before After Removing Outliers


Before plot of chunk unnamed-chunk-18


After plot of chunk unnamed-chunk-19

Linear OLS Regression Model After Removing Outliers


Before

# linear OLS model
fitOLSRemOutModel <- lm(Price ~ 
                    AdvancedBookingDays 
                  + Airline 
                  + Departure 
                  + IsWeekend 
                  + IsDiwali 
                  + DepartureCityCode 
                  + FlyingMinutes 
                  + SeatPitch 
                  + SeatWidth, 
                  data = airlineRemOut.df)
# summary of linear OLS model
summary(fitOLSRemOutModel)


After

# linear OLS model after removing outliers
fitOLSRemOutModel <- lm(Price ~ 
                    AdvancedBookingDays 
                  + Airline 
                  + Departure 
                  + IsWeekend 
                  + IsDiwali 
                  + DepartureCityCode 
                  + FlyingMinutes 
                  + SeatPitch 
                  + SeatWidth, 
                  data = airline.df)
# summary of linear OLS model
summary(fitOLSRemOutModel)

Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure + 
    IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
    SeatPitch + SeatWidth, data = airline.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2671.2 -1266.2  -456.4   517.4 11953.9 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -4292.94    8897.87  -0.482   0.6298    
AdvancedBookingDays    -87.70      12.47  -7.033 1.43e-11 ***
AirlineIndiGo         -577.17     778.64  -0.741   0.4591    
AirlineJet            -120.75     436.69  -0.277   0.7823    
AirlineSpice Jet     -1118.38     697.85  -1.603   0.1101    
DeparturePM           -589.79     275.23  -2.143   0.0329 *  
IsWeekendYes          -345.92     408.06  -0.848   0.3973    
IsDiwaliYes           4346.80     568.14   7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46     351.54  -4.021 7.38e-05 ***
FlyingMinutes           38.97      29.27   1.331   0.1841    
SeatPitch             -279.19     226.64  -1.232   0.2190    
SeatWidth              868.58     507.54   1.711   0.0881 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared:  0.2695,    Adjusted R-squared:  0.2421 
F-statistic: 9.828 on 11 and 293 DF,  p-value: 3.604e-15

Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure + 
    IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes + 
    SeatPitch + SeatWidth, data = airlineRemOut.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2536.6 -1061.6  -263.8   486.4  6662.0 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -3267.244   6977.753  -0.468   0.6400    
AdvancedBookingDays    -63.475      9.937  -6.388 6.72e-10 ***
AirlineIndiGo         -817.455    610.832  -1.338   0.1819    
AirlineJet            -103.673    342.436  -0.303   0.7623    
AirlineSpice Jet      -648.966    548.301  -1.184   0.2375    
DeparturePM            -43.895    219.496  -0.200   0.8416    
IsWeekendYes           164.065    322.158   0.509   0.6110    
IsDiwaliYes           3738.528    447.738   8.350 2.87e-15 ***
DepartureCityCodeDEL -1705.666    276.493  -6.169 2.31e-09 ***
FlyingMinutes           48.069     22.963   2.093   0.0372 *  
SeatPitch             -229.666    177.758  -1.292   0.1974    
SeatWidth              608.514    398.449   1.527   0.1278    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1630 on 289 degrees of freedom
Multiple R-squared:  0.376, Adjusted R-squared:  0.3522 
F-statistic: 15.83 on 11 and 289 DF,  p-value: < 2.2e-16

Normality Plot

Before plot of chunk unnamed-chunk-24

After plot of chunk unnamed-chunk-25