Sameer Mathur
Outliers and High Leverage Points
Regression Diagnostics
---
Reading data into a dataframe
# reading data into R
airline.df <- read.csv(paste("BOMDELBOM.csv"))
# attaching data columns of the dataframe
attach(airline.df)
Number of rows and columns
# dimension of the dataframe
dim(airline.df)
[1] 305 23
# descriptive statistics
library(psych)
describe(airline.df)[, c(1:5, 8:9)]
vars n mean sd median min max
FlightNumber* 1 305 31.86 18.35 32.00 1.00 63.00
Airline* 2 305 2.60 0.88 3.00 1.00 4.00
DepartureCityCode* 3 305 1.57 0.50 2.00 1.00 2.00
ArrivalCityCode* 4 305 1.43 0.50 1.00 1.00 2.00
DepartureTime 5 305 1249.54 579.86 1035.00 225.00 2320.00
ArrivalTime 6 305 1329.31 613.52 1215.00 20.00 2345.00
Departure* 7 305 1.45 0.50 1.00 1.00 2.00
FlyingMinutes 8 305 136.03 4.71 135.00 125.00 145.00
Aircraft* 9 305 1.54 0.50 2.00 1.00 2.00
PlaneModel* 10 305 3.82 2.71 3.00 1.00 9.00
Capacity 11 305 176.36 32.39 180.00 138.00 303.00
SeatPitch 12 305 30.26 0.93 30.00 29.00 33.00
SeatWidth 13 305 17.41 0.49 17.00 17.00 18.00
DataCollectionDate* 14 305 4.36 1.98 5.00 1.00 7.00
DateDeparture* 15 305 8.14 6.69 7.00 1.00 20.00
IsWeekend* 16 305 1.13 0.34 1.00 1.00 2.00
Price 17 305 5394.54 2388.29 4681.00 2607.00 18015.00
AdvancedBookingDays 18 305 28.90 22.30 30.00 2.00 61.00
IsDiwali* 19 305 1.40 0.49 1.00 1.00 2.00
DayBeforeDiwali* 20 305 1.19 0.40 1.00 1.00 2.00
DayAfterDiwali* 21 305 1.20 0.40 1.00 1.00 2.00
MarketShare 22 305 21.18 11.04 15.40 13.20 39.60
LoadFactor 23 305 85.13 4.32 83.32 78.73 94.06
# linear OLS model
fitOLSModel <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth,
data = airline.df)
# summary of linear OLS model
summary(fitOLSModel)
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2671.2 -1266.2 -456.4 517.4 11953.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4292.94 8897.87 -0.482 0.6298
AdvancedBookingDays -87.70 12.47 -7.033 1.43e-11 ***
AirlineIndiGo -577.17 778.64 -0.741 0.4591
AirlineJet -120.75 436.69 -0.277 0.7823
AirlineSpice Jet -1118.38 697.85 -1.603 0.1101
DeparturePM -589.79 275.23 -2.143 0.0329 *
IsWeekendYes -345.92 408.06 -0.848 0.3973
IsDiwaliYes 4346.80 568.14 7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46 351.54 -4.021 7.38e-05 ***
FlyingMinutes 38.97 29.27 1.331 0.1841
SeatPitch -279.19 226.64 -1.232 0.2190
SeatWidth 868.58 507.54 1.711 0.0881 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared: 0.2695, Adjusted R-squared: 0.2421
F-statistic: 9.828 on 11 and 293 DF, p-value: 3.604e-15
We can see that our data is right skewed. We conclude presence of outliers in our dataset.
As we can see that residuals are not normally distributed. This plot suggests that we have some low and high outliers in our dataframe.
# quantiles
quantile(Price)
0% 25% 50% 75% 100%
2607 4051 4681 5725 18015
where
0%
is minimum50%
is median100%
is maximumAn outlier is a data point whose response y
does not follow the general trend of the rest of the data.
The presence of outliers may affect the interpretation of the model, because it increases the RSE.
Outliers can be identified by examining the standardized residual (or studentized residual), which is the residual divided by its estimated standard error.
Standardized residuals can be interpreted as the number of standard errors away from the regression line.
Observations whose standardized residuals are greater than 3 in absolute value are outliers (James et al. 2014).
A data point has high leverage if it has extreme predictor x
values.
With a single predictor, an extreme x
value is simply one that is particularly high or low.
With multiple predictors, extreme x
values may be particularly high or low for one or more predictors, or may be unusual combinations of predictor values.
For example, with two predictors that are positively correlated, an unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor.
A value of this statistic \[ \frac{2(p + 1)}{n} \] indicates an observation with high leverage (P. Bruce and Bruce 2017); where,
p
is the number of predictors and n
is the number of observations.Outliers and high leverage points can be identified by inspecting the Residuals vs Leverage plot
# residuals vs. leverage
plot(fitOLSModel, 5)
The plot highlights the some extreme points, with a standardized residuals above.
The plot highlights the top 3 most extreme points (#49, #182 and #183), with a standardized residuals above 2.
Additionally, there are high leverage points in the data. That is, some data points, have a leverage statistic above.
# number of predictors (p) = 9
# number of observations (n) = 305
# calculating boundary of high leverage points
2*(9+1) / 305
[1] 0.06557377
In our case, boundary for high leverage point is 0.065
.
In Residual vs Leverage plot, the points above
0.065
have high leverage points.
The function outlierTest
from car
package gives the most extreme observations based on a given model.
# outliers test
library(car)
outlierTest(fitOLSModel,
n.max = 15, order = FALSE, digits = 3)
rstudent unadjusted p-value Bonferonni p
182 5.049417 7.8156e-07 2.3838e-04
183 6.199350 1.9294e-09 5.8845e-07
184 5.049417 7.8156e-07 2.3838e-04
185 5.049417 7.8156e-07 2.3838e-04
Rows 182
, 183
, 184
and 185
are most extreme.
There are multiple approaches for handling outliers.
One way is to simply remove the outliers suggested by the outlierTest
.
We remove the extreme outliers from the dataframe airline.df
, suggested by the outlierTest
that have high leverage points.
Subsetting dataframe removing rows 182
, 183
, 184
and 185
.
# subset of dataframe after removing extreme outliers
airlineRemOut.df <- airline.df[c(-182, -183, -184, -185), ]
# number of rows and columns
dim(airlineRemOut.df)
[1] 301 23
Now we will run linear regression model using dataframe airlineRemOut.df
.
Before
# number of rows and columns
dim(airline.df)
[1] 305 23
After
# number of rows and columns
dim(airlineRemOut.df)
[1] 301 23
Before
After
Before
# linear OLS model
fitOLSRemOutModel <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth,
data = airlineRemOut.df)
# summary of linear OLS model
summary(fitOLSRemOutModel)
After
# linear OLS model after removing outliers
fitOLSRemOutModel <- lm(Price ~
AdvancedBookingDays
+ Airline
+ Departure
+ IsWeekend
+ IsDiwali
+ DepartureCityCode
+ FlyingMinutes
+ SeatPitch
+ SeatWidth,
data = airline.df)
# summary of linear OLS model
summary(fitOLSRemOutModel)
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth, data = airline.df)
Residuals:
Min 1Q Median 3Q Max
-2671.2 -1266.2 -456.4 517.4 11953.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4292.94 8897.87 -0.482 0.6298
AdvancedBookingDays -87.70 12.47 -7.033 1.43e-11 ***
AirlineIndiGo -577.17 778.64 -0.741 0.4591
AirlineJet -120.75 436.69 -0.277 0.7823
AirlineSpice Jet -1118.38 697.85 -1.603 0.1101
DeparturePM -589.79 275.23 -2.143 0.0329 *
IsWeekendYes -345.92 408.06 -0.848 0.3973
IsDiwaliYes 4346.80 568.14 7.651 2.90e-13 ***
DepartureCityCodeDEL -1413.46 351.54 -4.021 7.38e-05 ***
FlyingMinutes 38.97 29.27 1.331 0.1841
SeatPitch -279.19 226.64 -1.232 0.2190
SeatWidth 868.58 507.54 1.711 0.0881 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2079 on 293 degrees of freedom
Multiple R-squared: 0.2695, Adjusted R-squared: 0.2421
F-statistic: 9.828 on 11 and 293 DF, p-value: 3.604e-15
Call:
lm(formula = Price ~ AdvancedBookingDays + Airline + Departure +
IsWeekend + IsDiwali + DepartureCityCode + FlyingMinutes +
SeatPitch + SeatWidth, data = airlineRemOut.df)
Residuals:
Min 1Q Median 3Q Max
-2536.6 -1061.6 -263.8 486.4 6662.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3267.244 6977.753 -0.468 0.6400
AdvancedBookingDays -63.475 9.937 -6.388 6.72e-10 ***
AirlineIndiGo -817.455 610.832 -1.338 0.1819
AirlineJet -103.673 342.436 -0.303 0.7623
AirlineSpice Jet -648.966 548.301 -1.184 0.2375
DeparturePM -43.895 219.496 -0.200 0.8416
IsWeekendYes 164.065 322.158 0.509 0.6110
IsDiwaliYes 3738.528 447.738 8.350 2.87e-15 ***
DepartureCityCodeDEL -1705.666 276.493 -6.169 2.31e-09 ***
FlyingMinutes 48.069 22.963 2.093 0.0372 *
SeatPitch -229.666 177.758 -1.292 0.1974
SeatWidth 608.514 398.449 1.527 0.1278
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1630 on 289 degrees of freedom
Multiple R-squared: 0.376, Adjusted R-squared: 0.3522
F-statistic: 15.83 on 11 and 289 DF, p-value: < 2.2e-16
Before
After