Introduction

This code through explores an Airline data set with several different identifiers for flight information. We will explore and visualize patterns using data pipelines and ggplot within these individual identifiers to ultimately determine statistical significance with respect to delay.

airline <- read.csv("Airlines.csv")
head(airline, n=12) %>% pander()
id Airline Flight AirportFrom AirportTo DayOfWeek Time Length Delay
1 CO 269 SFO IAH 3 15 205 1
2 US 1558 PHX CLT 3 15 222 1
3 AA 2400 LAX DFW 3 20 165 1
4 AA 2466 SFO DFW 3 20 195 1
5 AS 108 ANC SEA 3 30 202 0
6 CO 1094 LAX IAH 3 30 181 1
7 DL 1768 LAX MSP 3 30 220 0
8 DL 2722 PHX DTW 3 30 228 0
9 DL 2606 SFO MSP 3 35 216 1
10 AA 2538 LAS ORD 3 40 200 1
11 CO 223 ANC SEA 3 49 201 1
12 DL 1646 PHX ATL 3 50 212 1


Content Overview

Specifically, we’ll examine and demonstrate significance tests using glm and stargazer for appropriate flight information with regards to delay. Does one airline have significantly more delays that others? Are there more delays on certain days of the week? Are there a greater number of delays at one airport over another?


Learning Objectives

In this code through, you’ll learn how to plot various categories with different features using ggplot, test for statistical significance using stargazer, and create linear regression models.



Plotting, Testing, and Results

Here, we’ll show how to create tables and a plots to visualize the different airline identifiers. We will also look at filtering the data to visualize delayed vs. on time flights. Then we will generate regression models for each identifier and test the statistical significance.


Commercial Airline

Here we will explore similarities and differences with respect to individual airlines. Below is a list of airline codes with the corresponding airline. Lets first look at what airline has the most flights.

WN - Southwest
DL - Delta
OO - SkyWest
AA - American
MQ - Envoy
US - US Airways
XE - JSX
EV - ExpressJet
UA - United
CO - Continental
FL - AirTran
9E - Endeavor
B6 - JetBlue
YV - Mesa
OH - Jetstream
AS - Alaska Airlines
F9 - Frontier
HA - Hawaiian Airlines

airline %>%
  group_by(Airline) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  pander()
Airline n
WN 94097
DL 60940
OO 50254
AA 45656
MQ 36605
US 34500
XE 31126
EV 27983
UA 27619
CO 21118
FL 20827
9E 20686
B6 18112
YV 13725
OH 12630
AS 11471
F9 6456
HA 5578


Here we can group by Airline and delay and see which airline has the greatest proportion of delayed flights.

airline %>%
  group_by(Airline) %>%
  summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
  arrange(desc(Prop_Delayed)) %>%
pander()
Airline n Prop_Delayed
WN 94097 0.698
CO 21118 0.566
B6 18112 0.467
OO 50254 0.453
DL 60940 0.45
F9 6456 0.449
EV 27983 0.402
9E 20686 0.398
AA 45656 0.388
XE 31126 0.379
MQ 36605 0.348
AS 11471 0.339
US 34500 0.336
UA 27619 0.324
HA 5578 0.32
FL 20827 0.301
OH 12630 0.277
YV 13725 0.243


Now let’s plot this table using ggplot.

plotairline <- airline %>%
  group_by(Airline) %>%
  summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
  arrange(desc(Prop_Delayed))


ggplot(data=plotairline, aes(x=reorder(Airline, Prop_Delayed), y=Prop_Delayed)) + 
  geom_bar(stat = "identity", fill=alpha("firebrick4", 0.75)) + 
  labs(x="Airline", y="Proportion Delayed") + 
  geom_hline(yintercept = mean(plotairline$Prop_Delayed), linetype= "dashed")

#mean(plotairline$Prop_Delayed)

As shown from the table and chart above, Southwest has the greatest number of flights as well as the greatest proportion of delayed flights. Delta is second in number of flights but fifth in proportion of delayed flights. Also, JetBlue is listed in the bottom six for number of flights but the top three for proportion of delayed.

Lastly, let’s look at the logistic regression and test for significant impact.

reg <- glm(airline$Delay ~ airline$Airline, family = binomial)

stargazer(reg,
          type="html",
          omit.stat = "all",
          digits=2)
Dependent variable:
Delay
AirlineAA -0.04**
(0.02)
AirlineAS -0.25***
(0.02)
AirlineB6 0.28***
(0.02)
AirlineCO 0.68***
(0.02)
AirlineDL 0.22***
(0.02)
AirlineEV 0.02
(0.02)
AirlineF9 0.21***
(0.03)
AirlineFL -0.43***
(0.02)
AirlineHA -0.34***
(0.03)
AirlineMQ -0.21***
(0.02)
AirlineOH -0.54***
(0.02)
AirlineOO 0.23***
(0.02)
AirlineUA -0.32***
(0.02)
AirlineUS -0.27***
(0.02)
AirlineWN 1.25***
(0.02)
AirlineXE -0.08***
(0.02)
AirlineYV -0.72***
(0.02)
Constant -0.42***
(0.01)
Note: p<0.1; p<0.05; p<0.01


This table shows that nearly every airline has a statistically significant impact on delay. The main airlines of note (due to large coefficients) are:

WN - Southwest
CO - Continental
B6 - JetBlue

This suggests that by flying with one of these Airlines, the probability of having a flight delayed increases dramatically (over 25%).


Days of the Week

Here we will explore similarities and differences with respect to the Day of the week and delay. In this data set, 1 represents Monday and & 7 represents Sunday.

airline %>%
  group_by(DayOfWeek) %>%
  summarise(n=n()) %>%
  pander()
DayOfWeek n
1 72769
2 71340
3 89746
4 91445
5 85248
6 58956
7 69879


Let’s look at the proportion of delayed flights by Day of Week.

airline %>%
  group_by(DayOfWeek) %>%
  summarise(n=n(), Prop_Delayed = round(sum(Delay ==1)/n, 3)) %>%
  pander()
DayOfWeek n Prop_Delayed
1 72769 0.468
2 71340 0.447
3 89746 0.471
4 91445 0.451
5 85248 0.417
6 58956 0.401
7 69879 0.454


Now lets plot the table using ggplot.

plotday <- airline %>%
  group_by(DayOfWeek) %>%
  summarise(n=n(), Prop_Delayed = round(sum(Delay ==1)/n, 3))

ggplot(data = plotday, aes(x = plotday$DayOfWeek, y = plotday$Prop_Delayed))+
  geom_line(col=alpha("firebrick4", 0.75)) + 
  geom_point(col=alpha("firebrick4", 0.75)) +
  labs(x="Day of Week", y = "Proportion of Flights Delayed")


As seen in the chart above, it appears that there is a greater likelyhood that your flight will be delayed if you fly on Monday or Wednesday.

Finally, let’s look at the logistic regression and test for significant impact.

reg <- glm(airline$Delay ~ as.factor(airline$DayOfWeek)-1, family = binomial)

stargazer(reg,
          type="html",
          omit.stat = "all",
          digits=2)
Dependent variable:
Delay
DayOfWeek)1 -0.13***
(0.01)
DayOfWeek)2 -0.21***
(0.01)
DayOfWeek)3 -0.12***
(0.01)
DayOfWeek)4 -0.20***
(0.01)
DayOfWeek)5 -0.34***
(0.01)
DayOfWeek)6 -0.40***
(0.01)
DayOfWeek)7 -0.19***
(0.01)
Note: p<0.1; p<0.05; p<0.01


From the table above, we can see that our suspicion is correct. Monday and Wednesday have the least negative (greatest positive) impact on predicting delay.


Airport of Departure

Here we will explore similarities and differences with respect to the Airport of Departure and delay.

airline %>%
  group_by(AirportFrom) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) %>%
  top_n(18) %>%
  pander()

Selecting by n

AirportFrom n
ATL 34449
ORD 24822
DFW 22154
DEN 19843
LAX 16657
IAH 15821
PHX 15557
DTW 13136
LAS 11918
SFO 11786
CLT 11133
MCO 10596
SLC 10473
MSP 10166
EWR 9673
JFK 9496
BOS 9439
BWI 8565

Let’s look at the proportion of delayed flights based on outgoing airport. Note: there are over 200 different airports listed in this data. We will only be viewing the top 18.

airline %>%
  group_by(AirportFrom) %>%
  summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
  arrange(desc(Prop_Delayed)) %>%
  top_n(18) %>%
pander()

Selecting by Prop_Delayed

AirportFrom n Prop_Delayed
MDW 7103 0.735
DAL 3838 0.716
OAK 3783 0.713
HOU 4420 0.667
OTH 93 0.634
FLO 18 0.611
SMF 3504 0.606
GUM 10 0.6
SJC 3357 0.587
LWS 57 0.579
BWI 8565 0.573
STL 5031 0.568
RNO 1708 0.566
LAS 11918 0.56
ADK 9 0.556
OTZ 90 0.556
ISP 631 0.552
BET 84 0.548


plotairport <- airline %>%
  group_by(AirportFrom) %>%
  summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
  arrange(desc(Prop_Delayed)) %>%
  top_n(18)

ggplot(data=plotairport, aes(x=reorder(AirportFrom, Prop_Delayed), y=Prop_Delayed)) + 
  geom_bar(stat = "identity", fill=alpha("firebrick4", 0.75)) + 
  labs(x="Airline", y="Proportion Delayed") + 
  geom_hline(yintercept = mean(plotairport$Prop_Delayed), linetype = "dashed", size=1.0)


With over 200 airports listed in this data, it is unreasonable to run a regression model. Based on the plot above, we can see that there are five airports that have above average delays:

  1. MDW
  2. DAL
  3. OAK
  4. HOU
  5. OTH

Without a regression, it is hard to predict the exact impact these airports have on flight delay.


Length of Flight

Finally, here we will explore similarities and differences with respect to the Length of flight and its impact on delay.

ggplot(data=new_data, aes(x=x, y=y)) +
  geom_point(col=alpha("firebrick4", 0.75), size=3)+
  geom_line(col=alpha("firebrick4",0.75))+
  geom_smooth(method = "glm", method.args = list(family=binomial)) +
  labs(x="Delayed (0=No, 1=Yes)", y="Mean Flight Time") +
  lims(x=c(-0.5,1.5), y=c(125, 140))

t.test(x1, x2)
Welch Two Sample t-test

data: x1 and x2 t = -29.621, df = 504681, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.090060 -5.334146 sample estimates: mean of x mean of y 129.6576 135.3697

reg <- glm(airline$Delay ~ airline$Length, family = binomial)

stargazer(reg,
          type="html",
          omit.stat = "all",
          digits=2)
Dependent variable:
Delay
Length 0.001***
(0.0000)
Constant -0.37***
(0.01)
Note: p<0.1; p<0.05; p<0.01


These tests and tables show that the length of flight does have an impact on whether or not the flight is likely to be delayed. The longer the flight, the greater change for delay. For every increase of 10 minutes in flight time, the probability your flight will be delayed will increase by 0.10 (or 10%)