This code through explores an Airline data set with several different identifiers for flight information. We will explore and visualize patterns using data pipelines and ggplot within these individual identifiers to ultimately determine statistical significance with respect to delay.
airline <- read.csv("Airlines.csv")
head(airline, n=12) %>% pander()| id | Airline | Flight | AirportFrom | AirportTo | DayOfWeek | Time | Length | Delay |
|---|---|---|---|---|---|---|---|---|
| 1 | CO | 269 | SFO | IAH | 3 | 15 | 205 | 1 |
| 2 | US | 1558 | PHX | CLT | 3 | 15 | 222 | 1 |
| 3 | AA | 2400 | LAX | DFW | 3 | 20 | 165 | 1 |
| 4 | AA | 2466 | SFO | DFW | 3 | 20 | 195 | 1 |
| 5 | AS | 108 | ANC | SEA | 3 | 30 | 202 | 0 |
| 6 | CO | 1094 | LAX | IAH | 3 | 30 | 181 | 1 |
| 7 | DL | 1768 | LAX | MSP | 3 | 30 | 220 | 0 |
| 8 | DL | 2722 | PHX | DTW | 3 | 30 | 228 | 0 |
| 9 | DL | 2606 | SFO | MSP | 3 | 35 | 216 | 1 |
| 10 | AA | 2538 | LAS | ORD | 3 | 40 | 200 | 1 |
| 11 | CO | 223 | ANC | SEA | 3 | 49 | 201 | 1 |
| 12 | DL | 1646 | PHX | ATL | 3 | 50 | 212 | 1 |
Specifically, we’ll examine and demonstrate significance tests using glm and stargazer for appropriate flight information with regards to delay. Does one airline have significantly more delays that others? Are there more delays on certain days of the week? Are there a greater number of delays at one airport over another?
In this code through, you’ll learn how to plot various categories with different features using ggplot, test for statistical significance using stargazer, and create linear regression models.
Here, we’ll show how to create tables and a plots to visualize the different airline identifiers. We will also look at filtering the data to visualize delayed vs. on time flights. Then we will generate regression models for each identifier and test the statistical significance.
Here we will explore similarities and differences with respect to individual airlines. Below is a list of airline codes with the corresponding airline. Lets first look at what airline has the most flights.
WN - Southwest
DL - Delta
OO - SkyWest
AA - American
MQ - Envoy
US - US Airways
XE - JSX
EV - ExpressJet
UA -
United
CO - Continental
FL - AirTran
9E - Endeavor
B6 -
JetBlue
YV - Mesa
OH - Jetstream
AS - Alaska Airlines
F9
- Frontier
HA - Hawaiian Airlines
airline %>%
group_by(Airline) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
pander()| Airline | n |
|---|---|
| WN | 94097 |
| DL | 60940 |
| OO | 50254 |
| AA | 45656 |
| MQ | 36605 |
| US | 34500 |
| XE | 31126 |
| EV | 27983 |
| UA | 27619 |
| CO | 21118 |
| FL | 20827 |
| 9E | 20686 |
| B6 | 18112 |
| YV | 13725 |
| OH | 12630 |
| AS | 11471 |
| F9 | 6456 |
| HA | 5578 |
Here we can group by Airline and delay and see which airline has the greatest proportion of delayed flights.
airline %>%
group_by(Airline) %>%
summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
arrange(desc(Prop_Delayed)) %>%
pander()| Airline | n | Prop_Delayed |
|---|---|---|
| WN | 94097 | 0.698 |
| CO | 21118 | 0.566 |
| B6 | 18112 | 0.467 |
| OO | 50254 | 0.453 |
| DL | 60940 | 0.45 |
| F9 | 6456 | 0.449 |
| EV | 27983 | 0.402 |
| 9E | 20686 | 0.398 |
| AA | 45656 | 0.388 |
| XE | 31126 | 0.379 |
| MQ | 36605 | 0.348 |
| AS | 11471 | 0.339 |
| US | 34500 | 0.336 |
| UA | 27619 | 0.324 |
| HA | 5578 | 0.32 |
| FL | 20827 | 0.301 |
| OH | 12630 | 0.277 |
| YV | 13725 | 0.243 |
Now let’s plot this table using ggplot.
plotairline <- airline %>%
group_by(Airline) %>%
summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
arrange(desc(Prop_Delayed))
ggplot(data=plotairline, aes(x=reorder(Airline, Prop_Delayed), y=Prop_Delayed)) +
geom_bar(stat = "identity", fill=alpha("firebrick4", 0.75)) +
labs(x="Airline", y="Proportion Delayed") +
geom_hline(yintercept = mean(plotairline$Prop_Delayed), linetype= "dashed")#mean(plotairline$Prop_Delayed)As shown from the table and chart above, Southwest has the greatest number of flights as well as the greatest proportion of delayed flights. Delta is second in number of flights but fifth in proportion of delayed flights. Also, JetBlue is listed in the bottom six for number of flights but the top three for proportion of delayed.
Lastly, let’s look at the logistic regression and test for significant impact.
reg <- glm(airline$Delay ~ airline$Airline, family = binomial)
stargazer(reg,
type="html",
omit.stat = "all",
digits=2)| Dependent variable: | |
| Delay | |
| AirlineAA | -0.04** |
| (0.02) | |
| AirlineAS | -0.25*** |
| (0.02) | |
| AirlineB6 | 0.28*** |
| (0.02) | |
| AirlineCO | 0.68*** |
| (0.02) | |
| AirlineDL | 0.22*** |
| (0.02) | |
| AirlineEV | 0.02 |
| (0.02) | |
| AirlineF9 | 0.21*** |
| (0.03) | |
| AirlineFL | -0.43*** |
| (0.02) | |
| AirlineHA | -0.34*** |
| (0.03) | |
| AirlineMQ | -0.21*** |
| (0.02) | |
| AirlineOH | -0.54*** |
| (0.02) | |
| AirlineOO | 0.23*** |
| (0.02) | |
| AirlineUA | -0.32*** |
| (0.02) | |
| AirlineUS | -0.27*** |
| (0.02) | |
| AirlineWN | 1.25*** |
| (0.02) | |
| AirlineXE | -0.08*** |
| (0.02) | |
| AirlineYV | -0.72*** |
| (0.02) | |
| Constant | -0.42*** |
| (0.01) | |
| Note: | p<0.1; p<0.05; p<0.01 |
This table shows that nearly every airline has a statistically
significant impact on delay. The main airlines of note (due to large
coefficients) are:
WN - Southwest
CO - Continental
B6 - JetBlue
This suggests that by flying with one of these Airlines, the probability of having a flight delayed increases dramatically (over 25%).
Here we will explore similarities and differences with respect to the Day of the week and delay. In this data set, 1 represents Monday and & 7 represents Sunday.
airline %>%
group_by(DayOfWeek) %>%
summarise(n=n()) %>%
pander()| DayOfWeek | n |
|---|---|
| 1 | 72769 |
| 2 | 71340 |
| 3 | 89746 |
| 4 | 91445 |
| 5 | 85248 |
| 6 | 58956 |
| 7 | 69879 |
Let’s look at the proportion of delayed flights by Day of Week.
airline %>%
group_by(DayOfWeek) %>%
summarise(n=n(), Prop_Delayed = round(sum(Delay ==1)/n, 3)) %>%
pander()| DayOfWeek | n | Prop_Delayed |
|---|---|---|
| 1 | 72769 | 0.468 |
| 2 | 71340 | 0.447 |
| 3 | 89746 | 0.471 |
| 4 | 91445 | 0.451 |
| 5 | 85248 | 0.417 |
| 6 | 58956 | 0.401 |
| 7 | 69879 | 0.454 |
Now lets plot the table using ggplot.
plotday <- airline %>%
group_by(DayOfWeek) %>%
summarise(n=n(), Prop_Delayed = round(sum(Delay ==1)/n, 3))
ggplot(data = plotday, aes(x = plotday$DayOfWeek, y = plotday$Prop_Delayed))+
geom_line(col=alpha("firebrick4", 0.75)) +
geom_point(col=alpha("firebrick4", 0.75)) +
labs(x="Day of Week", y = "Proportion of Flights Delayed")
As seen in the chart above, it appears that there is a greater likelyhood that your flight will be delayed if you fly on Monday or Wednesday.
Finally, let’s look at the logistic regression and test for significant impact.
reg <- glm(airline$Delay ~ as.factor(airline$DayOfWeek)-1, family = binomial)
stargazer(reg,
type="html",
omit.stat = "all",
digits=2)| Dependent variable: | |
| Delay | |
| DayOfWeek)1 | -0.13*** |
| (0.01) | |
| DayOfWeek)2 | -0.21*** |
| (0.01) | |
| DayOfWeek)3 | -0.12*** |
| (0.01) | |
| DayOfWeek)4 | -0.20*** |
| (0.01) | |
| DayOfWeek)5 | -0.34*** |
| (0.01) | |
| DayOfWeek)6 | -0.40*** |
| (0.01) | |
| DayOfWeek)7 | -0.19*** |
| (0.01) | |
| Note: | p<0.1; p<0.05; p<0.01 |
From the table above, we can see that our suspicion is correct.
Monday and Wednesday have the least negative (greatest positive) impact
on predicting delay.
Here we will explore similarities and differences with respect to the Airport of Departure and delay.
airline %>%
group_by(AirportFrom) %>%
summarise(n=n()) %>%
arrange(desc(n)) %>%
top_n(18) %>%
pander()Selecting by n
| AirportFrom | n |
|---|---|
| ATL | 34449 |
| ORD | 24822 |
| DFW | 22154 |
| DEN | 19843 |
| LAX | 16657 |
| IAH | 15821 |
| PHX | 15557 |
| DTW | 13136 |
| LAS | 11918 |
| SFO | 11786 |
| CLT | 11133 |
| MCO | 10596 |
| SLC | 10473 |
| MSP | 10166 |
| EWR | 9673 |
| JFK | 9496 |
| BOS | 9439 |
| BWI | 8565 |
…
Let’s look at the proportion of delayed flights based on outgoing airport. Note: there are over 200 different airports listed in this data. We will only be viewing the top 18.
airline %>%
group_by(AirportFrom) %>%
summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
arrange(desc(Prop_Delayed)) %>%
top_n(18) %>%
pander()Selecting by Prop_Delayed
| AirportFrom | n | Prop_Delayed |
|---|---|---|
| MDW | 7103 | 0.735 |
| DAL | 3838 | 0.716 |
| OAK | 3783 | 0.713 |
| HOU | 4420 | 0.667 |
| OTH | 93 | 0.634 |
| FLO | 18 | 0.611 |
| SMF | 3504 | 0.606 |
| GUM | 10 | 0.6 |
| SJC | 3357 | 0.587 |
| LWS | 57 | 0.579 |
| BWI | 8565 | 0.573 |
| STL | 5031 | 0.568 |
| RNO | 1708 | 0.566 |
| LAS | 11918 | 0.56 |
| ADK | 9 | 0.556 |
| OTZ | 90 | 0.556 |
| ISP | 631 | 0.552 |
| BET | 84 | 0.548 |
plotairport <- airline %>%
group_by(AirportFrom) %>%
summarise(n=n(), Prop_Delayed = round(sum(Delay==1)/n, 3)) %>%
arrange(desc(Prop_Delayed)) %>%
top_n(18)
ggplot(data=plotairport, aes(x=reorder(AirportFrom, Prop_Delayed), y=Prop_Delayed)) +
geom_bar(stat = "identity", fill=alpha("firebrick4", 0.75)) +
labs(x="Airline", y="Proportion Delayed") +
geom_hline(yintercept = mean(plotairport$Prop_Delayed), linetype = "dashed", size=1.0)
With over 200 airports listed in this data, it is unreasonable to run a regression model. Based on the plot above, we can see that there are five airports that have above average delays:
Without a regression, it is hard to predict the exact impact these airports have on flight delay.
Finally, here we will explore similarities and differences with respect to the Length of flight and its impact on delay.
ggplot(data=new_data, aes(x=x, y=y)) +
geom_point(col=alpha("firebrick4", 0.75), size=3)+
geom_line(col=alpha("firebrick4",0.75))+
geom_smooth(method = "glm", method.args = list(family=binomial)) +
labs(x="Delayed (0=No, 1=Yes)", y="Mean Flight Time") +
lims(x=c(-0.5,1.5), y=c(125, 140))t.test(x1, x2)Welch Two Sample t-test
data: x1 and x2 t = -29.621, df = 504681, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -6.090060 -5.334146 sample estimates: mean of x mean of y 129.6576 135.3697
reg <- glm(airline$Delay ~ airline$Length, family = binomial)
stargazer(reg,
type="html",
omit.stat = "all",
digits=2)| Dependent variable: | |
| Delay | |
| Length | 0.001*** |
| (0.0000) | |
| Constant | -0.37*** |
| (0.01) | |
| Note: | p<0.1; p<0.05; p<0.01 |
These tests and tables show that the length of flight does have
an impact on whether or not the flight is likely to be delayed. The
longer the flight, the greater change for delay. For every increase of
10 minutes in flight time, the probability your flight will be delayed
will increase by 0.10 (or 10%)
Learn more about [ggplot, glm, Airline] with the following:
Resource I https://ggplot2.tidyverse.org/
Resource II https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm
Resource III https://www.kaggle.com/datasets/jimschacko/airlines-dataset-to-predict-a-delay?resource=download