# Load packages
library(tidyverse)
# Load data
d <- read_csv("flight_delay_clean.csv")
# Inspect data
glimpse(d)
## Rows: 360
## Columns: 13
## $ airline <chr> "RegionEx", "RegionEx", "RegionEx", "RegionEx"…
## $ departure_date <date> 2008-09-01, 2008-09-01, 2008-09-01, 2008-09-0…
## $ origin <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ destination <chr> "MSY", "MSY", "MSY", "MSY", "MSY", "MSY", "MSY…
## $ route_code <chr> "DFW/MSY", "DFW/MSY", "DFW/MSY", "DFW/MSY", "D…
## $ scheduled_departure <time> 09:10:00, 13:10:00, 18:10:00, 09:10:00, 13:10…
## $ scheduled_arrival <time> 10:40:00, 14:40:00, 19:40:00, 10:40:00, 14:40…
## $ actual_arrival <time> 11:00:00, 15:00:00, 19:58:00, 10:50:00, 14:51…
## $ scheduled_flight_length <dbl> 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90…
## $ actual_flight_length <dbl> 110, 110, 108, 100, 101, 100, 99, 99, 99, 100,…
## $ delay <dbl> 20, 20, 18, 10, 11, 10, 9, 9, 9, 10, 10, 9, 13…
## $ delay_indicator <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_week <dbl> 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7…
summary(d)
## airline departure_date origin destination
## Length:360 Min. :2008-09-01 Length:360 Length:360
## Class :character 1st Qu.:2008-09-08 Class :character Class :character
## Mode :character Median :2008-09-15 Mode :character Mode :character
## Mean :2008-09-15
## 3rd Qu.:2008-09-23
## Max. :2008-09-30
## route_code scheduled_departure scheduled_arrival actual_arrival
## Length:360 Length:360 Length:360 Length:360
## Class :character Class1:hms Class1:hms Class1:hms
## Mode :character Class2:difftime Class2:difftime Class2:difftime
## Mode :numeric Mode :numeric Mode :numeric
##
##
## scheduled_flight_length actual_flight_length delay delay_indicator
## Min. : 70.00 Min. : 59.00 Min. :-13.00 Min. :0.0000
## 1st Qu.: 75.00 1st Qu.: 86.00 1st Qu.: 5.00 1st Qu.:0.0000
## Median : 90.00 Median : 97.00 Median : 11.00 Median :0.0000
## Mean : 85.83 Mean : 99.91 Mean : 14.07 Mean :0.2611
## 3rd Qu.: 90.00 3rd Qu.:108.25 3rd Qu.: 15.00 3rd Qu.:1.0000
## Max. :100.00 Max. :243.00 Max. :153.00 Max. :1.0000
## day_of_week
## Min. :1.0
## 1st Qu.:2.0
## Median :4.0
## Mean :3.9
## 3rd Qu.:6.0
## Max. :7.0
The dataset has been cleaned for you and is ready for analysis.
Lightly comment your code and use pipes for readability.
Most of the questions require some writing. Please place your text immediately after the “>” in each question, which will make it into a block quote.
Comment briefly on each of the questions, as directed. Only the the final question requires a lengthier response.
Compute the mean, median, 90th percentile, and standard deviation of arrival delay minutes for RegionEx flights. Do the same for MDA flights. Which measure of central tendency would be most appropriate for comparing airline performance?
d %>%
group_by(airline) %>%
summarize(mean = mean(delay),
median = median(delay),
perc_90 = quantile(delay, c(.9)),
sd = sd(delay))
## # A tibble: 2 × 5
## airline mean median perc_90 sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 MDA 10.9 13 16.1 6.34
## 2 RegionEx 15.7 9 21 27.7
Median should be used for comparing MDA and RegionEx. RegionEx has a standard deviation of 27.65, telling us that the data is very spread out. In addition, RegionEx’s 90th percentile is very far apart from the median, while it is very close to the mean, indicating that the data distribution of RegionEx is right-skewed. Since the mean is sensitive to the spread or the extreme values of the data, median is a more appropriate central tendency in comparing the performances of the two airlines.
Inspect the distribution of RegionEx’s arrival delays by constructing a histogram of the number of arrival delay minutes of RegionEx’s flights. Do the same for MDA’s flights. How do these two distributions compare? Hint: use facet_wrap().
ggplot(data = d, aes(x = delay)) +
geom_histogram() +
facet_wrap(~airline) +
labs(title = "arrival delay minitues ~ airline")
The medians of the two airlines are close to each other, however, MDA data mainly clusters around -10 to 20 minutes, in contrast, the arrival delay at RegionEx has a wide spread from -10 to 150 minutes. In other words, RegionEx has values that are more dispersed. It also appears that MDA has a normal distribution. Yet, the tail on the right side in RegionEx’s histogram confirms that RegionEx data is right-skewed.
So far we have considered airline performance in terms of average delay in minutes. However, the performance metrics, as noted in the case description, also include the percentage of delayed flights. Let’s verify that MDA’s COO is correct: does RegionEx have a higher percentage of delayed flights?
Here is code to answer that question:
# Create a summary table of percent delayed by airline.
d %>%
group_by(airline) %>%
summarize(n = n(),
percent_delay = (mean(delay_indicator) * 100) %>% round(1))
## # A tibble: 2 × 3
## airline n percent_delay
## <chr> <int> <dbl>
## 1 MDA 120 25.8
## 2 RegionEx 240 26.2
Note that because delay_indicator
is numeric (a binary
0/1 variable) calculating the mean of the vector returns the proportion
of 1s, which, multiplied by 100, is equivalent to the
percentage of delayed flights.
Write your own code to create a table summarizing the percentage of delayed flights by airline and route.
d %>%
group_by(airline,
route_code) %>%
summarize(n = n(),
percent_delayed = (mean(delay_indicator) * 100) %>% round(1)) %>%
arrange(desc(route_code))
## # A tibble: 8 × 4
## # Groups: airline [2]
## airline route_code n percent_delayed
## <chr> <chr> <int> <dbl>
## 1 MDA PNS/MSY 30 26.7
## 2 RegionEx PNS/MSY 30 26.7
## 3 MDA MSY/PNS 30 20
## 4 RegionEx MSY/PNS 30 20
## 5 MDA MSY/DFW 30 30
## 6 RegionEx MSY/DFW 90 28.9
## 7 MDA DFW/MSY 30 26.7
## 8 RegionEx DFW/MSY 90 25.6
These tables contain conflicting information. How should you answer the question of whether RegionEx has a higher percentage of delayed flights? Is the the COO correct? And, if not, why not?
The COO is not correct. By grouping the data by airlines and routes, the table above shows that in the flights connecting PNS and MSY, MDA and RegionEx both have the same percentage of delayed flights. However, we can observe a difference in the percentage between the airlines in the flights connecting MSY and DFW. In the MYS/DFW route, MDA and RegionEx have 30% and 28.9% of delayed flights respectively. And in the DFW/MYS route, MDA has a 26.7% of delayed flights, while RegionEx only has a 25.6%. Therefore, MDA has higher percentages of delayed flights than RegionEx. In the first table, RegionEx has a higher percentage because RegionEx has double the amount of flights compared to MDA, i.e., 240 and 120.
Compare the scheduled flight durations for the two airlines on each of their four routes. Also compare the actual flight durations for the two airlines. What do you notice? If the two airlines had the same scheduled duration, what impact would this have on their delay records?
d %>%
group_by(airline,
route_code) %>%
summarize(n = n(),
median_scheduled = median(scheduled_flight_length),
median_actual = median(actual_flight_length)) %>%
arrange(desc(route_code))
## # A tibble: 8 × 5
## # Groups: airline [2]
## airline route_code n median_scheduled median_actual
## <chr> <chr> <int> <dbl> <dbl>
## 1 MDA PNS/MSY 30 75 79.5
## 2 RegionEx PNS/MSY 30 70 77.5
## 3 MDA MSY/PNS 30 75 86
## 4 RegionEx MSY/PNS 30 70 76.5
## 5 MDA MSY/DFW 30 100 113
## 6 RegionEx MSY/DFW 90 90 99
## 7 MDA DFW/MSY 30 100 114.
## 8 RegionEx DFW/MSY 90 90 100
The two airlines do not have the same scheduled flight duration. MDA’s scheduled flight durations connecting DFW and MSY is 10 minutes longer than RegionEx, and 5 minutes longer than RegionEx for scheduled flights connecting PNS and MSY. A similar observervation is seen in the actual flight durations. In all routes, MDA has longer flights than RegionEx, especially in the route MSY/DFW and DFW/MSY (113 vs 99 and 113.5 vs 100). If the two airlines had the same scheduled duration, RegionEx would actually have a lower percentage of delayed flights than MDA.
Does the data support the claim that the on‐time performance of RegionEx is worse than that of MDA? Write a paragraph in which you argue a position. In your answer, please incorporate quantitative evidence from the earlier questions.
The data does not support the claim that the on-time performance of RegionEx is worse than that of MDA. First, the FAA claimed that RegionEx ranked worse than MDA in the percentage of delayed flights for September. The table in Q3 shows that MDA has a higher percentage of delayed flights than RegionEx. RegionEx shares the same percentage of delayed flights with MDA in two routes and has a lower rate in the other two routes (i.e., 8.9% vs. 30% and 25.6% vs. 26.7%). Secondly, the FAA also claimed that RegionEx’s average flight delay in minutes is significantly greater than that of MDA. Q4’s table shows that RegionEx has a shorter actual flight duration than MDA in all routes, where the extremes can be observed in the MSY/DFW and DFW/MSY routes, 99 vs. 113 and 100 vs. 113.5. Therefore, in reality, RegionEx has a better on-time performance than that of MDA.