Load packages and data

# Load packages
library(tidyverse) 

# Load data
d <- read_csv("flight_delay_clean.csv")

# Inspect data
glimpse(d)
## Rows: 360
## Columns: 13
## $ airline                 <chr> "RegionEx", "RegionEx", "RegionEx", "RegionEx"…
## $ departure_date          <date> 2008-09-01, 2008-09-01, 2008-09-01, 2008-09-0…
## $ origin                  <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ destination             <chr> "MSY", "MSY", "MSY", "MSY", "MSY", "MSY", "MSY…
## $ route_code              <chr> "DFW/MSY", "DFW/MSY", "DFW/MSY", "DFW/MSY", "D…
## $ scheduled_departure     <time> 09:10:00, 13:10:00, 18:10:00, 09:10:00, 13:10…
## $ scheduled_arrival       <time> 10:40:00, 14:40:00, 19:40:00, 10:40:00, 14:40…
## $ actual_arrival          <time> 11:00:00, 15:00:00, 19:58:00, 10:50:00, 14:51…
## $ scheduled_flight_length <dbl> 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90…
## $ actual_flight_length    <dbl> 110, 110, 108, 100, 101, 100, 99, 99, 99, 100,…
## $ delay                   <dbl> 20, 20, 18, 10, 11, 10, 9, 9, 9, 10, 10, 9, 13…
## $ delay_indicator         <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ day_of_week             <dbl> 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7…
summary(d)
##    airline          departure_date          origin          destination       
##  Length:360         Min.   :2008-09-01   Length:360         Length:360        
##  Class :character   1st Qu.:2008-09-08   Class :character   Class :character  
##  Mode  :character   Median :2008-09-15   Mode  :character   Mode  :character  
##                     Mean   :2008-09-15                                        
##                     3rd Qu.:2008-09-23                                        
##                     Max.   :2008-09-30                                        
##   route_code        scheduled_departure scheduled_arrival actual_arrival   
##  Length:360         Length:360          Length:360        Length:360       
##  Class :character   Class1:hms          Class1:hms        Class1:hms       
##  Mode  :character   Class2:difftime     Class2:difftime   Class2:difftime  
##                     Mode  :numeric      Mode  :numeric    Mode  :numeric   
##                                                                            
##                                                                            
##  scheduled_flight_length actual_flight_length     delay        delay_indicator 
##  Min.   : 70.00          Min.   : 59.00       Min.   :-13.00   Min.   :0.0000  
##  1st Qu.: 75.00          1st Qu.: 86.00       1st Qu.:  5.00   1st Qu.:0.0000  
##  Median : 90.00          Median : 97.00       Median : 11.00   Median :0.0000  
##  Mean   : 85.83          Mean   : 99.91       Mean   : 14.07   Mean   :0.2611  
##  3rd Qu.: 90.00          3rd Qu.:108.25       3rd Qu.: 15.00   3rd Qu.:1.0000  
##  Max.   :100.00          Max.   :243.00       Max.   :153.00   Max.   :1.0000  
##   day_of_week 
##  Min.   :1.0  
##  1st Qu.:2.0  
##  Median :4.0  
##  Mean   :3.9  
##  3rd Qu.:6.0  
##  Max.   :7.0

The dataset has been cleaned for you and is ready for analysis.

Questions

Lightly comment your code and use pipes for readability.

Most of the questions require some writing. Please place your text immediately after the “>” in each question, which will make it into a block quote.

Comment briefly on each of the questions, as directed. Only the the final question requires a lengthier response.

Q1

Compute the mean, median, 90th percentile, and standard deviation of arrival delay minutes for RegionEx flights. Do the same for MDA flights. Which measure of central tendency would be most appropriate for comparing airline performance?

d %>%
  group_by(airline) %>%
  summarize(mean = mean(delay),
            median = median(delay),
            perc_90 = quantile(delay, c(.9)),
            sd = sd(delay))
## # A tibble: 2 × 5
##   airline   mean median perc_90    sd
##   <chr>    <dbl>  <dbl>   <dbl> <dbl>
## 1 MDA       10.9     13    16.1  6.34
## 2 RegionEx  15.7      9    21   27.7

Median should be used for comparing MDA and RegionEx. RegionEx has a standard deviation of 27.65, telling us that the data is very spread out. In addition, RegionEx’s 90th percentile is very far apart from the median, while it is very close to the mean, indicating that the data distribution of RegionEx is right-skewed. Since the mean is sensitive to the spread or the extreme values of the data, median is a more appropriate central tendency in comparing the performances of the two airlines.

Q2

Inspect the distribution of RegionEx’s arrival delays by constructing a histogram of the number of arrival delay minutes of RegionEx’s flights. Do the same for MDA’s flights. How do these two distributions compare? Hint: use facet_wrap().

ggplot(data = d, aes(x = delay)) + 
  geom_histogram() + 
  facet_wrap(~airline) +
  labs(title = "arrival delay minitues ~ airline")

The medians of the two airlines are close to each other, however, MDA data mainly clusters around -10 to 20 minutes, in contrast, the arrival delay at RegionEx has a wide spread from -10 to 150 minutes. In other words, RegionEx has values that are more dispersed. It also appears that MDA has a normal distribution. Yet, the tail on the right side in RegionEx’s histogram confirms that RegionEx data is right-skewed.

Q3

So far we have considered airline performance in terms of average delay in minutes. However, the performance metrics, as noted in the case description, also include the percentage of delayed flights. Let’s verify that MDA’s COO is correct: does RegionEx have a higher percentage of delayed flights?

Here is code to answer that question:

# Create a summary table of percent delayed by airline.
d %>% 
  group_by(airline) %>% 
  summarize(n = n(),
            percent_delay = (mean(delay_indicator) * 100) %>% round(1)) 
## # A tibble: 2 × 3
##   airline      n percent_delay
##   <chr>    <int>         <dbl>
## 1 MDA        120          25.8
## 2 RegionEx   240          26.2

Note that because delay_indicator is numeric (a binary 0/1 variable) calculating the mean of the vector returns the proportion of 1s, which, multiplied by 100, is equivalent to the percentage of delayed flights.

Write your own code to create a table summarizing the percentage of delayed flights by airline and route.

d %>%
  group_by(airline,
           route_code) %>%
  summarize(n = n(),
    percent_delayed = (mean(delay_indicator) * 100) %>% round(1)) %>%
  arrange(desc(route_code))
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code     n percent_delayed
##   <chr>    <chr>      <int>           <dbl>
## 1 MDA      PNS/MSY       30            26.7
## 2 RegionEx PNS/MSY       30            26.7
## 3 MDA      MSY/PNS       30            20  
## 4 RegionEx MSY/PNS       30            20  
## 5 MDA      MSY/DFW       30            30  
## 6 RegionEx MSY/DFW       90            28.9
## 7 MDA      DFW/MSY       30            26.7
## 8 RegionEx DFW/MSY       90            25.6

These tables contain conflicting information. How should you answer the question of whether RegionEx has a higher percentage of delayed flights? Is the the COO correct? And, if not, why not?

The COO is not correct. By grouping the data by airlines and routes, the table above shows that in the flights connecting PNS and MSY, MDA and RegionEx both have the same percentage of delayed flights. However, we can observe a difference in the percentage between the airlines in the flights connecting MSY and DFW. In the MYS/DFW route, MDA and RegionEx have 30% and 28.9% of delayed flights respectively. And in the DFW/MYS route, MDA has a 26.7% of delayed flights, while RegionEx only has a 25.6%. Therefore, MDA has higher percentages of delayed flights than RegionEx. In the first table, RegionEx has a higher percentage because RegionEx has double the amount of flights compared to MDA, i.e., 240 and 120.

Q4

Compare the scheduled flight durations for the two airlines on each of their four routes. Also compare the actual flight durations for the two airlines. What do you notice? If the two airlines had the same scheduled duration, what impact would this have on their delay records?

d %>%
  group_by(airline,
           route_code) %>%
  summarize(n = n(),
            median_scheduled = median(scheduled_flight_length),
            median_actual = median(actual_flight_length)) %>%
  arrange(desc(route_code))
## # A tibble: 8 × 5
## # Groups:   airline [2]
##   airline  route_code     n median_scheduled median_actual
##   <chr>    <chr>      <int>            <dbl>         <dbl>
## 1 MDA      PNS/MSY       30               75          79.5
## 2 RegionEx PNS/MSY       30               70          77.5
## 3 MDA      MSY/PNS       30               75          86  
## 4 RegionEx MSY/PNS       30               70          76.5
## 5 MDA      MSY/DFW       30              100         113  
## 6 RegionEx MSY/DFW       90               90          99  
## 7 MDA      DFW/MSY       30              100         114. 
## 8 RegionEx DFW/MSY       90               90         100

The two airlines do not have the same scheduled flight duration. MDA’s scheduled flight durations connecting DFW and MSY is 10 minutes longer than RegionEx, and 5 minutes longer than RegionEx for scheduled flights connecting PNS and MSY. A similar observervation is seen in the actual flight durations. In all routes, MDA has longer flights than RegionEx, especially in the route MSY/DFW and DFW/MSY (113 vs 99 and 113.5 vs 100). If the two airlines had the same scheduled duration, RegionEx would actually have a lower percentage of delayed flights than MDA.

Q5

Does the data support the claim that the on‐time performance of RegionEx is worse than that of MDA? Write a paragraph in which you argue a position. In your answer, please incorporate quantitative evidence from the earlier questions.

The data does not support the claim that the on-time performance of RegionEx is worse than that of MDA. First, the FAA claimed that RegionEx ranked worse than MDA in the percentage of delayed flights for September. The table in Q3 shows that MDA has a higher percentage of delayed flights than RegionEx. RegionEx shares the same percentage of delayed flights with MDA in two routes and has a lower rate in the other two routes (i.e., 8.9% vs. 30% and 25.6% vs. 26.7%). Secondly, the FAA also claimed that RegionEx’s average flight delay in minutes is significantly greater than that of MDA. Q4’s table shows that RegionEx has a shorter actual flight duration than MDA in all routes, where the extremes can be observed in the MSY/DFW and DFW/MSY routes, 99 vs. 113 and 100 vs. 113.5. Therefore, in reality, RegionEx has a better on-time performance than that of MDA.