Question1

Compute the mean, median, 90th percentile, and standard deviation of arrival delay minutes for RegionEx flights. Do the same for MDA flights.

# Load the 'tidyverse' library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Read the CSV file 'flight_delay_clean.csv' into a data frame called 'flight'
flight <- read.csv("flight_delay_clean.csv")


# Read the CSV file 'flight_delay_clean.csv' into a data frame called 'flight'
flight |>
  group_by(airline) |>
  summarize(mean_adelay = mean(delay),# Calculate the mean of delays for each airline
            median_adelay = median(delay),# Calculate the median of delays for each airline
            sd_adelay = sd(delay),# Calculate the sd of delays for each airline
            perc_90_adelay = quantile(delay, probs = .9))# Calculate the 90th percentile of delays for each airline
## # A tibble: 2 × 5
##   airline  mean_adelay median_adelay sd_adelay perc_90_adelay
##   <chr>          <dbl>         <dbl>     <dbl>          <dbl>
## 1 MDA             10.9            13      6.34           16.1
## 2 RegionEx        15.7             9     27.7            21

Contractual obligations aside, which measure of central tendency would be most appropriate for comparing airline performance?

We should use the median, because the the data is skewed and the mean would be affected by extreme values (outliers).

Question2

Inspect the distribution of RegionEx’s arrival delays by constructing a histogram of the number of arrival delay minutes of RegionEx’s flights. Do the same for MDA’s flights. Hint: use facet_wrap().

# Create a histogram of flight delays,by airline.
flight %>%
  ggplot(aes(x=delay))+geom_histogram()+facet_wrap(airline~.)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

How do these two distributions compare? RegionEx’s delay distribution is right-skewed and more spread out compared to MDA. This is mainly because RegionEx had several flights that arrived significantly later than most of their other flights.

Question3

So far you have considered airline performance in terms of minutes delayed. However, the performance metrics, as noted in the case description, also include the percentage of delayed flights. Let’s verify that MDA’s COO is correct:

Does RegionEx have a higher percentage of delayed flights?

In the letter, MDA’s COO stated that RegionEx ranked worse than MDA in terms of the percentage of delayed flights for September. However, this claim is not entirely accurate. The difference in the percentage of delayed flights between MDA and RegionEx is minimal.Based on the results, the delay percentage for MDA is 25.8%, while for RegionEx, it is 26.2%. This slight difference shows that RegionEx’s performance is almost the same as MDA’s, contrary to what was implied in the letter.

Here is code to answer that question:

# Create a summary table of percent delayed by airline.
flight%>% 
  group_by(airline) %>% 
  summarize(n = n(),
            percent_delay = (mean(delay_indicator) * 100) %>% round(1)) 
## # A tibble: 2 × 3
##   airline      n percent_delay
##   <chr>    <int>         <dbl>
## 1 MDA        120          25.8
## 2 RegionEx   240          26.2

Write your own code to create a table summarizing the percentage of delayed flights by airline and route.

flight %>% 
  group_by(airline, route_code) %>%  #Group by airline and route
  summarize(
    total_flights = n(),  # Count total number of flights
    percent_delay = round(mean(delay_indicator) * 100, 1)  # Calculate and round the percentage of delayed flights
  )
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code total_flights percent_delay
##   <chr>    <chr>              <int>         <dbl>
## 1 MDA      DFW/MSY               30          26.7
## 2 MDA      MSY/DFW               30          30  
## 3 MDA      MSY/PNS               30          20  
## 4 MDA      PNS/MSY               30          26.7
## 5 RegionEx DFW/MSY               90          25.6
## 6 RegionEx MSY/DFW               90          28.9
## 7 RegionEx MSY/PNS               30          20  
## 8 RegionEx PNS/MSY               30          26.7

Notice that these tables—percent delayed by airline vs. percent delayed by airline and route— contain conflicting information. How should you answer the question of whether RegionEx has a higher percentage of delayed flights? Is the the COO correct? And, if not, why not?

The route-level breakdown shows that RegionEx’s performance varies depending on the routes they operate, which the overall percentage doesn’t capture. For example, both airlines have the same delay percentage on the MSY/PNS route (20%) and PNS/MSY route (26.7%). While MDA has a slightly higher delay percentage on the MSY/DFW route (30% compared to RegionEx’s 28.9%), RegionEx performs slightly better on the DFW/MSY route (25.6% compared to MDA’s 26.7%).

The results of the percentage of delayed flights by airline and routes suggest that the COO’s conclusion missed some key differences. The overall delay percentages (26.2% for RegionEx and 25.8% for MDA) alone don’t reflect the impact of specific routes and congestion. This indicates that RegionEx’s performance is not necessarily worse than MDA’s.

Question4

Compare the scheduled flight durations for the two airlines on each of their four routes. Also compare the actual flight durations for the two airlines. What do you notice? If the two airlines had the same scheduled duration, what impact would this have on their delay records?

flight %>%
  group_by(airline, route_code) %>% # Group the data by airline and route code
  summarize(scheduled_flight_duration= mean(scheduled_flight_length),# Calculate the average scheduled flight duration for each group
           actual_flight_duration= round(mean(actual_flight_length),2))# Calculate the average actual flight duration and round it to 2 decimal.
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code scheduled_flight_duration actual_flight_duration
##   <chr>    <chr>                          <dbl>                  <dbl>
## 1 MDA      DFW/MSY                          100                  113. 
## 2 MDA      MSY/DFW                          100                  114. 
## 3 MDA      MSY/PNS                           75                   85.2
## 4 MDA      PNS/MSY                           75                   81.4
## 5 RegionEx DFW/MSY                           90                  106. 
## 6 RegionEx MSY/DFW                           90                  108. 
## 7 RegionEx MSY/PNS                           70                   81  
## 8 RegionEx PNS/MSY                           70                   80.4

Based on the results, the scheduled flight durations for MDA and RegionEx differ across the same routes. For example, the DFW/MSY route is scheduled for 100 minutes by MDA but only 90 minutes by RegionEx. Similarly, MDA schedules 75 minutes for the PNS/MSY route, while RegionEx schedules 70 minutes.

When looking at the actual flight durations, MDA’s average for the DFW/MSY route is 113.47 minutes, compared to RegionEx’s 106.42 minutes. On the MSY/PNS route, MDA’s average actual duration is 85.20 minutes, while RegionEx’s is 81.00 minutes. This shows that MDA generally has longer actual times on most routes, suggesting they include more buffer time in their schedules to improve their on-time performance.

If both airlines used the same scheduled durations, MDA’s on-time performance would likely drop since they benefit from longer schedules that cover delays. On the other hand, RegionEx’s delay records would likely improve because they wouldn’t be affected by their shorter schedules.

Question5

Does the data support the claim that the on‐time performance of RegionEx is worse than that of MDA? Write a paragraph in which you argue a position. In your answer, please incorporate quantitative evidence from the earlier questions

The data does not clearly support the claim that RegionEx’s on-time performance is worse than MDA’s. While the overall delay rates show RegionEx slightly higher (26.2% vs. 25.8%), this overlooks differences in routes and congestion. A route-level breakdown reveals significant variations in RegionEx’s performance. MDA schedules longer durations for routes like DFW/MSY and MSY/DFW (100 minutes) compared to RegionEx’s 90 minutes. Similarly, for MSY/PNS and PNS/MSY, MDA schedules 75 minutes, while RegionEx schedules 70 minutes. Despite these shorter schedules, RegionEx’s actual flight durations are often close to MDA’s, with MDA’s flights generally taking longer, such as DFW/MSY: 113.47 minutes for MDA vs. 106.42 minutes for RegionEx. This suggests that MDA’s longer scheduled times may make their on-time performance seem better, rather than indicating a real difference in operational efficiency.