Lightly comment your code and use pipes for readability.

Comment briefly on each of the questions, as directed. Only the the final question requires a lengthier response.

data <- read.csv("flight_delay_clean.csv")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Q1

# Filter for RegionEx and MDA flights
regionex_flights <- subset(data, airline == "RegionEx")
mda_flights <- subset(data, airline == "MDA")

#For RegionEx:
# Mean
mean_regionex <- mean(regionex_flights$delay, na.rm = TRUE)

# Median
median_regionex <- median(regionex_flights$delay, na.rm = TRUE)

# 90th Percentile
percentile_90_regionex <- quantile(regionex_flights$delay, 0.90, na.rm = TRUE)

# Standard Deviation
sd_regionex <- sd(regionex_flights$delay, na.rm = TRUE)

#For MDA:
# Mean
mean_mda <- mean(mda_flights$delay, na.rm = TRUE)

# Median
median_mda <- median(mda_flights$delay, na.rm = TRUE)

# 90th Percentile
percentile_90_mda <- quantile(mda_flights$delay, 0.90, na.rm = TRUE)

# Standard Deviation
sd_mda <- sd(mda_flights$delay, na.rm = TRUE)


#Answer= I believe that in this case using the median allows us to have a more consistent comparison of the airlines' performance in terms of arrival delays. The median arrival delays for RegionEx of 9 minutes and MDA of 13 minutes provide a direct comparison. Although the mean delay for MDA is lower, at 10.9 minutes, the median indicates that more MDA flights were delayed than those on RegionEx. The RegionEx The result of 15.66 minutes as the mean for the RegionEx is significantly higher than the 9 minutes for the median, these resouls can indicating that extreme delays can influence the mean. This disparity perhaps tells us that although some flights may experience substantial delays, the typical passenger experience, as indicated by the median, is much better.```

Q2

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
# Combine the data for both airlines
combined_data <- rbind(
  data.frame(airline = "RegionEx", delay = regionex_flights$delay),
  data.frame(airline = "MDA", delay = mda_flights$delay)
)
# Histogram
ggplot(combined_data, aes(x = delay)) +
  geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
  facet_wrap(~ airline) +
  labs(title = "Histogram of Arrival Delays", x = "Delay in minutes", y = "Number of flights") +
  theme_minimal()

#Answer:Peaks: Both companies have the highest number of flights with delays between 0 and 10 minutes. Distribution: The number of flights for MDA decreases as delay times increase, while RegionEx shows a more uniform distribution of delays up to 70 minutes. Long Delays: MDA shows few flights with long delays of more than 10 minutes. Meanwhile, RegionEx does show one or two flights with long delays of up to 70 minutes.

#These indicators show that RegionEx does manage its flights outside the delay quality standards that MDA offers its customers.

Q3

# Create Summary Table of Percentage of Delayed Flights by Airline
data %>% 
  group_by(airline) %>% 
  summarize(n = n(),  # Total number of flights
            percent_delay = (mean(delay_indicator) * 100) %>% round(1))  # Percent of delayed flights
## # A tibble: 2 × 3
##   airline      n percent_delay
##   <chr>    <int>         <dbl>
## 1 MDA        120          25.8
## 2 RegionEx   240          26.2
# Create Summary Table by Airline and Route
data %>% 
  group_by(airline, route_code) %>% 
  summarize(n = n(),
            percent_delay = (mean(delay_indicator) * 100) %>% round(1)) 
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code     n percent_delay
##   <chr>    <chr>      <int>         <dbl>
## 1 MDA      DFW/MSY       30          26.7
## 2 MDA      MSY/DFW       30          30  
## 3 MDA      MSY/PNS       30          20  
## 4 MDA      PNS/MSY       30          26.7
## 5 RegionEx DFW/MSY       90          25.6
## 6 RegionEx MSY/DFW       90          28.9
## 7 RegionEx MSY/PNS       30          20  
## 8 RegionEx PNS/MSY       30          26.7
#Answer: The two tables have remarkably conflicting information. The overall delay percentage shows that RegionEx has a higher rate of delays than MDA. But when analyzing the percentage of delays by airline and route, we can see that RegionEx has, in some cases, a lower rate than MDA, such as the DFW/MSY and MSY/DFW routes, and in other routes, they have the same percentage of delays such as the MSY/PNS and PNS/MSY routes.

#For these reasons, the COO's claim that RegionEx has a higher percentage of delayed flights is valid when looking at the overall picture, but it does not hold evenly across all routes. The performance on individual routes shows a more consistent picture where, in some cases, RegionEx performs better or equal to MDA. 

Q4

# Group by airline and route, then summarize scheduled and actual flight durations
flight_duration_summary <- data %>%
  group_by(airline, route_code) %>%
  summarize(
    mean_scheduled_duration = mean(scheduled_flight_length, na.rm = TRUE),
    mean_actual_duration = mean(actual_flight_length, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
flight_duration_summary
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code mean_scheduled_duration mean_actual_duration
##   <chr>    <chr>                        <dbl>                <dbl>
## 1 MDA      DFW/MSY                        100                113. 
## 2 MDA      MSY/DFW                        100                114. 
## 3 MDA      MSY/PNS                         75                 85.2
## 4 MDA      PNS/MSY                         75                 81.4
## 5 RegionEx DFW/MSY                         90                106. 
## 6 RegionEx MSY/DFW                         90                108. 
## 7 RegionEx MSY/PNS                         70                 81  
## 8 RegionEx PNS/MSY                         70                 80.4
#Answer: Interestingly, both airlines have different scheduled and actual flight times. In fact, MDA has longer flight times than RegionEx. This can help camouflage MDA's delay hours and make them invisible since they have longer scheduled flight times.

Q5

#Answer: The fact that RegionEx's performance is worse than MDA's is partially supported by quantity evidence. We can see that for the overall percentage of delayed flights, RegionEx does indeed have a higher delay rate than MDA. For example, RegionEx's delay percentage was around 15.7%, compared to MDA's 10.9%.

#However, a more uniform behavior is observed when we perform a more detailed analysis of the data for specific routes. On routes such as DFW/MSY and MSY/DFW, RegionEx has fewer delays than MDA. Also, on other routes, such as MSY/PNS and PNS/MSY, both airlines have an equal percentage of delayed flights. This suggests that RegionEx's overall worse performance could be due to specific routes where delays are more frequent, but it is generally better on all routes.

#So, while RegionEx appears to have worse delays overall, when looking at the details per route, it may stand out as having better performance than even MDA. And I would recommend basing your decisions on the data per route rather than in general when making changes to the service contract with RegionEx.

rmarkdown::render(“Flight Delay Case.Rmd”)