Lightly comment your code and use pipes for readability.

Comment briefly on each of the questions, as directed. Only the the final question requires a lengthier response.

Q1

Compute the mean, median, 90th percentile, and standard deviation of arrival delay minutes for RegionEx flights. Do the same for MDA flights.

setwd("/Users/txharris/Desktop/IS 6489")

# Import the CSV file and assign it to the variable "flight_delay"
flight_delay <- read.csv("flight_delay_clean.csv")

# Convert relevant columns to numeric
flight_delay$delay <- as.numeric(flight_delay$delay)
flight_delay$actual_flight_length <- as.numeric(flight_delay$actual_flight_length)

# Check the structure of the imported dataset
str(flight_delay)
## 'data.frame':    360 obs. of  13 variables:
##  $ airline                : chr  "RegionEx" "RegionEx" "RegionEx" "RegionEx" ...
##  $ departure_date         : chr  "2008-09-01" "2008-09-01" "2008-09-01" "2008-09-02" ...
##  $ origin                 : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ destination            : chr  "MSY" "MSY" "MSY" "MSY" ...
##  $ route_code             : chr  "DFW/MSY" "DFW/MSY" "DFW/MSY" "DFW/MSY" ...
##  $ scheduled_departure    : chr  "09:10:00" "13:10:00" "18:10:00" "09:10:00" ...
##  $ scheduled_arrival      : chr  "10:40:00" "14:40:00" "19:40:00" "10:40:00" ...
##  $ actual_arrival         : chr  "11:00:00" "15:00:00" "19:58:00" "10:50:00" ...
##  $ scheduled_flight_length: int  90 90 90 90 90 90 90 90 90 90 ...
##  $ actual_flight_length   : num  110 110 108 100 101 100 99 99 99 100 ...
##  $ delay                  : num  20 20 18 10 11 10 9 9 9 10 ...
##  $ delay_indicator        : int  1 1 1 0 0 0 0 0 0 0 ...
##  $ day_of_week            : int  2 2 2 3 3 3 4 4 4 5 ...
# Compute statistics for RegionEx flights
regionex_stats <- flight_delay[flight_delay$airline == "RegionEx", ]

# Calculate mean, median, 90th percentile, and standard deviation
regionex_mean <- mean(regionex_stats$delay, na.rm = TRUE)
regionex_median <- median(regionex_stats$delay, na.rm = TRUE)
regionex_percentile90 <- quantile(regionex_stats$delay, probs = 0.9, na.rm = TRUE)
regionex_std <- sd(regionex_stats$delay, na.rm = TRUE)

# Compute statistics for MDA flights
mda_stats <- flight_delay[flight_delay$airline == "MDA", ]

# Calculate mean, median, 90th percentile, and standard deviation
mda_mean <- mean(mda_stats$delay, na.rm = TRUE)
mda_median <- median(mda_stats$delay, na.rm = TRUE)
mda_percentile90 <- quantile(mda_stats$delay, probs = 0.9, na.rm = TRUE)
mda_std <- sd(mda_stats$delay, na.rm = TRUE)

# Print the computed statistics
cat("RegionEx Flight Statistics:\n")
## RegionEx Flight Statistics:
cat("Mean Arrival Delay:", regionex_mean, "\n")
## Mean Arrival Delay: 15.6625
cat("Median Arrival Delay:", regionex_median, "\n")
## Median Arrival Delay: 9
cat("90th Percentile Arrival Delay:", regionex_percentile90, "\n")
## 90th Percentile Arrival Delay: 21
cat("Standard Deviation of Arrival Delay:", regionex_std, "\n\n")
## Standard Deviation of Arrival Delay: 27.65036
cat("MDA Flight Statistics:\n")
## MDA Flight Statistics:
cat("Mean Arrival Delay:", mda_mean, "\n")
## Mean Arrival Delay: 10.9
cat("Median Arrival Delay:", mda_median, "\n")
## Median Arrival Delay: 13
cat("90th Percentile Arrival Delay:", mda_percentile90, "\n")
## 90th Percentile Arrival Delay: 16.1
cat("Standard Deviation of Arrival Delay:", mda_std, "\n")
## Standard Deviation of Arrival Delay: 6.338359

Contractual obligations aside, which measure of central tendency would be most appropriate for comparing airline performance?

## I believe that, contractural obligations aside, the measure of central tendency that would be most appropriate to compare airline performance in this setting is Median Arrival Delay. The reasoning behind this decision is that it is significantly less affected by extreme outliers the way that Mean Arrival Delay is, in addition to the fact that representing the median instead of the mean will show a more hollistic/grouped analysis because it is more likely to show a clear "average" over the mean since it is limiting potential outliers. 

Q2

Inspect the distribution of RegionEx’s arrival delays by constructing a histogram of the number of arrival delay minutes of RegionEx’s flights. Do the same for MDA’s flights. Hint: use facet_wrap().

# Assuming you have the "ggplot2" library installed
library(ggplot2)

# Create a histogram of arrival delay minutes for RegionEx and MDA flights
ggplot(flight_delay, aes(x = delay)) +
  geom_histogram(binwidth = 5, fill = "blue", alpha = 0.7) +  # Adjust binwidth as needed
  facet_wrap(~ airline, ncol = 2) +  # Separate histograms for each airline
  labs(x = "Arrival Delay Minutes", y = "Frequency") +  # Labels for axes
  ggtitle("Distribution of Arrival Delays for RegionEx and MDA Flights")  # Title for the plot

How do these two distributions compare?

## Looking at the ditribution of these two plots, it is clear to see that with RegionEx has more extreme outliers, and a higher frequency in shorter delays below 20 min or so. They also have more instances, however of making up time and arriving before their scheduled arrival times. MDA has fewer frequency of delays overall, but their delays are on average higher than RegionEx's (when looking at Median Avg Delay). While MDA has fewer delay frequency, they have longer delays when they do occur, whereas RegionEx has frequent, short delays, and a higher frequency of making up time on the trip, in addition to a extreme outliers.

Q3

So far you have considered airline performance in terms of minutes delayed. However, the performance metrics, as noted in the case description, also include the percentage of delayed flights. Let’s verify that MDA’s COO is correct: does RegionEx have a higher percentage of delayed flights?

  ## Yes, RegionEx does have a higher percentage of delayed flights, however, they also have twice as many flights total as MDA does in the data set. Having such a significantly greater "N" would make me expect them have a higher percentage of flights delayed.

Note that because delay_indicator is numeric (a binary 0/1 variable) calculating the mean of the vector returns the proportion of 1s, which, multiplied by 100, is equivalent to the percentage of delayed flights.

Write your own code to create a table summarizing the percentage of delayed flights by airline and route.

Notice that these tables—percent delayed by airline vs. percent delayed by airline and route— contain conflicting information. How should you answer the question of whether RegionEx has a higher percentage of delayed flights? Is the the COO correct? And, if not, why not?

## It is all relative. In the new summary table, it appears as though MDA has a higher percentage of delayed flights overall, but those numbers are higher percentages on fewer total flights. If you average those totals out, like we did in the first table, they have a lower percentage of flights delayed. I believe the COO is correct in his assumption, but I would again argue that these two airlines are difficult to really compare on a head-to-head basis. MDA is handling so many fewer flights than RegionEx that it does not seem relevant to be comparing their delay percentages against one another as it is inferrable that when N doubles, one's flight delay percentages will increase as well. 
# Load the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Create a summary table of percent delayed by airline.
flight_delay %>% 
  group_by(airline) %>% 
  summarize(n = n(),
            percent_delay = (mean(delay_indicator) * 100) %>% round(1)) 
## # A tibble: 2 × 3
##   airline      n percent_delay
##   <chr>    <int>         <dbl>
## 1 MDA        120          25.8
## 2 RegionEx   240          26.2
# Create a summary table of percentage of delayed flights by airline and route
delay_summary <- flight_delay %>%
  group_by(airline, route_code) %>%
  summarize(
    total_flights = n(),
    percent_delayed = (mean(delay_indicator) * 100) %>% round(2))
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
# Print the summary table
print(delay_summary)
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code total_flights percent_delayed
##   <chr>    <chr>              <int>           <dbl>
## 1 MDA      DFW/MSY               30            26.7
## 2 MDA      MSY/DFW               30            30  
## 3 MDA      MSY/PNS               30            20  
## 4 MDA      PNS/MSY               30            26.7
## 5 RegionEx DFW/MSY               90            25.6
## 6 RegionEx MSY/DFW               90            28.9
## 7 RegionEx MSY/PNS               30            20  
## 8 RegionEx PNS/MSY               30            26.7

Q4

Compare the scheduled flight durations for the two airlines on each of their four routes. Also compare the actual flight durations for the two airlines. What do you notice? If the two airlines had the same scheduled duration, what impact would this have on their delay records?

## The priimary thing I notice is the difference between scheduled flight times for the two airlines. MDA shcedules their flight times longer on average depending on the route by 10 and 5 minutes respectively compared to RegionEx. The other thing that is significant is that RegionEx schedules their flights for shorter times, is "delayed" more often, but still on average has shorter actual flight durations than MDA. This shows that the original analysis into the data does not show the full reality of the discrepancies between the two airlines. If MDA had RegionEx's scheduled flight durations, their flight delay percentage would be much higher, and far worse in terms of magnitude. If it were vice versa, RegionEX would have a much smaller percentage of flights delayed and their delay record frequency would shift closer to 0. 
# Load the dplyr package if it's not already loaded
library(dplyr)

# Create a summary table for scheduled flight durations by airline and route
scheduled_summary <- flight_delay %>%
  group_by(airline, route_code) %>%
  summarize(
    mean_scheduled_duration = mean(scheduled_flight_length),
    median_scheduled_duration = median(scheduled_flight_length)
  )
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
# Print the summary table for scheduled flight durations
print(scheduled_summary)
## # A tibble: 8 × 4
## # Groups:   airline [2]
##   airline  route_code mean_scheduled_duration median_scheduled_duration
##   <chr>    <chr>                        <dbl>                     <dbl>
## 1 MDA      DFW/MSY                        100                       100
## 2 MDA      MSY/DFW                        100                       100
## 3 MDA      MSY/PNS                         75                        75
## 4 MDA      PNS/MSY                         75                        75
## 5 RegionEx DFW/MSY                         90                        90
## 6 RegionEx MSY/DFW                         90                        90
## 7 RegionEx MSY/PNS                         70                        70
## 8 RegionEx PNS/MSY                         70                        70
# Create a summary table for both scheduled and actual flight durations by airline and route
duration_summary <- flight_delay %>%
  group_by(airline, route_code) %>%
  summarize(
    mean_scheduled_duration = mean(scheduled_flight_length),
    mean_actual_duration = mean(actual_flight_length),
    median_scheduled_duration = median(scheduled_flight_length),
    median_actual_duration = median(actual_flight_length)
  )
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
# Print the summary table for both scheduled and actual flight durations
print(duration_summary)
## # A tibble: 8 × 6
## # Groups:   airline [2]
##   airline  route_code mean_scheduled_duration mean_actual_duration
##   <chr>    <chr>                        <dbl>                <dbl>
## 1 MDA      DFW/MSY                        100                113. 
## 2 MDA      MSY/DFW                        100                114. 
## 3 MDA      MSY/PNS                         75                 85.2
## 4 MDA      PNS/MSY                         75                 81.4
## 5 RegionEx DFW/MSY                         90                106. 
## 6 RegionEx MSY/DFW                         90                108. 
## 7 RegionEx MSY/PNS                         70                 81  
## 8 RegionEx PNS/MSY                         70                 80.4
## # ℹ 2 more variables: median_scheduled_duration <dbl>,
## #   median_actual_duration <dbl>

Q5

Does the data support the claim that the on‐time performance of RegionEx is worse than that of MDA? Write a paragraph in which you argue a position. In your answer, please incorporate quantitative evidence from the earlier questions.

## No, the data does not support the claim that the on-time performance of RegionEx is worse than MDA. As I have mentioned in responses to previous questions, it is clear that RegionEx does, in fact, have a higher percentage of flights that are delayed. In the data, we see that while they have a higher percentage of flights delayed, on average (grouped by route) each individual route has a lower percentage of delays than MDA. It is also clear from the data that the reason RegionEx even has more delays in the first place is because they have shorter scheduled flight durations than MDA to begin with. This shortened scheduled flight duration is working against RegionEx statistically when indiviudals just look at these statsitics at face value. Digging deeper, we actually learn that in every major route in our dataset, RegionEx has a significantly shorter actual flight time than MDA despite having the higher percentage of delays due to the over zealous scheduling team who believes their flights can be made faster than the actual times that have been reported. RegionEx has a higher percentage of flights that have delays, but in every single route their average flight time is significantly shorter at a much higher capacity and thus I would argue they do not have a worse on-time performance than MDA.