Lightly comment your code and use pipes for readability.
Comment briefly on each of the questions, as directed. Only the the final question requires a lengthier response.
setwd("/Users/txharris/Desktop/IS 6489")
# Import the CSV file and assign it to the variable "flight_delay"
flight_delay <- read.csv("flight_delay_clean.csv")
# Convert relevant columns to numeric
flight_delay$delay <- as.numeric(flight_delay$delay)
flight_delay$actual_flight_length <- as.numeric(flight_delay$actual_flight_length)
# Check the structure of the imported dataset
str(flight_delay)
## 'data.frame': 360 obs. of 13 variables:
## $ airline : chr "RegionEx" "RegionEx" "RegionEx" "RegionEx" ...
## $ departure_date : chr "2008-09-01" "2008-09-01" "2008-09-01" "2008-09-02" ...
## $ origin : chr "DFW" "DFW" "DFW" "DFW" ...
## $ destination : chr "MSY" "MSY" "MSY" "MSY" ...
## $ route_code : chr "DFW/MSY" "DFW/MSY" "DFW/MSY" "DFW/MSY" ...
## $ scheduled_departure : chr "09:10:00" "13:10:00" "18:10:00" "09:10:00" ...
## $ scheduled_arrival : chr "10:40:00" "14:40:00" "19:40:00" "10:40:00" ...
## $ actual_arrival : chr "11:00:00" "15:00:00" "19:58:00" "10:50:00" ...
## $ scheduled_flight_length: int 90 90 90 90 90 90 90 90 90 90 ...
## $ actual_flight_length : num 110 110 108 100 101 100 99 99 99 100 ...
## $ delay : num 20 20 18 10 11 10 9 9 9 10 ...
## $ delay_indicator : int 1 1 1 0 0 0 0 0 0 0 ...
## $ day_of_week : int 2 2 2 3 3 3 4 4 4 5 ...
# Compute statistics for RegionEx flights
regionex_stats <- flight_delay[flight_delay$airline == "RegionEx", ]
# Calculate mean, median, 90th percentile, and standard deviation
regionex_mean <- mean(regionex_stats$delay, na.rm = TRUE)
regionex_median <- median(regionex_stats$delay, na.rm = TRUE)
regionex_percentile90 <- quantile(regionex_stats$delay, probs = 0.9, na.rm = TRUE)
regionex_std <- sd(regionex_stats$delay, na.rm = TRUE)
# Compute statistics for MDA flights
mda_stats <- flight_delay[flight_delay$airline == "MDA", ]
# Calculate mean, median, 90th percentile, and standard deviation
mda_mean <- mean(mda_stats$delay, na.rm = TRUE)
mda_median <- median(mda_stats$delay, na.rm = TRUE)
mda_percentile90 <- quantile(mda_stats$delay, probs = 0.9, na.rm = TRUE)
mda_std <- sd(mda_stats$delay, na.rm = TRUE)
# Print the computed statistics
cat("RegionEx Flight Statistics:\n")
## RegionEx Flight Statistics:
cat("Mean Arrival Delay:", regionex_mean, "\n")
## Mean Arrival Delay: 15.6625
cat("Median Arrival Delay:", regionex_median, "\n")
## Median Arrival Delay: 9
cat("90th Percentile Arrival Delay:", regionex_percentile90, "\n")
## 90th Percentile Arrival Delay: 21
cat("Standard Deviation of Arrival Delay:", regionex_std, "\n\n")
## Standard Deviation of Arrival Delay: 27.65036
cat("MDA Flight Statistics:\n")
## MDA Flight Statistics:
cat("Mean Arrival Delay:", mda_mean, "\n")
## Mean Arrival Delay: 10.9
cat("Median Arrival Delay:", mda_median, "\n")
## Median Arrival Delay: 13
cat("90th Percentile Arrival Delay:", mda_percentile90, "\n")
## 90th Percentile Arrival Delay: 16.1
cat("Standard Deviation of Arrival Delay:", mda_std, "\n")
## Standard Deviation of Arrival Delay: 6.338359
## I believe that, contractural obligations aside, the measure of central tendency that would be most appropriate to compare airline performance in this setting is Median Arrival Delay. The reasoning behind this decision is that it is significantly less affected by extreme outliers the way that Mean Arrival Delay is, in addition to the fact that representing the median instead of the mean will show a more hollistic/grouped analysis because it is more likely to show a clear "average" over the mean since it is limiting potential outliers.
# Assuming you have the "ggplot2" library installed
library(ggplot2)
# Create a histogram of arrival delay minutes for RegionEx and MDA flights
ggplot(flight_delay, aes(x = delay)) +
geom_histogram(binwidth = 5, fill = "blue", alpha = 0.7) + # Adjust binwidth as needed
facet_wrap(~ airline, ncol = 2) + # Separate histograms for each airline
labs(x = "Arrival Delay Minutes", y = "Frequency") + # Labels for axes
ggtitle("Distribution of Arrival Delays for RegionEx and MDA Flights") # Title for the plot
## Looking at the ditribution of these two plots, it is clear to see that with RegionEx has more extreme outliers, and a higher frequency in shorter delays below 20 min or so. They also have more instances, however of making up time and arriving before their scheduled arrival times. MDA has fewer frequency of delays overall, but their delays are on average higher than RegionEx's (when looking at Median Avg Delay). While MDA has fewer delay frequency, they have longer delays when they do occur, whereas RegionEx has frequent, short delays, and a higher frequency of making up time on the trip, in addition to a extreme outliers.
## Yes, RegionEx does have a higher percentage of delayed flights, however, they also have twice as many flights total as MDA does in the data set. Having such a significantly greater "N" would make me expect them have a higher percentage of flights delayed.
## It is all relative. In the new summary table, it appears as though MDA has a higher percentage of delayed flights overall, but those numbers are higher percentages on fewer total flights. If you average those totals out, like we did in the first table, they have a lower percentage of flights delayed. I believe the COO is correct in his assumption, but I would again argue that these two airlines are difficult to really compare on a head-to-head basis. MDA is handling so many fewer flights than RegionEx that it does not seem relevant to be comparing their delay percentages against one another as it is inferrable that when N doubles, one's flight delay percentages will increase as well.
# Load the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Create a summary table of percent delayed by airline.
flight_delay %>%
group_by(airline) %>%
summarize(n = n(),
percent_delay = (mean(delay_indicator) * 100) %>% round(1))
## # A tibble: 2 × 3
## airline n percent_delay
## <chr> <int> <dbl>
## 1 MDA 120 25.8
## 2 RegionEx 240 26.2
# Create a summary table of percentage of delayed flights by airline and route
delay_summary <- flight_delay %>%
group_by(airline, route_code) %>%
summarize(
total_flights = n(),
percent_delayed = (mean(delay_indicator) * 100) %>% round(2))
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
# Print the summary table
print(delay_summary)
## # A tibble: 8 × 4
## # Groups: airline [2]
## airline route_code total_flights percent_delayed
## <chr> <chr> <int> <dbl>
## 1 MDA DFW/MSY 30 26.7
## 2 MDA MSY/DFW 30 30
## 3 MDA MSY/PNS 30 20
## 4 MDA PNS/MSY 30 26.7
## 5 RegionEx DFW/MSY 90 25.6
## 6 RegionEx MSY/DFW 90 28.9
## 7 RegionEx MSY/PNS 30 20
## 8 RegionEx PNS/MSY 30 26.7
## The priimary thing I notice is the difference between scheduled flight times for the two airlines. MDA shcedules their flight times longer on average depending on the route by 10 and 5 minutes respectively compared to RegionEx. The other thing that is significant is that RegionEx schedules their flights for shorter times, is "delayed" more often, but still on average has shorter actual flight durations than MDA. This shows that the original analysis into the data does not show the full reality of the discrepancies between the two airlines. If MDA had RegionEx's scheduled flight durations, their flight delay percentage would be much higher, and far worse in terms of magnitude. If it were vice versa, RegionEX would have a much smaller percentage of flights delayed and their delay record frequency would shift closer to 0.
# Load the dplyr package if it's not already loaded
library(dplyr)
# Create a summary table for scheduled flight durations by airline and route
scheduled_summary <- flight_delay %>%
group_by(airline, route_code) %>%
summarize(
mean_scheduled_duration = mean(scheduled_flight_length),
median_scheduled_duration = median(scheduled_flight_length)
)
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
# Print the summary table for scheduled flight durations
print(scheduled_summary)
## # A tibble: 8 × 4
## # Groups: airline [2]
## airline route_code mean_scheduled_duration median_scheduled_duration
## <chr> <chr> <dbl> <dbl>
## 1 MDA DFW/MSY 100 100
## 2 MDA MSY/DFW 100 100
## 3 MDA MSY/PNS 75 75
## 4 MDA PNS/MSY 75 75
## 5 RegionEx DFW/MSY 90 90
## 6 RegionEx MSY/DFW 90 90
## 7 RegionEx MSY/PNS 70 70
## 8 RegionEx PNS/MSY 70 70
# Create a summary table for both scheduled and actual flight durations by airline and route
duration_summary <- flight_delay %>%
group_by(airline, route_code) %>%
summarize(
mean_scheduled_duration = mean(scheduled_flight_length),
mean_actual_duration = mean(actual_flight_length),
median_scheduled_duration = median(scheduled_flight_length),
median_actual_duration = median(actual_flight_length)
)
## `summarise()` has grouped output by 'airline'. You can override using the
## `.groups` argument.
# Print the summary table for both scheduled and actual flight durations
print(duration_summary)
## # A tibble: 8 × 6
## # Groups: airline [2]
## airline route_code mean_scheduled_duration mean_actual_duration
## <chr> <chr> <dbl> <dbl>
## 1 MDA DFW/MSY 100 113.
## 2 MDA MSY/DFW 100 114.
## 3 MDA MSY/PNS 75 85.2
## 4 MDA PNS/MSY 75 81.4
## 5 RegionEx DFW/MSY 90 106.
## 6 RegionEx MSY/DFW 90 108.
## 7 RegionEx MSY/PNS 70 81
## 8 RegionEx PNS/MSY 70 80.4
## # ℹ 2 more variables: median_scheduled_duration <dbl>,
## # median_actual_duration <dbl>
## No, the data does not support the claim that the on-time performance of RegionEx is worse than MDA. As I have mentioned in responses to previous questions, it is clear that RegionEx does, in fact, have a higher percentage of flights that are delayed. In the data, we see that while they have a higher percentage of flights delayed, on average (grouped by route) each individual route has a lower percentage of delays than MDA. It is also clear from the data that the reason RegionEx even has more delays in the first place is because they have shorter scheduled flight durations than MDA to begin with. This shortened scheduled flight duration is working against RegionEx statistically when indiviudals just look at these statsitics at face value. Digging deeper, we actually learn that in every major route in our dataset, RegionEx has a significantly shorter actual flight time than MDA despite having the higher percentage of delays due to the over zealous scheduling team who believes their flights can be made faster than the actual times that have been reported. RegionEx has a higher percentage of flights that have delays, but in every single route their average flight time is significantly shorter at a much higher capacity and thus I would argue they do not have a worse on-time performance than MDA.