Assignment 7

When choosing an airline, prospective flyers would like to know many things however two of the most important questions are “will my flight be on time” and “is this airline reliable”. To answer this question I have decided to search the provided database for which airlines had the worst (highest) cancellation rate and which had the highest disparity between their estimated flight time and their actual flight time.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

flights_data <- read.csv("domestic_flights_jan_2016.csv")
cancellation_rates_by_carrier <- flights_data %>%
  group_by(Carrier) %>%
  summarise(CancellationRate = mean(Cancelled, na.rm = TRUE) * 100) %>%
  arrange(desc(CancellationRate))
highest_risk_carrier <- cancellation_rates_by_carrier %>%
  slice_max(CancellationRate, n = 1) %>%
  pull(Carrier)
print(highest_risk_carrier)

[1] "B6"

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
flights_data <- read.csv("domestic_flights_jan_2016.csv")
highest_risk_carrier <- cancellation_rates_by_carrier %>%
  slice_max(CancellationRate, n = 1) %>%
  pull(Carrier)
print(cancellation_rates_by_carrier)

# A tibble: 12 × 2
   Carrier CancellationRate
   <chr>              <dbl>
 1 B6                3.90  
 2 AA                3.60  
 3 EV                3.40  
 4 UA                3.36  
 5 VX                2.95  
 6 NK                2.79  
 7 WN                2.53  
 8 OO                2.07  
 9 DL                1.40  
10 F9                1.08  
11 AS                0.979 
12 HA                0.0637

cancellation_rates_by_carrier <- cancellation_rates_by_carrier %>%
  mutate(Color = ifelse(Carrier == highest_risk_carrier, "highest_risk", "other"))
ggplot(cancellation_rates_by_carrier, aes(x = reorder(Carrier, -CancellationRate), y = CancellationRate, fill = Color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("highest_risk" = "red", "other" = "steelblue")) +
  labs(title = "Cancellation Rates by Carrier",
       x = "Carrier",
       y = "Cancellation Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotates carrier names for readability

This graph found that carrier “B6” had the highest cancellation rate of any carrier at 3.86% which is just under 1 in 25 flights. This is nearly double the average rate of about 2% which should be noted by anyone considering using airline B6. I would like to commend airline HA however for their extremely low rate of cancellations relative to the other airlines. Of course their are factors which may be outside of the airlines control which are impacting their ability to not cancel flights such as climate (An airline operating in Alaska probably has a higher cancellation rate than one operating in Europe) which should be taken into account.

elapsed_time_deviation <- flights_data %>%
  mutate(ElapsedTimeDeviation = abs(ActualElapsedTime - CRSElapsedTime)) %>%
  group_by(Carrier) %>%
  summarise(MeanDeviation = mean(ElapsedTimeDeviation, na.rm = TRUE)) %>%
  arrange(desc(MeanDeviation))

highest_deviation_carrier <- elapsed_time_deviation %>%
  slice_max(MeanDeviation, n = 1) %>%
  pull(Carrier)

elapsed_time_deviation <- elapsed_time_deviation %>%
  mutate(Color = ifelse(Carrier == highest_deviation_carrier, "greatest_deviation", "other"))
ggplot(elapsed_time_deviation, aes(x = reorder(Carrier, -MeanDeviation), y = MeanDeviation, fill = Color)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("greatest_deviation" = "red", "other" = "steelblue")) +
  labs(title = "Mean Elapsed Time Deviation by Carrier",
       x = "Carrier",
       y = "Mean Deviation (Minutes)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The above graph determined that UA was the carrier with the highest disparity between projected and actual flight lengths. A disparity of 15 minutes is not very extreme so all of the airlines have an acceptable amount of time disparity. Yet again we see that airline HA has the lowest time disparity of any of the available airlines which tells us that they are either a very well run organization or that they operate in an exceptionally calm environment.

1 Grading Rubric

Item (percent overall)	100% - flawless	67% - minor issues	33% - moderate issues	0% - major issues or not attempted
Question 1 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 1 visualization or table. (15%)	Visually pleasant and relevant to the question.
Question 2 query. (22%)	Relevant question that is fully answered in the query or queries.
Question 2 visualization or table. (15%)	Visually pleasant and relevant to the question.
Data was subsetted separately from the assignment. (10%)	You included the description of your subsetted data in your narrative.	You subsetted the data but didn’t include the description in the narrative.	NA	You didn’t subset the data.
Messages and/or errors suppressed from rendered document and all code is shown. (8%)
Submitted properly to Brightspace (8%)		NA	NA	You must submit according to instructions to receive any credit for this portion.