The data is read in from the csv that was created and can be viewed in the github repository this project resides in. Some filtering is done to remove empty rows in the csv and fill the airline name in.
raw_data <- read_csv("flights.csv")
marshal <- raw_data |>
filter(!if_all(names(raw_data), ~ is.na(.))) |>
fill(...1) |>
rename(airline = ...1, status = ...2)
marshal
## # A tibble: 4 × 7
## airline status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
In order to make analysis easier, the destinations are collapsed into a single column named “destination” and the number of flights in each cell is mapped to a column named “flights”. The “status” column is then broken out into two columns, “on_time” and “delayed” with the value of the “flights” column mapped to the corresponding status. This data format makes calculating the rate of delayed flights simple for each airline and destination. Delayed percentage was chosen as it will be easier to see differences between airlines and destinations when plotting the data due to most flights being on time.
pivoted <- marshal |>
pivot_longer(
cols = `Los Angeles`:Seattle,
names_to = "destination",
values_to = "flights"
) |>
pivot_wider(
names_from = status,
values_from = flights
) |>
rename(on_time = `on time`) |>
mutate(total_flights = on_time + delayed, delayed_percentage = delayed / total_flights)
pivoted
## # A tibble: 10 × 6
## airline destination on_time delayed total_flights delayed_percentage
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA Los Angeles 497 62 559 0.111
## 2 ALASKA Phoenix 221 12 233 0.0515
## 3 ALASKA San Diego 212 20 232 0.0862
## 4 ALASKA San Francisco 503 102 605 0.169
## 5 ALASKA Seattle 1841 305 2146 0.142
## 6 AM WEST Los Angeles 694 117 811 0.144
## 7 AM WEST Phoenix 4840 415 5255 0.0790
## 8 AM WEST San Diego 383 65 448 0.145
## 9 AM WEST San Francisco 320 129 449 0.287
## 10 AM WEST Seattle 201 61 262 0.233
Aggregating the on time and delayed flights by airline allows us to get an idea of how the airlines compare overall. The difference in overall on time rates between the two airlines is relatively small at just 2%, but worth noting is that Am West has almost double the total number of flights and a lower overall delayed percentage, initially indicating strong performance by the airline.
agg_airline <- pivoted |>
group_by(airline) |>
summarise(total_on_time = sum(on_time), total_delayed = sum(delayed)) |>
mutate(total_flights = total_on_time + total_delayed, delayed_percentage = total_delayed / (total_delayed + total_on_time))
agg_airline
## # A tibble: 2 × 5
## airline total_on_time total_delayed total_flights delayed_percentage
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA 3274 501 3775 0.133
## 2 AM WEST 6438 787 7225 0.109
By plotting the rate of delayed flights by destination the data starts to become more interesting. It can be seen that Alaska has a consistently higher rate of on time flights for each individual destination. The difference is most notable for San Francisco and Seattle, with both having around a 10% difference in the rate of on time flights. This appears to run counter to our initial analysis that Am West was the better performing airline. This indicates that Alaska is actually the better performing airline and our initial analysis was skewed by heavy traffic at certain airports, namely Phoenix and Seattle.
pivoted |>
ggplot(aes(x = destination, y = delayed_percentage, fill = airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(delayed_percentage,2)), vjust = -0.2, position = position_dodge(width = .9)) +
labs(x = "City", y = "Delayed Rate", fill = "Airline", title = "Comparing Airline Delay Rates by Destination")