I will begin by recreating the airline delay chart and exporting it to a .CSV file. I will then be able to use tidyr and dplyr to transform the data from wide to long by creating airline, status (on time/ delayed), city and flight count columns, and compare delays/ arrival rates for airlines and cities.
Code Base
# load librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
I created the airline delay .CSV file using Excel, and will begin by reading it.
# read csvairline_df <-read.csv("airline.csv")
Populate missing cells by adding column names.
# add airline and status column namescolnames(airline_df) <-c("airline","status","Los_Angeles","Phoenix","San_Diego","San_Francisco","Seattle")
Remove blank row.
# make sure all blank cells are NAairline_df[airline_df ==""] <-NA# remove row 3airline_df <- airline_df[-3, ]
Populate the rest of the NA cells by filling the airline column.
airline_df <- airline_df %>%fill(airline)
Convert the data from wide to long.
# make sure all values are numeric, and remove commasairline_df<- airline_df %>%mutate(across(Los_Angeles:Seattle,~as.numeric(gsub(",", "", .))) )# use pivot_longer to change each city column to a singular cities column, and a flight count column airline_df <- airline_df %>%pivot_longer(cols = Los_Angeles:Seattle,names_to ="city",values_to ="flight_count" )
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
arrivals_by_city
# A tibble: 10 × 5
airline city arrival_count total_count arrival_percentage
<chr> <chr> <dbl> <dbl> <dbl>
1 ALASKA Phoenix 221 233 94.8
2 AM WEST Phoenix 4840 5255 92.1
3 ALASKA San_Diego 212 232 91.4
4 ALASKA Los_Angeles 497 559 88.9
5 ALASKA Seattle 1841 2146 85.8
6 AM WEST Los_Angeles 694 811 85.6
7 AM WEST San_Diego 383 448 85.5
8 ALASKA San_Francisco 503 605 83.1
9 AM WEST Seattle 201 262 76.7
10 AM WEST San_Francisco 320 449 71.3
This table shows that, for both airlines, the city with the highest rate of on time flights was Phoenix, and the city with the lowest rate of on time flights was San Francisco. It also shows that Alaska has a higher rate of on time flights than AM West in every city.
Conclusion
When comparing the airlines AM West and Alaska, a discrepancy rises in that Alaska has a higher rate of delayed flights overall, but when comparing on time flight rates by city, Alaska also has a higher rate of on time flights in every city. This can be explained by Simpson’s Paradox, which is when groups of data show different trends when reversed. Since AM West operates a higher number of flights in each city with lower overall delays, its overall delay rate appears lower, even though Alaska performs better in every individual city. To verify this work, one could hand calculate the counts and percentages to compare them to the table outputs.