Approach

My plan for this assignment 5A is to first recreate the airline delay table in the same format as the source (Numbersense, Kaiser Fung, McGraw Hill, 2013), including the empty cells used in the original layout. Then I will import the data into R and clean it into a tidy format for analysis. The main step will be reshaping the data from wide to long format so that the city names are stored in one column and the flight counts are stored in another. This will make it easier to compare the two airlines by status and destination. I will then calculate percentages instead of only using raw counts, since the goal is to compare performance by rates. I plan to compare the airlines in two ways. First, overall across all cities combined, and then separately for each of the five cities. Then I will explain any difference between the overall result and the city-by-city results. The main challenge is that the source table is not already tidy. Some cells are blank for formatting, and the airline labels are grouped instead of repeated on each row. I will need to handle those carefully so each observation is assigned the correct airline, status, city, and count. Another issue is that the overall comparison may not match the city-level comparison, so I will make sure to explain that clearly in the final submission.

airline_wide <- tibble(
  airline = c("ALASKA", NA, "AM WEST", NA),
  status = c("on time", "delayed", "on time", "delayed"),
  `Los Angeles` = c(497, 62, 694, 117),
  Phoenix = c(221, 12, 4840, 415),
  `San Diego` = c(212, 20, 383, 65),
  `San Francisco` = c(503, 102, 320, 129),
  Seattle = c(1841, 305, 201, 61)
)

airline_wide
# Saving the file as CSV
write_csv(airline_wide, "airline_delays_wide.csv")
# Read the CSV back into R
airline_data <- read_csv("airline_delays_wide.csv", show_col_types = FALSE)

airline_data
# Filling in missing airline names
# Converting from wide to long format
airline_long <- airline_data %>%
  fill(airline, .direction = "down") %>%
  pivot_longer(
    cols = -c(airline, status),
    names_to = "city",
    values_to = "count"
  )

airline_long
overall_summary <- airline_long %>%
  group_by(airline, status) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  group_by(airline) %>%
  mutate(
    total = sum(count),
    pct = count / total
  ) %>%
  ungroup()

overall_summary
overall_delays <- overall_summary %>%
  filter(status == "delayed") %>%
  select(airline, count, total, pct)

overall_delays
ggplot(overall_delays, aes(x = airline, y = pct)) +
  geom_col() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Overall Delay Rate by Airline",
    x = "Airline",
    y = "Delay Rate"
  )

overall_text <- overall_delays %>%
  arrange(pct) %>%
  mutate(label = paste0(airline, " = ", percent(pct, accuracy = 0.1)))

cat(
  "Overall,",
  overall_text$label[1], "and",
  overall_text$label[2], ".",
  "The airline with the lower overall delay rate appears better when all cities are combined."
)
## Overall, AM WEST = 10.9% and ALASKA = 13.3% . The airline with the lower overall delay rate appears better when all cities are combined.
city_summary <- airline_long %>%
  group_by(airline, city, status) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  group_by(airline, city) %>%
  mutate(
    total = sum(count),
    pct = count / total
  ) %>%
  ungroup()

city_summary
city_delays <- city_summary %>%
  filter(status == "delayed") %>%
  select(airline, city, count, total, pct)

city_delays
ggplot(city_delays, aes(x = city, y = pct, fill = airline)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Delay Rate by Airline and City",
    x = "City",
    y = "Delay Rate"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

city_table <- city_delays %>%
  select(airline, city, pct) %>%
  mutate(pct = percent(pct, accuracy = 0.1)) %>%
  pivot_wider(names_from = airline, values_from = pct)

city_table
city_winners <- city_delays %>%
  select(airline, city, pct) %>%
  pivot_wider(names_from = airline, values_from = pct) %>%
  mutate(
    better_airline = case_when(
      ALASKA < `AM WEST` ~ "ALASKA",
      ALASKA > `AM WEST` ~ "AM WEST",
      TRUE ~ "Tie"
    )
  )

city_winners
cat(
  "By city, the better airline is shown above.",
  "This comparison should be based on percentages, not raw counts, because the number of flights differs across cities."
)
## By city, the better airline is shown above. This comparison should be based on percentages, not raw counts, because the number of flights differs across cities.

Conclusion

In conclusion, after recreating the original table, preserving the missing cells, and transforming the data into tidy long format, the percentage comparisons showed that AM WEST had the lower overall delay rate (10.9%), while ALASKA had the lower delay rate in every individual city. This difference shows that overall results can be misleading when the groups being combined are not distributed the same way across categories. In this case, the discrepancy is caused by the airlines having different numbers of flights by city, which makes this a clear example of Simpson’s paradox.