Approach 5A

Approach

My plan for this assignment 5A is to first recreate the airline delay table in the same format as the source (Numbersense, Kaiser Fung, McGraw Hill, 2013), including the empty cells used in the original layout. Then I will import the data into R and clean it into a tidy format for analysis. The main step will be reshaping the data from wide to long format so that the city names are stored in one column and the flight counts are stored in another. This will make it easier to compare the two airlines by status and destination. I will then calculate percentages instead of only using raw counts, since the goal is to compare performance by rates. I plan to compare the airlines in two ways. First, overall across all cities combined, and then separately for each of the five cities. Then I will explain any difference between the overall result and the city-by-city results. The main challenge is that the source table is not already tidy. Some cells are blank for formatting, and the airline labels are grouped instead of repeated on each row. I will need to handle those carefully so each observation is assigned the correct airline, status, city, and count. Another issue is that the overall comparison may not match the city-level comparison, so I will make sure to explain that clearly in the final submission.

airline_wide <- tibble(
  airline = c("ALASKA", NA, "AM WEST", NA),
  status = c("on time", "delayed", "on time", "delayed"),
  `Los Angeles` = c(497, 62, 694, 117),
  Phoenix = c(221, 12, 4840, 415),
  `San Diego` = c(212, 20, 383, 65),
  `San Francisco` = c(503, 102, 320, 129),
  Seattle = c(1841, 305, 201, 61)
)

airline_wide

# Saving the file as CSV
write_csv(airline_wide, "airline_delays_wide.csv")

# Read the CSV back into R
airline_data <- read_csv("airline_delays_wide.csv", show_col_types = FALSE)

airline_data

# Filling in missing airline names
# Converting from wide to long format
airline_long <- airline_data %>%
  fill(airline, .direction = "down") %>%
  pivot_longer(
    cols = -c(airline, status),
    names_to = "city",
    values_to = "count"
  )

airline_long

overall_summary <- airline_long %>%
  group_by(airline, status) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  group_by(airline) %>%
  mutate(
    total = sum(count),
    pct = count / total
  ) %>%
  ungroup()

overall_summary

overall_delays <- overall_summary %>%
  filter(status == "delayed") %>%
  select(airline, count, total, pct)

overall_delays

ggplot(overall_delays, aes(x = airline, y = pct)) +
  geom_col() +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Overall Delay Rate by Airline",
    x = "Airline",
    y = "Delay Rate"
  )

overall_text <- overall_delays %>%
  arrange(pct) %>%
  mutate(label = paste0(airline, " = ", percent(pct, accuracy = 0.1)))

cat(
  "Overall,",
  overall_text$label[1], "and",
  overall_text$label[2], ".",
  "The airline with the lower overall delay rate appears better when all cities are combined."
)

## Overall, AM WEST = 10.9% and ALASKA = 13.3% . The airline with the lower overall delay rate appears better when all cities are combined.

city_summary <- airline_long %>%
  group_by(airline, city, status) %>%
  summarise(count = sum(count), .groups = "drop") %>%
  group_by(airline, city) %>%
  mutate(
    total = sum(count),
    pct = count / total
  ) %>%
  ungroup()

city_summary

city_delays <- city_summary %>%
  filter(status == "delayed") %>%
  select(airline, city, count, total, pct)

city_delays

ggplot(city_delays, aes(x = city, y = pct, fill = airline)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Delay Rate by Airline and City",
    x = "City",
    y = "Delay Rate"
  ) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

city_table <- city_delays %>%
  select(airline, city, pct) %>%
  mutate(pct = percent(pct, accuracy = 0.1)) %>%
  pivot_wider(names_from = airline, values_from = pct)

city_table

city_winners <- city_delays %>%
  select(airline, city, pct) %>%
  pivot_wider(names_from = airline, values_from = pct) %>%
  mutate(
    better_airline = case_when(
      ALASKA < `AM WEST` ~ "ALASKA",
      ALASKA > `AM WEST` ~ "AM WEST",
      TRUE ~ "Tie"
    )
  )

city_winners

cat(
  "By city, the better airline is shown above.",
  "This comparison should be based on percentages, not raw counts, because the number of flights differs across cities."
)

## By city, the better airline is shown above. This comparison should be based on percentages, not raw counts, because the number of flights differs across cities.

Approach 5A

Ha Kyoung Kim

2026-02-27

Approach

Conclusion