Approach

In this assignment, I will recreate the airline delay table shown in the prompt as a dataset that can be imported into R, then tidy and transform it to support comparisons between two airlines (Alaska and AM West). The goal is not just to count delays, but to compare delay percentages overall and across five cities, and then explain why the “overall” comparison may disagree with the city-by-city comparison.

1) Recreate the table as a structured dataset (wide format first).
I will begin by entering the values from the chart into a CSV file that mirrors the table’s layout. I plan to keep the data “wide” initially (with one row per airline + status and separate columns for each city). This matches the assignment’s intention to practice tidying. I will also preserve any empty cells in the original layout to reflect “missingness” as it appears in the source table.

2) Load the CSV into R and confirm it matches the source.
After importing the CSV into R, I will validate that the counts (on-time and delayed) match the values in the chart for each airline and city. This helps prevent subtle data-entry mistakes before I do any transformations.

3) Populate missing data in a reproducible way.
If there are empty cells in the recreated file, I will write a documented cleanup step that fills or flags them appropriately (depending on whether they represent true missing values or simply formatting gaps). The main idea is to demonstrate a clear, repeatable method for handling missing data rather than editing it manually.

4) Tidy the data from wide → long format.
Next, I will transform the dataset into a tidy/long format with columns such as:

This structure will make it easier to compute totals and percentages using grouped summaries.

5) Compute percentages and compare airline performance in two ways.
I will perform two types of comparisons, focusing on percent delayed (or equivalently, percent on-time):

I will present these results using a table and/or a chart, and include brief narrative interpretation for each.

6) Explain the discrepancy between overall vs city-level results.
Finally, I will explain why the overall comparison can differ from the city-by-city comparison. My expectation is that the airlines may operate different numbers of flights in each city, meaning the overall rate is a weighted average that can be heavily influenced by a city where one airline has much higher volume. This can create a situation where one airline appears better overall even if it performs worse in several individual cities - or as we learned in class: the Simpson’s paradox.

Some data challenges I anticipate are:

By recreating the table, tidying the dataset, and comparing percentages overall and by city, I expect to produce a clear summary of which airline appears to perform better under each view with a clear explanation of why those views can disagree.

Codebase

#After creating the CSV, I will load it in R 

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
wide <- read_csv("arrival_delays_wide - Sheet1.csv", show_col_types = FALSE)
wide
## # A tibble: 4 × 7
##   Airline Status  `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <chr>   <chr>           <dbl>   <dbl>       <dbl>           <dbl>   <dbl>
## 1 ALASKA  on time           497     221         212             503    1841
## 2 <NA>    delayed            62      12          20             102     305
## 3 AMWEST  on time           694    4840         383             320     201
## 4 <NA>    delayed           117     415          65             129      61
#Fill in missing data (airline name)
wide_filled <- wide %>%
  fill(Airline)

#Transform the data into long format

long <- wide_filled %>%
  pivot_longer(
    cols = c(`Los Angeles`, Phoenix, `San Diego`, `San Francisco`, Seattle),
    names_to = "City",
    values_to = "Flights"
  )
long
## # A tibble: 20 × 4
##    Airline Status  City          Flights
##    <chr>   <chr>   <chr>           <dbl>
##  1 ALASKA  on time Los Angeles       497
##  2 ALASKA  on time Phoenix           221
##  3 ALASKA  on time San Diego         212
##  4 ALASKA  on time San Francisco     503
##  5 ALASKA  on time Seattle          1841
##  6 ALASKA  delayed Los Angeles        62
##  7 ALASKA  delayed Phoenix            12
##  8 ALASKA  delayed San Diego          20
##  9 ALASKA  delayed San Francisco     102
## 10 ALASKA  delayed Seattle           305
## 11 AMWEST  on time Los Angeles       694
## 12 AMWEST  on time Phoenix          4840
## 13 AMWEST  on time San Diego         383
## 14 AMWEST  on time San Francisco     320
## 15 AMWEST  on time Seattle           201
## 16 AMWEST  delayed Los Angeles       117
## 17 AMWEST  delayed Phoenix           415
## 18 AMWEST  delayed San Diego          65
## 19 AMWEST  delayed San Francisco     129
## 20 AMWEST  delayed Seattle            61
#Create totals + percent delayed overall by airline:
overall <- long %>%
  group_by(Airline, Status) %>%
  summarise(Flights = sum(Flights), .groups = "drop") %>%
  group_by(Airline) %>%
  mutate(
    Total = sum(Flights),
    Pct = Flights / Total
  ) %>%
  ungroup()

overall
## # A tibble: 4 × 5
##   Airline Status  Flights Total   Pct
##   <chr>   <chr>     <dbl> <dbl> <dbl>
## 1 ALASKA  delayed     501  3775 0.133
## 2 ALASKA  on time    3274  3775 0.867
## 3 AMWEST  delayed     787  7225 0.109
## 4 AMWEST  on time    6438  7225 0.891
#Create a percent specific delayed
overall_delayed <- overall %>%
  filter(Status == "delayed") %>%
  select(Airline, Flights, Total, Pct)

overall_delayed
## # A tibble: 2 × 4
##   Airline Flights Total   Pct
##   <chr>     <dbl> <dbl> <dbl>
## 1 ALASKA      501  3775 0.133
## 2 AMWEST      787  7225 0.109

If we plot the overall delayed vs on-time percentage, we will see that Alaska has a higher percentage of delays than AM West.

overall %>%
  ggplot(aes(x = Airline, y = Pct, fill = Status)) +
  geom_col(position = "stack") +
  scale_y_continuous(labels = scales::percent)

city_perf <- long %>%
  group_by(Airline, City, Status) %>%
  summarise(Flights = sum(Flights), .groups = "drop") %>%
  group_by(Airline, City) %>%
  mutate(
    Total = sum(Flights),
    Pct = Flights / Total
  ) %>%
  ungroup()

city_delayed <- city_perf %>%
  filter(Status == "delayed") %>%
  select(Airline, City, Flights, Total, Pct)

city_delayed
## # A tibble: 10 × 5
##    Airline City          Flights Total    Pct
##    <chr>   <chr>           <dbl> <dbl>  <dbl>
##  1 ALASKA  Los Angeles        62   559 0.111 
##  2 ALASKA  Phoenix            12   233 0.0515
##  3 ALASKA  San Diego          20   232 0.0862
##  4 ALASKA  San Francisco     102   605 0.169 
##  5 ALASKA  Seattle           305  2146 0.142 
##  6 AMWEST  Los Angeles       117   811 0.144 
##  7 AMWEST  Phoenix           415  5255 0.0790
##  8 AMWEST  San Diego          65   448 0.145 
##  9 AMWEST  San Francisco     129   449 0.287 
## 10 AMWEST  Seattle            61   262 0.233
city_delayed %>%
  ggplot(aes(x = City, y = Pct, fill = Airline)) +
  geom_col(position = "dodge") +
  scale_y_continuous(labels = scales::percent) +
  coord_flip()

If we plot the city by city percentage of delayed percentage, we will see, contradictory to our earlier finding, that AMWEST has a higher percentage of delays than Alaska. AMWEST has a huge number of Phoenix flights compared to other cities, so Phoenix heavily influences AMWEST’s overall rate. That kind of weighting creates a Simpson’s paradox situation: one airline looks better overall even if it’s worse in several individual cities (or vice versa), depending on where most of its flights are concentrated.