Preface

I experimented with the kableExtra package to display the output of code blocks more nicely than the default, which limits the number of columns shown. If not already installed on your computer, you can run the command below.

install.packages("kableExtra")

Introduction

This assignment aims to explain how analyses of individual groups of data versus aggregated data can identify trends in opposite directions, a phenomenon in data analysis known as “Simpson’s paradox.”1

Data

The original data table in the assignment instructions looks like this:

I created a corresponding CSV file from these data and saved it to my GitHub repository. For convenience, I substituted the city names with standard airport abbreviations (LAX = Los Angeles, PHX = Phoenix, SAN = San Diego, SFO = San Francisco, SEA = Seattle).

Airline,Arrival_status,LAX,PHX,SAN,SFO,SEA
Alaska,on time,497,221,212,503,1841
Alaska,delayed,62,12,20,102,305
AmWest,on time,694,4840,383,320,201
AmWest,delayed,117,415,65,129,61

I read this file into a data frame (tibble).

(arrival_data_raw <- read_csv('https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Assignment5/arrival_delays.csv', show_col_types = FALSE))
## # A tibble: 4 × 7
##   Airline Arrival_status   LAX   PHX   SAN   SFO   SEA
##   <chr>   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska  on time          497   221   212   503  1841
## 2 Alaska  delayed           62    12    20   102   305
## 3 AmWest  on time          694  4840   383   320   201
## 4 AmWest  delayed          117   415    65   129    61

Data transformations

The raw data has both long and wide characteristics. It is “long” in the sense that each airline and arrival status has a separate record (row). But it is “wide” in the sense that each airport has a separate column. To tidy the data, I reshaped the data into two different wide formats. (I did this to make subsequent analyses easier.)

Wide format #1. I consolidated the rows for each airline into a single row. I also reordered the columns so that all the “on time” columns and all the “delayed” columns were next to each other.

arrivals_by_airline <- arrival_data_raw %>% 
    pivot_wider(
      id_cols = Airline,
      names_from = Arrival_status,
      values_from = c(LAX, PHX, SAN, SFO, SEA)
    ) %>%
    select(Airline, ends_with("on time"), ends_with("delayed"))

kbl_display(arrivals_by_airline, "100%")
Airline LAX_on time PHX_on time SAN_on time SFO_on time SEA_on time LAX_delayed PHX_delayed SAN_delayed SFO_delayed SEA_delayed
Alaska 497 221 212 503 1841 62 12 20 102 305
AmWest 694 4840 383 320 201 117 415 65 129 61


Wide format #2. The second wide format is a transposition of the first (ie, so the airports are rows and the airlines are columns). To do this, I first reshaped the data into an intermediate long format.

intermediate_long_df <- arrivals_by_airline %>%
    pivot_longer(
      cols = -1, # Pivot all columns except the first
      names_to = c("Airport", "Arrival_status"),
      names_sep = "_",
      values_to = "Arrivals"
    )

kbl_display(intermediate_long_df, "50%", "300px")
Airline Airport Arrival_status Arrivals
Alaska LAX on time 497
Alaska PHX on time 221
Alaska SAN on time 212
Alaska SFO on time 503
Alaska SEA on time 1841
Alaska LAX delayed 62
Alaska PHX delayed 12
Alaska SAN delayed 20
Alaska SFO delayed 102
Alaska SEA delayed 305
AmWest LAX on time 694
AmWest PHX on time 4840
AmWest SAN on time 383
AmWest SFO on time 320
AmWest SEA on time 201
AmWest LAX delayed 117
AmWest PHX delayed 415
AmWest SAN delayed 65
AmWest SFO delayed 129
AmWest SEA delayed 61


Then I pivoted this into the second wide format.

arrivals_by_airport <- intermediate_long_df %>%
    pivot_wider(
      names_from = c(Airline, Arrival_status),
      names_sep = "_",
      values_from = Arrivals
    )

kbl_display(arrivals_by_airport, "75%")
Airport Alaska_on time Alaska_delayed AmWest_on time AmWest_delayed
LAX 497 62 694 117
PHX 221 12 4840 415
SAN 212 20 383 65
SFO 503 102 320 129
SEA 1841 305 201 61


Analysis

I based my analysis on the comparisons that were described in the Numbersense book that was referenced in the data table.

1. Comparison of arrival delays across all airports (aggregated data)

This analysis uses wide format #1. I calculated the overall proportion of Alaska and AmWest arrivals that are delayed across all 5 airports.

aggregate_delays <- arrivals_by_airline %>%
    rowwise() %>%
    mutate(
      Total_delays = sum(c_across(LAX_delayed:SEA_delayed)),                          
      Total_arrivals = sum(c_across(`LAX_on time`:SEA_delayed)),
      Proportion_delays = round(Total_delays / Total_arrivals, digits = 3)
      ) %>%
    select(Airline, Total_delays, Total_arrivals, Proportion_delays)

kbl_display(aggregate_delays, "60%")
Airline Total_delays Total_arrivals Proportion_delays
Alaska 501 3775 0.133
AmWest 787 7225 0.109


The barplot shows that a higher proportion of Alaska arrivals are delayed than AmWest arrivals across all 5 airports.

ggplot(aggregate_delays, aes(x = Airline, y = Proportion_delays, fill = Airline)) +
  geom_bar(stat = "identity") + 
  ylim(0, 0.15) +
  xlab("Airline") +
    theme(axis.title.x = element_text(face = "bold")) +
  ylab("Proportion of arrivals delayed") +
    theme(axis.title.y = element_text(face = "bold"))


2. Comparison of arrival delays at individual airports

This analysis uses wide format #2. I calculated the proportion of Alaska and AmWest arrivals that are delayed at each airport.

arrivals_by_airport <- arrivals_by_airport %>%
  mutate(
    Alaska_total_arrivals = `Alaska_on time` + Alaska_delayed,
    Alaska_proportion_delay = round(Alaska_delayed / Alaska_total_arrivals, digits = 3),
    AmWest_total_arrivals = `AmWest_on time` + AmWest_delayed,
    AmWest_proportion_delay = round(AmWest_delayed / AmWest_total_arrivals, digits = 3),
  )

individual_airport_delays <- arrivals_by_airport %>%
  select(Airport, ends_with("_proportion_delay"))

kbl_display(individual_airport_delays, "60%")
Airport Alaska_proportion_delay AmWest_proportion_delay
LAX 0.111 0.144
PHX 0.052 0.079
SAN 0.086 0.145
SFO 0.169 0.287
SEA 0.142 0.233


To compare the results graphically, I wanted to create a side-by-side barplot, so I reshaped the data. The barplot shows that a higher proportion of AmWest arrivals are delayed than Alaska arrivals at each airport.

reshape_df_for_sbs_barplot <- function(df) {
  reshaped_df <- df %>%
    pivot_longer(
      cols = -1, 
      names_to = c("Airline", "Arrival_status"),
      names_sep = "_proportion_",
      values_to = "Proportion"
    )
  return(reshaped_df)
}

ggplot(reshape_df_for_sbs_barplot(individual_airport_delays), aes(x = Airport, y = Proportion, fill = Airline)) +
  geom_col(position = 'dodge') +
  xlab("Airport") +
    theme(axis.title.x = element_text(face = "bold")) +
  ylab("Proportion of arrivals delayed") +
    theme(axis.title.y = element_text(face = "bold"))

Why did the first analysis show that AmWest has a lower proportion of arrival delays, but the second analysis of the same data shows that Alaska has a lower proportion? As explained in the Numbersense book, aggregated data may be confounded by “hidden” variables2 that influence individual data groups. For example, the 5 airports may differ by an unknown attribute that influences the proportion of Alaska and AmWest arrivals at each airport.

The next two analyses examine this possibility.


3. Comparison of arrivals at each airport

The proportion of Alaska and AmWest arrivals (on time + delayed) at each of the 5 airports is not balanced (ie, not a 1:1 ratio).

arrivals_by_airport <- arrivals_by_airport %>%
  mutate(
    Total_arrivals = Alaska_total_arrivals + AmWest_total_arrivals,
    Alaska_proportion_arrivals = round(Alaska_total_arrivals / Total_arrivals, digits = 3),    
    AmWest_proportion_arrivals = round(AmWest_total_arrivals / Total_arrivals, digits = 3),
  )

individual_airport_arrivals <- arrivals_by_airport %>%
  select(Airport, ends_with("_proportion_arrivals"))

kbl_display(individual_airport_arrivals, "60%")
Airport Alaska_proportion_arrivals AmWest_proportion_arrivals
LAX 0.408 0.592
PHX 0.042 0.958
SAN 0.341 0.659
SFO 0.574 0.426
SEA 0.891 0.109


The imbalance is most pronounced at PHX and SEA. At PHX, almost all arrivals are AmWest flights. In contrast, at SEA, most arrivals are Alaska flights.

ggplot(reshape_df_for_sbs_barplot(individual_airport_arrivals), aes(x = Airport, y = Proportion, fill = Airline)) +
  geom_col(position = 'fill') +
  xlab("Airport") +
    theme(axis.title.x = element_text(face = "bold")) +
  ylab("Proportion of arrivals") +
    theme(axis.title.y = element_text(face = "bold"))


4. Comparison of arrival delays at each airport

Furthermore, the overall proportion of delayed arrivals (ie, for both airlines combined) is highest at SFO and lowest at PHX. SEA has the second highest proportion of delayed arrivals.

arrivals_by_airport <- arrivals_by_airport %>%
  mutate(
    Total_delays = Alaska_delayed + AmWest_delayed,
    Overall_proportion_delayed = round(Total_delays / Total_arrivals, digits = 3),    
  ) %>%
  arrange(desc(Overall_proportion_delayed))

individual_airport_delays <- arrivals_by_airport %>%
  select(Airport, Overall_proportion_delayed)

kbl_display(individual_airport_delays, "40%")
Airport Overall_proportion_delayed
SFO 0.219
SEA 0.152
LAX 0.131
SAN 0.125
PHX 0.078


ggplot(individual_airport_delays, aes(x = Airport, y = Overall_proportion_delayed, fill = Airport)) +
  geom_col() + 
  theme(legend.position = "none") +
  ylim(0, 0.3) +
  xlab("Airport") +
    theme(axis.title.x = element_text(face = "bold")) +
  ylab("Proportion of arrivals delayed") +
    theme(axis.title.y = element_text(face = "bold"))


Explanation of Simpson’s paradox

Analyses #3 and #4 show that not all airports are equal with respect to arrival delay rates due to an unknown airport attribute. In the aggregate analysis, AmWest’s arrival performance was boosted by the high proportion of its arrivals at PHX (analysis #3), which was the best-performing airport (analysis #4). In addition, Alaska’s arrival performance was hurt by the high proportion of its arrivals at SEA, which was the second-worst performing airport. As a result, the aggregate comparison drowned out Alaska’s low proportion of delayed arrivals at airports with a higher proportion of AmWest flights, making it appear that AmWest has better on-time performance.


Conclusions

These analyses demonstrate that the aggregate analysis leads to a misleading conclusion due to Simpson’s paradox. Specifically, the aggregate analysis shows that AmWest has a lower overall proportion of arrival delays than Alaska, because a confounding airport variable affects the proportion of Alaska and AmWest arrivals at each airport. The individual airport analysis is not affected by this and more accurately shows that Alaska has a lower proportion of arrival delays than AmWest at each of the 5 airports.


  1. Numbersense, Kaiser Fung, McGraw Hill, 2013. Available at CUNY library.↩︎

  2. More accurately, “hidden” variables are unmeasured variables (ie, not in the dataset).↩︎