1 Overview

A CSV file containing needed data was created and uploaded to GitHub. The data includes on time and delayed arrival counts for two airlines across five destinations. This r packages tidyr and dplyr were used to tidy and transform data for analysis. Data analysis was performed to assess and compare the arrival delays for the two airlines present in the data. The analysis provides an example of Simpson’s Paradox, a phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined.


2 Import Data

Read in csv file as data frame and fill blank strings as NA values

df <- suppressWarnings(read.csv("https://raw.githubusercontent.com/okazevedo90/DATA607_GitHub/refs/heads/main/DATA607_Assignment4Data.csv", header = T, na.strings = c("")))

3 Tidy and Transform Data

3.1 Fill Missing Values

Forward fill NA airline values

df <- df |>
  fill(Airline)

3.2 Pivot Data

Pivot data into tidy form via lengthening the data to transform the 5 destination arrival count columns into two columns, Destination and Delay_Count. Pivoting the data in this way makes each observation a row and each column a variable; thus, making the data tidy.

long_df <- df |>
  pivot_longer(
    cols = c("Los_Angeles", "Phoenix", "San_Diego", "San_Franciso", "Seattle"),
    names_to = "Destination",
    values_to = "Delay_Count"
  ) |> 
  transform(Destination = str_replace(Destination, "_", " "))

4 Data Analysis

4.1 Airline Arrival Counts

Total Airline Arrival Counts

long_df |>
  group_by(Airline, Delay_Type) |>
  summarise(Delay_Count_Sum = sum(Delay_Count), .groups = "keep") |>
  ggplot(aes(x = Delay_Type, y = Delay_Count_Sum, fill = Airline)) +
  geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
  scale_fill_manual(values = c("tomato1", 'steelblue')) + 
  labs(
    title = "Total Airline Arrivals",
    x = "Arrival Type",
    y = "Arrival Count",
    fill = "Airline"
  ) +
  theme_classic() + 
  theme(
    axis.text.x = element_text(size = 11),
    plot.title = element_text(size = 14, face = "bold"))


Total Airline Arrival Counts by Destination

long_df |>
  group_by(Destination, Delay_Type) |>
  ggplot(aes(x = Destination, y = Delay_Count, fill = Airline)) +
  geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
  facet_wrap(~Delay_Type) +
  labs(
    title = "Total Airline Arrivals by Destination",
    x = "Destination",
    y = "Arrival Count",
    fill = "Airline"
  ) +
  scale_fill_manual(values = c("tomato1", 'steelblue')) + 
  theme_classic() + 
  theme(
    axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
    plot.title = element_text(size = 14, face = "bold"))
Total Airline Arrival Counts by Destination

Figure 1: Total Airline Arrival Counts by Destination


4.2 Airline Arrivals Distribution

Airline Arrival Distribution per Total Airline Arrivals
Example Interpretation: About 13% of all Alaska airline arrivals are delayed

airline_prcts =  long_df |>
    group_by(Airline) |>
    mutate(Airline_Total_Count = sum(Delay_Count)) |>
    group_by(Airline, Delay_Type) |>
    mutate(Airline_Delay_Count = sum(Delay_Count)) |>
    mutate(Airline_Delay_Percent = round(Airline_Delay_Count / Airline_Total_Count, 4) * 100) |>
    distinct(Airline, Delay_Type, Airline_Delay_Count, Airline_Total_Count, Airline_Delay_Percent)
  
airline_prcts |>
  ggplot(aes(x = Delay_Type, y = Airline_Delay_Percent, fill = Airline)) +
    geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
    scale_fill_manual(values = c("tomato1", 'steelblue')) + 
    labs(
      title = "Airline Arrival Distribution per Total Airline Arrivals",
        x = "Arrival Type",
        y = "Arrival Count",
        fill = "Airline"
    ) +
    theme_classic() + 
    theme(
    axis.text.x = element_text(size = 11),
    plot.title = element_text(size = 14, face = "bold"))

Table 1: Delayed Airline Arrival Distribution Counts per Airline
Airline Delay_Type Airline_Delay_Count Airline_Total_Count Airline_Delay_Percent
Alaska Delayed 501 3775 13.27
AM West Delayed 787 7225 10.89


Airline Arrivals Distribution per Total Airline Arrivals & Destination
Example Interpretation: About 29% of AM West Seattle arrivals are delayed

airline_dest_prcts = long_df |>
  group_by(Airline, Destination) |>
  mutate(Airline_Dest_Total_Count = sum(Delay_Count)) |>
  group_by(Airline, Destination, Delay_Type) |>
  mutate(Airline_Dest_Delay_Percent = round(Delay_Count / Airline_Dest_Total_Count, 4) * 100) |>
  distinct(Airline, Destination, Delay_Type, Delay_Count, Airline_Dest_Total_Count, Airline_Dest_Delay_Percent)

 airline_dest_prcts |>
  ggplot(aes(x = Destination, y = Airline_Dest_Delay_Percent, fill = Airline)) +
    geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
    scale_fill_manual(values = c("tomato1", 'steelblue')) + 
    labs(
      title = "Airline Arrival Distribution per Total Airline Arrivals & Destination",
      x = "Destination",
      y = "Percent of Airline Arrivals",
      fill = "Airline"
    ) +
    facet_wrap(~Delay_Type) +
    theme_classic() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
    plot.title = element_text(size = 14, face = "bold"))

Table 2: Delayed Airline Arrival Distribution Counts per Airline & Destination
Airline Destination Delay_Type Delay_Count Airline_Dest_Total_Count Airline_Dest_Delay_Percent
Alaska Los Angeles Delayed 62 559 11.09
Alaska Phoenix Delayed 12 233 5.15
Alaska San Diego Delayed 20 232 8.62
Alaska San Franciso Delayed 102 605 16.86
Alaska Seattle Delayed 305 2146 14.21
AM West Los Angeles Delayed 117 811 14.43
AM West Phoenix Delayed 415 5255 7.9
AM West San Diego Delayed 65 448 14.51
AM West San Franciso Delayed 129 449 28.73
AM West Seattle Delayed 61 262 23.28


Airline Arrivals per Total Arrival Type
Example Interpretation: AM West accounts for about 61% of all delays and 66% of on time arrivals for all AM West and Alaska arrivals

delay_prcts = long_df |>
  group_by(Delay_Type) |>
  mutate(Delay_Type_Total_Count = sum(Delay_Count)) |>
  group_by(Airline, Delay_Type) |>
  mutate(Airline_Delay_Count = sum(Delay_Count)) |>
  mutate(Delay_Type_Percent = round(Airline_Delay_Count / Delay_Type_Total_Count, 4) * 100) |>
  distinct(Airline, Delay_Type, Airline_Delay_Count, Delay_Type_Total_Count, Delay_Type_Percent)

delay_prcts |>
  ggplot(aes(x = Delay_Type, y = Delay_Type_Percent, fill = Airline)) +
    geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
    scale_fill_manual(values = c("tomato1", 'steelblue')) + 
    labs(
      title = "Airline Arrival Distribution per Total Arrival Types",
      x = "Arrival Type",
      y = "Percent of Arrival Types",
      fill = "Airline"
    ) +
    theme_classic() + 
    theme(
        axis.text.x = element_text(size = 11),
        plot.title = element_text(size = 14, face = "bold"))
Airline Arrivals per Total Arrival Type

Figure 2: Airline Arrivals per Total Arrival Type

Table 3: Delayed Airline Arrival Distribution Counts per Arrival Type
Airline Delay_Type Airline_Delay_Count Delay_Type_Total_Count Delay_Type_Percent
Alaska Delayed 501 1288 38.9
AM West Delayed 787 1288 61.1


Airline Arrivals per Total Arrival Types & Destination
Example Interpretation: AM West accounts for about 97% of all delayed Phoenix arrivals for both airlines

delay_des_prcts = long_df |>
  group_by(Destination, Delay_Type) |>
  mutate(Total_Delay_Type_Counts = sum(Delay_Count)) |>
  group_by(Airline, Destination, Delay_Type) |>
  mutate(Total_Des_Delay_Counts = sum(Delay_Count)) |> 
  mutate(Delay_Des_Percent = round(Total_Des_Delay_Counts / Total_Delay_Type_Counts, 4) * 100) |>
  distinct(Airline, Destination, Delay_Type, Total_Des_Delay_Counts, Total_Delay_Type_Counts, Delay_Des_Percent) 

delay_des_prcts |>
  ggplot(aes(x = Destination, y = Delay_Des_Percent, fill = Airline)) +
    geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
    scale_fill_manual(values = c("tomato1", 'steelblue')) + 
    labs(
      title = "Airline Arrival Distribution per Total Arrival Types & Destination",
      x = "Destination",
      y = "Percent of Arrivals",
      fill = "Arrival Delay Type"
    ) +
    facet_wrap(~Delay_Type) +
    theme_classic() + 
  theme(
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
        plot.title = element_text(size = 14, face = "bold"))

Table 4: Delayed Airline Arrival Distribution Counts per Arrival Type
Airline Destination Delay_Type Total_Des_Delay_Counts Total_Delay_Type_Counts Delay_Des_Percent
Alaska Los Angeles Delayed 62 179 34.64
Alaska Phoenix Delayed 12 427 2.81
Alaska San Diego Delayed 20 85 23.53
Alaska San Franciso Delayed 102 231 44.16
Alaska Seattle Delayed 305 366 83.33
AM West Los Angeles Delayed 117 179 65.36
AM West Phoenix Delayed 415 427 97.19
AM West San Diego Delayed 65 85 76.47
AM West San Franciso Delayed 129 231 55.84
AM West Seattle Delayed 61 366 16.67


Airline Arrivals Percent Difference per Total Arrival Types & Destination (differences from above plot)
Example Interpretation: Alaska airline has a Seattle on time arrival rate about 80% greater than the AM West Seattle on time arrival rate.

delay_des_prcts |>
  ungroup() |>
  group_by(Destination, Delay_Type) |>
  mutate(Des_Delay_Prct_Diff = abs(diff(Delay_Des_Percent))) |>
  filter(Delay_Des_Percent == max(Delay_Des_Percent)) |>
  distinct(Airline, Destination, Delay_Type, Delay_Des_Percent, Des_Delay_Prct_Diff) |>
  ggplot(aes(x = Destination, y = Des_Delay_Prct_Diff, fill = Airline)) +
    geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
    scale_fill_manual(values = c("tomato1", 'steelblue')) + 
    facet_wrap(~Delay_Type) +
   labs(
      title = "Airline Arrivals Percent Different per Total Arrival Types & Destination",
      x = "Destination",
      y = "Arrival Percent Difference",
      fill = "Airline"
    ) +
    theme_classic() + 
    theme(
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
        plot.title = element_text(size = 14, face = "bold"))
Airline Arrivals Percent Difference per Total Arrival Types & Destination

Figure 3: Airline Arrivals Percent Difference per Total Arrival Types & Destination


5 Conclusions

This analysis illustrates an example of Simpson’s Paradox or a statistical phenomenon that occurs when a trend in data reverses when the data is broken down into subgroups.

When comparing Alaska and AM West airline arrival delays per each respective airline, the results display that Alaska has more delayed arrivals, with 13% of Alaska arrivals being delayed (see Table 1). However, when taking into account arrival destination, the delayed arrival percentages reveal that AM West has a higher rate of delays for every arrival destination.

The paradoxical conclusion is that AM West has more arrival delays for each arrival destination, yet Alaska appears to have overall more delays when considering all destinations at the same time.

This paradox happens because two effects occur together:

  1. The sizes of the groups, which are combined when destination is ignored, are very different. The totals are dominated by AM West Phoenix arrivals (accounts for 73% of AM West arrivals) and Alaska Seattle arrivals (accounts for 57% of Alaska arrivals). See Figure 1.
    • In addition, AM West has a much larger sample size accounting for about 61% of all delays and 66% of on time arrivals across both airlines. See Figure 2.
  2. The arrival destination, has a large effect on arrival delay rates regardless of the airline. Each airline experiences low and high delay rates for the same destination, i.e. San Francisco has the highest delay rate for both Alaska and AM West airlines. See Table 2.

Analyzing arrivals by percent of delays per destination provides further insight. AM West has a higher delay rate for 4/5 destinations as well as higher on time rates for 3/5 destinations. Conversely, Alaska has a higher delay rate for only 1/5 destinations and higher on time rates for 2/5 destinations. See Figure 3. In addition, the airline with the highest delay rate per destination also has the highest on time rate for each respective destination except for San Francisco. San Francisco is the only destination in which one airline (AM West) has a higher percentage of delays and the other airline (Alaska) has the higher percentage of on time arrivals. See Table 4.

To continue this analysis I would recommend weighting the Alaska airline delay counts and reassess if Simpson’s paradox is still present given the adjusted group sizes.