A CSV file containing needed data was created and uploaded to GitHub. The data includes on time and delayed arrival counts for two airlines across five destinations. This r packages tidyr and dplyr were used to tidy and transform data for analysis. Data analysis was performed to assess and compare the arrival delays for the two airlines present in the data. The analysis provides an example of Simpson’s Paradox, a phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
Read in csv file as data frame and fill blank strings as NA values
df <- suppressWarnings(read.csv("https://raw.githubusercontent.com/okazevedo90/DATA607_GitHub/refs/heads/main/DATA607_Assignment4Data.csv", header = T, na.strings = c("")))
Forward fill NA airline values
df <- df |>
fill(Airline)
Pivot data into tidy form via lengthening the data to transform the 5 destination arrival count columns into two columns, Destination and Delay_Count. Pivoting the data in this way makes each observation a row and each column a variable; thus, making the data tidy.
long_df <- df |>
pivot_longer(
cols = c("Los_Angeles", "Phoenix", "San_Diego", "San_Franciso", "Seattle"),
names_to = "Destination",
values_to = "Delay_Count"
) |>
transform(Destination = str_replace(Destination, "_", " "))
Total Airline Arrival Counts
long_df |>
group_by(Airline, Delay_Type) |>
summarise(Delay_Count_Sum = sum(Delay_Count), .groups = "keep") |>
ggplot(aes(x = Delay_Type, y = Delay_Count_Sum, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
labs(
title = "Total Airline Arrivals",
x = "Arrival Type",
y = "Arrival Count",
fill = "Airline"
) +
theme_classic() +
theme(
axis.text.x = element_text(size = 11),
plot.title = element_text(size = 14, face = "bold"))
Total Airline Arrival Counts by Destination
long_df |>
group_by(Destination, Delay_Type) |>
ggplot(aes(x = Destination, y = Delay_Count, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
facet_wrap(~Delay_Type) +
labs(
title = "Total Airline Arrivals by Destination",
x = "Destination",
y = "Arrival Count",
fill = "Airline"
) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"))
Figure 1: Total Airline Arrival Counts by Destination
Airline Arrival Distribution per Total Airline Arrivals
Example Interpretation: About 13% of all Alaska airline arrivals are delayed
airline_prcts = long_df |>
group_by(Airline) |>
mutate(Airline_Total_Count = sum(Delay_Count)) |>
group_by(Airline, Delay_Type) |>
mutate(Airline_Delay_Count = sum(Delay_Count)) |>
mutate(Airline_Delay_Percent = round(Airline_Delay_Count / Airline_Total_Count, 4) * 100) |>
distinct(Airline, Delay_Type, Airline_Delay_Count, Airline_Total_Count, Airline_Delay_Percent)
airline_prcts |>
ggplot(aes(x = Delay_Type, y = Airline_Delay_Percent, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
labs(
title = "Airline Arrival Distribution per Total Airline Arrivals",
x = "Arrival Type",
y = "Arrival Count",
fill = "Airline"
) +
theme_classic() +
theme(
axis.text.x = element_text(size = 11),
plot.title = element_text(size = 14, face = "bold"))
| Airline | Delay_Type | Airline_Delay_Count | Airline_Total_Count | Airline_Delay_Percent |
|---|---|---|---|---|
| Alaska | Delayed | 501 | 3775 | 13.27 |
| AM West | Delayed | 787 | 7225 | 10.89 |
Airline Arrivals Distribution per Total Airline Arrivals & Destination
Example Interpretation: About 29% of AM West Seattle arrivals are delayed
airline_dest_prcts = long_df |>
group_by(Airline, Destination) |>
mutate(Airline_Dest_Total_Count = sum(Delay_Count)) |>
group_by(Airline, Destination, Delay_Type) |>
mutate(Airline_Dest_Delay_Percent = round(Delay_Count / Airline_Dest_Total_Count, 4) * 100) |>
distinct(Airline, Destination, Delay_Type, Delay_Count, Airline_Dest_Total_Count, Airline_Dest_Delay_Percent)
airline_dest_prcts |>
ggplot(aes(x = Destination, y = Airline_Dest_Delay_Percent, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
labs(
title = "Airline Arrival Distribution per Total Airline Arrivals & Destination",
x = "Destination",
y = "Percent of Airline Arrivals",
fill = "Airline"
) +
facet_wrap(~Delay_Type) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"))
| Airline | Destination | Delay_Type | Delay_Count | Airline_Dest_Total_Count | Airline_Dest_Delay_Percent |
|---|---|---|---|---|---|
| Alaska | Los Angeles | Delayed | 62 | 559 | 11.09 |
| Alaska | Phoenix | Delayed | 12 | 233 | 5.15 |
| Alaska | San Diego | Delayed | 20 | 232 | 8.62 |
| Alaska | San Franciso | Delayed | 102 | 605 | 16.86 |
| Alaska | Seattle | Delayed | 305 | 2146 | 14.21 |
| AM West | Los Angeles | Delayed | 117 | 811 | 14.43 |
| AM West | Phoenix | Delayed | 415 | 5255 | 7.9 |
| AM West | San Diego | Delayed | 65 | 448 | 14.51 |
| AM West | San Franciso | Delayed | 129 | 449 | 28.73 |
| AM West | Seattle | Delayed | 61 | 262 | 23.28 |
Airline Arrivals per Total Arrival Type
Example Interpretation: AM West accounts for about 61% of all delays and 66% of on time arrivals for all AM West and Alaska arrivals
delay_prcts = long_df |>
group_by(Delay_Type) |>
mutate(Delay_Type_Total_Count = sum(Delay_Count)) |>
group_by(Airline, Delay_Type) |>
mutate(Airline_Delay_Count = sum(Delay_Count)) |>
mutate(Delay_Type_Percent = round(Airline_Delay_Count / Delay_Type_Total_Count, 4) * 100) |>
distinct(Airline, Delay_Type, Airline_Delay_Count, Delay_Type_Total_Count, Delay_Type_Percent)
delay_prcts |>
ggplot(aes(x = Delay_Type, y = Delay_Type_Percent, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
labs(
title = "Airline Arrival Distribution per Total Arrival Types",
x = "Arrival Type",
y = "Percent of Arrival Types",
fill = "Airline"
) +
theme_classic() +
theme(
axis.text.x = element_text(size = 11),
plot.title = element_text(size = 14, face = "bold"))
Figure 2: Airline Arrivals per Total Arrival Type
| Airline | Delay_Type | Airline_Delay_Count | Delay_Type_Total_Count | Delay_Type_Percent |
|---|---|---|---|---|
| Alaska | Delayed | 501 | 1288 | 38.9 |
| AM West | Delayed | 787 | 1288 | 61.1 |
Airline Arrivals per Total Arrival Types & Destination
Example Interpretation: AM West accounts for about 97% of all delayed Phoenix arrivals for both airlines
delay_des_prcts = long_df |>
group_by(Destination, Delay_Type) |>
mutate(Total_Delay_Type_Counts = sum(Delay_Count)) |>
group_by(Airline, Destination, Delay_Type) |>
mutate(Total_Des_Delay_Counts = sum(Delay_Count)) |>
mutate(Delay_Des_Percent = round(Total_Des_Delay_Counts / Total_Delay_Type_Counts, 4) * 100) |>
distinct(Airline, Destination, Delay_Type, Total_Des_Delay_Counts, Total_Delay_Type_Counts, Delay_Des_Percent)
delay_des_prcts |>
ggplot(aes(x = Destination, y = Delay_Des_Percent, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
labs(
title = "Airline Arrival Distribution per Total Arrival Types & Destination",
x = "Destination",
y = "Percent of Arrivals",
fill = "Arrival Delay Type"
) +
facet_wrap(~Delay_Type) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"))
| Airline | Destination | Delay_Type | Total_Des_Delay_Counts | Total_Delay_Type_Counts | Delay_Des_Percent |
|---|---|---|---|---|---|
| Alaska | Los Angeles | Delayed | 62 | 179 | 34.64 |
| Alaska | Phoenix | Delayed | 12 | 427 | 2.81 |
| Alaska | San Diego | Delayed | 20 | 85 | 23.53 |
| Alaska | San Franciso | Delayed | 102 | 231 | 44.16 |
| Alaska | Seattle | Delayed | 305 | 366 | 83.33 |
| AM West | Los Angeles | Delayed | 117 | 179 | 65.36 |
| AM West | Phoenix | Delayed | 415 | 427 | 97.19 |
| AM West | San Diego | Delayed | 65 | 85 | 76.47 |
| AM West | San Franciso | Delayed | 129 | 231 | 55.84 |
| AM West | Seattle | Delayed | 61 | 366 | 16.67 |
Airline Arrivals Percent Difference per Total Arrival Types & Destination (differences from above plot)
Example Interpretation: Alaska airline has a Seattle on time arrival rate about 80% greater than the AM West Seattle on time arrival rate.
delay_des_prcts |>
ungroup() |>
group_by(Destination, Delay_Type) |>
mutate(Des_Delay_Prct_Diff = abs(diff(Delay_Des_Percent))) |>
filter(Delay_Des_Percent == max(Delay_Des_Percent)) |>
distinct(Airline, Destination, Delay_Type, Delay_Des_Percent, Des_Delay_Prct_Diff) |>
ggplot(aes(x = Destination, y = Des_Delay_Prct_Diff, fill = Airline)) +
geom_col(position = position_dodge2(width = 0.3, preserve = "single")) +
scale_fill_manual(values = c("tomato1", 'steelblue')) +
facet_wrap(~Delay_Type) +
labs(
title = "Airline Arrivals Percent Different per Total Arrival Types & Destination",
x = "Destination",
y = "Arrival Percent Difference",
fill = "Airline"
) +
theme_classic() +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.title = element_text(size = 14, face = "bold"))
Figure 3: Airline Arrivals Percent Difference per Total Arrival Types & Destination
This analysis illustrates an example of Simpson’s Paradox or a statistical phenomenon that occurs when a trend in data reverses when the data is broken down into subgroups.
When comparing Alaska and AM West airline arrival delays per each respective airline, the results display that Alaska has more delayed arrivals, with 13% of Alaska arrivals being delayed (see Table 1). However, when taking into account arrival destination, the delayed arrival percentages reveal that AM West has a higher rate of delays for every arrival destination.
The paradoxical conclusion is that AM West has more arrival delays for each arrival destination, yet Alaska appears to have overall more delays when considering all destinations at the same time.
This paradox happens because two effects occur together:
Analyzing arrivals by percent of delays per destination provides further insight. AM West has a higher delay rate for 4/5 destinations as well as higher on time rates for 3/5 destinations. Conversely, Alaska has a higher delay rate for only 1/5 destinations and higher on time rates for 2/5 destinations. See Figure 3. In addition, the airline with the highest delay rate per destination also has the highest on time rate for each respective destination except for San Francisco. San Francisco is the only destination in which one airline (AM West) has a higher percentage of delays and the other airline (Alaska) has the higher percentage of on time arrivals. See Table 4.
To continue this analysis I would recommend weighting the Alaska airline delay counts and reassess if Simpson’s paradox is still present given the adjusted group sizes.