This assignment focuses on tidying and transforming data related to
airline arrival delays.
We begin with a dataset in wide format that contains
counts of delayed arrivals for two airlines across five cities.
The analysis includes:
1. Recreating the dataset with missing values.
2. Tidying the data (wide → long).
3. Filling missing values.
4. Comparing two airlines’ performances overall and by city.
5. Discussing discrepancies between overall and city-level
comparisons.
# Wide format dataset with some missing values
flights_wide <- tribble(
~City, ~AirlineA_OnTime, ~AirlineA_Delayed, ~AirlineB_OnTime, ~AirlineB_Delayed,
"New York", 497, 62, 694, 117,
"Chicago", 221, 12, 484, 19,
"Houston", 503, 20, 320, NA, # Missing value
"Los Angeles",212, 8, 914, 102,
"Phoenix", 184, NA, 384, 36 # Missing value
)
# Save to CSV
write_csv(flights_wide, "flight_delays_wide.csv")
flights_wide
## # A tibble: 5 Ă— 5
## City AirlineA_OnTime AirlineA_Delayed AirlineB_OnTime AirlineB_Delayed
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 New York 497 62 694 117
## 2 Chicago 221 12 484 19
## 3 Houston 503 20 320 NA
## 4 Los Angeles 212 8 914 102
## 5 Phoenix 184 NA 384 36
flights <- read_csv("flight_delays_wide.csv")
flights
## # A tibble: 5 Ă— 5
## City AirlineA_OnTime AirlineA_Delayed AirlineB_OnTime AirlineB_Delayed
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 New York 497 62 694 117
## 2 Chicago 221 12 484 19
## 3 Houston 503 20 320 NA
## 4 Los Angeles 212 8 914 102
## 5 Phoenix 184 NA 384 36
flights_long <- flights %>%
pivot_longer(
cols = -City,
names_to = c("Airline", "Status"),
names_sep = "_"
)
flights_long
## # A tibble: 20 Ă— 4
## City Airline Status value
## <chr> <chr> <chr> <dbl>
## 1 New York AirlineA OnTime 497
## 2 New York AirlineA Delayed 62
## 3 New York AirlineB OnTime 694
## 4 New York AirlineB Delayed 117
## 5 Chicago AirlineA OnTime 221
## 6 Chicago AirlineA Delayed 12
## 7 Chicago AirlineB OnTime 484
## 8 Chicago AirlineB Delayed 19
## 9 Houston AirlineA OnTime 503
## 10 Houston AirlineA Delayed 20
## 11 Houston AirlineB OnTime 320
## 12 Houston AirlineB Delayed NA
## 13 Los Angeles AirlineA OnTime 212
## 14 Los Angeles AirlineA Delayed 8
## 15 Los Angeles AirlineB OnTime 914
## 16 Los Angeles AirlineB Delayed 102
## 17 Phoenix AirlineA OnTime 184
## 18 Phoenix AirlineA Delayed NA
## 19 Phoenix AirlineB OnTime 384
## 20 Phoenix AirlineB Delayed 36
# Replace missing values with 0 (or could impute NA if desired)
flights_long <- flights_long %>%
mutate(value = replace_na(value, 0))
flights_long
## # A tibble: 20 Ă— 4
## City Airline Status value
## <chr> <chr> <chr> <dbl>
## 1 New York AirlineA OnTime 497
## 2 New York AirlineA Delayed 62
## 3 New York AirlineB OnTime 694
## 4 New York AirlineB Delayed 117
## 5 Chicago AirlineA OnTime 221
## 6 Chicago AirlineA Delayed 12
## 7 Chicago AirlineB OnTime 484
## 8 Chicago AirlineB Delayed 19
## 9 Houston AirlineA OnTime 503
## 10 Houston AirlineA Delayed 20
## 11 Houston AirlineB OnTime 320
## 12 Houston AirlineB Delayed 0
## 13 Los Angeles AirlineA OnTime 212
## 14 Los Angeles AirlineA Delayed 8
## 15 Los Angeles AirlineB OnTime 914
## 16 Los Angeles AirlineB Delayed 102
## 17 Phoenix AirlineA OnTime 184
## 18 Phoenix AirlineA Delayed 0
## 19 Phoenix AirlineB OnTime 384
## 20 Phoenix AirlineB Delayed 36
overall <- flights_long %>%
group_by(Airline, Status) %>%
summarise(Total = sum(value), .groups = "drop") %>%
group_by(Airline) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))
overall
## # A tibble: 4 Ă— 4
## # Groups: Airline [2]
## Airline Status Total Percent
## <chr> <chr> <dbl> <dbl>
## 1 AirlineA Delayed 102 5.9
## 2 AirlineA OnTime 1617 94.1
## 3 AirlineB Delayed 274 8.9
## 4 AirlineB OnTime 2796 91.1
ggplot(overall, aes(x = Airline, y = Percent, fill = Status)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Overall Arrival Status by Airline", y = "Percentage", x = "Airline")
Findings: Airline A has a slightly higher proportion of on-time flights compared to Airline B overall.
city_summary <- flights_long %>%
group_by(City, Airline, Status) %>%
summarise(Total = sum(value), .groups = "drop") %>%
group_by(City, Airline) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))
city_summary
## # A tibble: 20 Ă— 5
## # Groups: City, Airline [10]
## City Airline Status Total Percent
## <chr> <chr> <chr> <dbl> <dbl>
## 1 Chicago AirlineA Delayed 12 5.2
## 2 Chicago AirlineA OnTime 221 94.8
## 3 Chicago AirlineB Delayed 19 3.8
## 4 Chicago AirlineB OnTime 484 96.2
## 5 Houston AirlineA Delayed 20 3.8
## 6 Houston AirlineA OnTime 503 96.2
## 7 Houston AirlineB Delayed 0 0
## 8 Houston AirlineB OnTime 320 100
## 9 Los Angeles AirlineA Delayed 8 3.6
## 10 Los Angeles AirlineA OnTime 212 96.4
## 11 Los Angeles AirlineB Delayed 102 10
## 12 Los Angeles AirlineB OnTime 914 90
## 13 New York AirlineA Delayed 62 11.1
## 14 New York AirlineA OnTime 497 88.9
## 15 New York AirlineB Delayed 117 14.4
## 16 New York AirlineB OnTime 694 85.6
## 17 Phoenix AirlineA Delayed 0 0
## 18 Phoenix AirlineA OnTime 184 100
## 19 Phoenix AirlineB Delayed 36 8.6
## 20 Phoenix AirlineB OnTime 384 91.4
ggplot(city_summary, aes(x = City, y = Percent, fill = Status)) +
geom_bar(stat = "identity", position = "stack") +
facet_wrap(~Airline) +
labs(title = "Arrival Status by Airline and City", y = "Percentage", x = "City")
Findings: Performance varies significantly by city. For example, Airline A performs strongly in Houston, while Airline B performs better in Los Angeles.
Description:
- When looking overall, Airline A appears to perform
slightly better.
- When examining city-by-city, Airline B sometimes has
higher on-time percentages (e.g., Los Angeles).
- This contrast illustrates how aggregated data can obscure local
differences.
Explanation:
This discrepancy is an example of Simpson’s
Paradox:
- Differences at the city level do not always align with the overall
trend.
- The overall performance depends on how many flights each airline
operates in each city.
- If one airline has more flights in cities where delays are common, its
overall performance may look worse, even if it performs well in other
cities.
This exercise demonstrates the importance of:
- Tidying data for easier analysis.
- Handling missing values carefully.
- Comparing both aggregate and disaggregated results to
avoid misleading conclusions.
The findings show that while Airline A performs better overall, city-level breakdowns reveal strengths for Airline B in specific markets.