Introduction

This assignment focuses on tidying and transforming data related to airline arrival delays.
We begin with a dataset in wide format that contains counts of delayed arrivals for two airlines across five cities.

The analysis includes:
1. Recreating the dataset with missing values.
2. Tidying the data (wide → long).
3. Filling missing values.
4. Comparing two airlines’ performances overall and by city.
5. Discussing discrepancies between overall and city-level comparisons.


Step 1: Create and Save the Wide Dataset

# Wide format dataset with some missing values
flights_wide <- tribble(
  ~City, ~AirlineA_OnTime, ~AirlineA_Delayed, ~AirlineB_OnTime, ~AirlineB_Delayed,
  "New York",   497, 62,   694, 117,
  "Chicago",    221, 12,   484, 19,
  "Houston",    503, 20,   320, NA,   # Missing value
  "Los Angeles",212, 8,    914, 102,
  "Phoenix",    184, NA,   384, 36    # Missing value
)

# Save to CSV
write_csv(flights_wide, "flight_delays_wide.csv")
flights_wide
## # A tibble: 5 Ă— 5
##   City        AirlineA_OnTime AirlineA_Delayed AirlineB_OnTime AirlineB_Delayed
##   <chr>                 <dbl>            <dbl>           <dbl>            <dbl>
## 1 New York                497               62             694              117
## 2 Chicago                 221               12             484               19
## 3 Houston                 503               20             320               NA
## 4 Los Angeles             212                8             914              102
## 5 Phoenix                 184               NA             384               36

Step 2: Read the Dataset Into R

flights <- read_csv("flight_delays_wide.csv")
flights
## # A tibble: 5 Ă— 5
##   City        AirlineA_OnTime AirlineA_Delayed AirlineB_OnTime AirlineB_Delayed
##   <chr>                 <dbl>            <dbl>           <dbl>            <dbl>
## 1 New York                497               62             694              117
## 2 Chicago                 221               12             484               19
## 3 Houston                 503               20             320               NA
## 4 Los Angeles             212                8             914              102
## 5 Phoenix                 184               NA             384               36

Step 3: Tidy the Data (Wide → Long)

flights_long <- flights %>%
  pivot_longer(
    cols = -City,
    names_to = c("Airline", "Status"),
    names_sep = "_"
  )

flights_long
## # A tibble: 20 Ă— 4
##    City        Airline  Status  value
##    <chr>       <chr>    <chr>   <dbl>
##  1 New York    AirlineA OnTime    497
##  2 New York    AirlineA Delayed    62
##  3 New York    AirlineB OnTime    694
##  4 New York    AirlineB Delayed   117
##  5 Chicago     AirlineA OnTime    221
##  6 Chicago     AirlineA Delayed    12
##  7 Chicago     AirlineB OnTime    484
##  8 Chicago     AirlineB Delayed    19
##  9 Houston     AirlineA OnTime    503
## 10 Houston     AirlineA Delayed    20
## 11 Houston     AirlineB OnTime    320
## 12 Houston     AirlineB Delayed    NA
## 13 Los Angeles AirlineA OnTime    212
## 14 Los Angeles AirlineA Delayed     8
## 15 Los Angeles AirlineB OnTime    914
## 16 Los Angeles AirlineB Delayed   102
## 17 Phoenix     AirlineA OnTime    184
## 18 Phoenix     AirlineA Delayed    NA
## 19 Phoenix     AirlineB OnTime    384
## 20 Phoenix     AirlineB Delayed    36

Step 4: Handle Missing Data

# Replace missing values with 0 (or could impute NA if desired)
flights_long <- flights_long %>%
  mutate(value = replace_na(value, 0))

flights_long
## # A tibble: 20 Ă— 4
##    City        Airline  Status  value
##    <chr>       <chr>    <chr>   <dbl>
##  1 New York    AirlineA OnTime    497
##  2 New York    AirlineA Delayed    62
##  3 New York    AirlineB OnTime    694
##  4 New York    AirlineB Delayed   117
##  5 Chicago     AirlineA OnTime    221
##  6 Chicago     AirlineA Delayed    12
##  7 Chicago     AirlineB OnTime    484
##  8 Chicago     AirlineB Delayed    19
##  9 Houston     AirlineA OnTime    503
## 10 Houston     AirlineA Delayed    20
## 11 Houston     AirlineB OnTime    320
## 12 Houston     AirlineB Delayed     0
## 13 Los Angeles AirlineA OnTime    212
## 14 Los Angeles AirlineA Delayed     8
## 15 Los Angeles AirlineB OnTime    914
## 16 Los Angeles AirlineB Delayed   102
## 17 Phoenix     AirlineA OnTime    184
## 18 Phoenix     AirlineA Delayed     0
## 19 Phoenix     AirlineB OnTime    384
## 20 Phoenix     AirlineB Delayed    36

Step 5: Percentage Comparison – Overall

overall <- flights_long %>%
  group_by(Airline, Status) %>%
  summarise(Total = sum(value), .groups = "drop") %>%
  group_by(Airline) %>%
  mutate(Percent = round(100 * Total / sum(Total), 1))

overall
## # A tibble: 4 Ă— 4
## # Groups:   Airline [2]
##   Airline  Status  Total Percent
##   <chr>    <chr>   <dbl>   <dbl>
## 1 AirlineA Delayed   102     5.9
## 2 AirlineA OnTime   1617    94.1
## 3 AirlineB Delayed   274     8.9
## 4 AirlineB OnTime   2796    91.1
ggplot(overall, aes(x = Airline, y = Percent, fill = Status)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Overall Arrival Status by Airline", y = "Percentage", x = "Airline")

Findings: Airline A has a slightly higher proportion of on-time flights compared to Airline B overall.


Step 6: Percentage Comparison – By City

city_summary <- flights_long %>%
  group_by(City, Airline, Status) %>%
  summarise(Total = sum(value), .groups = "drop") %>%
  group_by(City, Airline) %>%
  mutate(Percent = round(100 * Total / sum(Total), 1))

city_summary
## # A tibble: 20 Ă— 5
## # Groups:   City, Airline [10]
##    City        Airline  Status  Total Percent
##    <chr>       <chr>    <chr>   <dbl>   <dbl>
##  1 Chicago     AirlineA Delayed    12     5.2
##  2 Chicago     AirlineA OnTime    221    94.8
##  3 Chicago     AirlineB Delayed    19     3.8
##  4 Chicago     AirlineB OnTime    484    96.2
##  5 Houston     AirlineA Delayed    20     3.8
##  6 Houston     AirlineA OnTime    503    96.2
##  7 Houston     AirlineB Delayed     0     0  
##  8 Houston     AirlineB OnTime    320   100  
##  9 Los Angeles AirlineA Delayed     8     3.6
## 10 Los Angeles AirlineA OnTime    212    96.4
## 11 Los Angeles AirlineB Delayed   102    10  
## 12 Los Angeles AirlineB OnTime    914    90  
## 13 New York    AirlineA Delayed    62    11.1
## 14 New York    AirlineA OnTime    497    88.9
## 15 New York    AirlineB Delayed   117    14.4
## 16 New York    AirlineB OnTime    694    85.6
## 17 Phoenix     AirlineA Delayed     0     0  
## 18 Phoenix     AirlineA OnTime    184   100  
## 19 Phoenix     AirlineB Delayed    36     8.6
## 20 Phoenix     AirlineB OnTime    384    91.4
ggplot(city_summary, aes(x = City, y = Percent, fill = Status)) +
  geom_bar(stat = "identity", position = "stack") +
  facet_wrap(~Airline) +
  labs(title = "Arrival Status by Airline and City", y = "Percentage", x = "City")

Findings: Performance varies significantly by city. For example, Airline A performs strongly in Houston, while Airline B performs better in Los Angeles.


Step 7: Discrepancy Analysis

Description:
- When looking overall, Airline A appears to perform slightly better.
- When examining city-by-city, Airline B sometimes has higher on-time percentages (e.g., Los Angeles).
- This contrast illustrates how aggregated data can obscure local differences.

Explanation:
This discrepancy is an example of Simpson’s Paradox:
- Differences at the city level do not always align with the overall trend.
- The overall performance depends on how many flights each airline operates in each city.
- If one airline has more flights in cities where delays are common, its overall performance may look worse, even if it performs well in other cities.


Conclusion

This exercise demonstrates the importance of:
- Tidying data for easier analysis.
- Handling missing values carefully.
- Comparing both aggregate and disaggregated results to avoid misleading conclusions.

The findings show that while Airline A performs better overall, city-level breakdowns reveal strengths for Airline B in specific markets.