Tidying and Transforming Data

1. Introduction

In this assignment, I worked with flight delay data for two airlines across five cities. The goal was to tidy the dataset, transform it into long format, and compare on-time vs delayed flights both overall and by city.

2. Load Libraries

I started by loading the tidyverse package, which includes the tools for data manipulation and visualization.

library(tidyverse)

3. Create the Dataset

The assignment provided data in a table. Since I only had the raw numbers, I created the dataset manually in R.

delays <- data.frame(
  Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
  Status = c("on time", "delayed", "on time", "delayed"),
  Los.Angeles = c(497, 62, 694, 117),
  Phoenix = c(221, 12, 4840, 415),
  San.Diego = c(212, 20, 383, 65),
  San.Francisco = c(503, 102, 320, 129),
  Seattle = c(1841, 305, 201, 61)
)

delays

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

4. Check for Missing Data

I verified whether there were missing values in the dataset. Since there were none, no imputation was needed.

sum(is.na(delays))

## [1] 0

5. Transform Data

To make analysis easier, I reshaped the dataset into long format using pivot_longer().

delays_long <- delays %>%
  pivot_longer(cols = -c(Airline, Status),
               names_to = "City",
               values_to = "Count")

delays_long

## # A tibble: 20 × 4
##    Airline Status  City          Count
##    <chr>   <chr>   <chr>         <dbl>
##  1 ALASKA  on time Los.Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San.Diego       212
##  4 ALASKA  on time San.Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los.Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San.Diego        20
##  9 ALASKA  delayed San.Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los.Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San.Diego       383
## 14 AM WEST on time San.Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los.Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San.Diego        65
## 19 AM WEST delayed San.Francisco   129
## 20 AM WEST delayed Seattle          61

6. Overall Percentages

Next, I calculated the overall percentages of on-time vs delayed flights for each airline.

overall <- delays_long %>%
group_by(Airline, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(Airline) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))

overall

## # A tibble: 4 × 4
## # Groups:   Airline [2]
##   Airline Status  Total Percent
##   <chr>   <chr>   <dbl>   <dbl>
## 1 ALASKA  delayed   501    13.3
## 2 ALASKA  on time  3274    86.7
## 3 AM WEST delayed   787    10.9
## 4 AM WEST on time  6438    89.1

Observation: The overall percentages indicate that Alaska experiences a higher proportion of delays (13.3%) compared to AM West (10.9%). Based on these totals alone, Alaska appears to perform worse overall.

7. Percentages by City

I then compared percentages of delays across the five cities.

by_city <- delays_long %>%
group_by(Airline, City, Status) %>%
summarise(Total = sum(Count), .groups = "drop") %>%
group_by(Airline, City) %>%
mutate(Percent = round(100 * Total / sum(Total), 1))

by_city

## # A tibble: 20 × 5
## # Groups:   Airline, City [10]
##    Airline City          Status  Total Percent
##    <chr>   <chr>         <chr>   <dbl>   <dbl>
##  1 ALASKA  Los.Angeles   delayed    62    11.1
##  2 ALASKA  Los.Angeles   on time   497    88.9
##  3 ALASKA  Phoenix       delayed    12     5.2
##  4 ALASKA  Phoenix       on time   221    94.8
##  5 ALASKA  San.Diego     delayed    20     8.6
##  6 ALASKA  San.Diego     on time   212    91.4
##  7 ALASKA  San.Francisco delayed   102    16.9
##  8 ALASKA  San.Francisco on time   503    83.1
##  9 ALASKA  Seattle       delayed   305    14.2
## 10 ALASKA  Seattle       on time  1841    85.8
## 11 AM WEST Los.Angeles   delayed   117    14.4
## 12 AM WEST Los.Angeles   on time   694    85.6
## 13 AM WEST Phoenix       delayed   415     7.9
## 14 AM WEST Phoenix       on time  4840    92.1
## 15 AM WEST San.Diego     delayed    65    14.5
## 16 AM WEST San.Diego     on time   383    85.5
## 17 AM WEST San.Francisco delayed   129    28.7
## 18 AM WEST San.Francisco on time   320    71.3
## 19 AM WEST Seattle       delayed    61    23.3
## 20 AM WEST Seattle       on time   201    76.7

Observation: When the data is broken down by city, AM West has a higher delay percentage in every single city.

8. Analysis of Discrepancy

The contrast between the city-by-city results and the overall totals reveals a clear example of Simpson’s paradox.

City-by-city: AM West performs worse in every city.

Overall: Alaska appears worse. This happens because AM West has an extremely large number of flights in Phoenix, and Phoenix has a relatively low delay percentage. Because Phoenix dominates the total number of flights, it pulls AM West’s overall delay percentage down, even though AM West is worse in each individual city.

9. Conclusion

Through this analysis, I tidied the dataset, reshaped it into long format, and compared delay rates across both airlines and cities. The key insight is that AM West performs worse in every city, yet Alaska appears worse overall due to Simpson’s paradox. This demonstrates the importance of examining both aggregated and disaggregated data to avoid misleading conclusions.