Tidying and Transforming Data

1. Introduction

In this assignment, I worked with flight delay data for two airlines across five cities.
The goal was to tidy the dataset, transform it into long format, and compare on-time vs delayed flights both overall and by city.

2. Load Libraries

I started by loading the tidyverse package, which includes the tools for data manipulation and visualization.

library(tidyverse)

3. Create the Dataset

The assignment provided data in a table. Since I only had the raw numbers, I created the dataset manually in R.

delays <- data.frame(
  Airline = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
  Status = c("on time", "delayed", "on time", "delayed"),
  Los.Angeles = c(497, 62, 694, 117),
  Phoenix = c(221, 12, 4840, 415),
  San.Diego = c(212, 20, 383, 65),
  San.Francisco = c(503, 102, 320, 129),
  Seattle = c(1841, 305, 201, 61)
)

delays

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

4. Check for Missing Data

I verified whether there were missing values in the dataset. Since there were none, no imputation was needed.

sum(is.na(delays))

## [1] 0

5. Transform Data

To make analysis easier, I reshaped the dataset into long format using pivot_longer().

delays_long <- delays %>%
  pivot_longer(cols = -c(Airline, Status),
               names_to = "City",
               values_to = "Count")

delays_long

## # A tibble: 20 × 4
##    Airline Status  City          Count
##    <chr>   <chr>   <chr>         <dbl>
##  1 ALASKA  on time Los.Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San.Diego       212
##  4 ALASKA  on time San.Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los.Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San.Diego        20
##  9 ALASKA  delayed San.Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los.Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San.Diego       383
## 14 AM WEST on time San.Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los.Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San.Diego        65
## 19 AM WEST delayed San.Francisco   129
## 20 AM WEST delayed Seattle          61

6. Overall Percentages

Next, I calculated the overall percentages of on-time vs delayed flights for each airline.

overall <- delays_long %>%
  group_by(Airline, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  mutate(Percent = round(100 * Total / sum(Total), 1))

overall

## # A tibble: 4 × 4
##   Airline Status  Total Percent
##   <chr>   <chr>   <dbl>   <dbl>
## 1 ALASKA  delayed   501     4.6
## 2 ALASKA  on time  3274    29.8
## 3 AM WEST delayed   787     7.2
## 4 AM WEST on time  6438    58.5

Explanation: Overall, Alaska Airlines had fewer delays in proportion compared to AM West. The imbalance is influenced by AM West’s large number of flights out of Phoenix.

7. Percentages by City

I then compared percentages of delays across the five cities.

by_city <- delays_long %>%
  group_by(Airline, City, Status) %>%
  summarise(Total = sum(Count), .groups = "drop") %>%
  group_by(City) %>%
  mutate(Percent = round(100 * Total / sum(Total), 1))

by_city

## # A tibble: 20 × 5
## # Groups:   City [5]
##    Airline City          Status  Total Percent
##    <chr>   <chr>         <chr>   <dbl>   <dbl>
##  1 ALASKA  Los.Angeles   delayed    62     4.5
##  2 ALASKA  Los.Angeles   on time   497    36.3
##  3 ALASKA  Phoenix       delayed    12     0.2
##  4 ALASKA  Phoenix       on time   221     4  
##  5 ALASKA  San.Diego     delayed    20     2.9
##  6 ALASKA  San.Diego     on time   212    31.2
##  7 ALASKA  San.Francisco delayed   102     9.7
##  8 ALASKA  San.Francisco on time   503    47.7
##  9 ALASKA  Seattle       delayed   305    12.7
## 10 ALASKA  Seattle       on time  1841    76.5
## 11 AM WEST Los.Angeles   delayed   117     8.5
## 12 AM WEST Los.Angeles   on time   694    50.7
## 13 AM WEST Phoenix       delayed   415     7.6
## 14 AM WEST Phoenix       on time  4840    88.2
## 15 AM WEST San.Diego     delayed    65     9.6
## 16 AM WEST San.Diego     on time   383    56.3
## 17 AM WEST San.Francisco delayed   129    12.2
## 18 AM WEST San.Francisco on time   320    30.4
## 19 AM WEST Seattle       delayed    61     2.5
## 20 AM WEST Seattle       on time   201     8.3

Explanation: When broken down by city, Alaska performed better in Phoenix, but worse in Seattle. AM West looked worse overall because of its very high delay counts in Phoenix.

8. Analysis of Discrepancy

Comparing the overall results to the city-by-city breakdown reveals a discrepancy.

City-by-city: Alaska sometimes performs worse (Seattle).

Overall: AM West looks much worse. This happens because Phoenix has so many AM West flights that its delay rate dominates the overall average. This is an example of Simpson’s paradox, where aggregated data can give a misleading impression.

9. Conclusion

Through this analysis, I tidied the dataset, transformed it into a long format, and compared delay rates across airlines and cities. The key insight is that while AM West appears worse overall, Alaska struggles in certain cities, showing the importance of looking at both aggregated and disaggregated data.