Introduction

The objective is to be able to transform a wide format data structure into a long format where you have ‘tidy’ the data to perform the analysis easier. The data is flight arrival counts from two airlines in five different cities.

Required Libraries

library(tidyverse)
library(pollster)

Import Data

Import wide format of CSV data

df <- read_csv('wide_airline_data.csv')
knitr::kable(df)
…1 …2 Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Rename missing column names with ‘carrier’ and ‘status’

df <- df %>% 
  rename(carrier = 1,
         status = 2)
knitr::kable(df)
carrier status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Drop Rows

Drop rows that contain N/A in every column

# remove rows where ALL columns 'N/A'
df <- df %>% 
  filter(if_any(everything(), ~ !is.na(.)))

knitr::kable(df)
carrier status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

Forward fill missing airline carrier names

df <- df %>% 
  fill(carrier)

knitr::kable(df)
carrier status Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Transform columns with city names into long format and name new columns city and flight_count

df <- df %>% 
  gather("city", "flight_count", -c("carrier", "status"))

knitr::kable(df)
carrier status city flight_count
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

Conclusion

City with highest arrival delays

From the following chart, Phoenix has the most significant total late arrivals among the five cities at 427. However, AM West accounts for most of these delays, with 415, compared to Alaska, which only has 12. The second highest city is Seattle, where Alaska Airlines owns most of the delays with 305 compared to 61.

compare_airlines <- df |> 
  filter(status == 'delayed')

compare_airlines |> ggplot(aes(x = flight_count, y = city, fill = carrier)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label = flight_count), 
            position = position_stack(vjust = 0.5), 
            size = 3,
            fontface = 'bold') +
  scale_y_discrete(limits=rev) +
  labs(title = 'Arrival Flight Delays')

knitr::kable(compare_airlines)
carrier status city flight_count
ALASKA delayed Los Angeles 62
AM WEST delayed Los Angeles 117
ALASKA delayed Phoenix 12
AM WEST delayed Phoenix 415
ALASKA delayed San Diego 20
AM WEST delayed San Diego 65
ALASKA delayed San Francisco 102
AM WEST delayed San Francisco 129
ALASKA delayed Seattle 305
AM WEST delayed Seattle 61

Comparing airline delays between Seattle and Phoenix

Looking further into how these arrival airline delays are proportional to their total footprint flying into Phoenix and Seattle, we can see that it is not a fair comparison. While AM West led the way with 415 delayed flights into Phoenix, it only accounted for almost 8 % of more than 5,000 total flights. We can see the same for Alaska, where 305 delays accounted for roughly 14 % of more than 2,000 flights into Seattle. The chart shows how much closer their delays are to each other.

compare_sea_pho <- df |> 
  filter(city %in% c('Phoenix', 'Seattle')) |> 
  group_by(carrier, city) |> 
  crosstab_3way(x = carrier, y = status, z = city, weight = flight_count, format = 'long')

compare_sea_pho |> 
  ggplot(aes(x = pct, y = carrier, fill = status)) +
  geom_bar(stat = 'identity') +
  facet_wrap(vars(city), ncol = 1) +
  theme(strip.background = element_blank(), 
        strip.placement = "outside",
        strip.text.x = element_text(face = 'bold', size = 10),
        legend.position = "top", 
        legend.title = element_blank()) +
  geom_text(aes(label = paste0(round(pct, 0), '%')), 
              position = position_stack(vjust = 0.5), 
              size = 3, 
              fontface = 'bold',
              colour = 'white') +
  scale_fill_manual(values=c("#C85200", "#1170AA")) +
  xlab("Total Flights %") +
  ylab(element_blank()) +
  labs(title = 'Flight Arrival Status')

carrier status city flight_count n pct
ALASKA delayed Phoenix 12 233 5.15
AM WEST delayed Phoenix 415 5255 7.90
ALASKA delayed Seattle 305 2146 14.21
AM WEST delayed Seattle 61 262 23.28

Recommendations

We can see a higher overall arrival delay in Seattle over Phoenix. A possible lurking variable to discover would be how weather plays a factor since Seattle is known for rainy weather compared to Phoenix. Also, does an airline’s hub location affect whether they are at the correct airports at the right time?