The objective is to be able to transform a wide format data structure into a long format where you have ‘tidy’ the data to perform the analysis easier. The data is flight arrival counts from two airlines in five different cities.
library(tidyverse)
library(pollster)
Import wide format of CSV data
df <- read_csv('wide_airline_data.csv')
knitr::kable(df)
…1 | …2 | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | NA | NA |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
Rename missing column names with ‘carrier’ and ‘status’
df <- df %>%
rename(carrier = 1,
status = 2)
knitr::kable(df)
carrier | status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
NA | NA | NA | NA | NA | NA | NA |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
Drop rows that contain N/A in every column
# remove rows where ALL columns 'N/A'
df <- df %>%
filter(if_any(everything(), ~ !is.na(.)))
knitr::kable(df)
carrier | status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
NA | delayed | 62 | 12 | 20 | 102 | 305 |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
NA | delayed | 117 | 415 | 65 | 129 | 61 |
Forward fill missing airline carrier names
df <- df %>%
fill(carrier)
knitr::kable(df)
carrier | status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Transform columns with city names into long format and name new columns city and flight_count
df <- df %>%
gather("city", "flight_count", -c("carrier", "status"))
knitr::kable(df)
carrier | status | city | flight_count |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | delayed | Los Angeles | 62 |
AM WEST | on time | Los Angeles | 694 |
AM WEST | delayed | Los Angeles | 117 |
ALASKA | on time | Phoenix | 221 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | on time | San Diego | 212 |
ALASKA | delayed | San Diego | 20 |
AM WEST | on time | San Diego | 383 |
AM WEST | delayed | San Diego | 65 |
ALASKA | on time | San Francisco | 503 |
ALASKA | delayed | San Francisco | 102 |
AM WEST | on time | San Francisco | 320 |
AM WEST | delayed | San Francisco | 129 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Seattle | 61 |
From the following chart, Phoenix has the most significant total late arrivals among the five cities at 427. However, AM West accounts for most of these delays, with 415, compared to Alaska, which only has 12. The second highest city is Seattle, where Alaska Airlines owns most of the delays with 305 compared to 61.
compare_airlines <- df |>
filter(status == 'delayed')
compare_airlines |> ggplot(aes(x = flight_count, y = city, fill = carrier)) +
geom_bar(stat = 'identity') +
geom_text(aes(label = flight_count),
position = position_stack(vjust = 0.5),
size = 3,
fontface = 'bold') +
scale_y_discrete(limits=rev) +
labs(title = 'Arrival Flight Delays')
knitr::kable(compare_airlines)
carrier | status | city | flight_count |
---|---|---|---|
ALASKA | delayed | Los Angeles | 62 |
AM WEST | delayed | Los Angeles | 117 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | delayed | San Diego | 20 |
AM WEST | delayed | San Diego | 65 |
ALASKA | delayed | San Francisco | 102 |
AM WEST | delayed | San Francisco | 129 |
ALASKA | delayed | Seattle | 305 |
AM WEST | delayed | Seattle | 61 |
Looking further into how these arrival airline delays are proportional to their total footprint flying into Phoenix and Seattle, we can see that it is not a fair comparison. While AM West led the way with 415 delayed flights into Phoenix, it only accounted for almost 8 % of more than 5,000 total flights. We can see the same for Alaska, where 305 delays accounted for roughly 14 % of more than 2,000 flights into Seattle. The chart shows how much closer their delays are to each other.
compare_sea_pho <- df |>
filter(city %in% c('Phoenix', 'Seattle')) |>
group_by(carrier, city) |>
crosstab_3way(x = carrier, y = status, z = city, weight = flight_count, format = 'long')
compare_sea_pho |>
ggplot(aes(x = pct, y = carrier, fill = status)) +
geom_bar(stat = 'identity') +
facet_wrap(vars(city), ncol = 1) +
theme(strip.background = element_blank(),
strip.placement = "outside",
strip.text.x = element_text(face = 'bold', size = 10),
legend.position = "top",
legend.title = element_blank()) +
geom_text(aes(label = paste0(round(pct, 0), '%')),
position = position_stack(vjust = 0.5),
size = 3,
fontface = 'bold',
colour = 'white') +
scale_fill_manual(values=c("#C85200", "#1170AA")) +
xlab("Total Flights %") +
ylab(element_blank()) +
labs(title = 'Flight Arrival Status')
carrier | status | city | flight_count | n | pct |
---|---|---|---|---|---|
ALASKA | delayed | Phoenix | 12 | 233 | 5.15 |
AM WEST | delayed | Phoenix | 415 | 5255 | 7.90 |
ALASKA | delayed | Seattle | 305 | 2146 | 14.21 |
AM WEST | delayed | Seattle | 61 | 262 | 23.28 |
We can see a higher overall arrival delay in Seattle over Phoenix. A possible lurking variable to discover would be how weather plays a factor since Seattle is known for rainy weather compared to Phoenix. Also, does an airline’s hub location affect whether they are at the correct airports at the right time?