I will create a CSV file comparing the arrival times and delays of two superb airlines, Southwest and Spirit, using a data structure that is wide (similar to Southwest’s seats).
Then, I will upload the data set to R in a reproducible way without removing NAs or tidying the data on upload. After that, I will populate the missing data with NA (or whatever is relevant for this data set).
I will analyze and compare the arrival delays for the two airlines. This will involve:
Creating a tibble (formatting columns as numeric as needed)
Pivoting the data longer
Comparing arrivals and delays for the two airlines, summarizing data using both counts and percentages
Including at least one data visualization
Commenting extensively, detailing my journey while cleaning this data
Getting rid of the white space and fixing the labels
#get rid of blank rowsflights_blanks_removed <-subset(flight_set, LAX !="") |>rename(airline = DESTINATION, status = X)#as tibble flights_airlines_pop <-as_tibble(flights_blanks_removed)#I tried to fill here, then found out the fill function only recognizes NA values#turning blanks to NAs#some blanks were automatically recognized as NAs when I loaded the table, these wer enotflights_airlines_pop <- flights_airlines_pop |>mutate(airline =na_if(airline, ""))#populate airline names downwardflights_airlines_pop <- flights_airlines_pop |>fill(airline)
Pivoting the data into a format more suitable to analysis
#here, I tell R to keep the airline and status columns #then I tell it that the column names become destinations and the values become a new variable called num_flights#I also told it to drop NA values - no need to see cities these airlines don't fly toflights_long <- flights_airlines_pop |>pivot_longer(cols =!airline:status, names_to ="destination",values_to ="num_flights",values_drop_na =TRUE )#Let's use the str_to_title function I found for project 1 to make the airline names not in all caps so we don't feel like the data frame is screaming at us#and the str_to_lower function to make the statuses all lowercaseflights_long <- flights_long |>mutate(airline =str_to_title(airline)) |>mutate(status =str_to_lower(status))
Analysis
#how many total destinations are there? Let's use the count and distinct functions:flights_long |>count(destination)
#there are 16 unique destinations#Let's compare the total number of delayed SW flights to Spiritflight_totals <- flights_long |>group_by(airline, status) |>summarize(total_flights =sum(num_flights))
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
#add totals and percentagespct_overall <- flight_totals |>group_by(airline) |>mutate(all_flights =sum(total_flights)) |>ungroup() |>mutate(pct_flights = (round(total_flights / all_flights *100, 2)))#let's visualizeggplot(data = pct_overall, aes(x = airline, y = pct_flights, fill = status)) +geom_col() +geom_text(aes(label =paste0(round(pct_flights), "%")), position =position_stack(vjust =0.5)) +scale_y_continuous(labels =function(x) paste0(x, "%")) +labs(title ="Percentage of on-time and delayed flights by airline")
ggplot(data = pct_overall, aes(x = airline, y = total_flights)) +geom_col(aes(fill = status)) +scale_fill_brewer(palette ="Set3") +geom_text(aes(label =paste0(round(total_flights))), position =position_stack(vjust =0.8)) +labs(title ="Number of on-time and delayed flights by airline")
#total flights for each airlineflight_totals_overall <- flights_long |>group_by(airline) |>summarize(total_flights =sum(num_flights)) |>arrange(desc(total_flights))library(gt)
Warning: package 'gt' was built under R version 4.5.2
pct_overall |>select(airline, status, pct_flights) |>gt() |>cols_label(airline ="Airline",status ="Status",pct_flights ="Percentage (%)" ) |>tab_header(title ="Flight Status by Airline")
Flight Status by Airline
Airline
Status
Percentage (%)
Southwest
delayed
28.87
Southwest
on-time
71.13
Spirit
delayed
36.94
Spirit
on-time
63.06
#Comparing percentage of delays and arrival rates pct_overall |>filter(status =="delayed") |>ggplot(aes(x = airline, y = pct_flights, fill = airline)) +geom_col() +geom_text(aes(label =paste0(round(pct_flights), "%")), position =position_stack(vjust =0.5)) +scale_y_continuous(labels =function(x) paste0(x, "%")) +labs(title ="Percentage of delayed flights by airline")
pct_overall |>filter(status =="on-time") |>ggplot(aes(x = airline, y = pct_flights, fill = airline)) +geom_col() +geom_text(aes(label =paste0(round(pct_flights), "%")), position =position_stack(vjust =0.5)) +scale_y_continuous(labels =function(x) paste0(x, "%")) +labs(title ="Percentage of on-time flights by airline")
Comparing delay percentages
Overall, a larger number of Southwest flights were delayed, but a larger percentage of Spirit flights were delayed. The number of Spirit flights is smaller overall.
#% and avg by destination flight_by_dest <- flights_long |>group_by(airline, status, destination) |>summarize(total_flights =sum(num_flights)) |>arrange(destination, airline)
`summarise()` has grouped output by 'airline', 'status'. You can override using
the `.groups` argument.
pct_by_dest <- flight_by_dest |>group_by(airline, destination) |>mutate(all_flights_to_dest =sum(total_flights)) |>ungroup() |>mutate(pct_flights = (round(total_flights / all_flights_to_dest *100, 2)))#Let's look at arrival rates to the five Florida airports - FTL, MCO, MIA, PBI, and ORD#NB: I later remembered that ORD was O'hare, not Orlando, but I'm keeping it in here anyway so there's a comparison of five destinationsfl_only <- pct_by_dest |>filter(destination %in%c("FTL", "MCO", "MIA", "ORD", "PBI"))#visualizingggplot(fl_only, aes(x =interaction(airline, destination), y = pct_flights, fill = status)) +geom_col(position ="dodge") +# Add percentage labelsgeom_text(aes(label =paste0(pct_flights, "%")), position =position_dodge(width =0.9), vjust =-0.5, size =2.5) +# Floridian colors (Teal and Orange)scale_fill_manual(values =c("ontime"="#20B2AA", "delayed"="#FF8C00")) +labs(title ="Florida Flight Status by Airline/Destination",x ="Airline & Destination",y ="Percentage",fill ="Status" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
#facet wrapggplot(fl_only, aes(x = airline, y = pct_flights, fill = status)) +geom_col(position ="dodge") +geom_text(aes(label =paste0(pct_flights, "%")), position =position_dodge(width =0.9), vjust =-0.5, size =2) +facet_wrap(~destination) +labs(title ="Flight Status by Airline and Florida Destination",x ="Airline",y ="Percentage (%)",fill ="Status" )
#facet wrap for totals ggplot(fl_only, aes(x = airline, y = total_flights, fill = status)) +geom_col(position ="dodge") +geom_text(aes(label =paste0(total_flights)), position =position_dodge(width =0.9), vjust =-0.5, size =2) +facet_wrap(~destination) +labs(title ="Flight Status by Airline and Florida Destination",x ="Airline",y ="Number of Flights",fill ="Status" )
#table by individual destinationfl_only |>select(airline, status, destination, total_flights, pct_flights) |>gt() |>cols_label(airline ="Airline", destination ="Destination", total_flights ="Total Flights",status ="Status",pct_flights ="Percentage (%)" ) |>tab_header(title ="Florida Flight Status by Airline")
Florida Flight Status by Airline
Airline
Status
Destination
Total Flights
Percentage (%)
Spirit
delayed
FTL
270
21.60
Spirit
on-time
FTL
980
78.40
Southwest
delayed
MCO
141
9.26
Southwest
on-time
MCO
1381
90.74
Spirit
delayed
MCO
1013
35.47
Spirit
on-time
MCO
1843
64.53
Southwest
delayed
MIA
912
41.95
Southwest
on-time
MIA
1262
58.05
Spirit
delayed
MIA
92
33.82
Spirit
on-time
MIA
180
66.18
Southwest
delayed
ORD
586
29.75
Southwest
on-time
ORD
1384
70.25
Spirit
delayed
ORD
607
24.38
Spirit
on-time
ORD
1883
75.62
Southwest
delayed
PBI
75
21.13
Southwest
on-time
PBI
280
78.87
ggplot(data = fl_grp_pct, aes(x = airline, y = pct_flights, fill = status)) +geom_col(position ="dodge") +geom_text(aes(label =paste0(round(pct_flights), "%")), position =position_dodge(width =0.9), vjust =-0.5, size =3) +scale_y_continuous(labels =function(x) paste0(x, "%")) +labs(title ="Percentage of on-time and delayed flights by airline")
Explaining discrepancies - Simpson’s paradox
As we can see from these graphs and tables, Southwest has a higher delayed percentage for 3/4 destinations it flies to in FL (MIA, ORD, and PBI, which Spirit does not fly to), but the percentages are much closer in the aggregate graph (graph that compares total flights by airline weights all flights equally), with Southwest pulling slightly ahead in overall percentage on-time flights.
This is because calculating the percentages for each destination idividually makes every destination look like it has equal weight, which is not true when calculating total percentages. The trend we see in in multiple, separate groups of data changes when we combine these groups.
Spirit and Southwest fly different numbers of planes to each destination. For example, Spirit flies 272 flights to Miami, whereas Southwest flies over 2174 – almost ten times that amount. This can mean very different percentages when totals are compared. Spirit also looks like it has a very low delayed flight rate (21.6%) at Ft Lauderdale airport, but it only flies a small number of planes (1250) compared to the almost 7,000 flights to Florida.
Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed Feb 13, 2026.