Assignment 5A

Author

Samantha Barbaro

Approach

I will create a CSV file comparing the arrival times and delays of two superb airlines, Southwest and Spirit, using a data structure that is wide (similar to Southwest’s seats).

Then, I will upload the data set to R in a reproducible way without removing NAs or tidying the data on upload. After that, I will populate the missing data with NA (or whatever is relevant for this data set).

I will analyze and compare the arrival delays for the two airlines. This will involve:

  • Creating a tibble (formatting columns as numeric as needed)
  • Pivoting the data longer
  • Comparing arrivals and delays for the two airlines, summarizing data using both counts and percentages
  • Including at least one data visualization
  • Commenting extensively, detailing my journey while cleaning this data

Here is the data set: https://raw.githubusercontent.com/samanthabarbaro/data607/fa10101ed55521a4083f04bdebeca13fbda875ca/flightdelays.csv

#loading the data and separating by comma
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
flight_set <- read.table("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/flightdelays.csv", sep = ",", header = TRUE)

Getting rid of the white space and fixing the labels

#get rid of blank rows
flights_blanks_removed <- subset(flight_set, LAX != "") |> 
  rename(airline = DESTINATION, status = X)

#as tibble 
flights_airlines_pop <- as_tibble(flights_blanks_removed)

#I tried to fill here, then found out the fill function only recognizes NA values
#turning blanks to NAs
#some blanks were automatically recognized as NAs when I loaded the table, these wer enot

flights_airlines_pop <- flights_airlines_pop |> 
  mutate(airline = na_if(airline, ""))

#populate airline names downward

flights_airlines_pop <- flights_airlines_pop |> fill(airline)

Pivoting the data into a format more suitable to analysis

#here, I tell R to keep the airline and status columns 
#then I tell it that the column names become destinations and the values become a new variable called num_flights

#I also told it to drop NA values - no need to see cities these airlines don't fly to

flights_long <- flights_airlines_pop |> 
  pivot_longer(
    cols = !airline:status, 
    names_to = "destination",
    values_to = "num_flights",
    values_drop_na = TRUE
  )

#Let's use the str_to_title function I found for project 1 to make the airline names not in all caps so we don't feel like the data frame is screaming at us
#and the str_to_lower function to make the statuses all lowercase

flights_long <- flights_long |> 
  mutate(airline = str_to_title(airline)) |> 
  mutate(status = str_to_lower(status))

Analysis

#how many total destinations are there? Let's use the count and distinct functions:

flights_long |> count(destination)
# A tibble: 16 × 2
   destination     n
   <chr>       <int>
 1 ACY             4
 2 ATL             4
 3 BOS             4
 4 CLT             4
 5 DFW             4
 6 EWR             2
 7 FTL             2
 8 JFK             4
 9 LAX             4
10 LGA             4
11 MCO             4
12 MIA             4
13 ORD             4
14 PBI             2
15 SEA             4
16 SFO             4
flights_long |> distinct(destination)
# A tibble: 16 × 1
   destination
   <chr>      
 1 LAX        
 2 LGA        
 3 JFK        
 4 MCO        
 5 ORD        
 6 SFO        
 7 BOS        
 8 ATL        
 9 DFW        
10 CLT        
11 SEA        
12 MIA        
13 PBI        
14 ACY        
15 EWR        
16 FTL        
#there are 16 unique destinations

#Let's compare the total number of delayed SW flights to Spirit
flight_totals <- flights_long |>
  group_by(airline, status) |>
  summarize(total_flights = sum(num_flights))
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
#add totals and percentages
pct_overall <- flight_totals |> 
  group_by(airline) |> 
  mutate(all_flights = sum(total_flights)) |>
  ungroup() |>
  mutate(pct_flights = (round(total_flights / all_flights * 100, 2)))

#let's visualize
ggplot(data = pct_overall, aes(x = airline, y = pct_flights, fill = status)) + 
    geom_col() + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_stack(vjust = 0.5)) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of on-time and delayed flights by airline")

ggplot(data = pct_overall, aes(x = airline, y = total_flights)) + 
    geom_col(aes(fill = status)) + scale_fill_brewer(palette = "Set3") + 
    geom_text(aes(label = paste0(round(total_flights))), 
              position = position_stack(vjust = 0.8)) + 
    labs(title = "Number of on-time and delayed flights by airline") 

#total flights for each airline
flight_totals_overall <- flights_long |>
  group_by(airline) |>
  summarize(total_flights = sum(num_flights)) |> 
  arrange(desc(total_flights))



library(gt)
Warning: package 'gt' was built under R version 4.5.2
pct_overall |>
  select(airline, status, pct_flights) |>
  gt() |>
  cols_label(
    airline = "Airline",
    status = "Status",
    pct_flights = "Percentage (%)"
  ) |>
  tab_header(title = "Flight Status by Airline")
Flight Status by Airline
Airline Status Percentage (%)
Southwest delayed 28.87
Southwest on-time 71.13
Spirit delayed 36.94
Spirit on-time 63.06
#Comparing percentage of delays and arrival rates  

pct_overall |> filter(status == "delayed") |> ggplot(aes(x = airline, y = pct_flights, fill = airline)) + 
    geom_col() + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_stack(vjust = 0.5)) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of delayed flights by airline")

pct_overall |> filter(status == "on-time") |> ggplot(aes(x = airline, y = pct_flights, fill = airline)) + 
    geom_col() + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_stack(vjust = 0.5)) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of on-time flights by airline")

Comparing delay percentages

Overall, a larger number of Southwest flights were delayed, but a larger percentage of Spirit flights were delayed. The number of Spirit flights is smaller overall.

#% and avg by destination 
flight_by_dest <- flights_long |>
  group_by(airline, status, destination) |>
  summarize(total_flights = sum(num_flights)) |>
  arrange(destination, airline)
`summarise()` has grouped output by 'airline', 'status'. You can override using
the `.groups` argument.
pct_by_dest <- flight_by_dest |>
  group_by(airline,  destination) |>
  mutate(all_flights_to_dest = sum(total_flights)) |>
  ungroup() |>
  mutate(pct_flights = (round(total_flights / all_flights_to_dest * 100, 2)))

#Let's look at arrival rates to the five Florida airports - FTL, MCO, MIA, PBI, and ORD
#NB: I later remembered that ORD was O'hare, not Orlando, but I'm keeping it in here anyway so there's a comparison of five destinations

fl_only <- pct_by_dest |> filter(destination %in% c("FTL", "MCO", "MIA", "ORD", "PBI"))

#visualizing

ggplot(fl_only, aes(x = interaction(airline, destination), y = pct_flights, fill = status)) +
  geom_col(position = "dodge") +
  # Add percentage labels
  geom_text(aes(label = paste0(pct_flights, "%")), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2.5) +
  # Floridian colors (Teal and Orange)
  scale_fill_manual(values = c("ontime" = "#20B2AA", "delayed" = "#FF8C00")) +
  labs(
    title = "Florida Flight Status by Airline/Destination",
    x = "Airline & Destination",
    y = "Percentage",
    fill = "Status"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#facet wrap
ggplot(fl_only, aes(x = airline, y = pct_flights, fill = status)) +
    geom_col(position = "dodge") +
    geom_text(aes(label = paste0(pct_flights, "%")), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, 
              size = 2) +
    facet_wrap(~destination) +
    labs(
        title = "Flight Status by Airline and Florida Destination",
        x = "Airline",
        y = "Percentage (%)",
        fill = "Status"
    )

#facet wrap for totals 
ggplot(fl_only, aes(x = airline, y = total_flights, fill = status)) +
    geom_col(position = "dodge") +
    geom_text(aes(label = paste0(total_flights)), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, 
              size = 2) +
    facet_wrap(~destination) +
    labs(
        title = "Flight Status by Airline and Florida Destination",
        x = "Airline",
        y = "Number of Flights",
        fill = "Status"
    )

#grouping Florida flights together 
fl_grouped <- fl_only |> group_by(airline, status) |> summarize(total_flights = sum(total_flights)) 
`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.
fl_grp_pct <- fl_grouped |>
  mutate(all_flights = sum(total_flights)) |>
  ungroup() |>
  mutate(pct_flights = (round(total_flights / all_flights * 100, 2)))


#Table by state
fl_grp_pct |>
    select(airline, status, total_flights, pct_flights) |>
    gt() |>
    cols_label(
        airline = "Airline", total_flights = "Total Flights",
        status = "Status",
        pct_flights = "Percentage (%)"
    ) |>
    tab_header(title = "Florida Flight Status by Airline")
Florida Flight Status by Airline
Airline Status Total Flights Percentage (%)
Southwest delayed 1714 28.47
Southwest on-time 4307 71.53
Spirit delayed 1982 28.86
Spirit on-time 4886 71.14
#table by individual destination
fl_only |>
    select(airline, status, destination, total_flights, pct_flights) |>
    gt() |>
    cols_label(
        airline = "Airline", destination = "Destination", total_flights = "Total Flights",
        status = "Status",
        pct_flights = "Percentage (%)"
    ) |>
    tab_header(title = "Florida Flight Status by Airline")
Florida Flight Status by Airline
Airline Status Destination Total Flights Percentage (%)
Spirit delayed FTL 270 21.60
Spirit on-time FTL 980 78.40
Southwest delayed MCO 141 9.26
Southwest on-time MCO 1381 90.74
Spirit delayed MCO 1013 35.47
Spirit on-time MCO 1843 64.53
Southwest delayed MIA 912 41.95
Southwest on-time MIA 1262 58.05
Spirit delayed MIA 92 33.82
Spirit on-time MIA 180 66.18
Southwest delayed ORD 586 29.75
Southwest on-time ORD 1384 70.25
Spirit delayed ORD 607 24.38
Spirit on-time ORD 1883 75.62
Southwest delayed PBI 75 21.13
Southwest on-time PBI 280 78.87
ggplot(data = fl_grp_pct, aes(x = airline, y = pct_flights, fill = status)) + 
    geom_col(position = "dodge") + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, 
              size = 3) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of on-time and delayed flights by airline")

Explaining discrepancies - Simpson’s paradox

As we can see from these graphs and tables, Southwest has a higher delayed percentage for 3/4 destinations it flies to in FL (MIA, ORD, and PBI, which Spirit does not fly to), but the percentages are much closer in the aggregate graph (graph that compares total flights by airline weights all flights equally), with Southwest pulling slightly ahead in overall percentage on-time flights.

This is because calculating the percentages for each destination idividually makes every destination look like it has equal weight, which is not true when calculating total percentages. The trend we see in in multiple, separate groups of data changes when we combine these groups.

Spirit and Southwest fly different numbers of planes to each destination. For example, Spirit flies 272 flights to Miami, whereas Southwest flies over 2174 – almost ten times that amount. This can mean very different percentages when totals are compared. Spirit also looks like it has a very low delayed flight rate (21.6%) at Ft Lauderdale airport, but it only flies a small number of planes (1250) compared to the almost 7,000 flights to Florida.

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed Feb 13, 2026.