Assignment 5A

Author

Samantha Barbaro

Approach

I will create a CSV file comparing the arrival times and delays of two superb airlines, Southwest and Spirit, using a data structure that is wide (similar to Southwest’s seats).

Then, I will upload the data set to R in a reproducible way without removing NAs or tidying the data on upload. After that, I will populate the missing data with NA (or whatever is relevant for this data set).

I will analyze and compare the arrival delays for the two airlines. This will involve:

Creating a tibble (formatting columns as numeric as needed)
Pivoting the data longer
Comparing arrivals and delays for the two airlines, summarizing data using both counts and percentages
Including at least one data visualization
Commenting extensively, detailing my journey while cleaning this data

Here is the data set: https://raw.githubusercontent.com/samanthabarbaro/data607/fa10101ed55521a4083f04bdebeca13fbda875ca/flightdelays.csv

#loading the data and separating by comma
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

flight_set <- read.table("https://raw.githubusercontent.com/samanthabarbaro/data607/refs/heads/main/flightdelays.csv", sep = ",", header = TRUE)

Getting rid of the white space and fixing the labels

#get rid of blank rows
flights_blanks_removed <- subset(flight_set, LAX != "") |> 
  rename(airline = DESTINATION, status = X)

#as tibble 
flights_airlines_pop <- as_tibble(flights_blanks_removed)

#I tried to fill here, then found out the fill function only recognizes NA values
#turning blanks to NAs
#some blanks were automatically recognized as NAs when I loaded the table, these wer enot

flights_airlines_pop <- flights_airlines_pop |> 
  mutate(airline = na_if(airline, ""))

#populate airline names downward

flights_airlines_pop <- flights_airlines_pop |> fill(airline)

Pivoting the data into a format more suitable to analysis

#here, I tell R to keep the airline and status columns 
#then I tell it that the column names become destinations and the values become a new variable called num_flights

#I also told it to drop NA values - no need to see cities these airlines don't fly to

flights_long <- flights_airlines_pop |> 
  pivot_longer(
    cols = !airline:status, 
    names_to = "destination",
    values_to = "num_flights",
    values_drop_na = TRUE
  )

#Let's use the str_to_title function I found for project 1 to make the airline names not in all caps so we don't feel like the data frame is screaming at us
#and the str_to_lower function to make the statuses all lowercase

flights_long <- flights_long |> 
  mutate(airline = str_to_title(airline)) |> 
  mutate(status = str_to_lower(status))

Analysis

#how many total destinations are there? Let's use the count and distinct functions:

flights_long |> count(destination)

# A tibble: 16 × 2
   destination     n
   <chr>       <int>
 1 ACY             4
 2 ATL             4
 3 BOS             4
 4 CLT             4
 5 DFW             4
 6 EWR             2
 7 FTL             2
 8 JFK             4
 9 LAX             4
10 LGA             4
11 MCO             4
12 MIA             4
13 ORD             4
14 PBI             2
15 SEA             4
16 SFO             4

flights_long |> distinct(destination)

# A tibble: 16 × 1
   destination
   <chr>      
 1 LAX        
 2 LGA        
 3 JFK        
 4 MCO        
 5 ORD        
 6 SFO        
 7 BOS        
 8 ATL        
 9 DFW        
10 CLT        
11 SEA        
12 MIA        
13 PBI        
14 ACY        
15 EWR        
16 FTL

#there are 16 unique destinations

#Let's compare the total number of delayed SW flights to Spirit
flight_totals <- flights_long |>
  group_by(airline, status) |>
  summarize(total_flights = sum(num_flights))

`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.

#add totals and percentages
pct_overall <- flight_totals |> 
  group_by(airline) |> 
  mutate(all_flights = sum(total_flights)) |>
  ungroup() |>
  mutate(pct_flights = (round(total_flights / all_flights * 100, 2)))

#let's visualize
ggplot(data = pct_overall, aes(x = airline, y = pct_flights, fill = status)) + 
    geom_col() + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_stack(vjust = 0.5)) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of on-time and delayed flights by airline")

ggplot(data = pct_overall, aes(x = airline, y = total_flights)) + 
    geom_col(aes(fill = status)) + scale_fill_brewer(palette = "Set3") + 
    geom_text(aes(label = paste0(round(total_flights))), 
              position = position_stack(vjust = 0.8)) + 
    labs(title = "Number of on-time and delayed flights by airline")

#total flights for each airline
flight_totals_overall <- flights_long |>
  group_by(airline) |>
  summarize(total_flights = sum(num_flights)) |> 
  arrange(desc(total_flights))



library(gt)

Warning: package 'gt' was built under R version 4.5.2

pct_overall |>
  select(airline, status, pct_flights) |>
  gt() |>
  cols_label(
    airline = "Airline",
    status = "Status",
    pct_flights = "Percentage (%)"
  ) |>
  tab_header(title = "Flight Status by Airline")

Airline	Status	Percentage (%)
Flight Status by Airline
Southwest	delayed	28.87
Southwest	on-time	71.13
Spirit	delayed	36.94
Spirit	on-time	63.06

#Comparing percentage of delays and arrival rates  

pct_overall |> filter(status == "delayed") |> ggplot(aes(x = airline, y = pct_flights, fill = airline)) + 
    geom_col() + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_stack(vjust = 0.5)) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of delayed flights by airline")

pct_overall |> filter(status == "on-time") |> ggplot(aes(x = airline, y = pct_flights, fill = airline)) + 
    geom_col() + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_stack(vjust = 0.5)) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of on-time flights by airline")

Comparing delay percentages

Overall, a larger number of Southwest flights were delayed, but a larger percentage of Spirit flights were delayed. The number of Spirit flights is smaller overall.

#% and avg by destination 
flight_by_dest <- flights_long |>
  group_by(airline, status, destination) |>
  summarize(total_flights = sum(num_flights)) |>
  arrange(destination, airline)

`summarise()` has grouped output by 'airline', 'status'. You can override using
the `.groups` argument.

pct_by_dest <- flight_by_dest |>
  group_by(airline,  destination) |>
  mutate(all_flights_to_dest = sum(total_flights)) |>
  ungroup() |>
  mutate(pct_flights = (round(total_flights / all_flights_to_dest * 100, 2)))

#Let's look at arrival rates to the five Florida airports - FTL, MCO, MIA, PBI, and ORD
#NB: I later remembered that ORD was O'hare, not Orlando, but I'm keeping it in here anyway so there's a comparison of five destinations

fl_only <- pct_by_dest |> filter(destination %in% c("FTL", "MCO", "MIA", "ORD", "PBI"))

#visualizing

ggplot(fl_only, aes(x = interaction(airline, destination), y = pct_flights, fill = status)) +
  geom_col(position = "dodge") +
  # Add percentage labels
  geom_text(aes(label = paste0(pct_flights, "%")), 
            position = position_dodge(width = 0.9), 
            vjust = -0.5, size = 2.5) +
  # Floridian colors (Teal and Orange)
  scale_fill_manual(values = c("ontime" = "#20B2AA", "delayed" = "#FF8C00")) +
  labs(
    title = "Florida Flight Status by Airline/Destination",
    x = "Airline & Destination",
    y = "Percentage",
    fill = "Status"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#facet wrap
ggplot(fl_only, aes(x = airline, y = pct_flights, fill = status)) +
    geom_col(position = "dodge") +
    geom_text(aes(label = paste0(pct_flights, "%")), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, 
              size = 2) +
    facet_wrap(~destination) +
    labs(
        title = "Flight Status by Airline and Florida Destination",
        x = "Airline",
        y = "Percentage (%)",
        fill = "Status"
    )

#facet wrap for totals 
ggplot(fl_only, aes(x = airline, y = total_flights, fill = status)) +
    geom_col(position = "dodge") +
    geom_text(aes(label = paste0(total_flights)), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, 
              size = 2) +
    facet_wrap(~destination) +
    labs(
        title = "Flight Status by Airline and Florida Destination",
        x = "Airline",
        y = "Number of Flights",
        fill = "Status"
    )

#grouping Florida flights together 
fl_grouped <- fl_only |> group_by(airline, status) |> summarize(total_flights = sum(total_flights))

`summarise()` has grouped output by 'airline'. You can override using the
`.groups` argument.

fl_grp_pct <- fl_grouped |>
  mutate(all_flights = sum(total_flights)) |>
  ungroup() |>
  mutate(pct_flights = (round(total_flights / all_flights * 100, 2)))


#Table by state
fl_grp_pct |>
    select(airline, status, total_flights, pct_flights) |>
    gt() |>
    cols_label(
        airline = "Airline", total_flights = "Total Flights",
        status = "Status",
        pct_flights = "Percentage (%)"
    ) |>
    tab_header(title = "Florida Flight Status by Airline")

Airline	Status	Total Flights	Percentage (%)
Florida Flight Status by Airline
Southwest	delayed	1714	28.47
Southwest	on-time	4307	71.53
Spirit	delayed	1982	28.86
Spirit	on-time	4886	71.14

#table by individual destination
fl_only |>
    select(airline, status, destination, total_flights, pct_flights) |>
    gt() |>
    cols_label(
        airline = "Airline", destination = "Destination", total_flights = "Total Flights",
        status = "Status",
        pct_flights = "Percentage (%)"
    ) |>
    tab_header(title = "Florida Flight Status by Airline")

Airline	Status	Destination	Total Flights	Percentage (%)
Florida Flight Status by Airline
Spirit	delayed	FTL	270	21.60
Spirit	on-time	FTL	980	78.40
Southwest	delayed	MCO	141	9.26
Southwest	on-time	MCO	1381	90.74
Spirit	delayed	MCO	1013	35.47
Spirit	on-time	MCO	1843	64.53
Southwest	delayed	MIA	912	41.95
Southwest	on-time	MIA	1262	58.05
Spirit	delayed	MIA	92	33.82
Spirit	on-time	MIA	180	66.18
Southwest	delayed	ORD	586	29.75
Southwest	on-time	ORD	1384	70.25
Spirit	delayed	ORD	607	24.38
Spirit	on-time	ORD	1883	75.62
Southwest	delayed	PBI	75	21.13
Southwest	on-time	PBI	280	78.87

ggplot(data = fl_grp_pct, aes(x = airline, y = pct_flights, fill = status)) + 
    geom_col(position = "dodge") + 
    geom_text(aes(label = paste0(round(pct_flights), "%")), 
              position = position_dodge(width = 0.9), 
              vjust = -0.5, 
              size = 3) + 
    scale_y_continuous(labels = function(x) paste0(x, "%")) +
    labs(title = "Percentage of on-time and delayed flights by airline")

Explaining discrepancies - Simpson’s paradox

As we can see from these graphs and tables, Southwest has a higher delayed percentage for 3/4 destinations it flies to in FL (MIA, ORD, and PBI, which Spirit does not fly to), but the percentages are much closer in the aggregate graph (graph that compares total flights by airline weights all flights equally), with Southwest pulling slightly ahead in overall percentage on-time flights.

This is because calculating the percentages for each destination idividually makes every destination look like it has equal weight, which is not true when calculating total percentages. The trend we see in in multiple, separate groups of data changes when we combine these groups.

Spirit and Southwest fly different numbers of planes to each destination. For example, Spirit flies 272 flights to Miami, whereas Southwest flies over 2174 – almost ten times that amount. This can mean very different percentages when totals are compared. Spirit also looks like it has a very low delayed flight rate (21.6%) at Ft Lauderdale airport, but it only flies a small number of planes (1250) compared to the almost 7,000 flights to Florida.

Google Gemini. (2026). Gemini 3 Flash [Large language model].
https://gemini.google.com. Accessed Feb 13, 2026.