Week 5 Data Dive

Introduction

In this data dive, I explore the Airline Delay Dataset from 2021–2023. The goal is not just to analyze airline delays, but to understand how important it is to read the dataset documentation carefully.

This dataset comes from the U.S. Bureau of Transportation Statistics (BTS). Each row represents one airline at one airport in one specific month and year. The dataset includes:

Total number of flights
Number of delayed flights
Total delay minutes
Delay minutes by cause (weather, carrier, security, etc.)

# Load libraries
library(tidyverse)
library(dplyr)
library(ggplot2)

Airline_Delay_Post_COVID_2021_2023 <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

df <- Airline_Delay_Post_COVID_2021_2023

1. Columns That Were Confusing Without Documentation

a. arr_del15

When I first encountered the column name arr_del15, its meaning was not immediately clear. The “15” could have referred to 15 delayed flights, an average delay of 15 minutes, or total delay time. After consulting the documentation, I learned that arr_del15 represents the number of arrival flights delayed more than 15 minutes, which is the official definition of a delayed flight in the United States. They likely used 15 minutes as the threshold because it is a standard definition for a flight delay in the U.S., allowing consistent comparisons across airlines. This shows why reviewing documentation is important before analysis.

b. nas_ct

The column nas_ct was also initially unclear because I did not recognize what “NAS” represented. It stands for the National Aviation System, which includes delays caused by air traffic control, airport operations, and other system-wide factors. This separation likely exists so analysts can distinguish system-wide delays from airline-specific delays, making it easier to identify patterns and responsibility. Without understanding this, I might have incorrectly attributed these delays to airlines rather than the broader aviation infrastructure.

c. carrier_ct v/s carrier_delay

At first glance, carrier_ct and carrier_delay appeared similar. However, they measure two different aspects of delay. carrier_ct represents the number of flights delayed due to carrier-related issues, while carrier_delay represents the total minutes of delay caused by those issues. The dataset likely separates these measures to allow analysis of both frequency and severity of delays. Treating them as equivalent would have led to flawed conclusions about airline performance.

2. Something Still Unclear After Reading Documentation

Even after reading the documentation, one thing remains unclear. If a flight is delayed for more than one reason (e.g., both weather and carrier issues), it is not clear how it is counted. I do not know whether the flight is included in both delay categories, whether the delay minutes are split between causes, or whether cancelled flights are included. Because the documentation does not explain this, I need to be cautious when interpreting total delays and comparing causes. This uncertainty could affect conclusions about which delay types are most significant.

3. Visualization

a. Percentage of Flights Delayed

Airline_Delay_Post_COVID_2021_2023$delay_pct <- 
  (Airline_Delay_Post_COVID_2021_2023$arr_del15 / 
   Airline_Delay_Post_COVID_2021_2023$arr_flights) * 100
ggplot(Airline_Delay_Post_COVID_2021_2023, aes(x = carrier, y = delay_pct)) +
  geom_boxplot(fill = "skyblue", color = "darkblue") +
  labs(
    title = "Percentage of Flights Delayed by Carrier",
    x = "Carrier",
    y = "Delay Percentage (%)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  ylim(0, 100)

Calculating the percentage of flights delayed (arr_del15 / arr_flights) accounts for differences in flight volume across carriers. If I had misinterpreted arr_del15 as total delay minutes, this comparison would have been misleading.

b. Delay Count vs. Delay Minutes

Here, I compare the number of delayed flights (carrier_ct) with the total delay minutes (carrier_delay):

ggplot(Airline_Delay_Post_COVID_2021_2023, aes(x = carrier_ct, y = carrier_delay)) +
  geom_point(alpha = 0.5, color = "darkred") +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(
    title = "Total Carrier Delay Minutes vs Number of Carrier Delays",
    x = "Number of Carrier Delayed Flights",
    y = "Total Carrier Delay Minutes"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

Some airlines have relatively few delayed flights but very high total delay minutes. This indicates that although fewer flights were affected, the delays were severe. Focusing only on the count of delayed flights could underestimate the true impact of delays.

4. Missing Rows in Categorical Data

Two categorical columns we are considering :

1. carrier
2. airport

a. Explicit Missing Values

sum(is.na(Airline_Delay_Post_COVID_2021_2023$carrier))

## [1] 0

sum(is.na(Airline_Delay_Post_COVID_2021_2023$airport))

## [1] 0

No missing rows were found in either carrier or airport.

b. Implicitly Missing Values

Some airlines or airports may not appear in certain years (2021–2023). This is called implicitly missing data because the category is missing entirely rather than marked as NA. This could create bias in the analysis.For example, some small carriers may not appear in 2022, which would affect year-over-year comparisons.

c. Empty Groups

Some delay categories, such as security_ct, may have many zeros.

ggplot(Airline_Delay_Post_COVID_2021_2023, aes(y = security_delay)) +
  geom_boxplot(fill = "lightblue", color = "darkblue") +
  labs(
    title = "Boxplot of Security Delay Minutes",
    x = "All Flights",
    y = "Security Delay Minutes"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

Some delay categories, such as security_ct, have many zeros. This indicates that these delays are rare and affects how we interpret delay causes.

5. Outliers in Continuous Data

I examined arr_delay (total arrival delay minutes)

ggplot(Airline_Delay_Post_COVID_2021_2023, aes(y = arr_delay)) +
  geom_boxplot(fill = "lightgreen", color = "darkgreen") +
  labs(
    title = "Boxplot of Total Arrival Delay Minutes",
    x = "Arrival Delays Across All Flights",
    y = "Total Arrival Delay Minutes"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::comma)

Arr_delay (total arrival delay minutes) mostly contains normal delays, but some flights experience extremely long delays. These unusually long delays are called outliers. Outliers can distort averages if not handled carefully, making some airlines look worse than they actually are.

One standard way to identify outliers is using the formula: Q3 + 1.5 × IQR, where Q3 is the 75th percentile of the data and IQR is the range of the middle 50% of values. Any delay above this threshold is considered an outlier. This gives a consistent method to flag extreme events rather than relying on guesswork.

6. Risks Identified

Several risks emerged from this analysis that could lead to misinterpretation:
1. Confusing the number of delayed flights with total delay minutes.
2. Misunderstanding what columns like arr_del15 or nas_ct mean.
3. Missing data or filtered airlines affecting results.
4. Outliers skewing averages.
5. Using raw counts instead of percentages, which can unfairly compare carriers.

These risks show why it’s important to read documentation carefully and choose appropriate metrics before drawing conclusions.

Conclusion

This data dive demonstrates the importance of understanding dataset documentation. Several column names, such as arr_del15 and nas_ct, were unclear at first, and reading the documentation improved interpretation. Some aspects remain ambiguous, including how overlapping delay causes are handled. Missing data and outliers can also influence results. Effective data analysis requires careful interpretation, attention to detail, and thoughtful metric selection, and not just running code.