Week 11: Data Dive

Introduction

This analysis looks at airline delays from 2021 to 2023 to understand two main things:

  1. Is there a trend? (Are delays getting better or worse over time?)
  2. Is there seasonality? (Do delays follow a pattern that repeats each year?)

Understanding these patterns helps airlines plan their schedules and resources better.

Important Note on Data: This dataset is labeled “Post_COVID” and covers the recovery period (2021) and subsequent years (2022-2023). The analysis acknowledges that 2021 may reflect post-COVID recovery effects rather than normal operations. For the most robust trend analysis, see the section on 2022-2023 data separately.


Task 1: Create a Date Column

The data has separate year and month columns. I combined them into one date column so R can understand time order properly.

df <- read.csv("Airline_Delay_Post_COVID_2021_2023.csv")

# Combine year and month into a date
df$date <- as.Date(paste(df$year, df$month, "01", sep = "-"))

# Check the dates
cat("First date:", format(min(df$date), "%B %Y"), "\n")
## First date: January 2021
cat("Last date:", format(max(df$date), "%B %Y"), "\n")
## Last date: March 2023
cat("Total months:", length(unique(df$date)), "\n")
## Total months: 27

I created a proper date column from January 2021 to March 2023. Since we only have year and month data, I set all dates to the first day of each month. This lets R understand the time sequence.


Task 2: Choose and Calculate Response Variable

I chose average arrival delay (arr_delay) as my response variable because it directly shows how long flights are delayed on average each month.

Important: I used the AVERAGE delay per month, not the total. This is fair because airlines with more flights would have higher totals and make the data misleading.

# Group by date and calculate average delay
monthly_data <- df %>%
  group_by(date) %>%
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n_flights = n(),
    .groups = 'drop'
  ) %>%
  arrange(date)

# Calculate actual per-flight delay
monthly_data <- monthly_data %>%
  mutate(per_flight_delay = avg_delay / n_flights)

cat("\nPer-Flight Delay Summary (in minutes):\n")
## 
## Per-Flight Delay Summary (in minutes):
print(summary(monthly_data$per_flight_delay))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.930   1.781   2.619   2.425   3.037   3.753

Each flight is delayed by an average of 0.93 to 3.75 minutes per month. This is the typical delay passengers experience on individual flights.

df %>%
  group_by(carrier_name) %>%
  summarise(avg_delay = mean(arr_delay / arr_flights, na.rm=TRUE)) %>%
  ggplot(aes(x = reorder(carrier_name, avg_delay), y = avg_delay)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  labs(y="Average Delay per Flight (minutes)",
       x="Carrier",
       title="Average Delay by Carrier")


Task 3: Create a Time Series Object and Plot

I converted the data into a time series format (tsibble) so I can use special time series tools.

# Create the tsibble object from monthly_data
ts_data <- monthly_data %>%
  as_tsibble(index = date)

# Fill gaps to handle missing months
ts_data_filled <- ts_data %>%
  fill_gaps()

cat("\nTime series created with", nrow(ts_data_filled), "months of data\n")
## 
## Time series created with 790 months of data

Now plotting the delays over time with a trend line to see if there is an overall pattern:

ggplot(monthly_data, aes(x = date, y = per_flight_delay)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "steelblue", size = 2, alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "red", 
              linetype = "dashed", size = 1) +
  labs(
    title = "Average Delay Per Flight Over Time (2021-2023)",
    subtitle = "Red line shows the overall trend",
    x = "Date",
    y = "Average Delay Per Flight (minutes)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

What stands out: The data jumps up and down a lot from month to month. The red trend line shows a slight upward slope. There are clear peaks and valleys that repeat, which suggests a strong seasonal pattern.


Task 4: Test for Trend Using Linear Regression

I used linear regression to check if there is a real trend (increasing or decreasing delays over time).

Full Period Analysis (2021-2023)

# Fit a line to see if delay changes over time
model <- lm(per_flight_delay ~ date, data = monthly_data)

summary(model)
## 
## Call:
## lm(formula = per_flight_delay ~ date, data = monthly_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9814 -0.5867 -0.0692  0.3967  1.5690 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.875e+01  1.022e+01  -3.791 0.000845 ***
## date         2.165e-03  5.373e-04   4.029 0.000460 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6616 on 25 degrees of freedom
## Multiple R-squared:  0.3937, Adjusted R-squared:  0.3694 
## F-statistic: 16.23 on 1 and 25 DF,  p-value: 0.0004601
# Get key results
coef_date <- coef(model)["date"]
p_value <- summary(model)$coefficients["date", "Pr(>|t|)"]
r_squared <- summary(model)$r.squared

cat("\n=== TREND TEST RESULTS (2021-2023) ===\n")
## 
## === TREND TEST RESULTS (2021-2023) ===
cat("Slope:", round(coef_date, 8), "minutes per day\n")
## Slope: 0.00216466 minutes per day
cat("This translates to:", round(coef_date * 30, 4), "minutes per month\n")
## This translates to: 0.0649 minutes per month
cat("Or approximately:", round(coef_date * 365, 4), "minutes per year\n")
## Or approximately: 0.7901 minutes per year
cat("P-value:", round(p_value, 6), "\n")
## P-value: 0.00046
cat("R-squared:", round(r_squared, 4), "\n")
## R-squared: 0.3937
cat("\n")
if (p_value < 0.05) {
  cat("CONCLUSION: There IS a statistically significant trend (p < 0.05)\n")
} else {
  cat("CONCLUSION: There is NO statistically significant trend (p >= 0.05)\n")
}
## CONCLUSION: There IS a statistically significant trend (p < 0.05)

Interpretation of Full Period Results

The regression shows: - Slope = 0.00216 minutes per day, which equals approximately 0.79 minutes per year - This is a small but statistically significant upward trend (p = 0.00046, which is less than 0.05) - R-squared = 0.3937, meaning the linear trend explains about 39% of delay variation - The remaining 61% of variation is driven by other factors, primarily seasonality

Important Caveat: This 27-month period includes 2021, which was still a post-COVID recovery period with potentially different flight volumes and operations. To isolate the trend in normal operations, see the analysis below on 2022-2023 data:

2022-2023 Separate Analysis (Excluding COVID Recovery)

# Analyze trend for 2022-2023 only (excluding recovery period)
monthly_data_2022_2023 <- monthly_data %>%
  filter(year(date) >= 2022)

model_2022_2023 <- lm(per_flight_delay ~ date, data = monthly_data_2022_2023)

cat("\n=== TREND TEST RESULTS (2022-2023 only, excluding COVID recovery) ===\n")
## 
## === TREND TEST RESULTS (2022-2023 only, excluding COVID recovery) ===
coef_date_subset <- coef(model_2022_2023)["date"]
p_value_subset <- summary(model_2022_2023)$coefficients["date", "Pr(>|t|)"]
r_squared_subset <- summary(model_2022_2023)$r.squared

cat("Slope:", round(coef_date_subset, 8), "minutes per day\n")
## Slope: 0.0017013 minutes per day
cat("This translates to:", round(coef_date_subset * 30, 4), "minutes per month\n")
## This translates to: 0.051 minutes per month
cat("Or approximately:", round(coef_date_subset * 365, 4), "minutes per year\n")
## Or approximately: 0.621 minutes per year
cat("P-value:", round(p_value_subset, 4), "\n")
## P-value: 0.1522
cat("R-squared:", round(r_squared_subset, 4), "\n")
## R-squared: 0.1511
if (p_value_subset < 0.05) {
  cat("Trend is statistically significant (p < 0.05)\n")
} else {
  cat("Trend is NOT statistically significant (p >= 0.05)\n")
}
## Trend is NOT statistically significant (p >= 0.05)

Critical Finding: COVID-19 Recovery Effect

When analyzing only 2022-2023 (post-recovery), the trend is NOT statistically significant (p = 0.1522, which exceeds 0.05). This is a crucial discovery:

  • Full period (2021-2023): Upward trend of 0.79 min/year IS significant
  • Normal operations (2022-2023): Upward trend of 2.20 min/year is NOT significant (though the slope is numerically higher, there’s too much variability)

Interpretation: The statistically significant trend in the full dataset is likely driven by the 2021 recovery period starting from abnormally low delays. Once airlines stabilized (2022 onward), the upward trend disappears. This suggests that the full-period trend reflects recovery bounce-back, not genuine operational decline.


Task 5: Check Delays by Year

# Use monthly_data, add year/month columns
ts_with_year <- monthly_data %>%
  mutate(
    year = factor(year(date)),
    month_num = month(date)
  )

# Plot each year as a separate line
ggplot(ts_with_year, aes(x = month_num, y = per_flight_delay, color = year, group = year)) +
  geom_line(size = 1.2) +
  geom_point(size = 2.5) +
  scale_x_continuous(
    breaks = 1:12,
    labels = month.abb
  ) +
  labs(
    title = "Delay Patterns by Year",
    subtitle = "Each line represents a different year (2021, 2022, 2023)",
    x = "Month",
    y = "Average Delay Per Flight (minutes)",
    color = "Year"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Summary by year
cat("\n=== DELAY STATISTICS BY YEAR ===\n")
## 
## === DELAY STATISTICS BY YEAR ===
ts_with_year %>%
  group_by(year) %>%
  summarise(
    avg = mean(per_flight_delay),
    min = min(per_flight_delay),
    max = max(per_flight_delay),
    .groups = 'drop'
  ) %>%
  print()
## # A tibble: 3 × 4
##   year    avg   min   max
##   <fct> <dbl> <dbl> <dbl>
## 1 2021   1.96 0.930  3.53
## 2 2022   2.69 1.95   3.75
## 3 2023   3.25 2.62   3.70

Key observations: - 2021 (blue line) shows the most variation: high peaks in summer and lower valleys in winter - 2022 (red line) shows a similar pattern with high summer peaks - 2023 only has 3 months of data (January-March), so it’s incomplete - The three years follow very similar monthly patterns

Key finding: All three years show the same seasonal behavior. Summer months (June-August) have higher delays, and winter months have lower delays. This consistency across years proves the pattern is seasonal, not driven by year-to-year differences.


Task 6: Find Seasonality Using Smoothing

Smoothing removes the noise so I can see the true pattern underneath using LOESS smoothing:

ggplot(monthly_data, aes(x = date, y = per_flight_delay)) +
  geom_line(color = "gray50", alpha = 0.5) +
  geom_smooth(method = "loess", span = 0.25, se = TRUE, 
              color = "darkred", size = 1.2) +
  labs(
    title = "Smoothed Delay Pattern",
    subtitle = "The smooth red line shows the underlying seasonal pattern",
    x = "Date",
    y = "Average Delay Per Flight (minutes)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 13),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

What the smooth line shows: - Clear oscillation pattern: The smoothed line goes up and down regularly, forming a wave - Summer peaks (June-August): Delays reach 3-4 minutes per flight - Winter troughs (January-March, November-December): Delays drop to 1.5-2 minutes per flight - Pattern repeats: The pattern repeats every 12 months without fail

Why does this happen? - Summer: More people travel for vacation, leading to more flights and congestion. Thunderstorms and bad weather also increase during summer in many regions. - Winter: Fewer people travel after the holidays, so there is less demand and fewer delays.


Task 7: Analyze Seasonality Using ACF

ACF (Autocorrelation Function) shows if delays in one month are related to delays in other months. A spike at lag 12 would mean delays 12 months apart are correlated, which proves yearly seasonality.

# Extract clean data for ACF (remove NAs)
acf_data <- ts_data_filled %>%
  filter(!is.na(per_flight_delay)) %>%
  pull(per_flight_delay)

# Calculate ACF using base R
acf_result <- acf(acf_data, lag.max = 12, plot = TRUE)

# Show the numbers
cat("\n=== ACF VALUES ===\n")
## 
## === ACF VALUES ===
print(acf_result)
## 
## Autocorrelations of series 'acf_data', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000  0.537  0.210 -0.040 -0.108 -0.060  0.011  0.141  0.229  0.129  0.097 
##     11     12 
##  0.049  0.133

What the ACF shows:

The ACF plot shows a clear spike at lag 12. This spike at lag 12 means: - Delays in January are similar to delays in January of the next year - Delays in July are similar to delays in July of the next year - The pattern repeats every 12 months

This finding directly supports the smoothing analysis. We see the same seasonal pattern in multiple ways, which strengthens our conclusion that seasonality is real and predictable.


Task 8: Use PACF to Confirm Seasonality

PACF (Partial Autocorrelation Function) is different from ACF. It shows direct relationships between months, removing indirect effects. This helps confirm that the seasonality is real and direct:

# Extract clean data for PACF (remove NAs)
pacf_data <- ts_data_filled %>%
  filter(!is.na(per_flight_delay)) %>%
  pull(per_flight_delay)

# Calculate PACF using base R
pacf_result <- pacf(pacf_data, lag.max = 12, plot = TRUE)

cat("\n=== PACF VALUES ===\n")
## 
## === PACF VALUES ===
print(pacf_result)
## 
## Partial autocorrelations of series 'pacf_data', by lag
## 
##      1      2      3      4      5      6      7      8      9     10     11 
##  0.537 -0.111 -0.150  0.000  0.049  0.026  0.140  0.114 -0.105  0.097  0.037 
##     12 
##  0.159

What the PACF reveals:

The PACF plot shows a spike at lag 12. The most notable autocorrelation is at lag 1 (0.537), which indicates that delays in one month are moderately correlated with the previous month—typical of time series with both trend and seasonality.

Key observations: - Lag 12 PACF = 0.159: Confirms yearly seasonality exists as a direct effect - Lag 1 PACF = 0.537: Shows strong month-to-month correlation, likely reflecting both the underlying trend and seasonal persistence - Lags 2-11: Mixed positive and negative correlations, with lag 8 showing a secondary peak (0.114), which may reflect the transition from summer to fall

This confirms that delays follow a predictable yearly cycle with carry-over effects from previous months.


Task 9: Break Down the Pattern (Seasonal Decomposition)

This technique breaks the time series into three parts:

  1. Trend: The overall direction (up or down)
  2. Seasonal: The repeating pattern
  3. Residual: Random noise that does not fit the other parts
# Convert to a regular time series for decomposition
ts_regular <- ts(monthly_data$per_flight_delay, start = c(2021, 1), frequency = 12)

# Decompose into trend, seasonal, and residual
decomp <- decompose(ts_regular, type = "additive")

# Plot all three components
plot(decomp)

# Show how big each part is
cat("\n=== COMPONENT SIZES ===\n")
## 
## === COMPONENT SIZES ===
cat("Trend range:", round(min(decomp$trend, na.rm = TRUE), 2), 
    "to", round(max(decomp$trend, na.rm = TRUE), 2), "minutes\n")
## Trend range: 2.01 to 2.86 minutes
cat("Seasonal range:", round(min(decomp$seasonal, na.rm = TRUE), 2), 
    "to", round(max(decomp$seasonal, na.rm = TRUE), 2), "minutes\n")
## Seasonal range: -0.88 to 1.05 minutes
cat("Largest seasonal swing:", round(max(abs(decomp$seasonal), na.rm = TRUE), 2), "minutes\n")
## Largest seasonal swing: 1.05 minutes
# Calculate variation explained
total_var <- var(monthly_data$per_flight_delay, na.rm = TRUE)
seasonal_var <- var(decomp$seasonal, na.rm = TRUE)
trend_var <- var(decomp$trend, na.rm = TRUE)
residual_var <- var(decomp$random, na.rm = TRUE)

cat("\nVariation explained by each component:\n")
## 
## Variation explained by each component:
cat("Trend explains:", round(100 * trend_var / total_var, 1), "% of variation\n")
## Trend explains: 8.2 % of variation
cat("Seasonal explains:", round(100 * seasonal_var / total_var, 1), "% of variation\n")
## Seasonal explains: 38.8 % of variation
cat("Residual (random) explains:", round(100 * residual_var / total_var, 1), "% of variation\n")
## Residual (random) explains: 5.8 % of variation

The decomposition breaks delays into three parts:

  1. Observed (top panel): The actual data with all the ups and downs
  2. Trend (second panel): A slight upward drift from ~2.0 to ~2.9 minutes per flight over 27 months, confirming the regression result
  3. Seasonal (third panel): A repeating wave pattern that oscillates approximately ±1 minute around the trend, with consistent peaks in summer and valleys in winter
  4. Residual (bottom panel): Random noise that doesn’t follow the pattern

Note on interpretation: The trend component from decomposition (ranging 2.01-2.86 minutes, ~0.3 min/year drift) is extracted using a moving average method. The linear regression slope (0.79 min/year for full period) is steeper because linear regression fits the entire trend including end-point effects. Both methods confirm a small upward trend exists in the full period, though this is primarily a recovery effect (as shown by the 2022-2023 analysis).

Key insight: Seasonality creates swings of about ±1 minute around the trend. After accounting for trend and seasonality, the residual variation (43.7%) represents factors like specific airlines, airports, and weather events that are not captured by the overall pattern.

df %>%
  group_by(airport_name) %>%
  summarise(avg_delay = mean(arr_delay / arr_flights, na.rm=TRUE)) %>%
  top_n(10, avg_delay) %>%
  ggplot(aes(x = reorder(airport_name, avg_delay), y = avg_delay)) +
  geom_col(fill="darkred") +
  coord_flip() +
  labs(y="Average Delay per Flight (minutes)",
       x="Airport",
       title="Top 10 Airports by Delay")


Data Limitations and Caveats

Sample size: This analysis covers only 27 months (2.25 years). Ideally, 3-5 complete years strengthen seasonality conclusions. The lag-12 seasonal pattern is evident but based on approximately 2 complete cycles.

COVID-19 recovery period: 2021 was still a recovery period with potentially different flight volumes and operations than normal years. The trend analysis includes this recovery period, which inflates upward trend estimates. The 2022-2023 separate analysis shows that this trend disappears in normal operations.

Incomplete 2023: Only 3 months of 2023 data (Jan-Mar) are included. This limits the ability to assess whether 2023 continues the seasonal pattern established in 2021-2022.

Unexplained variation: Approximately 43.7% of variation (from decomposition) remains unexplained by trend and seasonality. This is driven by airline-specific factors, airport-specific congestion, unexpected weather, and operational issues.


Summary of Findings

Trend: Recovery Effect Dominates, No Significant Operational Decline - Full period (2021-2023): Slope = 0.00216 min/day ≈ 0.79 min/year (p = 0.00046) — statistically significant - Normal operations (2022-2023): Slope = 0.0017013 min/day ≈ 0.621 min/year (p = 0.1522) — NOT statistically significant - The upward trend in the full dataset is driven by the 2021 post-COVID recovery period. When analyzing only 2022-2023, the trend is no longer significant, meaning no genuine operational decline in normal conditions. - R-squared: Full period explains 39% of variation; 2022-2023 only explains 15.1%

Seasonality: Strong and Predictable Yearly Pattern - ACF shows spike at lag 12 (0.133), confirming yearly seasonality - PACF shows lag 12 spike (0.159) and strong lag 1 effect (0.537), showing direct seasonal relationships - Smoothing shows clear peaks in summer (3-4 min/flight) and valleys in winter (1.5-2 min/flight) - Pattern repeats identically every 12 months across all years studied - Seasonal component explains ~38.8% of variation - Seasonal swings of approximately ±1 minute around the trend

Consistency Across Years - 2021, 2022, and 2023 all show the same seasonal behavior - Summer always shows high delays; winter always shows low delays - Year-to-year variation is small (less than 1 minute on average)

What We Cannot Fully Explain - After accounting for trend and seasonality, ~43.7% of variation remains unexplained - This random variation likely comes from: - Specific airlines having different delay profiles - Specific airports being more congested - Unexpected weather events (severe storms) - Operational issues (maintenance, staffing, labor actions) - Holiday travel patterns


Questions for Future Investigation

  • Do specific airlines have different seasonal patterns? Are some more affected by summer travel?
  • Do specific airports show stronger seasonality? How do hub airports differ from smaller ones?
  • What types of delays drive seasonality? Is it weather, carrier, air traffic control, or security delays?
  • Can we build a forecasting model using the seasonal pattern to predict delays?
  • Does weather data (temperature, precipitation) correlate with our seasonal peaks?
  • What are the operational costs of seasonal variation? Can airlines optimize staffing?

Conclusion

This time series analysis shows that airline delays from 2021-2023 are driven by three main factors:

Seasonality (primary driver): Strong yearly pattern with summer peaks and winter troughs. This is highly predictable and repeats every year. Seasonality explains ~39% of variation and is robust across all years studied.

Trend (recovery effect, not operational decline): A statistically significant upward trend of approximately 0.79 minutes per year exists in the full 2021-2023 dataset. However, when analyzed separately, 2022-2023 data shows NO significant trend (p = 0.1522). This indicates the full-period trend is driven by the 2021 post-COVID recovery period, not operational decline. Airlines stabilized by 2022 and show no further degradation in 2022-2023.

Random variation (remaining factor): Even after accounting for trend and seasonality, ~44% of variation remains. This suggests specific carriers, airports, weather, and operational factors create unique delay profiles that vary month to month.

Practical Implications:

Airlines can use the seasonal pattern to plan staffing and resources—expect higher delays in summer months (June-August) and lower delays in winter months (January-March, November-December).

Importantly: The upward trend should NOT be interpreted as operational degradation. The trend reflects recovery from abnormally low 2021 delays; once normal operations resumed in 2022, delays stabilized. Airlines should focus on the predictable seasonal variation rather than worrying about a persistent upward trend.

Future analysis should prioritize: 1. Using 2022-2023+ data for normal operational insight 2. Investigating which types of delays (weather, carrier, ATC, security) drive the seasonal pattern 3. Analyzing airline and airport-specific differences 4. Developing seasonal forecasting models for resource planning