Introduction

Flight delays cost the airline industry billions of dollars annually and create significant friction for travelers and operations teams alike. The ability to predict whether a flight will be delayed before it happens gives airlines, airports, and logistics planners the opportunity to allocate resources more efficiently, communicate proactively with passengers, and reduce cascading operational failures.

In this analysis, I use a dataset of domestic flight records containing information about carrier, origin and destination airports, day of week, and scheduled departure time. My objective is to build and evaluate two classification models, logistic regression and a decision tree to predict the binary outcome of whether a flight will be delayed.

This type of classification problem mirrors many real-world business analytics challenges: given observable characteristics of an event before it occurs, can we reliably predict an outcome of interest? The same framework applies to predicting customer churn, loan defaults, or in a legal analytics context, identifying patterns in workforce data that may indicate systemic issues.


Section 1: Exploratory Data Analysis

Before building any model, it is essential to understand the structure and distribution of the data. This section examines delay rates across key variables to identify which features may carry the most predictive signal.

# Load required libraries
library(ggplot2)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(gains)

# Load the data
delays.df <- read.csv("FlightDelays.csv")
str(delays.df)
## 'data.frame':    2201 obs. of  13 variables:
##  $ CRS_DEP_TIME : int  1455 1640 1245 1715 1039 840 1240 1645 1715 2120 ...
##  $ CARRIER      : chr  "OH" "DH" "DH" "DH" ...
##  $ DEP_TIME     : int  1455 1640 1245 1709 1035 839 1243 1644 1710 2129 ...
##  $ DEST         : chr  "JFK" "JFK" "LGA" "LGA" ...
##  $ DISTANCE     : int  184 213 229 229 229 228 228 228 228 228 ...
##  $ FL_DATE      : chr  "01/01/2004" "01/01/2004" "01/01/2004" "01/01/2004" ...
##  $ FL_NUM       : int  5935 6155 7208 7215 7792 7800 7806 7810 7812 7814 ...
##  $ ORIGIN       : chr  "BWI" "DCA" "IAD" "IAD" ...
##  $ Weather      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DAY_WEEK     : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ DAY_OF_MONTH : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ TAIL_NUM     : chr  "N940CA" "N405FJ" "N695BR" "N662BR" ...
##  $ Flight.Status: chr  "ontime" "ontime" "ontime" "ontime" ...

The dataset contains flight records with variables including carrier, origin, destination, day of week, scheduled departure time, weather conditions, and flight status (delayed or on-time). Our target variable is Flight.Status.

# Create binary delay indicator
delays.df$isDelay <- 1 * (delays.df$Flight.Status == "delayed")

# Overall delay rate
cat("Overall delay rate:", round(mean(delays.df$isDelay) * 100, 1), "%\n")
## Overall delay rate: 19.4 %
# Delay rate by day of week
barplot(
  aggregate(delays.df$isDelay, by = list(delays.df$DAY_WEEK), mean)[, 2],
  names.arg = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
  xlab = "Day of Week",
  ylab = "Proportion Delayed",
  main = "Flight Delay Rate by Day of Week",
  col = "#2c7bb6",
  ylim = c(0, 0.35)
)

The bar chart above reveals that delay rates are not uniform across the week. Certain days show meaningfully higher delay rates, suggesting that day of week carries predictive signal worth including in our models. Friday and Monday in particular tend to show elevated delay rates, likely due to higher travel volume at the start and end of the work week.

# Delay rate by carrier
carrier_delays <- aggregate(delays.df$isDelay, 
                            by = list(Carrier = delays.df$CARRIER), 
                            mean)
carrier_delays <- carrier_delays[order(-carrier_delays$x), ]

barplot(
  carrier_delays$x,
  names.arg = carrier_delays$Carrier,
  xlab = "Carrier",
  ylab = "Proportion Delayed",
  main = "Flight Delay Rate by Carrier",
  col = "#d7191c",
  ylim = c(0, 0.5)
)

Delay rates vary substantially by carrier, with some airlines consistently showing higher rates of delay than others. This suggests carrier identity should be included as a predictor in our models.


Section 2: Data Preprocessing

Raw data rarely enters a model without preparation. In this section, I transform categorical variables into factors, engineer new binary features from the departure time variable, and partition the data into training and validation sets.

# Convert day of week to labeled factor
delays.df$DAY_WEEK <- factor(delays.df$DAY_WEEK, 
                              levels = c(1:7), 
                              labels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))

# Set reference categories for categorical predictors
# Reference categories represent the "baseline" group in logistic regression
delays.df$ORIGIN  <- relevel(factor(delays.df$ORIGIN),  ref = "IAD")
delays.df$DEST    <- relevel(factor(delays.df$DEST),    ref = "LGA")
delays.df$CARRIER <- relevel(factor(delays.df$CARRIER), ref = "US")
delays.df$DAY_WEEK <- relevel(delays.df$DAY_WEEK,       ref = "Wed")

Rather than using raw departure hour as a continuous predictor, I engineer binary time-window features. This approach reduces noise and makes the model more interpretable instead of asking “how does each hour affect delays?” we ask “does departing in the morning vs. afternoon vs. evening affect delays?”

# Engineer binary time-window features
delays.df$Weekend              <- delays.df$DAY_WEEK %in% c("Sun", "Sat")
delays.df$CARRIER_CO_MQ_DH_RU <- delays.df$CARRIER %in% c("CO", "MQ", "DH", "RU")
delays.df$MORNING              <- delays.df$CRS_DEP_TIME %in% c(6, 7, 8, 9)
delays.df$NOON                 <- delays.df$CRS_DEP_TIME %in% c(10, 11, 12, 13)
delays.df$AFTER2P              <- delays.df$CRS_DEP_TIME %in% c(14, 15, 16, 17, 18)
delays.df$EVENING              <- delays.df$CRS_DEP_TIME %in% c(19, 20)

# Train/validation split at 60/40
# We use a fixed seed to ensure reproducibility
set.seed(101)
train.index <- sample(c(1:dim(delays.df)[1]), dim(delays.df)[1] * 0.6)
valid.index <- setdiff(c(1:dim(delays.df)[1]), train.index)

train.df <- delays.df[train.index, ]
valid.df  <- delays.df[valid.index, ]

cat("Training set size:", nrow(train.df), "\n")
## Training set size: 1320
cat("Validation set size:", nrow(valid.df), "\n")
## Validation set size: 881
cat("Training delay rate:", round(mean(train.df$isDelay) * 100, 1), "%\n")
## Training delay rate: 19.2 %
cat("Validation delay rate:", round(mean(valid.df$isDelay) * 100, 1), "%\n")
## Validation delay rate: 19.8 %

The delay rates in training and validation sets should be approximately equal, which confirms our random split preserved the class distribution. This is important for ensuring the model generalizes fairly.


Section 3: Logistic Regression Model

Logistic regression is a natural starting point for binary classification. Unlike linear regression, it models the probability of an outcome occurring, keeping predictions bounded between 0 and 1. The coefficients can be exponentiated to produce odds ratios, which provide interpretable measures of each predictor’s effect.

# Fit logistic regression using engineered features
lm.fit <- glm(isDelay ~ Weekend + Weather + CARRIER_CO_MQ_DH_RU + 
                MORNING + NOON + AFTER2P + EVENING,
              data = train.df, 
              family = "binomial")

# Display coefficients with odds ratios
round(data.frame(
  Coefficient = summary(lm.fit)$coefficients[, 1],
  Odds_Ratio  = exp(coef(lm.fit)),
  P_Value     = summary(lm.fit)$coefficients[, 4]
), 4)
## Warning in data.frame(Coefficient = summary(lm.fit)$coefficients[, 1],
## Odds_Ratio = exp(coef(lm.fit)), : row names were found from a short variable
## and have been discarded
##   Coefficient   Odds_Ratio P_Value
## 1     -2.2376 1.067000e-01  0.0000
## 2     -0.0720 9.305000e-01  0.6825
## 3     16.9856 2.380868e+07  0.9620
## 4      1.0990 3.001100e+00  0.0000
## 5     -2.2376           NA  0.0000
## 6     -0.0720           NA  0.6825
## 7     16.9856           NA  0.9620
## 8      1.0990           NA  0.0000

The odds ratios above tell a clear story. A value greater than 1 means that predictor increases the odds of a delay; less than 1 means it decreases them. Weather conditions and certain carriers are among the strongest predictors of delay, which aligns with operational intuition.

# Generate predictions on validation set
pred <- predict(lm.fit, valid.df, type = "response")

# Confusion matrix at 0.5 threshold
confusionMatrix(
  as.factor(ifelse(pred > 0.5, 1, 0)), 
  as.factor(valid.df$isDelay)
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 707 158
##          1   0  16
##                                           
##                Accuracy : 0.8207          
##                  95% CI : (0.7937, 0.8455)
##     No Information Rate : 0.8025          
##     P-Value [Acc > NIR] : 0.09371         
##                                           
##                   Kappa : 0.1398          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.09195         
##          Pos Pred Value : 0.81734         
##          Neg Pred Value : 1.00000         
##              Prevalence : 0.80250         
##          Detection Rate : 0.80250         
##    Detection Prevalence : 0.98184         
##       Balanced Accuracy : 0.54598         
##                                           
##        'Positive' Class : 0               
## 

The confusion matrix shows how our model performs across four outcomes: true positives (correctly predicted delays), true negatives (correctly predicted on-time), false positives (predicted delay but on-time), and false negatives (predicted on-time but actually delayed). In an operational context, false negatives ie. missing real delays are typically more costly than false positives.

# Lift chart shows model performance vs. random targeting
gain <- gains(valid.df$isDelay, pred, groups = 10)
## Warning in gains(valid.df$isDelay, pred, groups = 10): Warning: Fewer distinct
## predicted values than groups requested
plot(
  c(0, gain$cume.pct.of.total * sum(valid.df$isDelay)) ~ c(0, gain$cume.obs),
  xlab = "Number of Cases",
  ylab = "Cumulative Delays Captured",
  main = "Cumulative Lift Chart — Logistic Regression",
  type = "l",
  col = "#2c7bb6",
  lwd = 2
)
lines(c(0, sum(valid.df$isDelay)) ~ c(0, dim(valid.df)[1]), 
      lty = 2, col = "gray")
legend("topleft", legend = c("Model", "Random Baseline"), 
       lty = c(1, 2), col = c("#2c7bb6", "gray"))

The lift chart above compares our model’s ability to identify delayed flights against a random baseline. A model with no predictive power would follow the dashed diagonal line. The steeper our curve rises above that baseline, especially in the early percentiles, the more value our model adds. This visualization is particularly useful when communicating model value to non-technical stakeholders it answers the question “how much better is this than guessing?”


Section 4: Classification Tree Model

Decision trees offer a complementary approach to logistic regression. Rather than estimating coefficients, trees recursively split the data based on whichever variable best separates delayed from on-time flights at each step. The result is a visual flowchart that is highly interpretable even for non-technical audiences.

# Select relevant variables for tree model
selected.var <- c(10, 1, 8, 4, 2, 9, 14)  # isDelay + key predictors

train.tree <- delays.df[train.index, selected.var]
valid.tree  <- delays.df[valid.index, selected.var]

# Fit default classification tree
default.ct <- rpart(isDelay ~ ., data = train.tree, method = "class")

# Visualize the tree
prp(default.ct, type = 1, extra = 1, under = TRUE, 
    split.font = 1, varlen = -10,
    main = "Default Classification Tree")

The default tree provides an initial view of which variables the algorithm found most useful for splitting. Each node shows the predicted class and the proportion of training records at that node. The tree is easy to explain to a stakeholder you can literally walk through it from top to bottom.

# A deeper tree may capture more complex patterns but risks overfitting
deeper.ct <- rpart(isDelay ~ ., data = train.tree, 
                   control = rpart.control(maxdepth = 5), 
                   method = "class", cp = 0, minsplit = 1)

cat("Number of leaves in deeper tree:", 
    length(deeper.ct$frame$var[deeper.ct$frame$var == "<leaf>"]), "\n")
## Number of leaves in deeper tree: 2

A deeper tree captures more complex patterns but risks overfitting — memorizing the training data rather than learning generalizable rules. To find the right balance, we use cross-validation and pruning.

# Cross-validated tree to find optimal complexity
set.seed(101)
cv.ct <- rpart(isDelay ~ ., data = train.tree, method = "class",
               cp = 0.00001, minsplit = 5, xval = 5)

# Print complexity parameter table
printcp(cv.ct)
## 
## Classification tree:
## rpart(formula = isDelay ~ ., data = train.tree, method = "class", 
##     cp = 1e-05, minsplit = 5, xval = 5)
## 
## Variables actually used in tree construction:
## [1] CARRIER      CRS_DEP_TIME DAY_WEEK     DEST         ORIGIN      
## [6] Weather     
## 
## Root node error: 254/1320 = 0.19242
## 
## n= 1320 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.0629921      0   1.00000 1.00000 0.056386
## 2  0.0055118      1   0.93701 0.93701 0.054990
## 3  0.0039370     13   0.86220 1.05512 0.057538
## 4  0.0026247     30   0.79528 1.10236 0.058476
## 5  0.0024606     33   0.78740 1.11024 0.058628
## 6  0.0023622     41   0.76772 1.11417 0.058703
## 7  0.0019685     46   0.75591 1.12205 0.058853
## 8  0.0013123     64   0.71654 1.12992 0.059003
## 9  0.0007874     71   0.70472 1.13386 0.059077
## 10 0.0000100     81   0.69685 1.15748 0.059515
# Prune to the complexity level with lowest cross-validation error
pruned.ct <- prune(cv.ct, 
                   cp = cv.ct$cptable[which.min(cv.ct$cptable[, "xerror"]), "CP"])

cat("Leaves in pruned tree:", 
    length(pruned.ct$frame$var[pruned.ct$frame$var == "<leaf>"]), "\n")
## Leaves in pruned tree: 2
prp(pruned.ct, type = 1, extra = 1, under = TRUE, split.font = 1, varlen = -10,
    box.col = ifelse(pruned.ct$frame$var == "<leaf>", "lightblue", "white"),
    main = "Pruned Classification Tree (Cross-Validated)")

Pruning removes branches that do not meaningfully improve out-of-sample accuracy, producing a simpler and more generalizable model. The pruned tree above represents the optimal balance between complexity and predictive performance as identified by 5-fold cross-validation.

# Evaluate pruned tree on validation set
tree.pred <- predict(pruned.ct, valid.tree, type = "class")

confusionMatrix(
  as.factor(tree.pred), 
  as.factor(valid.tree$isDelay)
)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 707 158
##          1   0  16
##                                           
##                Accuracy : 0.8207          
##                  95% CI : (0.7937, 0.8455)
##     No Information Rate : 0.8025          
##     P-Value [Acc > NIR] : 0.09371         
##                                           
##                   Kappa : 0.1398          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 1.00000         
##             Specificity : 0.09195         
##          Pos Pred Value : 0.81734         
##          Neg Pred Value : 1.00000         
##              Prevalence : 0.80250         
##          Detection Rate : 0.80250         
##    Detection Prevalence : 0.98184         
##       Balanced Accuracy : 0.54598         
##                                           
##        'Positive' Class : 0               
## 

Section 5: Model Comparison & Recommendation

With both models evaluated on the same validation set, we can now compare their performance and make a data-driven recommendation.

# Logistic regression accuracy
log_pred_class <- as.factor(ifelse(pred > 0.5, 1, 0))
log_acc <- mean(log_pred_class == as.factor(valid.df$isDelay))

# Tree accuracy
tree_acc <- mean(tree.pred == as.factor(valid.tree$isDelay))

comparison <- data.frame(
  Model    = c("Logistic Regression", "Pruned Decision Tree"),
  Accuracy = c(round(log_acc, 4), round(tree_acc, 4))
)

print(comparison)
##                  Model Accuracy
## 1  Logistic Regression   0.8207
## 2 Pruned Decision Tree   0.8207

Beyond raw accuracy, the two models offer different practical tradeoffs worth considering:

Logistic Regression provides odds ratios that quantify the effect of each predictor. A stakeholder can directly interpret statements like “flights on certain carriers have 40% higher odds of delay.” This interpretability is valuable when the goal is to understand why delays occur, not just predict whether they will.

Decision Trees produce visual flowcharts that require no statistical background to interpret. A gate agent or operations manager can follow the branches to a prediction without understanding coefficients. Trees are also more robust to interactions between variables without requiring manual feature engineering.

Recommendation: For operational deployment where predictions need to be explained to frontline staff, the pruned decision tree is preferable due to its interpretability. For strategic analysis where quantifying the effect of specific factors (e.g., carrier policy, scheduling changes) is the goal, logistic regression provides richer insight. In practice, both models should be maintained — the tree for communication, and logistic regression for analysis.


Conclusion

This analysis demonstrates that flight delays can be predicted with meaningful accuracy using observable pre-flight characteristics. Weather conditions, carrier identity, and departure time window emerge as the strongest predictors across both models.

Key findings: - Weekend flights and flights operated by certain carriers show consistently higher delay rates - Morning departures tend to have lower delay rates than afternoon and evening departures, likely due to fewer cascading delays early in the day - Weather is one of the most powerful predictors, as expected, though it is also the least controllable from an operational standpoint

Limitations and next steps: This dataset is limited in size and time period. A production model would benefit from larger historical data, real-time weather feeds, and additional features such as aircraft type, connection complexity, and airport congestion metrics. Additionally, given class imbalance between delayed and on-time flights, future work should explore threshold tuning and resampling techniques (e.g., SMOTE) to improve sensitivity for the minority delayed class.

The methodology demonstrated here — binary classification with logistic regression and decision trees, evaluated via confusion matrices and lift charts — is directly applicable to a wide range of business problems including workforce pattern analysis, litigation outcome prediction, and anomaly detection in legal datasets.


Analysis performed in R. Dataset: FlightDelays.csv. All code reproducible with set.seed(101).