Introduction

Today’s globalization necessitates efficient and seamless commuting systems. One of the most efficient systems available to us today is aviation. Unfortunately, flight delays are not uncommon for most of us, regardless of background or location. The International Air Transport Association (IATA) reports that flight delays has resulted in lost of approximately $30 billion annually. Passengers affected by delays face financial burdens such as missed connections, additional accommodation costs, and loss of productivity.

In hopes of dampening the detrimental effects of airline delays, we aim to determine how the different features contribute in the delaying of flights. In addition, we would also like to evaluate the accuracy of various regression and classification models in estimating airline departure delays.

Specifically, this project aims to answer these questions;

  1. How well does the machine learning model perform in predicting departure delay status using classification and regression approaches?

  2. How do various factors contribute to the departure delay of airline flights?

Objectives

For this study we aim to address the issues as follow;

  1. To evaluate the accuracy of a classification model in predicting whether a flight is likely to experience departure delay.

  2. To assess the effectiveness of a regression model in providing precise estimates of departure delay.

  3. To determine the correlations between different features that are related to flight departure delay.

Importing Libraries

First and foremost, we will import all the relevant libraries that will be using throughout this project.

Data Collection

The data is collected from: https://www.kaggle.com/datasets/usdot/flight-delays?select=flights.csv. This dataset originates from the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics in 2015.

Below display the overall structure and the dimension of our dataset.

# read data
raw_data <- read.csv("flights.csv")

data <- raw_data

# view the structure and dimension our data
str(data)
## 'data.frame':    5819079 obs. of  31 variables:
##  $ YEAR               : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ MONTH              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DAY                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DAY_OF_WEEK        : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ AIRLINE            : chr  "AS" "AA" "US" "AA" ...
##  $ FLIGHT_NUMBER      : int  98 2336 840 258 135 806 612 2013 1112 1173 ...
##  $ TAIL_NUMBER        : chr  "N407AS" "N3KUAA" "N171US" "N3HYAA" ...
##  $ ORIGIN_AIRPORT     : chr  "ANC" "LAX" "SFO" "LAX" ...
##  $ DESTINATION_AIRPORT: chr  "SEA" "PBI" "CLT" "MIA" ...
##  $ SCHEDULED_DEPARTURE: int  5 10 20 20 25 25 25 30 30 30 ...
##  $ DEPARTURE_TIME     : int  2354 2 18 15 24 20 19 44 19 33 ...
##  $ DEPARTURE_DELAY    : int  -11 -8 -2 -5 -1 -5 -6 14 -11 3 ...
##  $ TAXI_OUT           : int  21 12 16 15 11 18 11 13 17 12 ...
##  $ WHEELS_OFF         : int  15 14 34 30 35 38 30 57 36 45 ...
##  $ SCHEDULED_TIME     : int  205 280 286 285 235 217 181 273 195 221 ...
##  $ ELAPSED_TIME       : int  194 279 293 281 215 230 170 249 193 203 ...
##  $ AIR_TIME           : int  169 263 266 258 199 206 154 228 173 186 ...
##  $ DISTANCE           : int  1448 2330 2296 2342 1448 1589 1299 2125 1464 1747 ...
##  $ WHEELS_ON          : int  404 737 800 748 254 604 504 745 529 651 ...
##  $ TAXI_IN            : int  4 4 11 8 5 6 5 8 3 5 ...
##  $ SCHEDULED_ARRIVAL  : int  430 750 806 805 320 602 526 803 545 711 ...
##  $ ARRIVAL_TIME       : int  408 741 811 756 259 610 509 753 532 656 ...
##  $ ARRIVAL_DELAY      : int  -22 -9 5 -9 -21 8 -17 -10 -13 -15 ...
##  $ DIVERTED           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CANCELLED          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CANCELLATION_REASON: chr  "" "" "" "" ...
##  $ AIR_SYSTEM_DELAY   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SECURITY_DELAY     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ AIRLINE_DELAY      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LATE_AIRCRAFT_DELAY: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WEATHER_DELAY      : int  NA NA NA NA NA NA NA NA NA NA ...
dim(data)
## [1] 5819079      31

Data Preprocessing

Data Cleaning

First we check the number of missing values and the percentage of missing values of each variable.

# check the number missing value or empty string for each variable
missing_values_count <- colSums(is.na(data)| data == "") %>% sort(decreasing = TRUE)
print(missing_values_count)
## CANCELLATION_REASON    AIR_SYSTEM_DELAY      SECURITY_DELAY       AIRLINE_DELAY 
##             5729195             4755640             4755640             4755640 
## LATE_AIRCRAFT_DELAY       WEATHER_DELAY        ELAPSED_TIME            AIR_TIME 
##             4755640             4755640              105071              105071 
##       ARRIVAL_DELAY           WHEELS_ON             TAXI_IN        ARRIVAL_TIME 
##              105071               92513               92513               92513 
##            TAXI_OUT          WHEELS_OFF      DEPARTURE_TIME     DEPARTURE_DELAY 
##               89047               89047               86153               86153 
##         TAIL_NUMBER      SCHEDULED_TIME                YEAR               MONTH 
##               14721                   6                   0                   0 
##                 DAY         DAY_OF_WEEK             AIRLINE       FLIGHT_NUMBER 
##                   0                   0                   0                   0 
##      ORIGIN_AIRPORT DESTINATION_AIRPORT SCHEDULED_DEPARTURE            DISTANCE 
##                   0                   0                   0                   0 
##   SCHEDULED_ARRIVAL            DIVERTED           CANCELLED 
##                   0                   0                   0
# check the percentage of missing value for each variable
missingPerc <- (colSums(is.na(data)| data == "")/dim(data)[1]) %>% sort(decreasing = TRUE)
missing_perc_percentage <- sprintf("%.2f%%", missingPerc * 100)
print(cbind(Column = names(missingPerc), Missing_Percentage = missing_perc_percentage))
##       Column                Missing_Percentage
##  [1,] "CANCELLATION_REASON" "98.46%"          
##  [2,] "AIR_SYSTEM_DELAY"    "81.72%"          
##  [3,] "SECURITY_DELAY"      "81.72%"          
##  [4,] "AIRLINE_DELAY"       "81.72%"          
##  [5,] "LATE_AIRCRAFT_DELAY" "81.72%"          
##  [6,] "WEATHER_DELAY"       "81.72%"          
##  [7,] "ELAPSED_TIME"        "1.81%"           
##  [8,] "AIR_TIME"            "1.81%"           
##  [9,] "ARRIVAL_DELAY"       "1.81%"           
## [10,] "WHEELS_ON"           "1.59%"           
## [11,] "TAXI_IN"             "1.59%"           
## [12,] "ARRIVAL_TIME"        "1.59%"           
## [13,] "TAXI_OUT"            "1.53%"           
## [14,] "WHEELS_OFF"          "1.53%"           
## [15,] "DEPARTURE_TIME"      "1.48%"           
## [16,] "DEPARTURE_DELAY"     "1.48%"           
## [17,] "TAIL_NUMBER"         "0.25%"           
## [18,] "SCHEDULED_TIME"      "0.00%"           
## [19,] "YEAR"                "0.00%"           
## [20,] "MONTH"               "0.00%"           
## [21,] "DAY"                 "0.00%"           
## [22,] "DAY_OF_WEEK"         "0.00%"           
## [23,] "AIRLINE"             "0.00%"           
## [24,] "FLIGHT_NUMBER"       "0.00%"           
## [25,] "ORIGIN_AIRPORT"      "0.00%"           
## [26,] "DESTINATION_AIRPORT" "0.00%"           
## [27,] "SCHEDULED_DEPARTURE" "0.00%"           
## [28,] "DISTANCE"            "0.00%"           
## [29,] "SCHEDULED_ARRIVAL"   "0.00%"           
## [30,] "DIVERTED"            "0.00%"           
## [31,] "CANCELLED"           "0.00%"

We are dropping all the variables with large number of missing values (> 80%).

# dropping variables with a large share of missing values (>80%)
toDrop <- c("AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY", "CANCELLATION_REASON")
data <- data[,-which(names(data) %in% toDrop)]

We are also dropping irrelevant variables that are determined based on the exploratory data analysis.

meaninglessVars <- c("YEAR", "DIVERTED", "CANCELLED", "TAIL_NUMBER","ELAPSED_TIME", "AIR_TIME","WHEELS_ON", "SCHEDULED_TIME", "TAXI_IN", "TAXI_OUT", "SCHEDULED_ARRIVAL", "ARRIVAL_TIME")
data <- data[,-which(names(data) %in% meaninglessVars)]

head(data)
##   MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 1     1   1           4      AS            98            ANC
## 2     1   1           4      AA          2336            LAX
## 3     1   1           4      US           840            SFO
## 4     1   1           4      AA           258            LAX
## 5     1   1           4      AS           135            SEA
## 6     1   1           4      DL           806            SFO
##   DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 1                 SEA                   5           2354             -11
## 2                 PBI                  10              2              -8
## 3                 CLT                  20             18              -2
## 4                 MIA                  20             15              -5
## 5                 ANC                  25             24              -1
## 6                 MSP                  25             20              -5
##   WHEELS_OFF DISTANCE ARRIVAL_DELAY
## 1         15     1448           -22
## 2         14     2330            -9
## 3         34     2296             5
## 4         30     2342            -9
## 5         35     1448           -21
## 6         38     1589             8

Feature Engineering

Next, we will creates a binary variable which tells us whether if the flight is domestic or international. In addition, we will creates a variable to categorize the variable ‘DAY’. Finally, we will be dropping insignificant variables to avoid information duplication and remove observations with missing values.

# first we define the airports by their abbreviation
americanAirports <- c("ATL", "ORD", "DFW", "DEN", "LAX", "SFO", "PHX", "IAH", "LAS", "MSP", "MCO", "SEA", "DTW", "BOS", "EWR", "CLT", "LGA", "SLC", "JFK", "BWI", "MDW", "DCA", "FLL", "SAN", "MIA", "PHL", "TPA", "DAL", "HOU", "BNA", "PDX", "STL", "HNL", "OAK", "AUS", "MSY", "MCI", "SJC", "SMF", "SNA", "CLE", "IAD", "RDU", "MKE", "SAT", "RSW", "IND", "SJU", "CMH", "PIT", "PBI", "OGG", "CVG", "ABQ", "BUR", "BDL", "JAX", "ONT", "BUF", "OMA", "OKC", "ANC", "RIC", "TUS", "MEM", "TUL", "RNO", "BHM", "ELP", "CHS", "BOI", "KOA", "PVD", "GRR", "LIH", "LIT", "SDF", "GEG", "ORF", "XNA", "MSN", "PSP", "LGB")

# binary variable which indicates origin of the airport
data$USA_ORIG <- ifelse(data$ORIGIN_AIRPORT %in% americanAirports, 1, 0)

# binary variable which indicates destination of the airport
data$USA_DEST <- ifelse(data$DESTINATION_AIRPORT %in% americanAirports, 1, 0)

# binary variable where domestic flight = 1, else international flight = 0
data$US_FLIGHT <- ifelse(data$USA_ORIG == 1 & data$USA_DEST == 1, 1, 0)

# binary variable that was equal to 1 if the flight flied between the 1st to 15th of a month, else 2.
data$MONTH_HALF <- ifelse(data$DAY %in% 1:15, 1, 2)

# then we will drop the irrelevant variables to avoid information duplication
toDrop2 <- c("USA_ORIG", "USA_DEST", "DESTINATION_AIRPORT", "DAY")
data <- data[,-which(names(data) %in% toDrop2)]

# drop all the observations that had missing values.
data <- na.omit(data)

Format Correction

Here we will be converting the categorical variables such as AIRLINE and ORIGIN_AIRPORT to factor and assign an integer value to it.

data$ORIGIN_AIRPORT <- as.integer(factor(data$ORIGIN_AIRPORT))
data$AIRLINE <- as.integer(factor(data$AIRLINE))

head(data,5)
##   MONTH DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT SCHEDULED_DEPARTURE
## 1     1           4       2            98            324                   5
## 2     1           4       1          2336            483                  10
## 3     1           4      12           840            585                  20
## 4     1           4       1           258            483                  20
## 5     1           4       2           135            584                  25
##   DEPARTURE_TIME DEPARTURE_DELAY WHEELS_OFF DISTANCE ARRIVAL_DELAY US_FLIGHT
## 1           2354             -11         15     1448           -22         1
## 2              2              -8         14     2330            -9         1
## 3             18              -2         34     2296             5         1
## 4             15              -5         30     2342            -9         1
## 5             24              -1         35     1448           -21         1
##   MONTH_HALF
## 1          1
## 2          1
## 3          1
## 4          1
## 5          1

Dependent Variable

Since our objective is predicting the flight delay, therefore, we will be using ‘DEPARTURE_DELAY’ as our dependent variable. We will be selecting the observations with departure delay equals to or larger than zero. Below present the summary and plotting of our dependent variable.

summary(data$DEPARTURE_DELAY)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -82.000   -5.000   -2.000    9.295    7.000 1988.000
data <- data[data$DEPARTURE_DELAY >= 0,]

ggplot(data, aes(DEPARTURE_DELAY)) + geom_histogram(aes(y=..count..), fill="blue", alpha = 1.0,color="white", bins = 60)+ labs(x = "Departure delay (mins)", y = "Frequency")

From the graph above, the distribution of our dependent variable is right-skewed and highly asymmetrical. Therefore, we will be using the natural logarithm of the departure delay as our dependent variable which can help make the distribution more symmetrical and close to normal distribution. We will also create a column to classify the departure delay. The condition is that if the delay minutes are more than around 12 minutes, we will classify it as delayed.

Finally, we will then remove the ‘DEPARTURE_DELAY’ column. Therefore ‘log_DEPARTURE_DELAY’ will be the dependent variable for our regression problem and ‘classDelay’ will be the dependent variable for classification problem.

data$DEPARTURE_DELAY <- data$DEPARTURE_DELAY + 1
data$log_DEPARTURE_DELAY <- log(data$DEPARTURE_DELAY)

ggplot(data, aes(log_DEPARTURE_DELAY)) + geom_histogram(aes(y=..count..), fill="blue", alpha = 1.0, color="white", bins = 60) +labs(x = "Departure delay (ln mins)", y = "Frequency")

data <- data[,-which(names(data)=='DEPARTURE_DELAY')]
data <- data %>% mutate(classDelay = ifelse(log_DEPARTURE_DELAY >= 2.5, 0, 1))
dim(data)
## [1] 2443491      14

Data Sampling

The cleaned data set consists of 2443491 observations and 14 columns. Due to the limitation of time and computer resources, we will perform random sampling method by taking 15000 samples from the data population.

set.seed(333)
sample_size <- 15000
sampledData <- data[sample(nrow(data), sample_size, replace = FALSE), ]
dim(sampledData)
## [1] 15000    14

Exploratory Data Analysis

Summary Statistics

Here we present the summary statistics of each variables in the data.

summary(sampledData)
##      MONTH         DAY_OF_WEEK       AIRLINE      FLIGHT_NUMBER 
##  Min.   : 1.000   Min.   :1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.: 3.000   1st Qu.:2.000   1st Qu.: 4.00   1st Qu.: 712  
##  Median : 6.000   Median :4.000   Median : 9.00   Median :1552  
##  Mean   : 6.416   Mean   :3.926   Mean   : 8.22   Mean   :2023  
##  3rd Qu.: 9.000   3rd Qu.:6.000   3rd Qu.:14.00   3rd Qu.:2852  
##  Max.   :12.000   Max.   :7.000   Max.   :14.00   Max.   :7438  
##  ORIGIN_AIRPORT  SCHEDULED_DEPARTURE DEPARTURE_TIME   WHEELS_OFF  
##  Min.   :  1.0   Min.   :   3        Min.   :   1   Min.   :   1  
##  1st Qu.:385.0   1st Qu.:1055        1st Qu.:1111   1st Qu.:1123  
##  Median :473.0   Median :1454        Median :1514   Median :1526  
##  Mean   :444.2   Mean   :1434        Mean   :1465   Mean   :1483  
##  3rd Qu.:535.0   3rd Qu.:1815        3rd Qu.:1843   3rd Qu.:1857  
##  Max.   :628.0   Max.   :2359        Max.   :2400   Max.   :2400  
##     DISTANCE      ARRIVAL_DELAY       US_FLIGHT        MONTH_HALF   
##  Min.   :  31.0   Min.   : -58.00   Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 408.0   1st Qu.:  -4.00   1st Qu.:1.0000   1st Qu.:1.000  
##  Median : 733.0   Median :   8.00   Median :1.0000   Median :2.000  
##  Mean   : 891.9   Mean   :  23.29   Mean   :0.7771   Mean   :1.517  
##  3rd Qu.:1172.0   3rd Qu.:  30.00   3rd Qu.:1.0000   3rd Qu.:2.000  
##  Max.   :4983.0   Max.   :1323.00   Max.   :1.0000   Max.   :2.000  
##  log_DEPARTURE_DELAY   classDelay   
##  Min.   :0.000       Min.   :0.000  
##  1st Qu.:1.386       1st Qu.:0.000  
##  Median :2.485       Median :1.000  
##  Mean   :2.387       Mean   :0.518  
##  3rd Qu.:3.497       3rd Qu.:1.000  
##  Max.   :7.197       Max.   :1.000

Data Visualization

Histograms

The histogram of DAY_OF_WEEK illustrated that weekend has lower flight delay as compared to the weekdays. Monday, Thursday and Friday have the most number of flight delayed cases.

ggplot(sampledData, aes(x = DAY_OF_WEEK, fill = factor(classDelay))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Histogram of DAY_OF_WEEK",
       x = "DAY_OF_WEEK",
       y = "Count",
       fill = "Delay")

From the histogram of MONTH below, we can clearly view that the number of flights decreased after August, and it also causes the number of flight delays decreases.

ggplot(sampledData, aes(x = MONTH, fill = factor(classDelay))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Histogram of MONTH",
       x = "MONTH",
       y = "Count",
       fill = "Delay")

From the histogram of AIRLINE, we can clearly see that airline ‘WN’ has the highest number of flights delays case, however it might because it has the highest frequency of the flight.

ggplot(sampledData, aes(x = AIRLINE, fill = factor(classDelay))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Histogram of AIRLINE",
       x = "AIRLINE",
       y = "Count",
       fill = "Delay")

From the histogram of US_FLIGHT, the domestic flight (labelled 1) has greater number of flight delay compared to international flight.

ggplot(sampledData, aes(x = factor(US_FLIGHT), fill = factor(classDelay))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "Histogram of US_FLIGHT",
       x = "US_FLIGHT",
       y = "Count",
       fill = "Delay") + scale_x_discrete(labels = c("0", "1"))

Correlation Matrix

The correlation matrix here display the relationship in between the variables. The closer the value (color) to 1 or -1, the higher the correlation in between the variables.

# Create a correlation matrix
cor_matrix <- cor(sampledData)
corrplot(cor_matrix, tl.col = "black")

Classification

Modelling

Before the modelling session, we will split our data into training and testing set at 0.8 distribution. In addition, since this part is performing for classification, therefore we will be dropping the ‘log_DEPARTURE_DELAY’ column.

classiData <- subset(sampledData, select = -c(log_DEPARTURE_DELAY))
classiData$classDelay <- as.factor(classiData$classDelay)

set.seed(333)
splitIndex <- createDataPartition(classiData$classDelay, p = 0.8, list = FALSE)
trainingSet <- classiData[splitIndex,]
testingSet <- classiData[-splitIndex,]

cat("Dimension of training:" ,dim(trainingSet), '\n')
## Dimension of training: 12000 13
cat("Dimension of testing:" ,dim(testingSet))
## Dimension of testing: 3000 13

Here, we are checking the data balancing in order to ensure our model later does not bias towards any class. We can see that both classes have the distribution proportion close to 0.50.

print("Distribution of classes: ")
## [1] "Distribution of classes: "
table(trainingSet$classDelay)
## 
##    0    1 
## 5784 6216
print("Distribution of classes (prop): ")
## [1] "Distribution of classes (prop): "
prop.table(table(trainingSet$classDelay))
## 
##     0     1 
## 0.482 0.518

Modelling with Naive Bayes classification model.

model1 <- NaiveBayes(classDelay~.,data = trainingSet)
xTest <- testingSet[,1:12]
yTest <- testingSet[,13]

predictions <- predict(model1, xTest)
cm <- confusionMatrix(predictions$class,yTest)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  892  106
##          1  554 1448
##                                           
##                Accuracy : 0.78            
##                  95% CI : (0.7647, 0.7947)
##     No Information Rate : 0.518           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5546          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6169          
##             Specificity : 0.9318          
##          Pos Pred Value : 0.8938          
##          Neg Pred Value : 0.7233          
##              Prevalence : 0.4820          
##          Detection Rate : 0.2973          
##    Detection Prevalence : 0.3327          
##       Balanced Accuracy : 0.7743          
##                                           
##        'Positive' Class : 0               
## 

Modelling with Random Forest classification model.

model2 <- randomForest(classDelay~., data = trainingSet)
predictions <- predict(model2, xTest)
cm <- confusionMatrix(predictions, yTest)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1173  202
##          1  273 1352
##                                           
##                Accuracy : 0.8417          
##                  95% CI : (0.8281, 0.8546)
##     No Information Rate : 0.518           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6824          
##                                           
##  Mcnemar's Test P-Value : 0.001319        
##                                           
##             Sensitivity : 0.8112          
##             Specificity : 0.8700          
##          Pos Pred Value : 0.8531          
##          Neg Pred Value : 0.8320          
##              Prevalence : 0.4820          
##          Detection Rate : 0.3910          
##    Detection Prevalence : 0.4583          
##       Balanced Accuracy : 0.8406          
##                                           
##        'Positive' Class : 0               
## 

Modelling with k-Nearest Neighbors (kNN) classification model.

model3 <- train(classDelay~., data = trainingSet, method = 'knn')
predictions <- predict(model3, xTest)
cm <- confusionMatrix(predictions, yTest)
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  766  452
##          1  680 1102
##                                          
##                Accuracy : 0.6227         
##                  95% CI : (0.605, 0.6401)
##     No Information Rate : 0.518          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.2402         
##                                          
##  Mcnemar's Test P-Value : 1.511e-11      
##                                          
##             Sensitivity : 0.5297         
##             Specificity : 0.7091         
##          Pos Pred Value : 0.6289         
##          Neg Pred Value : 0.6184         
##              Prevalence : 0.4820         
##          Detection Rate : 0.2553         
##    Detection Prevalence : 0.4060         
##       Balanced Accuracy : 0.6194         
##                                          
##        'Positive' Class : 0              
## 

Evaluation

In the classification modelling session, we utilised three different classification models which are Naive Bayes, Random Forest and k-Nearest Neighbors model to perform prediction on the flights delay.

Generally, Random Forest classifier has the best performance as it achieved 84.2% of accuracy.The Kappa statistic, which measures the agreement between the model’s predictions and the actual outcomes is 68.2 which suggests a substantial level of agreement. The model shows a good sensitivity and specificity which indicating its ability to correctly identify both positive and negative instances.

Whereas, Naive Bayes classifier overall has good accuracy at 78% yet the sensitivity is quite low with only 61.7%. This suggests that the model is better at correctly identifying negative instances than positive ones. Its Kappa statistic at around 55% suggests a moderate level of agreement. Finally, the kNN classfier has the worst performance as compared to the previous models. It achieved 62.3% of accuracy, 53% and 70% of sensitivity and specificity. A value of Kappa statistic around 20% suggests a low level of agreement.Further optimization and refinement may be beneficial in order to enhance the overall performance.

Regression

Modeling

We will define a new dataset (regData) to fit in our regression models. The column “classDelay” that is used as target variable for the classification model will be drop in this section as it does not imply for the regression model.

# defining the variable
regData <- sampledData[, !names(sampledData) %in% c("classDelay")]
head(regData)
##         MONTH DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 2022412     5           6      10          6411            535
## 153312      1           6       4          2521            400
## 319644      1           3       8          3606            400
## 3365753     7           3       1           683            594
## 5163910    11           4       4          2825            483
## 1634846     4           3       3           457            346
##         SCHEDULED_DEPARTURE DEPARTURE_TIME WHEELS_OFF DISTANCE ARRIVAL_DELAY
## 2022412                 925            934       1004     1041             7
## 153312                 1950           1950       2001      405           -29
## 319644                 2114           2134       2142      235             0
## 3365753                1230           1233       1243      647             0
## 5163910                1130           1132       1146      590            -5
## 1634846                1025           1025       1037      413            -8
##         US_FLIGHT MONTH_HALF log_DEPARTURE_DELAY
## 2022412         1          1            2.302585
## 153312          1          1            0.000000
## 319644          1          2            3.044522
## 3365753         1          2            1.386294
## 5163910         1          2            1.098612
## 1634846         1          1            0.000000
str(regData)
## 'data.frame':    15000 obs. of  13 variables:
##  $ MONTH              : int  5 1 1 7 11 4 10 9 12 7 ...
##  $ DAY_OF_WEEK        : int  6 6 3 3 4 3 7 4 3 6 ...
##  $ AIRLINE            : int  10 4 8 1 4 3 2 1 11 1 ...
##  $ FLIGHT_NUMBER      : int  6411 2521 3606 683 2825 457 26 1333 983 47 ...
##  $ ORIGIN_AIRPORT     : int  535 400 400 594 483 346 265 545 510 535 ...
##  $ SCHEDULED_DEPARTURE: int  925 1950 2114 1230 1130 1025 1810 1745 1505 1210 ...
##  $ DEPARTURE_TIME     : int  934 1950 2134 1233 1132 1025 1811 1758 1526 1215 ...
##  $ WHEELS_OFF         : int  1004 2001 2142 1243 1146 1037 1829 1823 1548 1230 ...
##  $ DISTANCE           : int  1041 405 235 647 590 413 1721 678 964 1739 ...
##  $ ARRIVAL_DELAY      : int  7 -29 0 0 -5 -8 -9 3 13 -15 ...
##  $ US_FLIGHT          : num  1 1 1 1 1 1 0 1 1 1 ...
##  $ MONTH_HALF         : num  1 1 2 2 2 1 2 2 2 2 ...
##  $ log_DEPARTURE_DELAY: num  2.3 0 3.04 1.39 1.1 ...

The data for regression model (regData) is split into 70% train sample and 30% test sample.

# divide the data into train and test set
set.seed(123)  # Set seed for reproducibility
index <- createDataPartition(regData$log_DEPARTURE_DELAY, p = 0.7, list = FALSE)

regTrain <- regData[index,]
regTest <- regData[-index,]

summary(regTrain$log_DEPARTURE_DELAY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.386   2.485   2.391   3.497   7.137
summary(regTest$log_DEPARTURE_DELAY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.386   2.485   2.379   3.497   7.197

As it is presented, the distribution among the two sub samples with the partition of 70% train set and 30% test set has the similar values for the 1st Quartile, Median, and 3rd Quartile. The statistic that differs is the Mean value (2.391 for train sample and 2.379 for test sample) and maximum value (7.137 for train sample and 7.197 for test sample).

We will proceed to fit in the model. First we define the cross validation and the performance metrics to evaluate the results for each model. Next, we define the function for our training models which are ordinary least squares estimation used in linear regression, ridge regression, and lasso regression.

# selecting cross validation

ctrl_cv <- trainControl(method = "cv", number = 5)
ctrl_nocv <- trainControl(method = "none")

# defining a function that calculates regression metrics

regMetrics <- function(real, predicted) {
  
  # Mean Square Error
  MSE <- mean((real - predicted)^2)  

  # Mean Absolute Error
  MAE <- mean(abs(real - predicted))

  # Total Sum of Squares
  TSS <- sum((real - mean(real))^2)
  # Explained Sum of Squares
  RSS <- sum((predicted - real)^2)
  # R2
  R2 <- 1 - RSS/TSS
  
  result <- data.frame(MSE, MAE, R2)
  
  return(result)
}

OLS (Ordinary Least Squares):

The OLS minimize the sum of squared differences between the observed and predicted values. It does not include any regularization terms. This technique is suitable when there is no multicollinearity, and the number of predictors is moderate. However, OLD may lead to overfitting if the number of predictors is large relative to the number of observations or if there is multicollinearity.

# ORDINARY LEAST SQUARE (OLS)

linear_nocv <- train(log_DEPARTURE_DELAY ~ .,
                          data = regTrain,
                          method = "lm",
                          trControl = ctrl_nocv)
linear_nocv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(linear_nocv, regTrain))
linear_nocv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(linear_nocv, regTest))

Ridge Regression:

Ridge Regression is suitable when there is multicollinearity and to avoid large coefficient values. It tends to shrink the coefficients, preventing overfitting and dealing well with multicollinearity. Ridge Regression works by minimizing the sum of squared differences plus a penalty term, which is the square of the magnitude of the coefficients multiplied by a regularization parameter (alpha or lambda).

# RIDGE REGRESSION

lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))

tgrid_nocv <- expand.grid(alpha = 0, lambda=0.05)
tgrid <- expand.grid(alpha = 0, lambda=lambdas)

ridge_nocv <- train(log_DEPARTURE_DELAY ~ .,
                         data = regTrain,
                         method = "glmnet",
                         tuneGrid = tgrid_nocv,
                         trControl = ctrl_nocv)

ridge_nocv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(ridge_nocv, regTrain))
ridge_nocv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(ridge_nocv, regTest))


ridge_cv <- train(log_DEPARTURE_DELAY ~ .,
                       data = regTrain,
                       method = "glmnet",
                       tuneGrid = tgrid,
                       trControl = ctrl_cv)

ridge_cv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(ridge_cv, regTrain))
ridge_cv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(ridge_cv, regTest))

Lasso Regression:

Lasso Regression is useful when feature selection is desired or when dealing with datasets with a large number of features. This technique tends to set some coefficients exactly to zero, effectively performing feature selection and producing sparse models. This method is almost similar to Ridge Regression, it minimize the sum of squared differences plus a penalty term, which is the absolute value of the coefficients multiplied by a regularization parameter (alpha or lambda).

# LASSO REGRESSION

tgrid_l <- expand.grid(alpha = 1, lambda=lambdas)
tgrid_nocv_l <- expand.grid(alpha = 1, lambda=0.05)

lasso_nocv <- train(log_DEPARTURE_DELAY ~ .,
                         data = regTrain,
                         method = "glmnet",
                         tuneGrid = tgrid_nocv_l,
                         trControl = ctrl_nocv)

lasso_nocv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(lasso_nocv, regTrain))
lasso_nocv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(lasso_nocv, regTest))


lasso_cv <- train(log_DEPARTURE_DELAY ~ .,
                        data = regTrain,
                        method = "glmnet",
                        tuneGrid = tgrid_l,
                        trControl = ctrl_cv)
 
lasso_cv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(lasso_cv, regTrain))
lasso_cv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(lasso_cv, regTest))

Evaluation

In this section, we will evaluate all of the three models with and without the cross-validation factor for in sample and also out of sample results. We will be performing the regression metrics of MSE, MAE, and R^2 that have previously define in the metric function.

# Regression in sample results

results_is <- rbind(linear_nocv_is, ridge_nocv_is, ridge_cv_is, lasso_nocv_is, lasso_cv_is)
row.names(results_is) <- c('OLS','Ridge (no cv)','Ridge (cv)','Lasso (no cv)', 'Lasso (cv)')
results_is
##                    MSE       MAE        R2
## OLS           1.160255 0.8698282 0.4672441
## Ridge (no cv) 1.164530 0.8798238 0.4652812
## Ridge (cv)    1.164530 0.8798238 0.4652812
## Lasso (no cv) 1.169900 0.8817319 0.4628151
## Lasso (cv)    1.162760 0.8724272 0.4660937
# Regression out of sample results

results_oss <- rbind(linear_nocv_oss, ridge_nocv_oss, ridge_cv_oss, lasso_nocv_oss, lasso_cv_oss)
row.names(results_oss) <- c('OLS','Ridge (no cv)','Ridge (cv)','Lasso (no cv)', 'Lasso (cv)')
results_oss
##                    MSE       MAE        R2
## OLS           1.227703 0.8824324 0.4386612
## Ridge (no cv) 1.224157 0.8911080 0.4402827
## Ridge (cv)    1.224157 0.8911080 0.4402827
## Lasso (no cv) 1.229428 0.8929090 0.4378726
## Lasso (cv)    1.226979 0.8844129 0.4389924

As it is visible in the tables above, the in-sample results were the best in the case in OLS method and none of the machine learning model outperformed the standard regression. The R-squared in this case is equal to 46.7% and the MSE is equal to 1.16. Apparently, Ridge Regression with cross validation and without cross validation provided similar results for each metrics. Lasso Regression, however, have a slightly higher accuracy with cross validation compare to no cross validation. with the value of R-squared equal to 46.6% with CV while without CV is 46.3%. Align with the MAE & MSE in which Lasso with CV have lower rate of error compare to it without using CV.

The out-of-sample results provided similar behavior result for each model. All of the results has lower performance compare to the in-sample results. Overall, the OLS was the best model that suit for this dataset with the R-squared value equal to 43.9%. The MSE and MAE value are 1.23 and 0.88 respectively.

Conclusion

This project have run through several processes on predicting flight departure delay. From data collection, pre-processing, exploratory data analysis, modeling, and evaluation. At the EDA stage, we have identify the correlation between each features to determine the potential predictive variables. There are 12 potential predictive variables to fit in the models with 2 different target variables for each classification and regression model.

We have successfully run 2 types of models. First is classification model with 3 different training algorithms which are Naive Bayes, Random Forest, and kNN. Second type of model is regression, which comprises of 3 techniques of OLS, Ridge Regression, and Lasso Regression. For classification, the best performance model is Random Forest with accuracy of 84.2% and Kappa statistic of 68.2% which suggests a substantial level of agreement. For regression, OLS outperformed the standard regression with R-squared of 46.7% and MSE of 1.16 for the in-sample results. Apparently, there are still some room for improvement that we can optimize and explore to address misclassifications and investigate additional influencing factors for this study.