Today’s globalization necessitates efficient and seamless commuting systems. One of the most efficient systems available to us today is aviation. Unfortunately, flight delays are not uncommon for most of us, regardless of background or location. The International Air Transport Association (IATA) reports that flight delays has resulted in lost of approximately $30 billion annually. Passengers affected by delays face financial burdens such as missed connections, additional accommodation costs, and loss of productivity.
In hopes of dampening the detrimental effects of airline delays, we aim to determine how the different features contribute in the delaying of flights. In addition, we would also like to evaluate the accuracy of various regression and classification models in estimating airline departure delays.
Specifically, this project aims to answer these questions;
How well does the machine learning model perform in predicting departure delay status using classification and regression approaches?
How do various factors contribute to the departure delay of airline flights?
For this study we aim to address the issues as follow;
To evaluate the accuracy of a classification model in predicting whether a flight is likely to experience departure delay.
To assess the effectiveness of a regression model in providing precise estimates of departure delay.
To determine the correlations between different features that are related to flight departure delay.
First and foremost, we will import all the relevant libraries that will be using throughout this project.
The data is collected from: https://www.kaggle.com/datasets/usdot/flight-delays?select=flights.csv. This dataset originates from the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics in 2015.
Below display the overall structure and the dimension of our dataset.
# read data
raw_data <- read.csv("flights.csv")
data <- raw_data
# view the structure and dimension our data
str(data)
## 'data.frame': 5819079 obs. of 31 variables:
## $ YEAR : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ MONTH : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DAY : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DAY_OF_WEEK : int 4 4 4 4 4 4 4 4 4 4 ...
## $ AIRLINE : chr "AS" "AA" "US" "AA" ...
## $ FLIGHT_NUMBER : int 98 2336 840 258 135 806 612 2013 1112 1173 ...
## $ TAIL_NUMBER : chr "N407AS" "N3KUAA" "N171US" "N3HYAA" ...
## $ ORIGIN_AIRPORT : chr "ANC" "LAX" "SFO" "LAX" ...
## $ DESTINATION_AIRPORT: chr "SEA" "PBI" "CLT" "MIA" ...
## $ SCHEDULED_DEPARTURE: int 5 10 20 20 25 25 25 30 30 30 ...
## $ DEPARTURE_TIME : int 2354 2 18 15 24 20 19 44 19 33 ...
## $ DEPARTURE_DELAY : int -11 -8 -2 -5 -1 -5 -6 14 -11 3 ...
## $ TAXI_OUT : int 21 12 16 15 11 18 11 13 17 12 ...
## $ WHEELS_OFF : int 15 14 34 30 35 38 30 57 36 45 ...
## $ SCHEDULED_TIME : int 205 280 286 285 235 217 181 273 195 221 ...
## $ ELAPSED_TIME : int 194 279 293 281 215 230 170 249 193 203 ...
## $ AIR_TIME : int 169 263 266 258 199 206 154 228 173 186 ...
## $ DISTANCE : int 1448 2330 2296 2342 1448 1589 1299 2125 1464 1747 ...
## $ WHEELS_ON : int 404 737 800 748 254 604 504 745 529 651 ...
## $ TAXI_IN : int 4 4 11 8 5 6 5 8 3 5 ...
## $ SCHEDULED_ARRIVAL : int 430 750 806 805 320 602 526 803 545 711 ...
## $ ARRIVAL_TIME : int 408 741 811 756 259 610 509 753 532 656 ...
## $ ARRIVAL_DELAY : int -22 -9 5 -9 -21 8 -17 -10 -13 -15 ...
## $ DIVERTED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CANCELLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CANCELLATION_REASON: chr "" "" "" "" ...
## $ AIR_SYSTEM_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
## $ SECURITY_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
## $ AIRLINE_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
## $ LATE_AIRCRAFT_DELAY: int NA NA NA NA NA NA NA NA NA NA ...
## $ WEATHER_DELAY : int NA NA NA NA NA NA NA NA NA NA ...
dim(data)
## [1] 5819079 31
First we check the number of missing values and the percentage of missing values of each variable.
# check the number missing value or empty string for each variable
missing_values_count <- colSums(is.na(data)| data == "") %>% sort(decreasing = TRUE)
print(missing_values_count)
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## 5729195 4755640 4755640 4755640
## LATE_AIRCRAFT_DELAY WEATHER_DELAY ELAPSED_TIME AIR_TIME
## 4755640 4755640 105071 105071
## ARRIVAL_DELAY WHEELS_ON TAXI_IN ARRIVAL_TIME
## 105071 92513 92513 92513
## TAXI_OUT WHEELS_OFF DEPARTURE_TIME DEPARTURE_DELAY
## 89047 89047 86153 86153
## TAIL_NUMBER SCHEDULED_TIME YEAR MONTH
## 14721 6 0 0
## DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER
## 0 0 0 0
## ORIGIN_AIRPORT DESTINATION_AIRPORT SCHEDULED_DEPARTURE DISTANCE
## 0 0 0 0
## SCHEDULED_ARRIVAL DIVERTED CANCELLED
## 0 0 0
# check the percentage of missing value for each variable
missingPerc <- (colSums(is.na(data)| data == "")/dim(data)[1]) %>% sort(decreasing = TRUE)
missing_perc_percentage <- sprintf("%.2f%%", missingPerc * 100)
print(cbind(Column = names(missingPerc), Missing_Percentage = missing_perc_percentage))
## Column Missing_Percentage
## [1,] "CANCELLATION_REASON" "98.46%"
## [2,] "AIR_SYSTEM_DELAY" "81.72%"
## [3,] "SECURITY_DELAY" "81.72%"
## [4,] "AIRLINE_DELAY" "81.72%"
## [5,] "LATE_AIRCRAFT_DELAY" "81.72%"
## [6,] "WEATHER_DELAY" "81.72%"
## [7,] "ELAPSED_TIME" "1.81%"
## [8,] "AIR_TIME" "1.81%"
## [9,] "ARRIVAL_DELAY" "1.81%"
## [10,] "WHEELS_ON" "1.59%"
## [11,] "TAXI_IN" "1.59%"
## [12,] "ARRIVAL_TIME" "1.59%"
## [13,] "TAXI_OUT" "1.53%"
## [14,] "WHEELS_OFF" "1.53%"
## [15,] "DEPARTURE_TIME" "1.48%"
## [16,] "DEPARTURE_DELAY" "1.48%"
## [17,] "TAIL_NUMBER" "0.25%"
## [18,] "SCHEDULED_TIME" "0.00%"
## [19,] "YEAR" "0.00%"
## [20,] "MONTH" "0.00%"
## [21,] "DAY" "0.00%"
## [22,] "DAY_OF_WEEK" "0.00%"
## [23,] "AIRLINE" "0.00%"
## [24,] "FLIGHT_NUMBER" "0.00%"
## [25,] "ORIGIN_AIRPORT" "0.00%"
## [26,] "DESTINATION_AIRPORT" "0.00%"
## [27,] "SCHEDULED_DEPARTURE" "0.00%"
## [28,] "DISTANCE" "0.00%"
## [29,] "SCHEDULED_ARRIVAL" "0.00%"
## [30,] "DIVERTED" "0.00%"
## [31,] "CANCELLED" "0.00%"
We are dropping all the variables with large number of missing values (> 80%).
# dropping variables with a large share of missing values (>80%)
toDrop <- c("AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY", "CANCELLATION_REASON")
data <- data[,-which(names(data) %in% toDrop)]
We are also dropping irrelevant variables that are determined based on the exploratory data analysis.
meaninglessVars <- c("YEAR", "DIVERTED", "CANCELLED", "TAIL_NUMBER","ELAPSED_TIME", "AIR_TIME","WHEELS_ON", "SCHEDULED_TIME", "TAXI_IN", "TAXI_OUT", "SCHEDULED_ARRIVAL", "ARRIVAL_TIME")
data <- data[,-which(names(data) %in% meaninglessVars)]
head(data)
## MONTH DAY DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 1 1 1 4 AS 98 ANC
## 2 1 1 4 AA 2336 LAX
## 3 1 1 4 US 840 SFO
## 4 1 1 4 AA 258 LAX
## 5 1 1 4 AS 135 SEA
## 6 1 1 4 DL 806 SFO
## DESTINATION_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY
## 1 SEA 5 2354 -11
## 2 PBI 10 2 -8
## 3 CLT 20 18 -2
## 4 MIA 20 15 -5
## 5 ANC 25 24 -1
## 6 MSP 25 20 -5
## WHEELS_OFF DISTANCE ARRIVAL_DELAY
## 1 15 1448 -22
## 2 14 2330 -9
## 3 34 2296 5
## 4 30 2342 -9
## 5 35 1448 -21
## 6 38 1589 8
Next, we will creates a binary variable which tells us whether if the flight is domestic or international. In addition, we will creates a variable to categorize the variable ‘DAY’. Finally, we will be dropping insignificant variables to avoid information duplication and remove observations with missing values.
# first we define the airports by their abbreviation
americanAirports <- c("ATL", "ORD", "DFW", "DEN", "LAX", "SFO", "PHX", "IAH", "LAS", "MSP", "MCO", "SEA", "DTW", "BOS", "EWR", "CLT", "LGA", "SLC", "JFK", "BWI", "MDW", "DCA", "FLL", "SAN", "MIA", "PHL", "TPA", "DAL", "HOU", "BNA", "PDX", "STL", "HNL", "OAK", "AUS", "MSY", "MCI", "SJC", "SMF", "SNA", "CLE", "IAD", "RDU", "MKE", "SAT", "RSW", "IND", "SJU", "CMH", "PIT", "PBI", "OGG", "CVG", "ABQ", "BUR", "BDL", "JAX", "ONT", "BUF", "OMA", "OKC", "ANC", "RIC", "TUS", "MEM", "TUL", "RNO", "BHM", "ELP", "CHS", "BOI", "KOA", "PVD", "GRR", "LIH", "LIT", "SDF", "GEG", "ORF", "XNA", "MSN", "PSP", "LGB")
# binary variable which indicates origin of the airport
data$USA_ORIG <- ifelse(data$ORIGIN_AIRPORT %in% americanAirports, 1, 0)
# binary variable which indicates destination of the airport
data$USA_DEST <- ifelse(data$DESTINATION_AIRPORT %in% americanAirports, 1, 0)
# binary variable where domestic flight = 1, else international flight = 0
data$US_FLIGHT <- ifelse(data$USA_ORIG == 1 & data$USA_DEST == 1, 1, 0)
# binary variable that was equal to 1 if the flight flied between the 1st to 15th of a month, else 2.
data$MONTH_HALF <- ifelse(data$DAY %in% 1:15, 1, 2)
# then we will drop the irrelevant variables to avoid information duplication
toDrop2 <- c("USA_ORIG", "USA_DEST", "DESTINATION_AIRPORT", "DAY")
data <- data[,-which(names(data) %in% toDrop2)]
# drop all the observations that had missing values.
data <- na.omit(data)
Here we will be converting the categorical variables such as AIRLINE and ORIGIN_AIRPORT to factor and assign an integer value to it.
data$ORIGIN_AIRPORT <- as.integer(factor(data$ORIGIN_AIRPORT))
data$AIRLINE <- as.integer(factor(data$AIRLINE))
head(data,5)
## MONTH DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT SCHEDULED_DEPARTURE
## 1 1 4 2 98 324 5
## 2 1 4 1 2336 483 10
## 3 1 4 12 840 585 20
## 4 1 4 1 258 483 20
## 5 1 4 2 135 584 25
## DEPARTURE_TIME DEPARTURE_DELAY WHEELS_OFF DISTANCE ARRIVAL_DELAY US_FLIGHT
## 1 2354 -11 15 1448 -22 1
## 2 2 -8 14 2330 -9 1
## 3 18 -2 34 2296 5 1
## 4 15 -5 30 2342 -9 1
## 5 24 -1 35 1448 -21 1
## MONTH_HALF
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
Since our objective is predicting the flight delay, therefore, we will be using ‘DEPARTURE_DELAY’ as our dependent variable. We will be selecting the observations with departure delay equals to or larger than zero. Below present the summary and plotting of our dependent variable.
summary(data$DEPARTURE_DELAY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -82.000 -5.000 -2.000 9.295 7.000 1988.000
data <- data[data$DEPARTURE_DELAY >= 0,]
ggplot(data, aes(DEPARTURE_DELAY)) + geom_histogram(aes(y=..count..), fill="blue", alpha = 1.0,color="white", bins = 60)+ labs(x = "Departure delay (mins)", y = "Frequency")
From the graph above, the distribution of our dependent variable is right-skewed and highly asymmetrical. Therefore, we will be using the natural logarithm of the departure delay as our dependent variable which can help make the distribution more symmetrical and close to normal distribution. We will also create a column to classify the departure delay. The condition is that if the delay minutes are more than around 12 minutes, we will classify it as delayed.
Finally, we will then remove the ‘DEPARTURE_DELAY’ column. Therefore ‘log_DEPARTURE_DELAY’ will be the dependent variable for our regression problem and ‘classDelay’ will be the dependent variable for classification problem.
data$DEPARTURE_DELAY <- data$DEPARTURE_DELAY + 1
data$log_DEPARTURE_DELAY <- log(data$DEPARTURE_DELAY)
ggplot(data, aes(log_DEPARTURE_DELAY)) + geom_histogram(aes(y=..count..), fill="blue", alpha = 1.0, color="white", bins = 60) +labs(x = "Departure delay (ln mins)", y = "Frequency")
data <- data[,-which(names(data)=='DEPARTURE_DELAY')]
data <- data %>% mutate(classDelay = ifelse(log_DEPARTURE_DELAY >= 2.5, 0, 1))
dim(data)
## [1] 2443491 14
The cleaned data set consists of 2443491 observations and 14 columns. Due to the limitation of time and computer resources, we will perform random sampling method by taking 15000 samples from the data population.
set.seed(333)
sample_size <- 15000
sampledData <- data[sample(nrow(data), sample_size, replace = FALSE), ]
dim(sampledData)
## [1] 15000 14
Here we present the summary statistics of each variables in the data.
summary(sampledData)
## MONTH DAY_OF_WEEK AIRLINE FLIGHT_NUMBER
## Min. : 1.000 Min. :1.000 Min. : 1.00 Min. : 1
## 1st Qu.: 3.000 1st Qu.:2.000 1st Qu.: 4.00 1st Qu.: 712
## Median : 6.000 Median :4.000 Median : 9.00 Median :1552
## Mean : 6.416 Mean :3.926 Mean : 8.22 Mean :2023
## 3rd Qu.: 9.000 3rd Qu.:6.000 3rd Qu.:14.00 3rd Qu.:2852
## Max. :12.000 Max. :7.000 Max. :14.00 Max. :7438
## ORIGIN_AIRPORT SCHEDULED_DEPARTURE DEPARTURE_TIME WHEELS_OFF
## Min. : 1.0 Min. : 3 Min. : 1 Min. : 1
## 1st Qu.:385.0 1st Qu.:1055 1st Qu.:1111 1st Qu.:1123
## Median :473.0 Median :1454 Median :1514 Median :1526
## Mean :444.2 Mean :1434 Mean :1465 Mean :1483
## 3rd Qu.:535.0 3rd Qu.:1815 3rd Qu.:1843 3rd Qu.:1857
## Max. :628.0 Max. :2359 Max. :2400 Max. :2400
## DISTANCE ARRIVAL_DELAY US_FLIGHT MONTH_HALF
## Min. : 31.0 Min. : -58.00 Min. :0.0000 Min. :1.000
## 1st Qu.: 408.0 1st Qu.: -4.00 1st Qu.:1.0000 1st Qu.:1.000
## Median : 733.0 Median : 8.00 Median :1.0000 Median :2.000
## Mean : 891.9 Mean : 23.29 Mean :0.7771 Mean :1.517
## 3rd Qu.:1172.0 3rd Qu.: 30.00 3rd Qu.:1.0000 3rd Qu.:2.000
## Max. :4983.0 Max. :1323.00 Max. :1.0000 Max. :2.000
## log_DEPARTURE_DELAY classDelay
## Min. :0.000 Min. :0.000
## 1st Qu.:1.386 1st Qu.:0.000
## Median :2.485 Median :1.000
## Mean :2.387 Mean :0.518
## 3rd Qu.:3.497 3rd Qu.:1.000
## Max. :7.197 Max. :1.000
The histogram of DAY_OF_WEEK illustrated that weekend has lower flight delay as compared to the weekdays. Monday, Thursday and Friday have the most number of flight delayed cases.
ggplot(sampledData, aes(x = DAY_OF_WEEK, fill = factor(classDelay))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Histogram of DAY_OF_WEEK",
x = "DAY_OF_WEEK",
y = "Count",
fill = "Delay")
From the histogram of MONTH below, we can clearly view that the number of flights decreased after August, and it also causes the number of flight delays decreases.
ggplot(sampledData, aes(x = MONTH, fill = factor(classDelay))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Histogram of MONTH",
x = "MONTH",
y = "Count",
fill = "Delay")
From the histogram of AIRLINE, we can clearly see that airline ‘WN’ has the highest number of flights delays case, however it might because it has the highest frequency of the flight.
ggplot(sampledData, aes(x = AIRLINE, fill = factor(classDelay))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Histogram of AIRLINE",
x = "AIRLINE",
y = "Count",
fill = "Delay")
From the histogram of US_FLIGHT, the domestic flight (labelled 1) has greater number of flight delay compared to international flight.
ggplot(sampledData, aes(x = factor(US_FLIGHT), fill = factor(classDelay))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "Histogram of US_FLIGHT",
x = "US_FLIGHT",
y = "Count",
fill = "Delay") + scale_x_discrete(labels = c("0", "1"))
The correlation matrix here display the relationship in between the variables. The closer the value (color) to 1 or -1, the higher the correlation in between the variables.
# Create a correlation matrix
cor_matrix <- cor(sampledData)
corrplot(cor_matrix, tl.col = "black")
Before the modelling session, we will split our data into training and testing set at 0.8 distribution. In addition, since this part is performing for classification, therefore we will be dropping the ‘log_DEPARTURE_DELAY’ column.
classiData <- subset(sampledData, select = -c(log_DEPARTURE_DELAY))
classiData$classDelay <- as.factor(classiData$classDelay)
set.seed(333)
splitIndex <- createDataPartition(classiData$classDelay, p = 0.8, list = FALSE)
trainingSet <- classiData[splitIndex,]
testingSet <- classiData[-splitIndex,]
cat("Dimension of training:" ,dim(trainingSet), '\n')
## Dimension of training: 12000 13
cat("Dimension of testing:" ,dim(testingSet))
## Dimension of testing: 3000 13
Here, we are checking the data balancing in order to ensure our model later does not bias towards any class. We can see that both classes have the distribution proportion close to 0.50.
print("Distribution of classes: ")
## [1] "Distribution of classes: "
table(trainingSet$classDelay)
##
## 0 1
## 5784 6216
print("Distribution of classes (prop): ")
## [1] "Distribution of classes (prop): "
prop.table(table(trainingSet$classDelay))
##
## 0 1
## 0.482 0.518
Modelling with Naive Bayes classification model.
model1 <- NaiveBayes(classDelay~.,data = trainingSet)
xTest <- testingSet[,1:12]
yTest <- testingSet[,13]
predictions <- predict(model1, xTest)
cm <- confusionMatrix(predictions$class,yTest)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 892 106
## 1 554 1448
##
## Accuracy : 0.78
## 95% CI : (0.7647, 0.7947)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5546
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6169
## Specificity : 0.9318
## Pos Pred Value : 0.8938
## Neg Pred Value : 0.7233
## Prevalence : 0.4820
## Detection Rate : 0.2973
## Detection Prevalence : 0.3327
## Balanced Accuracy : 0.7743
##
## 'Positive' Class : 0
##
Modelling with Random Forest classification model.
model2 <- randomForest(classDelay~., data = trainingSet)
predictions <- predict(model2, xTest)
cm <- confusionMatrix(predictions, yTest)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1173 202
## 1 273 1352
##
## Accuracy : 0.8417
## 95% CI : (0.8281, 0.8546)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6824
##
## Mcnemar's Test P-Value : 0.001319
##
## Sensitivity : 0.8112
## Specificity : 0.8700
## Pos Pred Value : 0.8531
## Neg Pred Value : 0.8320
## Prevalence : 0.4820
## Detection Rate : 0.3910
## Detection Prevalence : 0.4583
## Balanced Accuracy : 0.8406
##
## 'Positive' Class : 0
##
Modelling with k-Nearest Neighbors (kNN) classification model.
model3 <- train(classDelay~., data = trainingSet, method = 'knn')
predictions <- predict(model3, xTest)
cm <- confusionMatrix(predictions, yTest)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 766 452
## 1 680 1102
##
## Accuracy : 0.6227
## 95% CI : (0.605, 0.6401)
## No Information Rate : 0.518
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2402
##
## Mcnemar's Test P-Value : 1.511e-11
##
## Sensitivity : 0.5297
## Specificity : 0.7091
## Pos Pred Value : 0.6289
## Neg Pred Value : 0.6184
## Prevalence : 0.4820
## Detection Rate : 0.2553
## Detection Prevalence : 0.4060
## Balanced Accuracy : 0.6194
##
## 'Positive' Class : 0
##
In the classification modelling session, we utilised three different classification models which are Naive Bayes, Random Forest and k-Nearest Neighbors model to perform prediction on the flights delay.
Generally, Random Forest classifier has the best performance as it achieved 84.2% of accuracy.The Kappa statistic, which measures the agreement between the model’s predictions and the actual outcomes is 68.2 which suggests a substantial level of agreement. The model shows a good sensitivity and specificity which indicating its ability to correctly identify both positive and negative instances.
Whereas, Naive Bayes classifier overall has good accuracy at 78% yet the sensitivity is quite low with only 61.7%. This suggests that the model is better at correctly identifying negative instances than positive ones. Its Kappa statistic at around 55% suggests a moderate level of agreement. Finally, the kNN classfier has the worst performance as compared to the previous models. It achieved 62.3% of accuracy, 53% and 70% of sensitivity and specificity. A value of Kappa statistic around 20% suggests a low level of agreement.Further optimization and refinement may be beneficial in order to enhance the overall performance.
We will define a new dataset (regData) to fit in our regression models. The column “classDelay” that is used as target variable for the classification model will be drop in this section as it does not imply for the regression model.
# defining the variable
regData <- sampledData[, !names(sampledData) %in% c("classDelay")]
head(regData)
## MONTH DAY_OF_WEEK AIRLINE FLIGHT_NUMBER ORIGIN_AIRPORT
## 2022412 5 6 10 6411 535
## 153312 1 6 4 2521 400
## 319644 1 3 8 3606 400
## 3365753 7 3 1 683 594
## 5163910 11 4 4 2825 483
## 1634846 4 3 3 457 346
## SCHEDULED_DEPARTURE DEPARTURE_TIME WHEELS_OFF DISTANCE ARRIVAL_DELAY
## 2022412 925 934 1004 1041 7
## 153312 1950 1950 2001 405 -29
## 319644 2114 2134 2142 235 0
## 3365753 1230 1233 1243 647 0
## 5163910 1130 1132 1146 590 -5
## 1634846 1025 1025 1037 413 -8
## US_FLIGHT MONTH_HALF log_DEPARTURE_DELAY
## 2022412 1 1 2.302585
## 153312 1 1 0.000000
## 319644 1 2 3.044522
## 3365753 1 2 1.386294
## 5163910 1 2 1.098612
## 1634846 1 1 0.000000
str(regData)
## 'data.frame': 15000 obs. of 13 variables:
## $ MONTH : int 5 1 1 7 11 4 10 9 12 7 ...
## $ DAY_OF_WEEK : int 6 6 3 3 4 3 7 4 3 6 ...
## $ AIRLINE : int 10 4 8 1 4 3 2 1 11 1 ...
## $ FLIGHT_NUMBER : int 6411 2521 3606 683 2825 457 26 1333 983 47 ...
## $ ORIGIN_AIRPORT : int 535 400 400 594 483 346 265 545 510 535 ...
## $ SCHEDULED_DEPARTURE: int 925 1950 2114 1230 1130 1025 1810 1745 1505 1210 ...
## $ DEPARTURE_TIME : int 934 1950 2134 1233 1132 1025 1811 1758 1526 1215 ...
## $ WHEELS_OFF : int 1004 2001 2142 1243 1146 1037 1829 1823 1548 1230 ...
## $ DISTANCE : int 1041 405 235 647 590 413 1721 678 964 1739 ...
## $ ARRIVAL_DELAY : int 7 -29 0 0 -5 -8 -9 3 13 -15 ...
## $ US_FLIGHT : num 1 1 1 1 1 1 0 1 1 1 ...
## $ MONTH_HALF : num 1 1 2 2 2 1 2 2 2 2 ...
## $ log_DEPARTURE_DELAY: num 2.3 0 3.04 1.39 1.1 ...
The data for regression model (regData) is split into 70% train sample and 30% test sample.
# divide the data into train and test set
set.seed(123) # Set seed for reproducibility
index <- createDataPartition(regData$log_DEPARTURE_DELAY, p = 0.7, list = FALSE)
regTrain <- regData[index,]
regTest <- regData[-index,]
summary(regTrain$log_DEPARTURE_DELAY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.386 2.485 2.391 3.497 7.137
summary(regTest$log_DEPARTURE_DELAY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.386 2.485 2.379 3.497 7.197
As it is presented, the distribution among the two sub samples with the partition of 70% train set and 30% test set has the similar values for the 1st Quartile, Median, and 3rd Quartile. The statistic that differs is the Mean value (2.391 for train sample and 2.379 for test sample) and maximum value (7.137 for train sample and 7.197 for test sample).
We will proceed to fit in the model. First we define the cross validation and the performance metrics to evaluate the results for each model. Next, we define the function for our training models which are ordinary least squares estimation used in linear regression, ridge regression, and lasso regression.
# selecting cross validation
ctrl_cv <- trainControl(method = "cv", number = 5)
ctrl_nocv <- trainControl(method = "none")
# defining a function that calculates regression metrics
regMetrics <- function(real, predicted) {
# Mean Square Error
MSE <- mean((real - predicted)^2)
# Mean Absolute Error
MAE <- mean(abs(real - predicted))
# Total Sum of Squares
TSS <- sum((real - mean(real))^2)
# Explained Sum of Squares
RSS <- sum((predicted - real)^2)
# R2
R2 <- 1 - RSS/TSS
result <- data.frame(MSE, MAE, R2)
return(result)
}
OLS (Ordinary Least Squares):
The OLS minimize the sum of squared differences between the observed and predicted values. It does not include any regularization terms. This technique is suitable when there is no multicollinearity, and the number of predictors is moderate. However, OLD may lead to overfitting if the number of predictors is large relative to the number of observations or if there is multicollinearity.
# ORDINARY LEAST SQUARE (OLS)
linear_nocv <- train(log_DEPARTURE_DELAY ~ .,
data = regTrain,
method = "lm",
trControl = ctrl_nocv)
linear_nocv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(linear_nocv, regTrain))
linear_nocv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(linear_nocv, regTest))
Ridge Regression:
Ridge Regression is suitable when there is multicollinearity and to avoid large coefficient values. It tends to shrink the coefficients, preventing overfitting and dealing well with multicollinearity. Ridge Regression works by minimizing the sum of squared differences plus a penalty term, which is the square of the magnitude of the coefficients multiplied by a regularization parameter (alpha or lambda).
# RIDGE REGRESSION
lambdas <- exp(log(10)*seq(-2, 9, length.out = 200))
tgrid_nocv <- expand.grid(alpha = 0, lambda=0.05)
tgrid <- expand.grid(alpha = 0, lambda=lambdas)
ridge_nocv <- train(log_DEPARTURE_DELAY ~ .,
data = regTrain,
method = "glmnet",
tuneGrid = tgrid_nocv,
trControl = ctrl_nocv)
ridge_nocv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(ridge_nocv, regTrain))
ridge_nocv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(ridge_nocv, regTest))
ridge_cv <- train(log_DEPARTURE_DELAY ~ .,
data = regTrain,
method = "glmnet",
tuneGrid = tgrid,
trControl = ctrl_cv)
ridge_cv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(ridge_cv, regTrain))
ridge_cv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(ridge_cv, regTest))
Lasso Regression:
Lasso Regression is useful when feature selection is desired or when dealing with datasets with a large number of features. This technique tends to set some coefficients exactly to zero, effectively performing feature selection and producing sparse models. This method is almost similar to Ridge Regression, it minimize the sum of squared differences plus a penalty term, which is the absolute value of the coefficients multiplied by a regularization parameter (alpha or lambda).
# LASSO REGRESSION
tgrid_l <- expand.grid(alpha = 1, lambda=lambdas)
tgrid_nocv_l <- expand.grid(alpha = 1, lambda=0.05)
lasso_nocv <- train(log_DEPARTURE_DELAY ~ .,
data = regTrain,
method = "glmnet",
tuneGrid = tgrid_nocv_l,
trControl = ctrl_nocv)
lasso_nocv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(lasso_nocv, regTrain))
lasso_nocv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(lasso_nocv, regTest))
lasso_cv <- train(log_DEPARTURE_DELAY ~ .,
data = regTrain,
method = "glmnet",
tuneGrid = tgrid_l,
trControl = ctrl_cv)
lasso_cv_is <- regMetrics(regTrain$log_DEPARTURE_DELAY, predict(lasso_cv, regTrain))
lasso_cv_oss <- regMetrics(regTest$log_DEPARTURE_DELAY, predict(lasso_cv, regTest))
In this section, we will evaluate all of the three models with and without the cross-validation factor for in sample and also out of sample results. We will be performing the regression metrics of MSE, MAE, and R^2 that have previously define in the metric function.
# Regression in sample results
results_is <- rbind(linear_nocv_is, ridge_nocv_is, ridge_cv_is, lasso_nocv_is, lasso_cv_is)
row.names(results_is) <- c('OLS','Ridge (no cv)','Ridge (cv)','Lasso (no cv)', 'Lasso (cv)')
results_is
## MSE MAE R2
## OLS 1.160255 0.8698282 0.4672441
## Ridge (no cv) 1.164530 0.8798238 0.4652812
## Ridge (cv) 1.164530 0.8798238 0.4652812
## Lasso (no cv) 1.169900 0.8817319 0.4628151
## Lasso (cv) 1.162760 0.8724272 0.4660937
# Regression out of sample results
results_oss <- rbind(linear_nocv_oss, ridge_nocv_oss, ridge_cv_oss, lasso_nocv_oss, lasso_cv_oss)
row.names(results_oss) <- c('OLS','Ridge (no cv)','Ridge (cv)','Lasso (no cv)', 'Lasso (cv)')
results_oss
## MSE MAE R2
## OLS 1.227703 0.8824324 0.4386612
## Ridge (no cv) 1.224157 0.8911080 0.4402827
## Ridge (cv) 1.224157 0.8911080 0.4402827
## Lasso (no cv) 1.229428 0.8929090 0.4378726
## Lasso (cv) 1.226979 0.8844129 0.4389924
As it is visible in the tables above, the in-sample results were the best in the case in OLS method and none of the machine learning model outperformed the standard regression. The R-squared in this case is equal to 46.7% and the MSE is equal to 1.16. Apparently, Ridge Regression with cross validation and without cross validation provided similar results for each metrics. Lasso Regression, however, have a slightly higher accuracy with cross validation compare to no cross validation. with the value of R-squared equal to 46.6% with CV while without CV is 46.3%. Align with the MAE & MSE in which Lasso with CV have lower rate of error compare to it without using CV.
The out-of-sample results provided similar behavior result for each model. All of the results has lower performance compare to the in-sample results. Overall, the OLS was the best model that suit for this dataset with the R-squared value equal to 43.9%. The MSE and MAE value are 1.23 and 0.88 respectively.
This project have run through several processes on predicting flight departure delay. From data collection, pre-processing, exploratory data analysis, modeling, and evaluation. At the EDA stage, we have identify the correlation between each features to determine the potential predictive variables. There are 12 potential predictive variables to fit in the models with 2 different target variables for each classification and regression model.
We have successfully run 2 types of models. First is classification model with 3 different training algorithms which are Naive Bayes, Random Forest, and kNN. Second type of model is regression, which comprises of 3 techniques of OLS, Ridge Regression, and Lasso Regression. For classification, the best performance model is Random Forest with accuracy of 84.2% and Kappa statistic of 68.2% which suggests a substantial level of agreement. For regression, OLS outperformed the standard regression with R-squared of 46.7% and MSE of 1.16 for the in-sample results. Apparently, there are still some room for improvement that we can optimize and explore to address misclassifications and investigate additional influencing factors for this study.