Flight Delay and Cancellation Case Study (ML)

library(readr)
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(rpart) # DT
library(randomForestSRC) # RF
library(tidyverse)
library(tree)
library(Metrics)
library(caret)
library(car)
library(kernlab)
library(MASS)
library(performance)
library(ROCR)
library(lmtest)
library(randomForest)
library(corrplot)
library(lubridate)
library(ggcorrplot) # corr plot
library(reshape2) # heat map
library(GGally) # Pais ggplot2
library(kableExtra) # Airlines table
library(e1071) # Tune SVM
library(ranger)
library(broom) # Tidy function

1-2) Introduction / The Problem

In the past several years, air travel has become both more hectic and expensive, with tighter security regulations and longer waits, yet it is still the fastest and safest travel method there is. The number of flights performed globally by the airline industry (both commercial and freight) increased steadily since the early 2000s and reached 38.9 million in 2019. However, due to the coronavirus pandemic, the number of flights dropped to 16.9 million in 2020. (Mazareanu, 2021) Putting the COVID crisis aside, airlines have always had the need to use data analytics to help them analyze and develop models that can predict better both the customer and the outside events behavior.

Airlines, like any other company from other industries are focused on finding the best way to be more cost efficient, and the easiest way to achieve this is by avoiding unnecessary costs. Delayed aircrafts are estimated to have cost the airlines several billion dollars in additional expense. Delays also drive the need for extra gates and ground personnel and impose costs on airline customers (including shippers) in the form of lost productivity, wages and goodwill. (Airlines for America, 2020) There is no wonder why Data Analytics play a very important role in helping airlines predicting any event that might cut their profit.

But what makes this analysis so difficult? Well, delayed flights are not the only problem for airlines, there are also cancelled flights, and there are many reasons for cancellation. Also, let’s go back to the delay flights, because the impact of a delay flight depends on the duration of this delay, a couple-of-minutes delay is way different than a many-hours delay. All these variables and their multiple effects will be analyzed throughout this report.

3) Motivation

Even when the data set is from 2015, and we won’t be seeing any effects of the coronavirus pandemic, this world-changing event showed us the importance of having models to analyze historic data, but also models that allow us to make decisions in real-time. Knowing the effects and the causes for both the passengers and the airline for delayed and cancelled flights can help avoiding costs on both ends: passenger and airlines.

This data set presents a wide variety of problems and offers different ways to approach them, for sure the main question we would ultimately like to answer is “How to avoid delayed and cancelled flights?”, and since there is no one easy solution, it gives space for us to analyze this data set in many ways:

4) Data

The data set was found at Kaggle (source: https://www.kaggle.com/usdot/flight-delays), the “2015 flight delays and cancellations” data set originally comes from the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics, all flights are domestic, which means they only flights within the U.S. (including Alaska and Puerto Rico). The data set includes 322 airports and 14 airlines. The data set consists of 5.8 million observations (5,819,079) and 31 variables (with information from day, month, airline, airport, scheduled departure, actual departure, cancellation code, delayed causes and others), from those 31 variables, 1 variable will be our main focus (response variable): “Arrival_Delay”. By the end of the Case Study, we will add 2 more variables = “ARRIVAL_PART_DAY” and “CANCELLED_PART_DAY”, these 2 variables will be used to analyze the Delayed problems (from the Arrival and Scheduled time perspective) with a categorical analysis.

Data Limitations

Due to the number of observations in this data set I might be unable to run more complex models such as SVM (or SV Regression), Decision Trees or Random Forest, therefore, I decided to subset the original data set and just use the information for the month of February, as it has less days that the other 11 months. The process of subsetting the data set can be found in the following link: https://rpubs.com/arachaparro/FlightDelay_2015_US

setwd("C:/Users/arach/OneDrive/4. UTSA/Spring 2022/S22 - DA6813 - Data Analytics Applications/00. Cases/00_Flight_Delay/")
flights_feb <- read.csv(file = "flights_feb.csv", header = TRUE)
flights_feb$X <- NULL

Data Preprocessing

str(flights_feb)

## 'data.frame':    429191 obs. of  31 variables:
##  $ YEAR               : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
##  $ MONTH              : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ DAY                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DAY_OF_WEEK        : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ AIRLINE            : chr  "AA" "AS" "AA" "DL" ...
##  $ FLIGHT_NUMBER      : int  2400 98 258 806 612 1112 1434 1832 1674 1450 ...
##  $ TAIL_NUMBER        : chr  "N3JKAA" "N794AS" "N3FEAA" "N962DN" ...
##  $ ORIGIN_AIRPORT     : chr  "LAX" "ANC" "LAX" "SFO" ...
##  $ DESTINATION_AIRPORT: chr  "DFW" "SEA" "MIA" "MSP" ...
##  $ SCHEDULED_DEPARTURE: int  5 5 20 20 25 30 30 35 45 45 ...
##  $ DEPARTURE_TIME     : int  2 34 10 12 16 24 25 31 40 36 ...
##  $ DEPARTURE_DELAY    : int  -3 29 -10 -8 -9 -6 -5 -4 -5 -9 ...
##  $ TAXI_OUT           : int  15 10 68 14 11 11 15 17 16 15 ...
##  $ WHEELS_OFF         : int  17 44 118 26 27 35 40 48 56 51 ...
##  $ SCHEDULED_TIME     : int  169 204 284 220 181 189 218 269 264 183 ...
##  $ ELAPSED_TIME       : int  170 207 339 216 179 191 209 277 257 175 ...
##  $ AIR_TIME           : int  150 190 258 192 164 175 192 249 236 155 ...
##  $ DISTANCE           : int  1235 1448 2342 1589 1299 1464 1535 2125 2174 1299 ...
##  $ WHEELS_ON          : int  447 454 836 538 511 530 552 757 752 526 ...
##  $ TAXI_IN            : int  5 7 13 10 4 5 2 11 5 5 ...
##  $ SCHEDULED_ARRIVAL  : int  454 429 804 600 526 539 608 804 809 548 ...
##  $ ARRIVAL_TIME       : int  452 501 849 548 515 535 554 808 757 531 ...
##  $ ARRIVAL_DELAY      : int  -2 32 45 -12 -11 -4 -14 4 -12 -17 ...
##  $ DIVERTED           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CANCELLED          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CANCELLATION_REASON: chr  NA NA NA NA ...
##  $ AIR_SYSTEM_DELAY   : int  NA 3 45 NA NA NA NA NA NA NA ...
##  $ SECURITY_DELAY     : int  NA 0 0 NA NA NA NA NA NA NA ...
##  $ AIRLINE_DELAY      : int  NA 29 0 NA NA NA NA NA NA NA ...
##  $ LATE_AIRCRAFT_DELAY: int  NA 0 0 NA NA NA NA NA NA NA ...
##  $ WEATHER_DELAY      : int  NA 0 0 NA NA NA NA NA NA NA ...

One of the variables that will be used as a response variable is “Cancelled”, to verify if the data set is balanced with used the function table ( ). The result table shows us that indeed the data set is highly imbalanced, with 98.5% of the flights not cancelled and 1.5% of them cancelled throughout the year. However, when the subset was created with only the flights of February, the cancelled flights went up to 4.78%

tbl_feb = table(flights_feb$CANCELLED)
cbind(tbl_feb, round(prop.table(tbl_feb)*100,2))

##   tbl_feb      
## 0  408674 95.22
## 1   20517  4.78

Cleaning the data

Some of the data, such as WHEELS_ON, SCHEDULED_ARRIVAL and ARRIVAL_TIME, are numeric but should be in HHMM format, however, since we will analyze them with the format they have, due to the complexity it will add to the model for interpretability down the road.

The only data parsing we will be doing will be with the columns in numeric and character format that should be factor (categorical) variables.

str(flights_feb[c(4:9,26)])

## 'data.frame':    429191 obs. of  7 variables:
##  $ DAY_OF_WEEK        : Ord.factor w/ 7 levels "Mon"<"Tue"<"Wed"<..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ AIRLINE            : Factor w/ 14 levels "AA","AS","B6",..: 1 2 1 4 9 1 4 12 1 4 ...
##  $ FLIGHT_NUMBER      : Factor w/ 6272 levels "1","2","3","4",..: 2394 96 256 803 610 1107 1428 1826 1668 1444 ...
##  $ TAIL_NUMBER        : Factor w/ 4361 levels "N001AA","N002AA",..: 1386 3358 1315 4264 2371 1413 181 2099 1349 141 ...
##  $ ORIGIN_AIRPORT     : Factor w/ 315 levels "ABE","ABI","ABQ",..: 172 16 172 273 170 273 172 172 170 170 ...
##  $ DESTINATION_AIRPORT: Factor w/ 315 levels "ABE","ABI","ABQ",..: 85 272 199 212 212 85 212 65 199 212 ...
##  $ CANCELLATION_REASON: Factor w/ 4 levels "A","B","C","D": NA NA NA NA NA NA NA NA NA NA ...

Missing data

Since our dataset has: 1) Flights on time, 2) Flights delayed, and 3) Flights Cancelled, there are missing values for columns such as “ARRIVAL_TIME” for flights that were Cancelled for instance. For the variables such as “DEPARTURE_TIME” and “ARRIVAL TIME”, the rows will be deleted at the train/test split (since a value equal to 0, would be equivalent to an Arrival time at 12:00:00 am). For the other variables, such as TAIL_NUMBER or the delay codes: SECURITY_DELAY, they will be substituded with a 0.

summary(flights_feb)

##       YEAR          MONTH        DAY       DAY_OF_WEEK    AIRLINE      
##  Min.   :2015   Min.   :2   Min.   : 1.0   Mon:62323   WN     : 90172  
##  1st Qu.:2015   1st Qu.:2   1st Qu.: 8.0   Tue:63374   DL     : 60884  
##  Median :2015   Median :2   Median :15.0   Wed:65156   EV     : 45138  
##  Mean   :2015   Mean   :2   Mean   :14.6   Thu:65429   OO     : 43989  
##  3rd Qu.:2015   3rd Qu.:2   3rd Qu.:22.0   Fri:50649   AA     : 39835  
##  Max.   :2015   Max.   :2   Max.   :28.0   Sat:57717   UA     : 36235  
##                                            Sun:64543   (Other):112938  
##  FLIGHT_NUMBER     TAIL_NUMBER     ORIGIN_AIRPORT   DESTINATION_AIRPORT
##  469    :   366   N486HA :   354   ATL    : 27366   ATL    : 27366     
##  345    :   334   N489HA :   348   ORD    : 21812   ORD    : 21814     
##  61     :   321   N477HA :   347   DFW    : 20839   DFW    : 20839     
##  745    :   312   N488HA :   336   LAX    : 15762   LAX    : 15764     
##  711    :   307   N476HA :   323   DEN    : 15642   DEN    : 15638     
##  1159   :   302   (Other):423889   IAH    : 12231   IAH    : 12228     
##  (Other):427249   NA's   :  3594   (Other):315539   (Other):315542     
##  SCHEDULED_DEPARTURE DEPARTURE_TIME  DEPARTURE_DELAY      TAXI_OUT     
##  Min.   :   5        Min.   :   1    Min.   : -61.00   Min.   :  1.00  
##  1st Qu.: 925        1st Qu.: 932    1st Qu.:  -4.00   1st Qu.: 11.00  
##  Median :1320        Median :1330    Median :  -1.00   Median : 14.00  
##  Mean   :1324        Mean   :1336    Mean   :  11.88   Mean   : 16.92  
##  3rd Qu.:1720        3rd Qu.:1731    3rd Qu.:  11.00   3rd Qu.: 19.00  
##  Max.   :2359        Max.   :2400    Max.   :1587.00   Max.   :225.00  
##                      NA's   :20059   NA's   :20059     NA's   :20412   
##    WHEELS_OFF    SCHEDULED_TIME  ELAPSED_TIME      AIR_TIME    
##  Min.   :   1    Min.   : 20    Min.   : 15.0   Min.   :  7.0  
##  1st Qu.: 947    1st Qu.: 85    1st Qu.: 83.0   1st Qu.: 60.0  
##  Median :1345    Median :122    Median :119.0   Median : 94.0  
##  Mean   :1360    Mean   :140    Mean   :137.1   Mean   :112.6  
##  3rd Qu.:1745    3rd Qu.:173    3rd Qu.:170.0   3rd Qu.:144.0  
##  Max.   :2400    Max.   :679    Max.   :766.0   Max.   :684.0  
##  NA's   :20412   NA's   :2      NA's   :21528   NA's   :21528  
##     DISTANCE        WHEELS_ON        TAXI_IN        SCHEDULED_ARRIVAL
##  Min.   :  31.0   Min.   :   1    Min.   :  1.000   Min.   :   1     
##  1st Qu.: 366.0   1st Qu.:1114    1st Qu.:  4.000   1st Qu.:1123     
##  Median : 641.0   Median :1519    Median :  6.000   Median :1525     
##  Mean   : 800.8   Mean   :1490    Mean   :  7.583   Mean   :1508     
##  3rd Qu.:1045.0   3rd Qu.:1912    3rd Qu.:  9.000   3rd Qu.:1915     
##  Max.   :4983.0   Max.   :2400    Max.   :202.000   Max.   :2359     
##                   NA's   :20822   NA's   :20822                      
##   ARRIVAL_TIME   ARRIVAL_DELAY         DIVERTED        CANCELLED 
##  Min.   :   1    Min.   : -79.000   Min.   :0.000000   0:408674  
##  1st Qu.:1119    1st Qu.: -12.000   1st Qu.:0.000000   1: 20517  
##  Median :1524    Median :  -2.000   Median :0.000000             
##  Mean   :1496    Mean   :   8.321   Mean   :0.002356             
##  3rd Qu.:1918    3rd Qu.:  13.000   3rd Qu.:0.000000             
##  Max.   :2400    Max.   :1627.000   Max.   :1.000000             
##  NA's   :20822   NA's   :21528                                   
##  CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY   AIRLINE_DELAY   
##  A   :  2815         Min.   :  0.0    Min.   :  0      Min.   :   0    
##  B   : 15447         1st Qu.:  0.0    1st Qu.:  0      1st Qu.:   0    
##  C   :  2254         Median :  4.0    Median :  0      Median :   2    
##  D   :     1         Mean   : 14.2    Mean   :  0      Mean   :  18    
##  NA's:408674         3rd Qu.: 19.0    3rd Qu.:  0      3rd Qu.:  18    
##                      Max.   :748.0    Max.   :168      Max.   :1587    
##                      NA's   :334012   NA's   :334012   NA's   :334012  
##  LATE_AIRCRAFT_DELAY WEATHER_DELAY   
##  Min.   :   0.0      Min.   :   0.0  
##  1st Qu.:   0.0      1st Qu.:   0.0  
##  Median :   3.0      Median :   0.0  
##  Mean   :  22.7      Mean   :   4.3  
##  3rd Qu.:  29.0      3rd Qu.:   0.0  
##  Max.   :1313.0      Max.   :1152.0  
##  NA's   :334012      NA's   :334012

flights_feb = flights_feb %>% 
  mutate( TAIL_NUMBER = ifelse(is.na(TAIL_NUMBER), 0, TAIL_NUMBER), 
          AIR_SYSTEM_DELAY = ifelse(is.na(AIR_SYSTEM_DELAY), 0, AIR_SYSTEM_DELAY), 
          SECURITY_DELAY = ifelse(is.na(SECURITY_DELAY), 0, SECURITY_DELAY),
          AIRLINE_DELAY = ifelse(is.na(AIRLINE_DELAY), 0, AIRLINE_DELAY), 
          LATE_AIRCRAFT_DELAY = ifelse(is.na(LATE_AIRCRAFT_DELAY), 0, LATE_AIRCRAFT_DELAY),
          WEATHER_DELAY = ifelse(is.na(WEATHER_DELAY), 0, WEATHER_DELAY),
          CANCELLATION_REASON = ifelse(is.na(CANCELLATION_REASON), 0, CANCELLATION_REASON),
          TAXI_IN = ifelse(is.na(TAXI_IN), 0, TAXI_IN),
          AIR_TIME = ifelse(is.na(AIR_TIME), 0, AIR_TIME),
          ELAPSED_TIME = ifelse(is.na(ELAPSED_TIME), 0, ELAPSED_TIME),
          SCHEDULED_TIME = ifelse(is.na(SCHEDULED_TIME), 0, SCHEDULED_TIME),
          WHEELS_OFF = ifelse(is.na(WHEELS_OFF), 0, WHEELS_OFF),
          TAXI_OUT = ifelse(is.na(TAXI_OUT), 0, TAXI_OUT)
    )

Comments: From the data above we can conclude the following:

We have 21,528 Cancelled flights (We can see it from AIR_TIME and ELAPSED_TIME NAs), however both WHEELS_OFF (20,412 NAs) and WHEELS_ON (20,822 NAs - Same as TAXI_IN, they are linked) are less than 21,528, and we would assume since a flight was Cancelled that the flight never took off (WHEELS_OFF) and therefore, never landed (WHEELS_ON). With these flights we will assume that: they actually took off, but due to reasons we do not know, they had to land immediately after.
We will consider a “delayed flight” using ARRIVAL_DELAY as a continuous variable.

# Delayed flights 
sum(!is.na(flights_feb$ARRIVAL_DELAY))

## [1] 407663

# Delayed flights of 1 minute or more
sum(na.omit(flights_feb$ARRIVAL_DELAY)!=0)

## [1] 398579

We have 407,663 delayed flights, from those 398,579 are 1 minute or more.

5) Methodology

Exploratory Analysis

Overall

We will start by analyzing if our variables have any correlation.

# ggcorrplot(cor(flights_feb[,-c(1:1,5,7:9,26)]))

cormat = round(cor(flights_feb[,-c(1,4:9,25:26)]),2)

## Warning in cor(flights_feb[, -c(1, 4:9, 25:26)]): the standard deviation is zero

melted_cormat = melt(cormat)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

For the heat map we had to exclude any categorical variables, and so far with the continuous variables we can not see any strong correlation other from “SCHEDULED_ARRIVAL” AND “SCHEDULED_DEPARTURE”. (which is obvious)

To analyze the categorical variables, we will use some box plots and bar plots. To analyze the relationship between the categorical variables as predictors and the response variables.

Delayed Arrival time

Delayed arrival time per airline

We will start by analyzing how the delayed time behaves throughout the airlines, and try to see if any of them stand out. First we will do it visually, where we can see the box plot and the outliers. For display we will use the abbreviation of the names, in the table bellow we will find the complete name of the airlines.

ggplot(flights_feb, aes(x=AIRLINE, y=ARRIVAL_DELAY, fill=AIRLINE))+
  geom_boxplot()+
  labs(title="Delayed Arrival time (minutes) per airline", x="Airlines", y="Delay (mins)")

At first glance we can see that American Airlines (AA) and Hawaiian Airlines (HA) have the flights with the longest delayed times. The second thing we can see is that pretty much all of them have a very low “Average Arrival Delay”, however JetBlue Airways (B6), Frontier Airlines (F9) and American Eagle Airlines (MAQ) show to be the airlines with an average arrival delay higher than the rest of them.

airlines <- read.csv(file = "airlines.csv", header = TRUE, fileEncoding = 'UTF-8-BOM')

airlines %>% 
  arrange(IATA_CODE) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

IATA_CODE	AIRLINE
AA	American Airlines Inc.
AS	Alaska Airlines Inc.
B6	JetBlue Airways
DL	Delta Air Lines Inc.
EV	Atlantic Southeast Airlines
F9	Frontier Airlines Inc.
HA	Hawaiian Airlines Inc.
MQ	American Eagle Airlines Inc.
NK	Spirit Air Lines
OO	Skywest Airlines Inc.
UA	United Air Lines Inc.
US	US Airways Inc.
VX	Virgin America
WN	Southwest Airlines Co.

Numbers wise we can verify that the JetBlue Airways (B6), Frontier Airlines (F9) and American Eagle Airlines (MQ) are the airlines with the highest mean values, but we also can see that Alaska Airlines (AS) has a negative mean, which translates in most their flights arriving even earlier than expected.

Airlines is something we would like to look into.

flights_feb %>% 
  group_by(AIRLINE) %>% 
  summarize(min = min(na.omit(ARRIVAL_DELAY)),
            median = median(na.omit(ARRIVAL_DELAY)),
            mean = round(mean(na.omit(ARRIVAL_DELAY)),2),
            max = max(na.omit(ARRIVAL_DELAY))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))

AIRLINE	min	median	mean	max
AA	-67	-3	7.53	1627
AS	-72	-6	-0.78	729
B6	-62	3	18.66	952
DL	-79	-5	5.61	1177
EV	-52	-1	10.42	637
F9	-44	6	27.42	839
HA	-52	1	6.03	1467
MQ	-55	5	21.30	1152
NK	-54	2	16.47	523
OO	-54	-2	9.59	953
UA	-65	-2	7.26	863
US	-59	-2	7.11	515
VX	-63	-4	7.78	572
WN	-73	-4	3.50	508

Delayed arrival time per Day of the Week

ggplot(flights_feb, aes(x=DAY_OF_WEEK, y=ARRIVAL_DELAY, fill=DAY_OF_WEEK))+
  geom_boxplot()+
  labs(title="Delayed Arrival time (minutes) by day of the week", x="Day of the Week", y="Delay (mins)") +
  guides(fill=guide_legend(title="Day of the Week"))

flights_feb %>% 
  group_by(DAY_OF_WEEK) %>% 
  summarize(min = min(na.omit(ARRIVAL_DELAY)),
            median = median(na.omit(ARRIVAL_DELAY)),
            mean = round(mean(na.omit(ARRIVAL_DELAY)),2),
            max = max(na.omit(ARRIVAL_DELAY))) %>% 
  kable(col.names = c("Day of the Week", "Min", "Median", "Mean", "Max")) %>% 
  kable_styling(bootstrap_options = c("striped","hover"))

Day of the Week	Min	Median	Mean	Max
Mon	-62	-3	8.56	1351
Tue	-64	-4	5.35	1371
Wed	-72	0	10.54	1366
Thu	-79	-3	6.19	1295
Fri	-73	-5	5.05	1627
Sat	-70	-3	9.51	1460
Sun	-77	0	12.76	1467

From the statistics above we can see that flights are delayed more minutes in average on Sunday, then Wednesday; and that if you want your flight to be arriving on time, you have a better chance if you take it on Friday or Tuesday.

Delayed arrival time depending on the times of the day

We would also know what is the schedule where is most likely to have your flight delayed.

flights_feb = flights_feb %>% 
  mutate(ARRIVAL_PART_DAY = case_when(
    ARRIVAL_TIME >= 600 & ARRIVAL_TIME < 1200 ~ "Morning",
    ARRIVAL_TIME >= 1200 & ARRIVAL_TIME < 1700 ~ "Afternoon",
    ARRIVAL_TIME >= 1700 & ARRIVAL_TIME < 2100 ~ "Evening",
    ARRIVAL_TIME <= 2100 & ARRIVAL_TIME < 600 ~ "Night" ))

flights_feb$ARRIVAL_PART_DAY = as.factor(flights_feb$ARRIVAL_PART_DAY)

ARRIVAL_MEANS  = flights_feb %>% 
  drop_na(ARRIVAL_DELAY) %>% 
  group_by(ARRIVAL_PART_DAY) %>% 
  summarize(mean_arrival_delay = mean(ARRIVAL_DELAY)) %>% 
  arrange(factor(ARRIVAL_PART_DAY, levels = c('Morning','Afternoon','Evening','Night')))

ARRIVAL_MEANS %>% 
  drop_na(ARRIVAL_PART_DAY) %>% 
  mutate(ARRIVAL_PART_DAY = fct_relevel(ARRIVAL_PART_DAY, 'Morning','Afternoon','Evening','Night')) %>% 
  ggplot() +
  geom_col(aes(x=ARRIVAL_PART_DAY, y=mean_arrival_delay, fill =ARRIVAL_PART_DAY)) + 
  labs(title="Avg Delay (mins) related to arrival", x="Scheduled Arrival (Part of the day)", y="Avg. Delay (mins)") +
  guides(fill=guide_legend(title="Part of Day (Arrival)"))

We can see that as we go later on the day, the flight is most likely to get delayed. Another interesting thing we can see is that in the morning the “Avg. Delay” (which wouldn’t be a delay) is actually negative, that means that the airports get the flights leaving earlier. (I would assume to counteract the possibility of a delay for the flights later that day).

Cancelled Fligths

The “CANCELLED” flights variable is a binomial variable with 2 categories 1 (yes) and 0 (no), that correspond to whether the flight was cancelled or not. We will take a look on the distribution, by airline and then by day of the week.

Cancelled Flights by airline

ggplot(flights_feb, aes(x = AIRLINE, fill = CANCELLED)) +
  geom_bar(position = position_dodge()) + 
  labs(title="Cancelled Flights by airline", x="Airline", y="Count of flights") +
  geom_text(stat = "count", aes(label = ..count..), size = 3)

Cancelled Flights by day of the week

ggplot(flights_feb, aes(x = DAY_OF_WEEK, fill = CANCELLED)) +
  geom_bar(position = position_dodge()) + 
  labs(title="Cancelled Flights by Day of the week", x="Day of the Week", y="Count of flights") +
  geom_text(stat = "count", aes(label = ..count..))

Here we can see that flights are most likely to be cancelled on Sunday and Saturday. And less likely on Wednesday.

Cancelled flights depending on the times of the day

flights_feb = flights_feb %>% 
  drop_na(SCHEDULED_DEPARTURE) %>% 
  mutate(CANCELLED_PART_DAY = case_when(
    SCHEDULED_DEPARTURE >= 600 & SCHEDULED_DEPARTURE < 1200 ~ "Morning",
    SCHEDULED_DEPARTURE >= 1200 & SCHEDULED_DEPARTURE < 1700 ~ "Afternoon",
    SCHEDULED_DEPARTURE >= 1700 & SCHEDULED_DEPARTURE < 2100 ~ "Evening",
    SCHEDULED_DEPARTURE <= 2100 & SCHEDULED_DEPARTURE < 600 ~ "Night" ))

flights_feb$CANCELLED_PART_DAY = as.factor(flights_feb$CANCELLED_PART_DAY)

flights_feb %>% 
  drop_na(CANCELLED_PART_DAY) %>% 
  mutate(CANCELLED_PART_DAY = fct_relevel(CANCELLED_PART_DAY, 'Morning','Afternoon','Evening','Night')) %>% 
  ggplot(aes(x = CANCELLED_PART_DAY, fill = CANCELLED)) + 
      geom_bar(position = position_dodge()) + 
      labs(title="Cancelled Flights by Day of the week", x="Day of the Week", y="Count of flights") + 
      geom_text(stat = "count", aes(label = ..count..))

6) Models

We will now analyze our data set in terms of: a) Delayed Flights and b) Cancelled flights. Delayed Flights is a continuous response variable (ARRIVAL_DELAY), therefore will be analyzed with: Linear Regression, SVR (Support Vector Regression), Decision Tree and Random Forest. Cancelled Flights is a binomial variable, therefore will be analyzed with: Logistic Regression, SVM, Decision Tree and Random Forest.

In the next section (Findings), I will present a summary of the MSE for the regression models and Accuracy (or confusion Matrix) for the categorical models.

First I will subset my training and test data sets. For this analysis, I will only use 4 out of the 14 airlines, due to storage limitation (my first try by running the linear regression with 7 out of 31 variables was unable to run, since it was a 34.2 Gb model).

We will use the following airlines: American Airlines (AA) and Delta Airlines (DL). I picked them since they and other 2 airlines have the highest delays (outliers), but also those 2 have the lowest avg. delay times. We will also focus on only 1 airport to be able to run all our models.

temp = flights_feb[flights_feb$AIRLINE == 'AA' | flights_feb$AIRLINE == 'DL', ]

temp %>% 
  count(ORIGIN_AIRPORT) %>% 
  mutate(rank = min_rank(-n)) %>% 
  arrange(desc(n)) %>% 
  rename(OriginAirport = ORIGIN_AIRPORT, times = n) %>% 
  filter(times > 2500)

##    OriginAirport times rank
## 1            ATL 17801    1
## 2            DFW 11614    2
## 3            MIA  4597    3
## 4            MSP  4277    4
## 5            ORD  4112    5
## 6            LAX  4015    6
## 7            DTW  3889    7
## 8            LGA  3247    8
## 9            JFK  3177    9
## 10           SLC  2711   10

We will pick the top 2 airports (ATL = Atlanta, DWF = Dallas-Forth Worth), to reduce even more the number of rows. But also to keep the representability of both Delta and American Airlines, since Atlanta and Dallas-FW are the “house” of each of those airlines.

To reduce even more our data set, so that we can run the SVM and Random Forest algorithms, we will choose only 14 out of 28 days.

temporary = temp[temp$ORIGIN_AIRPORT == 'ATL' | temp$ORIGIN_AIRPORT == 'DFW', ]

temporary = temporary %>% 
  filter(DAY <= 14)
  
# rm(flights_feb)
sum(is.na(temporary$ARRIVAL_DELAY))

## [1] 182

temporary = temporary %>% 
  drop_na(ARRIVAL_DELAY) %>% 
  drop_na(ARRIVAL_PART_DAY) %>% 
  drop_na(CANCELLED_PART_DAY)


sum(is.na(temporary$ARRIVAL_DELAY))

## [1] 0

sum(is.na(temporary$ARRIVAL_PART_DAY))

## [1] 0

Remove unused levels

temporary = temporary %>% 
  filter(ORIGIN_AIRPORT == 'ATL' | ORIGIN_AIRPORT == 'DFW') %>% 
  droplevels()

summary(temporary$ORIGIN_AIRPORT)

##  ATL  DFW 
## 6884 4441

temporary %>% 
  ggplot() +
  geom_bar(aes(x=ORIGIN_AIRPORT, fill = AIRLINE))

set.seed(123)


idx.train = sample(1:nrow(temporary), size = 0.6 * nrow(temporary))

train.df = temporary[idx.train,] 
test.df = temporary[-idx.train,]

Delayed Flights

Linear Regression

Due to memory/data processing constraint, I won’t show the multiple linear regression models ran, but at all the models and different combinations, DAY_OF_WEEK stayed significant, along with ORIGIN_AIRPORT.

Only 1 Destination Airport was significant: SGF (Springfield, MO), and for FLIGHT_NUMBER couple of flights were significant: 280 (pos coeff), 315 (neg coeff), 382 (pos coeff), 886 (pos coeff), 926 (pos coeff). I will later use this information.

I won’t include the variables such as “WHEELS_OFF”, “WHEELS_ON”, since those variables are more of “descriptors” than “predictors” of the model.

Creating a small data set

I will create a data set specifically to tune my model, the reason behind this, is that because of the size of the actual training data set, I was unable tu perform this procedure, so I needed an even smaller data set.

set.seed(123)


idx.train.svm = sample(1:nrow(train.df), size = 0.6 * nrow(train.df))

train.df.svm = train.df[idx.train.svm,] 
test.df.svm = train.df[-idx.train.svm,]

train.df.svm = na.omit(train.df.svm)
test.df.svm = na.omit(test.df.svm)

Fitting the Model with Train Model

We removed ELAPSED_TIME due to the correlation with DISTANCE

lm.fit = lm(ARRIVAL_DELAY ~ DAY_OF_WEEK+AIRLINE+
               ORIGIN_AIRPORT+DEPARTURE_TIME+DISTANCE+
               AIR_SYSTEM_DELAY+SECURITY_DELAY+
               AIRLINE_DELAY+LATE_AIRCRAFT_DELAY+
               WEATHER_DELAY, 
            data = train.df)

summary(lm.fit)

## 
## Call:
## lm(formula = ARRIVAL_DELAY ~ DAY_OF_WEEK + AIRLINE + ORIGIN_AIRPORT + 
##     DEPARTURE_TIME + DISTANCE + AIR_SYSTEM_DELAY + SECURITY_DELAY + 
##     AIRLINE_DELAY + LATE_AIRCRAFT_DELAY + WEATHER_DELAY, data = train.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.847  -5.611   0.224   5.492  33.755 
## 
## Coefficients: (1 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.7903724  0.7929299  -8.564  < 2e-16 ***
## DAY_OF_WEEK.L        1.2061243  0.2850332   4.232 2.35e-05 ***
## DAY_OF_WEEK.Q        1.1159728  0.2833956   3.938 8.30e-05 ***
## DAY_OF_WEEK.C        2.0990930  0.2899629   7.239 5.01e-13 ***
## DAY_OF_WEEK^4        2.2024309  0.2855764   7.712 1.41e-14 ***
## DAY_OF_WEEK^5        0.1993601  0.2950048   0.676 0.499200    
## DAY_OF_WEEK^6       -0.1565786  0.2877759  -0.544 0.586391    
## AIRLINEDL           -0.5106828  0.6600668  -0.774 0.439145    
## ORIGIN_AIRPORTDFW    1.5333521  0.6600956   2.323 0.020213 *  
## DEPARTURE_TIME       0.0010850  0.0003002   3.614 0.000303 ***
## DISTANCE            -0.0026612  0.0002218 -11.996  < 2e-16 ***
## AIR_SYSTEM_DELAY     1.0846307  0.0101414 106.951  < 2e-16 ***
## SECURITY_DELAY              NA         NA      NA       NA    
## AIRLINE_DELAY        1.0772619  0.0080341 134.086  < 2e-16 ***
## LATE_AIRCRAFT_DELAY  1.0745771  0.0157362  68.287  < 2e-16 ***
## WEATHER_DELAY        1.0937481  0.0527056  20.752  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.898 on 6780 degrees of freedom
## Multiple R-squared:  0.8525, Adjusted R-squared:  0.8522 
## F-statistic:  2799 on 14 and 6780 DF,  p-value: < 2.2e-16

There are some variables that show “perfect separation”, such as the Delay Causes (WEATHER_DELAY, LATE_AIRCRAFT_DELAY), I will perform the VIF test to analyze if it is needed to remove them or not.

# vif(lm.fit)

lm.null = lm(ARRIVAL_DELAY ~ 1, data = train.df.svm)
lm.full = lm(ARRIVAL_DELAY ~ DAY_OF_WEEK+AIRLINE+
               ORIGIN_AIRPORT+DEPARTURE_TIME+DISTANCE+
               AIR_SYSTEM_DELAY+SECURITY_DELAY+
               AIRLINE_DELAY+LATE_AIRCRAFT_DELAY+
               WEATHER_DELAY, data = train.df.svm)

step.AIC = step(lm.null, scope = list(upper=lm.full),
     direction ="both",test ="Chisq", trace = F)


summary(step.AIC)

## 
## Call:
## lm(formula = ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
##     LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
##     ORIGIN_AIRPORT + DEPARTURE_TIME, data = train.df.svm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.583  -5.489   0.278   5.646  27.905 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -7.4303756  0.5881669 -12.633  < 2e-16 ***
## AIRLINE_DELAY        1.0748550  0.0097046 110.757  < 2e-16 ***
## AIR_SYSTEM_DELAY     1.0718980  0.0115360  92.917  < 2e-16 ***
## LATE_AIRCRAFT_DELAY  1.0875702  0.0220169  49.397  < 2e-16 ***
## WEATHER_DELAY        1.0768011  0.0619198  17.390  < 2e-16 ***
## DAY_OF_WEEK.L        1.5777268  0.3703899   4.260 2.09e-05 ***
## DAY_OF_WEEK.Q        1.3159512  0.3653922   3.601  0.00032 ***
## DAY_OF_WEEK.C        1.9777472  0.3758248   5.262 1.50e-07 ***
## DAY_OF_WEEK^4        2.3520740  0.3698091   6.360 2.24e-10 ***
## DAY_OF_WEEK^5       -0.0472345  0.3827212  -0.123  0.90178    
## DAY_OF_WEEK^6       -0.3378758  0.3690306  -0.916  0.35994    
## DISTANCE            -0.0027983  0.0002855  -9.802  < 2e-16 ***
## ORIGIN_AIRPORTDFW    2.2306786  0.2974040   7.501 7.76e-14 ***
## DEPARTURE_TIME       0.0011027  0.0003869   2.850  0.00439 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.912 on 4063 degrees of freedom
## Multiple R-squared:  0.8645, Adjusted R-squared:  0.8641 
## F-statistic:  1994 on 13 and 4063 DF,  p-value: < 2.2e-16

According to VIF, the model looks good, so we will go ahead and perform the predictions and calculate the MSE.

Linear Regression: Predictions and Probabilities on the test data set

lm.preds = predict(step.AIC, newdata = test.df.svm)
# lm.rmse = rmse(test.df$ARRIVAL_DELAY, lm.preds)
lm.rmse = sqrt(mean((test.df.svm$ARRIVAL_DELAY - lm.preds)^2)) 
lm.rmse

## [1] 8.890251

SVR

Tune - find best gamma and cost

tuned = tune.svm(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
    LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
    ORIGIN_AIRPORT + DEPARTURE_TIME, 
            data = train.df.svm, gamma = seq(0.1, 1, by = 0.2), 
            cost = seq(0.1, 1, by = 0.2), scale = TRUE)

Please note: I couldn’t run the gamma and cost sequence by 0.1, since it was a very large tune, it brought me back every single time “empty tables”. Therefore, the best Gamma and Cost is 0.6.

svm.fit.lm = svm(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
    LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
    ELAPSED_TIME + ORIGIN_AIRPORT + AIRLINE + DEPARTURE_TIME, 
            data = train.df.svm, gamma = tuned$best.parameters$gamma, 
            cost = tuned$best.parameters$cost, scale = TRUE)

SVR: Predictions and Probabilities on the test data set

svm.preds.lm = predict(svm.fit.lm, test.df.svm)
svm.rmse.lm = sqrt(mean((test.df.svm$ARRIVAL_DELAY - svm.preds.lm)^2))
svm.rmse.lm

## [1] 9.866235

Comments: Even when it might not be the best model, due to memory constraints in my computer RAM, we can see that there was INDEED an improvement on the RMSE.

Decision Tree

dt.model.lm = tree(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
    LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
    ORIGIN_AIRPORT + DEPARTURE_TIME, 
            data = train.df)

dt.preds = predict(dt.model.lm, test.df)
dt.rmse = sqrt(mean((test.df$ARRIVAL_DELAY - dt.preds)^2))
dt.rmse

## [1] 11.11843

Random Forest

For Random Forest, it was a challenge, due to the memory usage. I will use “ranger” which is the package for Random Forest for data sets with high dimensionality.

Finding the best parameters

grid <- expand.grid(
  mtry       = 2:6,
  min.node.size  = seq(0.1,1,0.1),
  sample_size = seq(0.1,1,0.1),
  OOB = 0
)

for(i in 1:nrow(grid)) {
  
  # train model
  model <- ranger(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
    LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
    ORIGIN_AIRPORT + DEPARTURE_TIME, 
    data            = train.df, 
    num.trees       = 200,
    mtry            = grid$mtry[i],
    min.node.size   = grid$min.node.size[i],
    sample.fraction = grid$sample_size[i],
    seed            = 123,
    write.forest = T)
  
    grid$OOB[i] <- sqrt(model$prediction.error)
}

To find the lowest OOB error I will arrange the resulting table in a descending order.

grid %>% 
  dplyr::arrange(OOB) %>% 
  head(10)

##    mtry min.node.size sample_size      OOB
## 1     6           0.1         0.9 10.41600
## 2     6           0.2         0.9 10.41600
## 3     6           0.3         0.9 10.41600
## 4     6           0.4         0.9 10.41600
## 5     6           0.5         0.9 10.41600
## 6     6           0.6         0.9 10.41600
## 7     6           0.7         0.9 10.41600
## 8     6           0.8         0.9 10.41600
## 9     6           0.9         0.9 10.41600
## 10    6           1.0         0.9 10.45969

We conclude that the best parameters are mtry = 4, min.node.size = 0.1, sample_size = 0.9.

Variable Importance

set.seed(123)

OOB <- vector(mode = "numeric", length = 200)

for(i in seq_along(OOB)) {

  optimal_ranger <- ranger(
    formula         = ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
    LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
    ORIGIN_AIRPORT + DEPARTURE_TIME, 
    data            = train.df.svm, 
    num.trees       = 200,
    mtry            = 6,
    min.node.size   = 0.1,
    sample.fraction = 0.7,
    importance      = 'impurity'
  )
  
  OOB[i] <- sqrt(optimal_ranger$prediction.error)
}

hist(OOB, breaks = 20)

optimal_ranger$variable.importance %>% 
  tidy() %>%
  dplyr::arrange(desc(x)) %>%
  dplyr::top_n(25) %>%
  ggplot(aes(reorder(names, x), x)) +
  geom_col() +
  coord_flip() +
  ggtitle("Top 25 important variables") + 
  labs(x="Variables", y="Variable Importance")

## Selecting by x

Predictions with the parameters

rf.model <- randomForest(
    formula         = ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY + 
    LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE + 
    ORIGIN_AIRPORT + DEPARTURE_TIME, 
    data = train.df.svm, 
    ntree = 200,
    mtry = 6,
    nodesize = 0.1,
    importance      = TRUE)

rf.preds = predict(rf.model, test.df.svm)
# rf.rmse = rmse(test.df.svm$ARRIVAL_DELAY, optimal_ranger)
rf.rmse = sqrt(mean((test.df.svm$ARRIVAL_DELAY - rf.preds)^2))
rf.rmse

## [1] 8.937098