library(readr)
library(dplyr) # data wrangling
library(tidyr) # data wrangling
library(ggplot2) # plotting
library(rpart) # DT
library(randomForestSRC) # RF
library(tidyverse)
library(tree)
library(Metrics)
library(caret)
library(car)
library(kernlab)
library(MASS)
library(performance)
library(ROCR)
library(lmtest)
library(randomForest)
library(corrplot)
library(lubridate)
library(ggcorrplot) # corr plot
library(reshape2) # heat map
library(GGally) # Pais ggplot2
library(kableExtra) # Airlines table
library(e1071) # Tune SVM
library(ranger)
library(broom) # Tidy functionIn the past several years, air travel has become both more hectic and expensive, with tighter security regulations and longer waits, yet it is still the fastest and safest travel method there is. The number of flights performed globally by the airline industry (both commercial and freight) increased steadily since the early 2000s and reached 38.9 million in 2019. However, due to the coronavirus pandemic, the number of flights dropped to 16.9 million in 2020. (Mazareanu, 2021) Putting the COVID crisis aside, airlines have always had the need to use data analytics to help them analyze and develop models that can predict better both the customer and the outside events behavior.
Airlines, like any other company from other industries are focused on finding the best way to be more cost efficient, and the easiest way to achieve this is by avoiding unnecessary costs. Delayed aircrafts are estimated to have cost the airlines several billion dollars in additional expense. Delays also drive the need for extra gates and ground personnel and impose costs on airline customers (including shippers) in the form of lost productivity, wages and goodwill. (Airlines for America, 2020) There is no wonder why Data Analytics play a very important role in helping airlines predicting any event that might cut their profit.
But what makes this analysis so difficult? Well, delayed flights are not the only problem for airlines, there are also cancelled flights, and there are many reasons for cancellation. Also, let’s go back to the delay flights, because the impact of a delay flight depends on the duration of this delay, a couple-of-minutes delay is way different than a many-hours delay. All these variables and their multiple effects will be analyzed throughout this report.
Even when the data set is from 2015, and we won’t be seeing any effects of the coronavirus pandemic, this world-changing event showed us the importance of having models to analyze historic data, but also models that allow us to make decisions in real-time. Knowing the effects and the causes for both the passengers and the airline for delayed and cancelled flights can help avoiding costs on both ends: passenger and airlines.
This data set presents a wide variety of problems and offers different ways to approach them, for sure the main question we would ultimately like to answer is “How to avoid delayed and cancelled flights?”, and since there is no one easy solution, it gives space for us to analyze this data set in many ways:
The data set was found at Kaggle (source: https://www.kaggle.com/usdot/flight-delays), the “2015 flight delays and cancellations” data set originally comes from the U.S. Department of Transportation’s (DOT) Bureau of Transportation Statistics, all flights are domestic, which means they only flights within the U.S. (including Alaska and Puerto Rico). The data set includes 322 airports and 14 airlines. The data set consists of 5.8 million observations (5,819,079) and 31 variables (with information from day, month, airline, airport, scheduled departure, actual departure, cancellation code, delayed causes and others), from those 31 variables, 1 variable will be our main focus (response variable): “Arrival_Delay”. By the end of the Case Study, we will add 2 more variables = “ARRIVAL_PART_DAY” and “CANCELLED_PART_DAY”, these 2 variables will be used to analyze the Delayed problems (from the Arrival and Scheduled time perspective) with a categorical analysis.
Due to the number of observations in this data set I might be unable to run more complex models such as SVM (or SV Regression), Decision Trees or Random Forest, therefore, I decided to subset the original data set and just use the information for the month of February, as it has less days that the other 11 months. The process of subsetting the data set can be found in the following link: https://rpubs.com/arachaparro/FlightDelay_2015_US
setwd("C:/Users/arach/OneDrive/4. UTSA/Spring 2022/S22 - DA6813 - Data Analytics Applications/00. Cases/00_Flight_Delay/")
flights_feb <- read.csv(file = "flights_feb.csv", header = TRUE)
flights_feb$X <- NULLstr(flights_feb)## 'data.frame': 429191 obs. of 31 variables:
## $ YEAR : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ MONTH : int 2 2 2 2 2 2 2 2 2 2 ...
## $ DAY : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DAY_OF_WEEK : int 7 7 7 7 7 7 7 7 7 7 ...
## $ AIRLINE : chr "AA" "AS" "AA" "DL" ...
## $ FLIGHT_NUMBER : int 2400 98 258 806 612 1112 1434 1832 1674 1450 ...
## $ TAIL_NUMBER : chr "N3JKAA" "N794AS" "N3FEAA" "N962DN" ...
## $ ORIGIN_AIRPORT : chr "LAX" "ANC" "LAX" "SFO" ...
## $ DESTINATION_AIRPORT: chr "DFW" "SEA" "MIA" "MSP" ...
## $ SCHEDULED_DEPARTURE: int 5 5 20 20 25 30 30 35 45 45 ...
## $ DEPARTURE_TIME : int 2 34 10 12 16 24 25 31 40 36 ...
## $ DEPARTURE_DELAY : int -3 29 -10 -8 -9 -6 -5 -4 -5 -9 ...
## $ TAXI_OUT : int 15 10 68 14 11 11 15 17 16 15 ...
## $ WHEELS_OFF : int 17 44 118 26 27 35 40 48 56 51 ...
## $ SCHEDULED_TIME : int 169 204 284 220 181 189 218 269 264 183 ...
## $ ELAPSED_TIME : int 170 207 339 216 179 191 209 277 257 175 ...
## $ AIR_TIME : int 150 190 258 192 164 175 192 249 236 155 ...
## $ DISTANCE : int 1235 1448 2342 1589 1299 1464 1535 2125 2174 1299 ...
## $ WHEELS_ON : int 447 454 836 538 511 530 552 757 752 526 ...
## $ TAXI_IN : int 5 7 13 10 4 5 2 11 5 5 ...
## $ SCHEDULED_ARRIVAL : int 454 429 804 600 526 539 608 804 809 548 ...
## $ ARRIVAL_TIME : int 452 501 849 548 515 535 554 808 757 531 ...
## $ ARRIVAL_DELAY : int -2 32 45 -12 -11 -4 -14 4 -12 -17 ...
## $ DIVERTED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CANCELLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CANCELLATION_REASON: chr NA NA NA NA ...
## $ AIR_SYSTEM_DELAY : int NA 3 45 NA NA NA NA NA NA NA ...
## $ SECURITY_DELAY : int NA 0 0 NA NA NA NA NA NA NA ...
## $ AIRLINE_DELAY : int NA 29 0 NA NA NA NA NA NA NA ...
## $ LATE_AIRCRAFT_DELAY: int NA 0 0 NA NA NA NA NA NA NA ...
## $ WEATHER_DELAY : int NA 0 0 NA NA NA NA NA NA NA ...
One of the variables that will be used as a response variable is “Cancelled”, to verify if the data set is balanced with used the function table ( ). The result table shows us that indeed the data set is highly imbalanced, with 98.5% of the flights not cancelled and 1.5% of them cancelled throughout the year. However, when the subset was created with only the flights of February, the cancelled flights went up to 4.78%
tbl_feb = table(flights_feb$CANCELLED)
cbind(tbl_feb, round(prop.table(tbl_feb)*100,2))## tbl_feb
## 0 408674 95.22
## 1 20517 4.78
Some of the data, such as WHEELS_ON, SCHEDULED_ARRIVAL and ARRIVAL_TIME, are numeric but should be in HHMM format, however, since we will analyze them with the format they have, due to the complexity it will add to the model for interpretability down the road.
The only data parsing we will be doing will be with the columns in numeric and character format that should be factor (categorical) variables.
str(flights_feb[c(4:9,26)])## 'data.frame': 429191 obs. of 7 variables:
## $ DAY_OF_WEEK : Ord.factor w/ 7 levels "Mon"<"Tue"<"Wed"<..: 6 6 6 6 6 6 6 6 6 6 ...
## $ AIRLINE : Factor w/ 14 levels "AA","AS","B6",..: 1 2 1 4 9 1 4 12 1 4 ...
## $ FLIGHT_NUMBER : Factor w/ 6272 levels "1","2","3","4",..: 2394 96 256 803 610 1107 1428 1826 1668 1444 ...
## $ TAIL_NUMBER : Factor w/ 4361 levels "N001AA","N002AA",..: 1386 3358 1315 4264 2371 1413 181 2099 1349 141 ...
## $ ORIGIN_AIRPORT : Factor w/ 315 levels "ABE","ABI","ABQ",..: 172 16 172 273 170 273 172 172 170 170 ...
## $ DESTINATION_AIRPORT: Factor w/ 315 levels "ABE","ABI","ABQ",..: 85 272 199 212 212 85 212 65 199 212 ...
## $ CANCELLATION_REASON: Factor w/ 4 levels "A","B","C","D": NA NA NA NA NA NA NA NA NA NA ...
Since our dataset has: 1) Flights on time, 2) Flights delayed, and 3) Flights Cancelled, there are missing values for columns such as “ARRIVAL_TIME” for flights that were Cancelled for instance. For the variables such as “DEPARTURE_TIME” and “ARRIVAL TIME”, the rows will be deleted at the train/test split (since a value equal to 0, would be equivalent to an Arrival time at 12:00:00 am). For the other variables, such as TAIL_NUMBER or the delay codes: SECURITY_DELAY, they will be substituded with a 0.
summary(flights_feb)## YEAR MONTH DAY DAY_OF_WEEK AIRLINE
## Min. :2015 Min. :2 Min. : 1.0 Mon:62323 WN : 90172
## 1st Qu.:2015 1st Qu.:2 1st Qu.: 8.0 Tue:63374 DL : 60884
## Median :2015 Median :2 Median :15.0 Wed:65156 EV : 45138
## Mean :2015 Mean :2 Mean :14.6 Thu:65429 OO : 43989
## 3rd Qu.:2015 3rd Qu.:2 3rd Qu.:22.0 Fri:50649 AA : 39835
## Max. :2015 Max. :2 Max. :28.0 Sat:57717 UA : 36235
## Sun:64543 (Other):112938
## FLIGHT_NUMBER TAIL_NUMBER ORIGIN_AIRPORT DESTINATION_AIRPORT
## 469 : 366 N486HA : 354 ATL : 27366 ATL : 27366
## 345 : 334 N489HA : 348 ORD : 21812 ORD : 21814
## 61 : 321 N477HA : 347 DFW : 20839 DFW : 20839
## 745 : 312 N488HA : 336 LAX : 15762 LAX : 15764
## 711 : 307 N476HA : 323 DEN : 15642 DEN : 15638
## 1159 : 302 (Other):423889 IAH : 12231 IAH : 12228
## (Other):427249 NA's : 3594 (Other):315539 (Other):315542
## SCHEDULED_DEPARTURE DEPARTURE_TIME DEPARTURE_DELAY TAXI_OUT
## Min. : 5 Min. : 1 Min. : -61.00 Min. : 1.00
## 1st Qu.: 925 1st Qu.: 932 1st Qu.: -4.00 1st Qu.: 11.00
## Median :1320 Median :1330 Median : -1.00 Median : 14.00
## Mean :1324 Mean :1336 Mean : 11.88 Mean : 16.92
## 3rd Qu.:1720 3rd Qu.:1731 3rd Qu.: 11.00 3rd Qu.: 19.00
## Max. :2359 Max. :2400 Max. :1587.00 Max. :225.00
## NA's :20059 NA's :20059 NA's :20412
## WHEELS_OFF SCHEDULED_TIME ELAPSED_TIME AIR_TIME
## Min. : 1 Min. : 20 Min. : 15.0 Min. : 7.0
## 1st Qu.: 947 1st Qu.: 85 1st Qu.: 83.0 1st Qu.: 60.0
## Median :1345 Median :122 Median :119.0 Median : 94.0
## Mean :1360 Mean :140 Mean :137.1 Mean :112.6
## 3rd Qu.:1745 3rd Qu.:173 3rd Qu.:170.0 3rd Qu.:144.0
## Max. :2400 Max. :679 Max. :766.0 Max. :684.0
## NA's :20412 NA's :2 NA's :21528 NA's :21528
## DISTANCE WHEELS_ON TAXI_IN SCHEDULED_ARRIVAL
## Min. : 31.0 Min. : 1 Min. : 1.000 Min. : 1
## 1st Qu.: 366.0 1st Qu.:1114 1st Qu.: 4.000 1st Qu.:1123
## Median : 641.0 Median :1519 Median : 6.000 Median :1525
## Mean : 800.8 Mean :1490 Mean : 7.583 Mean :1508
## 3rd Qu.:1045.0 3rd Qu.:1912 3rd Qu.: 9.000 3rd Qu.:1915
## Max. :4983.0 Max. :2400 Max. :202.000 Max. :2359
## NA's :20822 NA's :20822
## ARRIVAL_TIME ARRIVAL_DELAY DIVERTED CANCELLED
## Min. : 1 Min. : -79.000 Min. :0.000000 0:408674
## 1st Qu.:1119 1st Qu.: -12.000 1st Qu.:0.000000 1: 20517
## Median :1524 Median : -2.000 Median :0.000000
## Mean :1496 Mean : 8.321 Mean :0.002356
## 3rd Qu.:1918 3rd Qu.: 13.000 3rd Qu.:0.000000
## Max. :2400 Max. :1627.000 Max. :1.000000
## NA's :20822 NA's :21528
## CANCELLATION_REASON AIR_SYSTEM_DELAY SECURITY_DELAY AIRLINE_DELAY
## A : 2815 Min. : 0.0 Min. : 0 Min. : 0
## B : 15447 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0
## C : 2254 Median : 4.0 Median : 0 Median : 2
## D : 1 Mean : 14.2 Mean : 0 Mean : 18
## NA's:408674 3rd Qu.: 19.0 3rd Qu.: 0 3rd Qu.: 18
## Max. :748.0 Max. :168 Max. :1587
## NA's :334012 NA's :334012 NA's :334012
## LATE_AIRCRAFT_DELAY WEATHER_DELAY
## Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 3.0 Median : 0.0
## Mean : 22.7 Mean : 4.3
## 3rd Qu.: 29.0 3rd Qu.: 0.0
## Max. :1313.0 Max. :1152.0
## NA's :334012 NA's :334012
flights_feb = flights_feb %>%
mutate( TAIL_NUMBER = ifelse(is.na(TAIL_NUMBER), 0, TAIL_NUMBER),
AIR_SYSTEM_DELAY = ifelse(is.na(AIR_SYSTEM_DELAY), 0, AIR_SYSTEM_DELAY),
SECURITY_DELAY = ifelse(is.na(SECURITY_DELAY), 0, SECURITY_DELAY),
AIRLINE_DELAY = ifelse(is.na(AIRLINE_DELAY), 0, AIRLINE_DELAY),
LATE_AIRCRAFT_DELAY = ifelse(is.na(LATE_AIRCRAFT_DELAY), 0, LATE_AIRCRAFT_DELAY),
WEATHER_DELAY = ifelse(is.na(WEATHER_DELAY), 0, WEATHER_DELAY),
CANCELLATION_REASON = ifelse(is.na(CANCELLATION_REASON), 0, CANCELLATION_REASON),
TAXI_IN = ifelse(is.na(TAXI_IN), 0, TAXI_IN),
AIR_TIME = ifelse(is.na(AIR_TIME), 0, AIR_TIME),
ELAPSED_TIME = ifelse(is.na(ELAPSED_TIME), 0, ELAPSED_TIME),
SCHEDULED_TIME = ifelse(is.na(SCHEDULED_TIME), 0, SCHEDULED_TIME),
WHEELS_OFF = ifelse(is.na(WHEELS_OFF), 0, WHEELS_OFF),
TAXI_OUT = ifelse(is.na(TAXI_OUT), 0, TAXI_OUT)
)Comments: From the data above we can conclude the following:
We have 21,528 Cancelled flights (We can see it from AIR_TIME and ELAPSED_TIME NAs), however both WHEELS_OFF (20,412 NAs) and WHEELS_ON (20,822 NAs - Same as TAXI_IN, they are linked) are less than 21,528, and we would assume since a flight was Cancelled that the flight never took off (WHEELS_OFF) and therefore, never landed (WHEELS_ON). With these flights we will assume that: they actually took off, but due to reasons we do not know, they had to land immediately after.
We will consider a “delayed flight” using ARRIVAL_DELAY as a continuous variable.
# Delayed flights
sum(!is.na(flights_feb$ARRIVAL_DELAY))## [1] 407663
# Delayed flights of 1 minute or more
sum(na.omit(flights_feb$ARRIVAL_DELAY)!=0)## [1] 398579
We have 407,663 delayed flights, from those 398,579 are 1 minute or more.
We will start by analyzing if our variables have any correlation.
# ggcorrplot(cor(flights_feb[,-c(1:1,5,7:9,26)]))
cormat = round(cor(flights_feb[,-c(1,4:9,25:26)]),2)## Warning in cor(flights_feb[, -c(1, 4:9, 25:26)]): the standard deviation is zero
melted_cormat = melt(cormat)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))For the heat map we had to exclude any categorical variables, and so far with the continuous variables we can not see any strong correlation other from “SCHEDULED_ARRIVAL” AND “SCHEDULED_DEPARTURE”. (which is obvious)
To analyze the categorical variables, we will use some box plots and bar plots. To analyze the relationship between the categorical variables as predictors and the response variables.
Delayed arrival time per airline
We will start by analyzing how the delayed time behaves throughout the airlines, and try to see if any of them stand out. First we will do it visually, where we can see the box plot and the outliers. For display we will use the abbreviation of the names, in the table bellow we will find the complete name of the airlines.
ggplot(flights_feb, aes(x=AIRLINE, y=ARRIVAL_DELAY, fill=AIRLINE))+
geom_boxplot()+
labs(title="Delayed Arrival time (minutes) per airline", x="Airlines", y="Delay (mins)") At first glance we can see that American Airlines (AA) and Hawaiian Airlines (HA) have the flights with the longest delayed times. The second thing we can see is that pretty much all of them have a very low “Average Arrival Delay”, however JetBlue Airways (B6), Frontier Airlines (F9) and American Eagle Airlines (MAQ) show to be the airlines with an average arrival delay higher than the rest of them.
airlines <- read.csv(file = "airlines.csv", header = TRUE, fileEncoding = 'UTF-8-BOM')
airlines %>%
arrange(IATA_CODE) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))| IATA_CODE | AIRLINE |
|---|---|
| AA | American Airlines Inc. |
| AS | Alaska Airlines Inc. |
| B6 | JetBlue Airways |
| DL | Delta Air Lines Inc. |
| EV | Atlantic Southeast Airlines |
| F9 | Frontier Airlines Inc. |
| HA | Hawaiian Airlines Inc. |
| MQ | American Eagle Airlines Inc. |
| NK | Spirit Air Lines |
| OO | Skywest Airlines Inc. |
| UA | United Air Lines Inc. |
| US | US Airways Inc. |
| VX | Virgin America |
| WN | Southwest Airlines Co. |
Numbers wise we can verify that the JetBlue Airways (B6), Frontier Airlines (F9) and American Eagle Airlines (MQ) are the airlines with the highest mean values, but we also can see that Alaska Airlines (AS) has a negative mean, which translates in most their flights arriving even earlier than expected.
Airlines is something we would like to look into.
flights_feb %>%
group_by(AIRLINE) %>%
summarize(min = min(na.omit(ARRIVAL_DELAY)),
median = median(na.omit(ARRIVAL_DELAY)),
mean = round(mean(na.omit(ARRIVAL_DELAY)),2),
max = max(na.omit(ARRIVAL_DELAY))) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))| AIRLINE | min | median | mean | max |
|---|---|---|---|---|
| AA | -67 | -3 | 7.53 | 1627 |
| AS | -72 | -6 | -0.78 | 729 |
| B6 | -62 | 3 | 18.66 | 952 |
| DL | -79 | -5 | 5.61 | 1177 |
| EV | -52 | -1 | 10.42 | 637 |
| F9 | -44 | 6 | 27.42 | 839 |
| HA | -52 | 1 | 6.03 | 1467 |
| MQ | -55 | 5 | 21.30 | 1152 |
| NK | -54 | 2 | 16.47 | 523 |
| OO | -54 | -2 | 9.59 | 953 |
| UA | -65 | -2 | 7.26 | 863 |
| US | -59 | -2 | 7.11 | 515 |
| VX | -63 | -4 | 7.78 | 572 |
| WN | -73 | -4 | 3.50 | 508 |
Delayed arrival time per Day of the Week
ggplot(flights_feb, aes(x=DAY_OF_WEEK, y=ARRIVAL_DELAY, fill=DAY_OF_WEEK))+
geom_boxplot()+
labs(title="Delayed Arrival time (minutes) by day of the week", x="Day of the Week", y="Delay (mins)") +
guides(fill=guide_legend(title="Day of the Week"))flights_feb %>%
group_by(DAY_OF_WEEK) %>%
summarize(min = min(na.omit(ARRIVAL_DELAY)),
median = median(na.omit(ARRIVAL_DELAY)),
mean = round(mean(na.omit(ARRIVAL_DELAY)),2),
max = max(na.omit(ARRIVAL_DELAY))) %>%
kable(col.names = c("Day of the Week", "Min", "Median", "Mean", "Max")) %>%
kable_styling(bootstrap_options = c("striped","hover")) | Day of the Week | Min | Median | Mean | Max |
|---|---|---|---|---|
| Mon | -62 | -3 | 8.56 | 1351 |
| Tue | -64 | -4 | 5.35 | 1371 |
| Wed | -72 | 0 | 10.54 | 1366 |
| Thu | -79 | -3 | 6.19 | 1295 |
| Fri | -73 | -5 | 5.05 | 1627 |
| Sat | -70 | -3 | 9.51 | 1460 |
| Sun | -77 | 0 | 12.76 | 1467 |
From the statistics above we can see that flights are delayed more minutes in average on Sunday, then Wednesday; and that if you want your flight to be arriving on time, you have a better chance if you take it on Friday or Tuesday.
Delayed arrival time depending on the times of the day
We would also know what is the schedule where is most likely to have your flight delayed.
flights_feb = flights_feb %>%
mutate(ARRIVAL_PART_DAY = case_when(
ARRIVAL_TIME >= 600 & ARRIVAL_TIME < 1200 ~ "Morning",
ARRIVAL_TIME >= 1200 & ARRIVAL_TIME < 1700 ~ "Afternoon",
ARRIVAL_TIME >= 1700 & ARRIVAL_TIME < 2100 ~ "Evening",
ARRIVAL_TIME <= 2100 & ARRIVAL_TIME < 600 ~ "Night" ))
flights_feb$ARRIVAL_PART_DAY = as.factor(flights_feb$ARRIVAL_PART_DAY)ARRIVAL_MEANS = flights_feb %>%
drop_na(ARRIVAL_DELAY) %>%
group_by(ARRIVAL_PART_DAY) %>%
summarize(mean_arrival_delay = mean(ARRIVAL_DELAY)) %>%
arrange(factor(ARRIVAL_PART_DAY, levels = c('Morning','Afternoon','Evening','Night')))ARRIVAL_MEANS %>%
drop_na(ARRIVAL_PART_DAY) %>%
mutate(ARRIVAL_PART_DAY = fct_relevel(ARRIVAL_PART_DAY, 'Morning','Afternoon','Evening','Night')) %>%
ggplot() +
geom_col(aes(x=ARRIVAL_PART_DAY, y=mean_arrival_delay, fill =ARRIVAL_PART_DAY)) +
labs(title="Avg Delay (mins) related to arrival", x="Scheduled Arrival (Part of the day)", y="Avg. Delay (mins)") +
guides(fill=guide_legend(title="Part of Day (Arrival)"))We can see that as we go later on the day, the flight is most likely to get delayed. Another interesting thing we can see is that in the morning the “Avg. Delay” (which wouldn’t be a delay) is actually negative, that means that the airports get the flights leaving earlier. (I would assume to counteract the possibility of a delay for the flights later that day).
The “CANCELLED” flights variable is a binomial variable with 2 categories 1 (yes) and 0 (no), that correspond to whether the flight was cancelled or not. We will take a look on the distribution, by airline and then by day of the week.
Cancelled Flights by airline
ggplot(flights_feb, aes(x = AIRLINE, fill = CANCELLED)) +
geom_bar(position = position_dodge()) +
labs(title="Cancelled Flights by airline", x="Airline", y="Count of flights") +
geom_text(stat = "count", aes(label = ..count..), size = 3)Cancelled Flights by day of the week
ggplot(flights_feb, aes(x = DAY_OF_WEEK, fill = CANCELLED)) +
geom_bar(position = position_dodge()) +
labs(title="Cancelled Flights by Day of the week", x="Day of the Week", y="Count of flights") +
geom_text(stat = "count", aes(label = ..count..))Here we can see that flights are most likely to be cancelled on Sunday and Saturday. And less likely on Wednesday.
Cancelled flights depending on the times of the day
flights_feb = flights_feb %>%
drop_na(SCHEDULED_DEPARTURE) %>%
mutate(CANCELLED_PART_DAY = case_when(
SCHEDULED_DEPARTURE >= 600 & SCHEDULED_DEPARTURE < 1200 ~ "Morning",
SCHEDULED_DEPARTURE >= 1200 & SCHEDULED_DEPARTURE < 1700 ~ "Afternoon",
SCHEDULED_DEPARTURE >= 1700 & SCHEDULED_DEPARTURE < 2100 ~ "Evening",
SCHEDULED_DEPARTURE <= 2100 & SCHEDULED_DEPARTURE < 600 ~ "Night" ))
flights_feb$CANCELLED_PART_DAY = as.factor(flights_feb$CANCELLED_PART_DAY)flights_feb %>%
drop_na(CANCELLED_PART_DAY) %>%
mutate(CANCELLED_PART_DAY = fct_relevel(CANCELLED_PART_DAY, 'Morning','Afternoon','Evening','Night')) %>%
ggplot(aes(x = CANCELLED_PART_DAY, fill = CANCELLED)) +
geom_bar(position = position_dodge()) +
labs(title="Cancelled Flights by Day of the week", x="Day of the Week", y="Count of flights") +
geom_text(stat = "count", aes(label = ..count..))We will now analyze our data set in terms of: a) Delayed Flights and b) Cancelled flights. Delayed Flights is a continuous response variable (ARRIVAL_DELAY), therefore will be analyzed with: Linear Regression, SVR (Support Vector Regression), Decision Tree and Random Forest. Cancelled Flights is a binomial variable, therefore will be analyzed with: Logistic Regression, SVM, Decision Tree and Random Forest.
In the next section (Findings), I will present a summary of the MSE for the regression models and Accuracy (or confusion Matrix) for the categorical models.
First I will subset my training and test data sets. For this analysis, I will only use 4 out of the 14 airlines, due to storage limitation (my first try by running the linear regression with 7 out of 31 variables was unable to run, since it was a 34.2 Gb model).
We will use the following airlines: American Airlines (AA) and Delta Airlines (DL). I picked them since they and other 2 airlines have the highest delays (outliers), but also those 2 have the lowest avg. delay times. We will also focus on only 1 airport to be able to run all our models.
temp = flights_feb[flights_feb$AIRLINE == 'AA' | flights_feb$AIRLINE == 'DL', ] temp %>%
count(ORIGIN_AIRPORT) %>%
mutate(rank = min_rank(-n)) %>%
arrange(desc(n)) %>%
rename(OriginAirport = ORIGIN_AIRPORT, times = n) %>%
filter(times > 2500)## OriginAirport times rank
## 1 ATL 17801 1
## 2 DFW 11614 2
## 3 MIA 4597 3
## 4 MSP 4277 4
## 5 ORD 4112 5
## 6 LAX 4015 6
## 7 DTW 3889 7
## 8 LGA 3247 8
## 9 JFK 3177 9
## 10 SLC 2711 10
We will pick the top 2 airports (ATL = Atlanta, DWF = Dallas-Forth Worth), to reduce even more the number of rows. But also to keep the representability of both Delta and American Airlines, since Atlanta and Dallas-FW are the “house” of each of those airlines.
To reduce even more our data set, so that we can run the SVM and Random Forest algorithms, we will choose only 14 out of 28 days.
temporary = temp[temp$ORIGIN_AIRPORT == 'ATL' | temp$ORIGIN_AIRPORT == 'DFW', ]
temporary = temporary %>%
filter(DAY <= 14)
# rm(flights_feb)
sum(is.na(temporary$ARRIVAL_DELAY))## [1] 182
temporary = temporary %>%
drop_na(ARRIVAL_DELAY) %>%
drop_na(ARRIVAL_PART_DAY) %>%
drop_na(CANCELLED_PART_DAY)
sum(is.na(temporary$ARRIVAL_DELAY))## [1] 0
sum(is.na(temporary$ARRIVAL_PART_DAY))## [1] 0
Remove unused levels
temporary = temporary %>%
filter(ORIGIN_AIRPORT == 'ATL' | ORIGIN_AIRPORT == 'DFW') %>%
droplevels()summary(temporary$ORIGIN_AIRPORT)## ATL DFW
## 6884 4441
temporary %>%
ggplot() +
geom_bar(aes(x=ORIGIN_AIRPORT, fill = AIRLINE))set.seed(123)
idx.train = sample(1:nrow(temporary), size = 0.6 * nrow(temporary))
train.df = temporary[idx.train,]
test.df = temporary[-idx.train,]Due to memory/data processing constraint, I won’t show the multiple linear regression models ran, but at all the models and different combinations, DAY_OF_WEEK stayed significant, along with ORIGIN_AIRPORT.
Only 1 Destination Airport was significant: SGF (Springfield, MO), and for FLIGHT_NUMBER couple of flights were significant: 280 (pos coeff), 315 (neg coeff), 382 (pos coeff), 886 (pos coeff), 926 (pos coeff). I will later use this information.
I won’t include the variables such as “WHEELS_OFF”, “WHEELS_ON”, since those variables are more of “descriptors” than “predictors” of the model.
Creating a small data set
I will create a data set specifically to tune my model, the reason behind this, is that because of the size of the actual training data set, I was unable tu perform this procedure, so I needed an even smaller data set.
set.seed(123)
idx.train.svm = sample(1:nrow(train.df), size = 0.6 * nrow(train.df))
train.df.svm = train.df[idx.train.svm,]
test.df.svm = train.df[-idx.train.svm,]
train.df.svm = na.omit(train.df.svm)
test.df.svm = na.omit(test.df.svm)Fitting the Model with Train Model
We removed ELAPSED_TIME due to the correlation with DISTANCE
lm.fit = lm(ARRIVAL_DELAY ~ DAY_OF_WEEK+AIRLINE+
ORIGIN_AIRPORT+DEPARTURE_TIME+DISTANCE+
AIR_SYSTEM_DELAY+SECURITY_DELAY+
AIRLINE_DELAY+LATE_AIRCRAFT_DELAY+
WEATHER_DELAY,
data = train.df)
summary(lm.fit)##
## Call:
## lm(formula = ARRIVAL_DELAY ~ DAY_OF_WEEK + AIRLINE + ORIGIN_AIRPORT +
## DEPARTURE_TIME + DISTANCE + AIR_SYSTEM_DELAY + SECURITY_DELAY +
## AIRLINE_DELAY + LATE_AIRCRAFT_DELAY + WEATHER_DELAY, data = train.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.847 -5.611 0.224 5.492 33.755
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.7903724 0.7929299 -8.564 < 2e-16 ***
## DAY_OF_WEEK.L 1.2061243 0.2850332 4.232 2.35e-05 ***
## DAY_OF_WEEK.Q 1.1159728 0.2833956 3.938 8.30e-05 ***
## DAY_OF_WEEK.C 2.0990930 0.2899629 7.239 5.01e-13 ***
## DAY_OF_WEEK^4 2.2024309 0.2855764 7.712 1.41e-14 ***
## DAY_OF_WEEK^5 0.1993601 0.2950048 0.676 0.499200
## DAY_OF_WEEK^6 -0.1565786 0.2877759 -0.544 0.586391
## AIRLINEDL -0.5106828 0.6600668 -0.774 0.439145
## ORIGIN_AIRPORTDFW 1.5333521 0.6600956 2.323 0.020213 *
## DEPARTURE_TIME 0.0010850 0.0003002 3.614 0.000303 ***
## DISTANCE -0.0026612 0.0002218 -11.996 < 2e-16 ***
## AIR_SYSTEM_DELAY 1.0846307 0.0101414 106.951 < 2e-16 ***
## SECURITY_DELAY NA NA NA NA
## AIRLINE_DELAY 1.0772619 0.0080341 134.086 < 2e-16 ***
## LATE_AIRCRAFT_DELAY 1.0745771 0.0157362 68.287 < 2e-16 ***
## WEATHER_DELAY 1.0937481 0.0527056 20.752 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.898 on 6780 degrees of freedom
## Multiple R-squared: 0.8525, Adjusted R-squared: 0.8522
## F-statistic: 2799 on 14 and 6780 DF, p-value: < 2.2e-16
There are some variables that show “perfect separation”, such as the Delay Causes (WEATHER_DELAY, LATE_AIRCRAFT_DELAY), I will perform the VIF test to analyze if it is needed to remove them or not.
# vif(lm.fit)lm.null = lm(ARRIVAL_DELAY ~ 1, data = train.df.svm)
lm.full = lm(ARRIVAL_DELAY ~ DAY_OF_WEEK+AIRLINE+
ORIGIN_AIRPORT+DEPARTURE_TIME+DISTANCE+
AIR_SYSTEM_DELAY+SECURITY_DELAY+
AIRLINE_DELAY+LATE_AIRCRAFT_DELAY+
WEATHER_DELAY, data = train.df.svm)step.AIC = step(lm.null, scope = list(upper=lm.full),
direction ="both",test ="Chisq", trace = F)
summary(step.AIC)##
## Call:
## lm(formula = ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
## LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
## ORIGIN_AIRPORT + DEPARTURE_TIME, data = train.df.svm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.583 -5.489 0.278 5.646 27.905
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.4303756 0.5881669 -12.633 < 2e-16 ***
## AIRLINE_DELAY 1.0748550 0.0097046 110.757 < 2e-16 ***
## AIR_SYSTEM_DELAY 1.0718980 0.0115360 92.917 < 2e-16 ***
## LATE_AIRCRAFT_DELAY 1.0875702 0.0220169 49.397 < 2e-16 ***
## WEATHER_DELAY 1.0768011 0.0619198 17.390 < 2e-16 ***
## DAY_OF_WEEK.L 1.5777268 0.3703899 4.260 2.09e-05 ***
## DAY_OF_WEEK.Q 1.3159512 0.3653922 3.601 0.00032 ***
## DAY_OF_WEEK.C 1.9777472 0.3758248 5.262 1.50e-07 ***
## DAY_OF_WEEK^4 2.3520740 0.3698091 6.360 2.24e-10 ***
## DAY_OF_WEEK^5 -0.0472345 0.3827212 -0.123 0.90178
## DAY_OF_WEEK^6 -0.3378758 0.3690306 -0.916 0.35994
## DISTANCE -0.0027983 0.0002855 -9.802 < 2e-16 ***
## ORIGIN_AIRPORTDFW 2.2306786 0.2974040 7.501 7.76e-14 ***
## DEPARTURE_TIME 0.0011027 0.0003869 2.850 0.00439 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.912 on 4063 degrees of freedom
## Multiple R-squared: 0.8645, Adjusted R-squared: 0.8641
## F-statistic: 1994 on 13 and 4063 DF, p-value: < 2.2e-16
According to VIF, the model looks good, so we will go ahead and perform the predictions and calculate the MSE.
Linear Regression: Predictions and Probabilities on the test data set
lm.preds = predict(step.AIC, newdata = test.df.svm)
# lm.rmse = rmse(test.df$ARRIVAL_DELAY, lm.preds)
lm.rmse = sqrt(mean((test.df.svm$ARRIVAL_DELAY - lm.preds)^2))
lm.rmse## [1] 8.890251
Tune - find best gamma and cost
tuned = tune.svm(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
ORIGIN_AIRPORT + DEPARTURE_TIME,
data = train.df.svm, gamma = seq(0.1, 1, by = 0.2),
cost = seq(0.1, 1, by = 0.2), scale = TRUE)Please note: I couldn’t run the gamma and cost sequence by 0.1, since it was a very large tune, it brought me back every single time “empty tables”. Therefore, the best Gamma and Cost is 0.6.
svm.fit.lm = svm(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
ELAPSED_TIME + ORIGIN_AIRPORT + AIRLINE + DEPARTURE_TIME,
data = train.df.svm, gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost, scale = TRUE)SVR: Predictions and Probabilities on the test data set
svm.preds.lm = predict(svm.fit.lm, test.df.svm)
svm.rmse.lm = sqrt(mean((test.df.svm$ARRIVAL_DELAY - svm.preds.lm)^2))
svm.rmse.lm## [1] 9.866235
Comments: Even when it might not be the best model, due to memory constraints in my computer RAM, we can see that there was INDEED an improvement on the RMSE.
dt.model.lm = tree(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
ORIGIN_AIRPORT + DEPARTURE_TIME,
data = train.df)
dt.preds = predict(dt.model.lm, test.df)
dt.rmse = sqrt(mean((test.df$ARRIVAL_DELAY - dt.preds)^2))
dt.rmse## [1] 11.11843
For Random Forest, it was a challenge, due to the memory usage. I will use “ranger” which is the package for Random Forest for data sets with high dimensionality.
Finding the best parameters
grid <- expand.grid(
mtry = 2:6,
min.node.size = seq(0.1,1,0.1),
sample_size = seq(0.1,1,0.1),
OOB = 0
)for(i in 1:nrow(grid)) {
# train model
model <- ranger(ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
ORIGIN_AIRPORT + DEPARTURE_TIME,
data = train.df,
num.trees = 200,
mtry = grid$mtry[i],
min.node.size = grid$min.node.size[i],
sample.fraction = grid$sample_size[i],
seed = 123,
write.forest = T)
grid$OOB[i] <- sqrt(model$prediction.error)
}To find the lowest OOB error I will arrange the resulting table in a descending order.
grid %>%
dplyr::arrange(OOB) %>%
head(10)## mtry min.node.size sample_size OOB
## 1 6 0.1 0.9 10.41600
## 2 6 0.2 0.9 10.41600
## 3 6 0.3 0.9 10.41600
## 4 6 0.4 0.9 10.41600
## 5 6 0.5 0.9 10.41600
## 6 6 0.6 0.9 10.41600
## 7 6 0.7 0.9 10.41600
## 8 6 0.8 0.9 10.41600
## 9 6 0.9 0.9 10.41600
## 10 6 1.0 0.9 10.45969
We conclude that the best parameters are mtry = 4, min.node.size = 0.1, sample_size = 0.9.
Variable Importance
set.seed(123)
OOB <- vector(mode = "numeric", length = 200)
for(i in seq_along(OOB)) {
optimal_ranger <- ranger(
formula = ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
ORIGIN_AIRPORT + DEPARTURE_TIME,
data = train.df.svm,
num.trees = 200,
mtry = 6,
min.node.size = 0.1,
sample.fraction = 0.7,
importance = 'impurity'
)
OOB[i] <- sqrt(optimal_ranger$prediction.error)
}
hist(OOB, breaks = 20)optimal_ranger$variable.importance %>%
tidy() %>%
dplyr::arrange(desc(x)) %>%
dplyr::top_n(25) %>%
ggplot(aes(reorder(names, x), x)) +
geom_col() +
coord_flip() +
ggtitle("Top 25 important variables") +
labs(x="Variables", y="Variable Importance")## Selecting by x
Predictions with the parameters
rf.model <- randomForest(
formula = ARRIVAL_DELAY ~ AIRLINE_DELAY + AIR_SYSTEM_DELAY +
LATE_AIRCRAFT_DELAY + WEATHER_DELAY + DAY_OF_WEEK + DISTANCE +
ORIGIN_AIRPORT + DEPARTURE_TIME,
data = train.df.svm,
ntree = 200,
mtry = 6,
nodesize = 0.1,
importance = TRUE)rf.preds = predict(rf.model, test.df.svm)
# rf.rmse = rmse(test.df.svm$ARRIVAL_DELAY, optimal_ranger)
rf.rmse = sqrt(mean((test.df.svm$ARRIVAL_DELAY - rf.preds)^2))
rf.rmse## [1] 8.937098