Air travel has become a pivotal feature of the ever globalized world, driving global trade, connecting individuals and countries alike, and driving global economic growth. According to the Air Transport Action Group in 2023, a total of 35.3 million flights carried a total of 4.4 billion passengers and 61.4 million tonnes of cargo between 4,072 global airports. This resulted in a $4.1 trillion total global economic contribution in 2023 alone. While only 1% of all global trade is conducted through air travel, air shipping accounts for 33% of all trade value, indicating that the most expensive goods are shipped via air travel. In 2023 alone, a total of 1,138 airlines were operating a total of 29,039 planes across 67,300 global routes, connecting 21,000 unique pairs of cities. The vast size of this transit network presents unique challenges when working to analyze flight data with the goal of reducing flight delays.
Due to the massive impact of air travel on our global communities and GDPs, there has been significant efforts implemented to streamline air travel and reduce fight delays to ensure smooth predictable travel. Mitigation and reduction of overall air travel delays will allow for increased air travel driven trade output and overall boosts to the global economy. In order to properly mitigate air travel delays, confounding factors that may impact flight delays were captured by Deepti Gupta in Applied Analytics through Case Studies Using SAS and R for analysis and modeling.
In order to aid in reducing overall flight delays, the factors with the greatest impact on overall flight delay times must be identified. Once identified, this information can then be leveraged by airports and airlines to improve staffing, streamline pre-flight processes, and ultimately mitigate delays due to factors within the airlines’ or airports’ control. In addition to preventative measures to mitigate delays, understanding the greatest impacts on flight delays globally will allow for creation of prediction models to live predict upcoming flight delays for real time mitigation before the delays in question even happen. The subsequent analysis and prediction model creation of the global flight delay will serve to address two points of concern:
Prior to any data analysis and visualization, the airline delay data was cleaned and modified in a process called exploratory data analysis. First, to allow for multiple analysis and prediction approaches, two numeric variables were binned to create categorical values (baggage loading time & cleaning time). After binning to create categorical variables, the data was examined to identify and subsequently remove impossible outlier values.
It is near impossible to ensure standardized data collection formats globally, resulting in incomplete data sets with missing values. In order to properly analyze flight data to successfully reduce flight delay times, considerations must be taken to predict said missing values, essentially filling in the gaps. In the world of data science, filling in these data gaps is known as imputation. For the purposes of this analysis and prediction model builds, a Multiple Imputation by Chained Equations (MICE) approach was implemented to impute all missing observations.
Once imputed, the data was then analysed for any potential skew and subsequently normalized utilizing a min-max equation. This normalization was implemented to further prepare the data for use in predictive modeling by reducing any skew and ensuring normal distributions to ensure the assumptions of regression were met. The normalized data was then analyzed for redundancies and the high impact variables were identified for retention through RFE feature selection. Once fully pruned, the PCA component data was then subjected to k-Nearest Neighbor cluster based feature extraction. This k-Means cluster ID creation allows for consolidation of the selected PCA components into a single binary variable reducing the overall number of variables for analysis without loss of data.
The normalized data with added PCA components and k-Means cluster IDs assigned was then utilized to develop four linear regression prediction models for evaluation and ultimate selection. These linear regression prediction models were analyzed using 80:20 cross validation to ensure the best fit model was identified through mean Mean Square Error (MSE) comparison. The ultimate selection of a best fit linear regression prediction model serves as an analytic tool to allow airlines to best determine where upstream delay reduction approaches should be implemented (i.e: ensuring adequate support crews are available, identifying ideal baggage loading / plane cleaning / plane fueling time windows, etc.).
A second prediction model was developed omitting all PCA components and k-Means cluster IDs to develop a low computational burden rapid prediction model for use in live prediction of delays based on current airport conditions. Multiple models were developed using the previously split 80:20 train:test data and compared for efficacy and reliability utilizing Akaike Information Criteria (AIC) values, ROC Receiver Operating Characteristic (ROC) curve comparisons, and calculated Area Under the Curve (AUC) values as calculated from the ROC curves.
delay <- read.csv("https://nlepera.github.io/sta552/HW01/data/Flight_delay-data.csv")
Flight delay data as collected from Applied Analytics through Case Studies Using SAS and R, Deepti Gupta by APress, ISBN - 978-1-4842-3525-6 was downloaded from https://pengdsci.github.io/datasets/.
A total of 3593 flights were analyzed for 11 variables of interest (1 categorical & 11 numeric). The flight variables as recorded include:
var.details <- data.frame(
name = c("Carrier", "Airport Distance", "Number of Flights", "Weather", "Support Crews Available", "Baggage Loading Time", "Late Incoming Plane Arrival", "Cleaning Time", "Fueling Time", "Security Time", "Overall Flight Delay"),
type = c("Categorical", "Numeric", "Numeric", "Numeric", "Numeric", "Numeric [Binned]", "Numeric", "Numeric [Binned]", "Numeric", "Numeric", "Numeric"),
details = c("Airline (Carrier)", "Distance (miles) between departure and arrival airport", "Total number of flights at arrival airport", "A ranking of delays due to weather condition (0: Mild to 10: Extreme)", "Total number of support crew available", "Total time for baggage loading (minutes)", "Delay time for plane arrival prior to flight (minutes)", "Time required to clean plane after arrival prior to passenger loading (minutes)", "Time required to fuel aircraft (minutes)", "Time required for security checks (minutes)", "Total flight delay time (minutes)")
)
kable(var.details, col.names = c("Variable Name", "Variable Type", "Details"), caption = "Table 1:
<br>Global flight data details
<br>note: numeric variables (Weather & Cleaning) were binned to allow for categorical styled handling") %>%
kable_styling()
Variable Name | Variable Type | Details |
---|---|---|
Carrier | Categorical | Airline (Carrier) |
Airport Distance | Numeric | Distance (miles) between departure and arrival airport |
Number of Flights | Numeric | Total number of flights at arrival airport |
Weather | Numeric | A ranking of delays due to weather condition (0: Mild to 10: Extreme) |
Support Crews Available | Numeric | Total number of support crew available |
Baggage Loading Time | Numeric [Binned] | Total time for baggage loading (minutes) |
Late Incoming Plane Arrival | Numeric | Delay time for plane arrival prior to flight (minutes) |
Cleaning Time | Numeric [Binned] | Time required to clean plane after arrival prior to passenger loading (minutes) |
Fueling Time | Numeric | Time required to fuel aircraft (minutes) |
Security Time | Numeric | Time required for security checks (minutes) |
Overall Flight Delay | Numeric | Total flight delay time (minutes) |
A cursory analysis of all data points identified four (4)
observations with impossible outliers in the cleaning time variable.
These values are known to be impossible as a negative duration for
cleaning value is impossible. Rather than dropping all data associated
with these erroneous observations, these negative values were replaced
with NA
to allow for the remaining observations’ data to be
utilized. These impossible values that were replaced with NA values will
subsequently be imputed further along in the analysis process.
impossible <- delay %>%
filter(Cleaning_o < 0) %>%
dplyr::select(c(Carrier, Cleaning_o, Arr_Delay))
delay.clean <- delay %>%
mutate(Cleaning_o = replace(Cleaning_o, Cleaning_o < 0, "NA")) #removed outlier values without removing full observations
delay.clean$Cleaning_o <- as.numeric(delay.clean$Cleaning_o)
kable(impossible, col.names= c(var.details[1,1], var.details[8,1], var.details[11,1]) , caption = "Table 2:
<br>Summary of impossible cleaning times
<br>These times are considered impossible as negative time values cannot exist") %>%
kable_styling()
Carrier | Cleaning Time | Overall Flight Delay |
---|---|---|
B6 | -1 | 70 |
EV | -4 | 82 |
WN | -1 | 75 |
EV | -1 | 54 |
B6 | -1 | 88 |
singlevar <- summary(delay.clean$Cleaning_o)
singlevar <- as.data.frame.AsIs(singlevar)
kable(singlevar, col.names = var.details[8,1], caption = "Table 3:
<br>Summary statistics of cleaning time observations after impossible values replaced with NA") %>%
kable_styling()
Cleaning Time | |
---|---|
Min. | 0.00 |
1st Qu. | 8.00 |
Median | 10.00 |
Mean | 10.04 |
3rd Qu. | 12.00 |
Max. | 23.00 |
NA’s | 5 |
As outlined in the table above in section 1.2 Overall Approach and 2.1 Data Details, two numerical variables were binned to create pseudo categorical variables. By binning these numeric variables into pseudo categorical variables, additional categorical analysis may be conducted and additional prediction models may be utilized for determining and predicting overall flight delays.
The baggage loading time variable was binned in to three (3) duration groups:
Additionally, the cleaning time variable was binned in to four (4) duration groups:
bin_clean <- c(0, 6, 12, 18, 23)
labels_clean <- c("Immediate", "Quick", "Moderate", "Slow")
bin_baggage <- c(14, 16, 18, 19)
labels_baggage <- c("Fast", "Moderate", "Slow")
delay.clean <- delay.clean %>%
mutate(Cleaning_o = cut(Cleaning_o, breaks = bin_clean, labels = labels_clean, right = TRUE, include.lowest = TRUE))
delay.clean <- delay.clean %>%
mutate(Baggage_loading_time = cut(Baggage_loading_time, breaks = bin_baggage, labels = labels_baggage, right = TRUE, include.lowest = TRUE))
kable(summary(delay.clean[, c(6,8)]), col.names = c(var.details[6,1], var.details[8,1]), caption = "Table 4:
<br>Summary of binned global flight data variables") %>%
kable_styling()
Baggage Loading Time | Cleaning Time | |
---|---|---|
Fast : 749 | Immediate: 517 | |
Moderate:2833 | Quick :2261 | |
Slow : 11 | Moderate : 780 | |
NA | Slow : 30 | |
NA | NA’s : 5 |
While the global flight delay data set as prepared by Gupti does not
contain missing values, real world collection of global flight data will
inevitably result in missing values and poor standardization across
countries and regions. In order to ensure the data analysis and
prediction models are prepared to handle these real world gaps in data,
missing values were forced for three variables. The weather, support
crew, and late initial plane arrival time variables were forced to have
missing observations at random using the sample()
procedure. Table 5
below outlines the total number of
missing values that were forced into the data set as a training method
to prepare for the real world data ambiguity.
weather.missing <- sample(1:3593, 5, replace = FALSE)
late.missing <- sample(1:3593, 6, replace = FALSE)
support.missing <- sample(1:3593, 4, replace = FALSE)
delay.clean$Weather[weather.missing] <- NA
delay.clean$Late_Arrival_o[late.missing] <- NA
delay.clean$Support_Crew_Available[support.missing] <- NA
kable(sapply(delay.clean[,c(4,5,7,8)], function (x) sum(is.na(x))), caption = "Table 5:
<br>Count of forced missing observations per variable", col.names = c("Variables", "Count of Missing Observations")) %>%
kable_styling()
Variables | Count of Missing Observations |
---|---|
Weather | 5 |
Support_Crew_Available | 4 |
Late_Arrival_o | 6 |
Cleaning_o | 5 |
In the previous section missing values were forced in this synthetic data set to imitate the realities of data collection and standardization across this globally diverse array of airlines and airports. This knockout procedure was completed to allow for the subsequent data imputation process that will be required for real world data that contains natural missing values. When handling real world global flight delay data, all missing values must be imputed to ensure no gaps remain in the data, without completely dropping observations with missing data. This prevents overall data loss by removing the need to omit full observations based on a single missing value.
Prior to completing any analysis, the missing values in the airline
delay data must be imputed to ensure there are no gaps in the data. . A
summary of the missing values across each variable are included in the
below Figure 1
, Table 6
, and
Table 7
. As demonstrated by the p value in Little’s MCAR
Test (p = 0.6382091) the result is not statistically significant
indicating that the missing values are at random and not cross
correlated between variables. This highlights that the missing values
between variables are not related to one another (ex: a missing value
for Weather does not imply that there will also be a missing value in
Cleaning Time).
missing.array <- plot_pattern(delay.clean, rotate = TRUE, caption = TRUE) +
theme(axis.title.x.top = element_blank())
grid.arrange(arrangeGrob(missing.array, ncol = 1, nrow = 1),
top = textGrob("Missing Values
", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("
Figure 1
The distribution pattern of missing values in flight delay data"))
kable(misty_test$result$little, col.names = c("# of Obs.", "# Missing Values", "# of Patterns in Missing Vals", "Statistic", "DF", "p Value"), caption = "Table 6:
<br>Little's MCAR Test") %>%
kable_styling()
# of Obs. | # Missing Values | # of Patterns in Missing Vals | Statistic | DF | p Value |
---|---|---|---|---|---|
3593 | 20 | 5 | 25.17261 | 40 | 0.9675297 |
kable(sapply(delay.clean[c(4,5,7,8)], function (x) sum(is.na(x))), col.names = c("Variable", "Count of Missing Values"), caption = "Table 7:
<br>Summary count of missing values in each variable of the airline delay data. Variables with no missing values omitted.") %>%
kable_styling()
Variable | Count of Missing Values |
---|---|
Weather | 5 |
Support_Crew_Available | 4 |
Late_Arrival_o | 6 |
Cleaning_o | 5 |
While the data set utilized for this analysis had a small number of missing values, manually knocked out of the data set for the purposes of this analysis and data manipulation, real world flight data contains multitudes of gaps. In order to accurately close these missing gaps it is wiser to run an imputation model multiple times to then calculate the predicted missing values based on the multiple imputation runs, rather than relying on a single model to accurately complete all predictions, flawlessly, each time. By utilizing this approach, known as a multiple imputation model, analysts can better close the gap in airline data while accurately reflecting the model’s uncertainty based on the current quantity and pattern of missing observations. Through a multiple imputation model the variance between each imputation run can be compared to capture a measure of the model’s uncertainty.
A mice imputation model was created to allow for increased
flexibility in the arrival delay predictions and to allow for better
preservation of relationships between variables that may not have been
observed in the initial exploratory data analysis steps. This process
allowed for further improved predictive power when applying these models
to real world data with a larger variety of missing data patterns than
observed above in Figure 1
.
No imputation method was dictated to the MICE package to allow the
package to preliminary evaluate each variable with missing data and
determine the appropriate imputation approach for the individual
variable. This allows for increased flexibility in imputing the data,
and prevents accidentally building a model that only works with the
current set of data. By allowing for this variable specific flexibility,
a re-determination of the best fit imputation model is completed each
time the imputation model is run. Therefore the imputation model can
employ best fit approach based on the currently available data, each
time the imputation is run regardless of the pattern and quantity of
missing observations present. This allows for real world data tuning as
new flight delay data is obtained globally. Distributions of the imputed
values in each of the 5 imputation runs utilized in the MICE imputation
model can be seen below in Figure 2
.
m.delay <- delay.clean[-1]
m.delay$Cleaning_o <- as.numeric(m.delay$Cleaning_o)
set.seed(122357)
mice1 <- mice(m.delay, m = 5, maxit = 10, seed = 123, print = FALSE)
mice.delay <- complete(mice1)
mice.delay$Cleaning_o <- as.factor(mice.delay$Cleaning_o)
levels(mice.delay$Cleaning_o) <- c("Immediate", "Quick", "Moderate", "Slow")
mice.plot <- plot_trace(mice1)
grid.arrange(arrangeGrob(mice.plot, ncol = 1, nrow =1),
top = textGrob("MICE Model Imputation
", gp = gpar(fontsize = 18, fontface = "bold")),
bottom = textGrob("Figure 2:
A visual representation of the mean and standard deviation of each imputation run
utilizing the MICE imputation package. A total of 5 imputation runs were completed
with a maximum of 10 iterations per imputation.", gp = gpar(fontsize = 10)))
kable(mice1$method, col.names = c("Variable Name", "Imputation Method"), caption = "Table 8:
<br>Imputation methods utilized for each variable.
<br>Blank values indicate no missing observations were identified for the variable and thus no imputation was required.") %>%
kable_styling()
Variable Name | Imputation Method |
---|---|
Airport_Distance | |
Number_of_flights | |
Weather | pmm |
Support_Crew_Available | pmm |
Baggage_loading_time | |
Late_Arrival_o | pmm |
Cleaning_o | pmm |
Fueling_o | |
Security_o | |
Arr_Delay |
Before passing the MICE imputation package through the complete()
package to automatically combine the imputation runs into a single best
fit imputed data set, the imputation statistics were calculated using
Ruben’s rules first to help determine the efficacy and validity of this
imputation. This allowed for sample means and variance of each
imputation run as well as the variance between each imputation
run to be combined into a working parameters to describe the
combinations of the multiple imputation model. As demonstrated below in
Table 9
B.bar and Standard Error values demonstrate near
negligible variance between each imputation runs and minimal error for
the Weather, Late Incoming Plane Arrival, and Cleaning time variables.
While the Support Crews Available variance between imputation runs has
returned a la value, the standard of error of 7 (7 crew members)
supports this variance may not be inconceivable.
\[ \begin{align*} Mean \ of \ Quanitity \ Of \ Interest's \ Means-> \ \overline{Q} &= \sum _{i=1} ^{m} \frac{Q_i}{m} \\ Average \ Standard \ of \ Error \ of \ Quant. \ of \ Interest-> \ \overline{U} &= \sum _{i=1} ^{m} \frac{U_i}{m} \\ Variance \ of \ Mean \ of \ Quant. \ of \ Interest-> \ B &= \frac{1}{m-1}\sum _{i=1} ^{m} (Q_i - \overline{Q})^2 \\ Estimated \ Total \ Variance \ of \ \overline{Q}-> \ T &= \overline{U} + \frac{m + 1}{m}B \end{align*} \]
n.mice <- length(mice1$imp$Weather)
qweather.m <- c(mean(mice1$imp$Weather[,1]), mean(mice1$imp$Weather[,2]), mean(mice1$imp$Weather[,3]), mean(mice1$imp$Weather[,4]),
mean(mice1$imp$Weather[,5]))
uweather.m <- c(std.error(mice1$imp$Weather[,1]), std.error(mice1$imp$Weather[,2]), std.error(mice1$imp$Weather[,3]),
std.error(mice1$imp$Weather[,4]), std.error(mice1$imp$Weather[,5]))
q.bar.weather.m <- mean(qweather.m)
u.bar.weather.m <- mean(uweather.m)
b.weather.m <- var(qweather.m)
t.weather.m <- (q.bar.weather.m + ((n.mice + 1)/n.mice)*b.weather.m)
se.weather.m <- std.error(qweather.m)
qsupcrew.m <- c(mean(mice1$imp$Support_Crew_Available[,1]), mean(mice1$imp$Support_Crew_Available[,2]), mean(mice1$imp$Support_Crew_Available[,3]),
mean(mice1$imp$Support_Crew_Available[,4]), mean(mice1$imp$Support_Crew_Available[,5]))
usupcrew.m <- c(std.error(mice1$imp$Support_Crew_Available[,1]), std.error(mice1$imp$Support_Crew_Available[,2]),
std.error(mice1$imp$Support_Crew_Available[,3]), std.error(mice1$imp$Support_Crew_Available[,4]),
std.error(mice1$imp$Support_Crew_Available[,5]))
q.bar.supcrew.m <- mean(qsupcrew.m)
u.bar.supcrew.m <- mean(usupcrew.m)
b.supcrew.m <- var(qsupcrew.m)
t.supcrew.m <- (q.bar.supcrew.m + ((n.mice + 1)/n.mice)*b.supcrew.m)
se.supcrew.m <- std.error(qsupcrew.m)
qlateincom.m <- c(mean(mice1$imp$Late_Arrival_o[,1]), mean(mice1$imp$Late_Arrival_o[,2]), mean(mice1$imp$Late_Arrival_o[,3]),
mean(mice1$imp$Late_Arrival_o[,4]), mean(mice1$imp$Late_Arrival_o[,5]))
ulateincom.m <- c(std.error(mice1$imp$Late_Arrival_o[,1]), std.error(mice1$imp$Late_Arrival_o[,2]), std.error(mice1$imp$Late_Arrival_o[,3]),
std.error(mice1$imp$Late_Arrival_o[,4]), std.error(mice1$imp$Late_Arrival_o[,5]))
q.bar.lateincom.m <- mean(qlateincom.m)
u.bar.lateincom.m <- mean(ulateincom.m)
b.lateincom.m <- var(qlateincom.m)
t.lateincom.m <- (q.bar.lateincom.m + ((n.mice + 1)/n.mice)*b.lateincom.m)
se.lateincom.m <- std.error(qlateincom.m)
qclean.m <- c(mean(mice1$imp$Cleaning_o[,1]), mean(mice1$imp$Cleaning_o[,2]), mean(mice1$imp$Cleaning_o[,3]), mean(mice1$imp$Cleaning_o[,4]),
mean(mice1$imp$Cleaning_o[,5]))
uclean.m <- c(std.error(mice1$imp$Cleaning_o[,1]), std.error(mice1$imp$Cleaning_o[,2]), std.error(mice1$imp$Cleaning_o[,3]),
std.error(mice1$imp$Cleaning_o[,4]), std.error(mice1$imp$Cleaning_o[,5]))
q.bar.clean.m <- mean(qclean.m)
u.bar.clean.m <- mean(uclean.m)
b.clean.m <- var(qclean.m)
t.clean.m <- (q.bar.clean.m + ((n.mice + 1)/n.mice)*b.clean.m)
se.clean.m <- std.error(qclean.m)
mice.merge <- data.frame(
Statistic = c("Q.bar", "U.bar", "B.var", "T", "Std.Err"),
mice.weather = c(q.bar.weather.m, u.bar.weather.m, b.weather.m, t.weather.m, se.weather.m),
mice.sup.crew = c(q.bar.supcrew.m, u.bar.supcrew.m, b.supcrew.m, t.supcrew.m, se.supcrew.m),
mice.late.incoming = c(q.bar.lateincom.m, u.bar.lateincom.m, b.lateincom.m, t.lateincom.m, se.lateincom.m),
mice.clean = c(q.bar.clean.m, u.bar.clean.m, b.clean.m, t.clean.m, se.clean.m)
)
kable(mice.merge, col.names = c("Statistic", var.details[4,1], var.details[5,1], var.details[7,1], var.details[8,1]) , caption = "Table 9:
<br>
Summary statistics using Ruben's Rules for MICE Imputation for missing values in global flight delay data <br> number of imputations (m) = 5 <br> maximum iterations (maxit) = 10") %>%
kable_styling()
Statistic | Weather | Support Crews Available | Late Incoming Plane Arrival | Cleaning Time |
---|---|---|---|---|
Q.bar | 5.4400000 | 86.150000 | 18.6333333 | 2.0400000 |
U.bar | 0.2359592 | 18.894378 | 0.2410825 | 0.2696663 |
B.var | 0.0280000 | 382.737500 | 0.0750000 | 0.0480000 |
T | 5.4736000 | 545.435000 | 18.7233333 | 2.0976000 |
Std.Err | 0.0748331 | 8.749143 | 0.1224745 | 0.0979796 |
As a supporting analysis to the Ruben’s Rules totals above, the
summary statistics for the initial variables prior to imputation and the
summary statistics after imputation were reviewed
(Table 10.a
& Table 10.b
below). The
imputed Weather and Late Incoming Plane Arrival variables displayed no
change to the mean or median post imputation, indicating a high success
rate for the MICE ppm imputation approach. Additionally, the Support
Crews Available variable demonstrated no change in mean and only a 0.03
change in the median. As this variable is measuring the number of people
available in the support crews this 0.03 change in median can be
completely ignored as there cannot feasibly be 0.03 of a person. Finally
the imputation of Cleaning Time variable demonstrated the greatest
frequency increase for the mode value, further corroborating the success
of the MICE ppm imputation for use with a categorical variable.
kable(summary(delay.clean[, c(4, 5, 7, 8)]), col.names = c(var.details[4,1], var.details[5,1], var.details[7,1], var.details[8,1]), caption = "Table 10.a:
<br>Summary statistics of variables with missing values before MICE imputation") %>%
kable_styling()
Weather | Support Crews Available | Late Incoming Plane Arrival | Cleaning Time | |
---|---|---|---|---|
Min. :5.000 | Min. : 0 | Min. :15.00 | Immediate: 517 | |
1st Qu.:5.000 | 1st Qu.: 56 | 1st Qu.:18.00 | Quick :2261 | |
Median :5.000 | Median : 83 | Median :19.00 | Moderate : 780 | |
Mean :5.354 | Mean : 85 | Mean :18.74 | Slow : 30 | |
3rd Qu.:6.000 | 3rd Qu.:112 | 3rd Qu.:19.00 | NA’s : 5 | |
Max. :6.000 | Max. :222 | Max. :22.00 | NA | |
NA’s :5 | NA’s :4 | NA’s :6 | NA |
kable(summary(mice.delay[, c(3, 4, 6, 7)]), col.names = c(var.details[4,1], var.details[5,1], var.details[7,1], var.details[8,1]), caption = "Table 10.b:
<br>Summary statistics of variables with missing values after MICE imputation") %>%
kable_styling()
Weather | Support Crews Available | Late Incoming Plane Arrival | Cleaning Time | |
---|---|---|---|---|
Min. :5.000 | Min. : 0.00 | Min. :15.00 | Immediate: 518 | |
1st Qu.:5.000 | 1st Qu.: 56.00 | 1st Qu.:18.00 | Quick :2265 | |
Median :5.000 | Median : 83.00 | Median :19.00 | Moderate : 780 | |
Mean :5.354 | Mean : 85.03 | Mean :18.74 | Slow : 30 | |
3rd Qu.:6.000 | 3rd Qu.:112.00 | 3rd Qu.:19.00 | NA | |
Max. :6.000 | Max. :222.00 | Max. :22.00 | NA |
In order to determine if any of the variable data is skewed, the continuous variable distributions were plotted with an overlay indicating the standard normal distribution as calculated with the respective variable’s calculated mean and standard deviation. This check was completed to ensure the data as available will cooperate with the prediction models utilized later in the analysis. As a majority of prediction models require the assumption of a normal distribution, the calculated normal distribution curve for each respective variable was overlaid as a comparison reference. In order to properly leverage the prediction models to optimize airport planning to reduce flight delays, it is imperative to ensure the data fits within the assumptions of the prediction model utilized.
dist_dens <- ggplot(mice.delay, aes(x = Airport_Distance)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Airport_Distance), sd = sd(mice.delay$Airport_Distance)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Airport_Distance)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Airport_Distance)), color = "black", linetype = "dashed") +
labs(title = "Distance Between Departure-Arrival Airports",
x = "Distance (miles)",
y = "Density")
numflight_dens <- ggplot(mice.delay, aes(x = Number_of_flights)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Number_of_flights), sd = sd(mice.delay$Number_of_flights)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Number_of_flights)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Number_of_flights)), color = "black", linetype = "dashed") +
labs(title = "Total Flights per Airport *",
x = "Flights per Airport",
y = NULL)
supcrew_dens <- ggplot(mice.delay, aes(x = Support_Crew_Available)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Support_Crew_Available), sd = sd(mice.delay$Support_Crew_Available, na.rm = TRUE)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Support_Crew_Available)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Support_Crew_Available)), color = "black", linetype = "dashed") +
labs(title = "Total Support Crew Members Available *",
x = "Support Crew Available",
y = NULL)
fueling_dens <- ggplot(mice.delay, aes(x = Fueling_o)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Fueling_o), sd = sd(mice.delay$Fueling_o)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Fueling_o)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Fueling_o)), color = "black", linetype = "dashed") +
labs(title = "Plane Fueling Times",
x = "Fueling Time (minutes)",
y = NULL)
security_dens <- ggplot(mice.delay, aes(x = Security_o)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Security_o), sd = sd(mice.delay$Security_o)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Security_o)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Security_o)), color = "black", linetype = "dashed") +
labs(title = "Total Security Times",
x = "Time (minutes)",
y = "Density")
arrival_dens <- ggplot(mice.delay, aes(x = Arr_Delay)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Arr_Delay), sd = sd(mice.delay$Arr_Delay)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Arr_Delay)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Arr_Delay)), color = "black", linetype = "dashed") +
labs(title = "Total Flight Delays",
x = "Duration (minutes)",
y = "Density")
lateincom_dens <- ggplot(mice.delay, aes(x = Late_Arrival_o)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Late_Arrival_o), sd = sd(mice.delay$Late_Arrival_o)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Late_Arrival_o)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Late_Arrival_o)), color = "black", linetype = "dashed") +
labs(title = "Late Incomig Plane Delays *",
x = "Duration (minutes)",
y = "Density")
weather_dens <- ggplot(mice.delay, aes(x = Weather)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(mice.delay$Weather), sd = sd(mice.delay$Weather)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(mice.delay$Weather)), color = "black") +
geom_vline(aes(xintercept = median(mice.delay$Weather)), color = "black", linetype = "dashed") +
labs(title = "Weather Conditions",
x = "Ranking of Weather Severity",
y = "Density")
baggage_bar <- ggplot(mice.delay, aes(x = Baggage_loading_time)) +
geom_bar(fill = "blue", alpha = 0.5) +
labs(title = "Baggage Loading Times *",
x = "Loading Duration (slowest to fastest from left to right)",
y = "Frequency")
cleaning_bar <- ggplot(mice.delay, aes(x = Cleaning_o)) +
geom_bar(fill = "blue", alpha = 0.5) +
labs(title = "Plane Cleaning Times *",
x = "Cleaning Duration (slowest to fastest from left to right)",
y = NULL)
grid.arrange(
arrangeGrob(arrival_dens, numflight_dens, dist_dens, fueling_dens, security_dens, supcrew_dens, lateincom_dens, weather_dens, baggage_bar, cleaning_bar, ncol = 2,nrow = 5),
top = textGrob("Distribution of Flight Variables
", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("
Figure 3
Graphs marked with one asterisk (*) indicate skewed data
Black solid line: mean of variable data | Black dashed line: median of variable data"))
As demonstrated in the above Figure 3
, the distance
between airports and total security times hold close to the normal
distribution curves. Meanwhile the total delay times for incoming planes
from previous flights, fueling times, and weather severity illustrate
non-normal distributions. At a cursory review these variables appear to
hold to multimodal and Bernoulli distributions. Meanwhile the number of
flights and total support crews available demonstrated skewed normal
distributions
In order to confirm the visually identified data skews the means and medians of the following variables were compared:
Upon comparison of the mean and median of the numeric variables it was identified that the late incoming planes variable was not skewed as the median and mean were identical. Frequency counts for the categorical variables also identified that the mode of these variables corroborated a normal distribution. This left the number of flights and support crews available as the two truly skewed variables requiring transformation.
kable(summary(round(mice.delay[c(2,4,6)]), digits = 1), caption = "Table 11:
<br>Summary statistics for variables with skewed distributions. Rounded to whole numbers as the fractions of flights and fractions of support crew members cannot exist naturally.", col.names = c(var.details[3,1], var.details[5,1], var.details[7,1])) %>%
kable_styling()
Number of Flights | Support Crews Available | Late Incoming Plane Arrival | |
---|---|---|---|
Min. :29475 | Min. : 0 | Min. :15 | |
1st Qu.:41634 | 1st Qu.: 56 | 1st Qu.:18 | |
Median :43424 | Median : 83 | Median :19 | |
Mean :43311 | Mean : 85 | Mean :19 | |
3rd Qu.:45140 | 3rd Qu.:112 | 3rd Qu.:19 | |
Max. :53461 | Max. :222 | Max. :22 |
kable(summary(mice.delay[c(5, 7)]), caption = "Table 12:
<br>Frequency count to capture the mode of categorical variables with skewed distributions.", col.names = c(var.details[6,1], var.details[8,1])) %>%
kable_styling()
Baggage Loading Time | Cleaning Time | |
---|---|---|
Fast : 749 | Immediate: 518 | |
Moderate:2833 | Quick :2265 | |
Slow : 11 | Moderate : 780 | |
NA | Slow : 30 |
In order to normalize the variables for future predictive model use, min-max normalization was employed to reduce all variables to a range of [0,1]. The categorical variables were already encoded as factor variables, allowing for easy label encoding to allow for cooperation with the min-max standardization. The following equation was applied to all numeric variables (both originally numeric and forced numeric from factor categorical variables).
\[x_{standardized} = \frac{x - min(x)}{max(x) - min(x)}\]
mice.delay$cleaning <- as.numeric(mice.delay$Cleaning_o)
mice.delay$baggage <- as.numeric(mice.delay$Baggage_loading_time)
delay.std <- data.frame(
airport_dist = (mice.delay$Airport_Distance - min(mice.delay$Airport_Distance)) / (max(mice.delay$Airport_Distance) - min(mice.delay$Airport_Distance)),
num_flights = (mice.delay$Number_of_flights - min(mice.delay$Number_of_flights)) / (max(mice.delay$Number_of_flights) - min(mice.delay$Number_of_flights)),
weather = (mice.delay$Weather - min(mice.delay$Weather)) / (max(mice.delay$Weather) - min(mice.delay$Weather)),
sup_crew = (mice.delay$Support_Crew_Available - min(mice.delay$Support_Crew_Available)) / (max(mice.delay$Support_Crew_Available) - min(mice.delay$Support_Crew_Available)),
late_incoming = (mice.delay$Late_Arrival_o - min(mice.delay$Late_Arrival_o)) / (max(mice.delay$Late_Arrival_o) - min(mice.delay$Late_Arrival_o)),
fueling = (mice.delay$Fueling_o - min(mice.delay$Fueling_o)) / (max(mice.delay$Fueling_o) - min(mice.delay$Fueling_o)),
security = (mice.delay$Security_o - min(mice.delay$Security_o)) / (max(mice.delay$Security_o) - min(mice.delay$Security_o)),
arr_delay = (mice.delay$Arr_Delay - min(mice.delay$Arr_Delay)) / (max(mice.delay$Arr_Delay) - min(mice.delay$Arr_Delay)),
cleaning = (mice.delay$cleaning - min(mice.delay$cleaning)) / (max(mice.delay$cleaning) - min(mice.delay$cleaning)),
baggage = (mice.delay$baggage - min(mice.delay$baggage)) / (max(mice.delay$baggage) - min(mice.delay$baggage))
)
kable(summary(delay.std), col.names = c(var.details[2,1], var.details[3,1], var.details[4,1], var.details[5,1], var.details[6,1], var.details[7,1], var.details[8,1], var.details[9,1], var.details[10,1], var.details[11,1]), caption = "Table 13:
<br>Summary statistics of normalized flight data") %>%
kable_styling() %>%
scroll_box(width = "100%")
Airport Distance | Number of Flights | Weather | Support Crews Available | Baggage Loading Time | Late Incoming Plane Arrival | Cleaning Time | Fueling Time | Security Time | Overall Flight Delay | |
---|---|---|---|---|---|---|---|---|---|---|
Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | |
1st Qu.:0.4472 | 1st Qu.:0.5069 | 1st Qu.:0.0000 | 1st Qu.:0.2523 | 1st Qu.:0.4286 | 1st Qu.:0.4348 | 1st Qu.:0.3800 | 1st Qu.:0.2722 | 1st Qu.:0.3333 | 1st Qu.:0.5000 | |
Median :0.5447 | Median :0.5815 | Median :0.0000 | Median :0.3739 | Median :0.5714 | Median :0.5217 | Median :0.4800 | Median :0.3889 | Median :0.3333 | Median :0.5000 | |
Mean :0.5395 | Mean :0.5768 | Mean :0.3537 | Mean :0.3830 | Mean :0.5341 | Mean :0.5222 | Mean :0.4817 | Mean :0.3878 | Mean :0.3632 | Mean :0.3973 | |
3rd Qu.:0.6341 | 3rd Qu.:0.6531 | 3rd Qu.:1.0000 | 3rd Qu.:0.5045 | 3rd Qu.:0.5714 | 3rd Qu.:0.6087 | 3rd Qu.:0.5800 | 3rd Qu.:0.5000 | 3rd Qu.:0.3333 | 3rd Qu.:0.5000 | |
Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
Min-max normalization was chosen to allow for future use with feature magnitude sensitive algorithms, such as k Nearest Neighbors cluster analysis.
In order to simplify the data for analysis and predictive modeling,
Recursive Feature Elimination (RFE) was employed utilizing the
caret package. This allows for removal of redundancy in the data set
through a random forest model using 5 iterations of cross validation to
determine which variables are most essential for creating prediction
models for flight delays. Reducing redundancy not only reduces the
computational burden of predictive models while accounting for any
potential feature interactions between variables. In order to prevent
over fitting using RFE, an 80:20 training:test set was created using
delay.caret
to identify the most important variables for
prediction of overall flight delays (arr_delay).
A visualization of the importance of each variable in potential
prediction of overall flight delays is included below in
Figure 4
. As demonstrated below there is a steep drop off
in importance for total cleaning time (cleaning), fueling times
(fueling), & security times (security); therefore these variables
will be omitted from future analysis to reduce computational burden and
prevent redundancies.
x.caret <- delay.std[,-8]
y.caret <- delay.std[,8]
set.seed(12345)
inTrain <- createDataPartition(y.caret, p = 0.80, list = FALSE)[,1]
x.caret.train <- x.caret[inTrain, ]
x.caret.test <- x.caret[-inTrain, ]
y.caret.train <- y.caret[inTrain]
y.caret.test <- y.caret[-inTrain]
control <- rfeControl(functions = rfFuncs, method = "cv", repeats = 5, number = 10)
delay.caret <- rfe(x = x.caret.train, y.caret.train , sizes = c(1:5), rfeControl = control)
delay.caret.var <- data.frame(feature = row.names(varImp(delay.caret))[1:9],
importance = varImp(delay.caret)[1:9, 1])
accuracy.caret <- postResample(predict(delay.caret, x.caret.test), y.caret.test)
ggplot(delay.caret.var, aes(x = reorder(feature, -importance), y = importance, fill = feature)) +
geom_bar(stat="identity") +
labs(x = "Features", y = "Variable Importance", title = "Variable Importance for All Features", caption = "Figure 4:
Importance of each variable in global flight data for predicting flight delays") +
geom_text(aes(label = round(importance, 2)), vjust = 0, color = "black", size = 4) +
theme_bw() +
theme(legend.position = "none",
plot.caption = element_text(hjust = 0.5))
In order to properly utilize the 6 most important variables for
prediction of overall flight delays as outlined above in
Figure 4
a two step feature creation process was utilized.
Initially a principal component analysis was conducted to combine these
6 variables into principal components that capture a combination of
these 6 variables. Utilizing the PCA approach allows for a more
streamlined approach for creating prediction models for overall flight
delay times. Once transformed into principal components, a k-means
cluster analysis was then conducted to add for an additional feature to
utilize when creating flight delay prediction models.
Utilizing the most important variables as identified in
Figure 4
(number of flights per airport, late incoming
plane, baggage loading time, distance between airports, support crews
available, and weather) a principal component analysis was conducted to
further reduce redundancies. An initial scree plot
(Figure 5
) was utilized to determine the “elbow” cutoff of
PC1 & PC2 where the variance steeply cuts off. Upon calculation of
the total percentage of variance, it was found that in order to create a
model that accounts for >80% of the variance in the total flight
delay data, the first 4 PCs must be utilized.
delay.std.km <- delay.std[,-c(6, 7,9)]
pca.delay <- prcomp(delay.std.km[,-6], scale = TRUE)
pca.delay.1 <- cbind(delay.std.km, pca.delay$x)
pca.var <- apply(pca.delay$x, 2, var)
eigen.pca <- summary(pca.delay)
plot(pca.delay, type ="l", ylim=c(0,3), xlim=c(1,6.5), main = NULL)
text((1:10)+0.15, pca.var+0.2, as.character(round(pca.var, 3)), cex = 0.8)
title(main = "Variance of Individual PCs", xlab ="Principal Components", sub = "Figure 5: Scree Plot to determine number of PCs to use")
points(x = 2, y = (pca.delay$sdev[2])^2, pch = 19, col = "purple", cex = 1.5)
text(x = 2, y = 0.7, "58.656%
of Var.", col = "purple", cex = 0.8)
points(x = 4, y = (pca.delay$sdev[4])^2, pch = 19, col = "darkred", cex = 1.5)
text(x = 4, y = 0.5, "84.613%
of Var.", col = "darkred", cex = 0.8)
kable(pca.delay$rotation[,c(1:5)], caption = "Table 14:
<br>Principal Component Coefficients (rotation matrix)") %>%
kable_styling()
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
airport_dist | 0.3706110 | 0.0899273 | -0.4517494 | 0.7958560 | -0.0373642 |
num_flights | 0.5153757 | 0.0539246 | -0.1114829 | -0.1796823 | -0.0161651 |
weather | 0.2584309 | -0.8926859 | 0.3298843 | 0.1642819 | 0.0014252 |
sup_crew | -0.2986655 | -0.4333072 | -0.8063486 | -0.2590790 | 0.0564256 |
late_incoming | 0.4674321 | 0.0237125 | -0.1323775 | -0.3910015 | -0.6664500 |
baggage | 0.4717573 | 0.0616424 | -0.0833575 | -0.2955247 | 0.7422945 |
kable(eigen.pca$importance, caption = "Table 15:
<br>Variance captured by each principal component, as calculated by using eigenvalues") %>%
kable_styling()
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
---|---|---|---|---|---|---|
Standard deviation | 1.620099 | 0.9463168 | 0.9171852 | 0.8463776 | 0.7307255 | 0.6230744 |
Proportion of Variance | 0.437450 | 0.1492500 | 0.1402000 | 0.1193900 | 0.0889900 | 0.0647000 |
Cumulative Proportion | 0.437450 | 0.5867100 | 0.7269100 | 0.8463000 | 0.9353000 | 1.0000000 |
Utilizing the 5 created principal components, k-Means clustering was
completed to create clusters to aid in total flight delay prediction. In
order to determine the best fit number of clusters, a wss
and silhouette
plots were created as an initial analysis of
the PCA data. As demonstrated below in Figure 7
was
determined that the best fit number of clusters for this transformed PCA
data was 2 clusters. Utilizing that pre-determined number of clusters,
the PCA data was then clustered into 2 groups, using a total of 25
run-throughs of the clustering process to best determine the center
point of the clusters and reduce overall cluster overlap.
kmeans.wss <- fviz_nbclust(pca.delay.1[-c(1:7, 12:13)], FUN = hcut, method = "wss")
kmeans.sil <- fviz_nbclust(pca.delay.1[-c(1:7, 12:13)], FUN = hcut, method = "silhouette")
grid.arrange(
arrangeGrob(kmeans.wss, kmeans.sil, ncol = 2,nrow = 1),
top = textGrob("Optimal Cluster Determination", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("
Figure 6:
Determination of optimal number of clusters for PCA overall flight data"))
set.seed(12345)
pca.delay.1$cluster <- kmeans(pca.delay.1[-c(1:7, 12:13)], centers = 2, nstart = 25)$cluster
pca.delay.1$centers <- kmeans(pca.delay.1[-c(1:7, 12:13)], centers = 2, nstart = 25)$center[1]
eigen <- round(get_eigenvalue(pca.delay), 1)
var.eigen <- eigen$variance.percent
cluster_centers <- aggregate(pca.delay.1, by=list(cluster = pca.delay.1$cluster), FUN = mean)
pca.delay.1$cluster <- as.factor(pca.delay.1$cluster)
ggscatter(pca.delay.1, x = "PC1", y = "PC2",
color = "cluster",
palette = "npg",
ellipse = TRUE,
ellipse.type = "convex",
legend = "right",
xlab = paste("PC1 (",var.eigen[1],"%)"),
ylab = paste("PC2 (", var.eigen[2],"%)")) +
geom_point(x = cluster_centers$PC1[1], y = cluster_centers$PC2[1], shape = 18, size = 5, color = "black") +
geom_point(x = cluster_centers$PC1[2], y = cluster_centers$PC2[2], shape = 18, size = 5, color = "black") +
labs(title = "Cluster Analysis", caption = "
Figure 7:
Clustering of PC1 & PC2 using k-Means Cluster analysis. Black diamonds represent cluster means.") +
theme_bw() +
theme(plot.caption = element_text(hjust = 0.5))
In order to determine the best linear regression prediction model for use in determining staff resourcing decisions to reduce global delay times. The following models were developed and analyzed to determine the best fit model to conduct these tests and determinations:
Final.Delay
data set)
Figure 4
final.delay <- pca.delay.1[-c(12:13, 15)]
final.delay$cluster <- as.numeric(final.delay$cluster)
lm.delay.full <- lm(arr_delay ~ ., data = final.delay)
lm.delay.step <- step(lm.delay.full, direction = "both", trace = 0)
With all four potential regression models created, 5-fold cross validation was conducted to calculate the Mean Square Error (MSE) of each validation run, followed by the mean MSE of all validation runs per prediction model. The data was split 80:20 into train:test data subsets respectively, utilizing the training subsets for the cross validation process. Through comparing the calculated training data mean MSE values, the best fit prediction model was selected for further testing utilizing the test data. As MSE represents the mean squared distance between the model’s predicted value and the actual associated data value, MSE can be utilized to compare the prediction accuracy of linear regression models and identify the best model for selection (lower MSE = more accurate model).
As identified below in Figure 8
and
Table 16
, model 2 returned the lowest mean MSE and lowest
individual MSE values for each cross-validation fold; therefore model 2
(PCA model) was selected as the primary prediction model for live
overall arrival delay times. As discussed earlier in section 3.3.1
Principal Component Analysis, each principal component contains a finely
tuned transformation of all important variables in the global flight
delay data. This holistic approach supports the returned results of
model 2 returning the lowest MSE as the PCA component of this model was
preliminary tuned for this purpose. Additionally the k-Means cluster ID
was created utilizing only the created PCA components, further ensuring
all aspects of the normalized data is represented in the regression
model with minimal noise.
set.seed(123)
#shuffling the data
n <- dim(final.delay)[1] #set sample size
obs.ID <- 1:n #create observation ID to allow for random splitting
n.train <- round(0.8*n) #create sample size for training data. round to keep whole number since no partial observations
shuffle.id <- sample(obs.ID, n, replace = FALSE) #randomize created observation ID, does not allow for dupes in list.
shuffle.delay <- final.delay[shuffle.id, ] #orders rows of dataframe to be in order of shuffle.id created random obs.id list
#splitting the data
train.data <- shuffle.delay[1:n.train, ] #splitting shuffled data from 1st obs to end of training data (80%)
test.data <- shuffle.delay[(n.train + 1):n, ] #splitting shuffled data from 1st obs after end of training data to end of remaining data (test)
n.fold <- round(n.train/5)-1 #number of observations in each of the 5 folds of validation
#validation loop - 5 fold
mse.lin.1 <- rep(0,5)
mse.lin.2 <- rep(0,5)
mse.lin.3 <- rep(0,5)
mse.lin.4 <- rep(0,5)
for (i in 1:5){
valid.id = ((i-1) * n.fold + 1):(1 * n.fold) #breaks down folds into 5 groups
lin.train = train.data[-valid.id, ]
lin.valid = train.data[valid.id, ]
#building models to run off of folded data
m1 = lm(arr_delay ~ . , data = lin.train)
m2 = lm(arr_delay ~ PC1 + cluster, data = lin.train)
m3 = lm(arr_delay ~ airport_dist + num_flights + weather + sup_crew + late_incoming + baggage + cluster, data = lin.train)
m4 = lm(arr_delay ~ late_incoming + num_flights + baggage + cluster, data = lin.train)
#predicting flight delay using models
predm1 = predict(m1, newdata = lin.train)
predm2 = predict(m2, newdata = lin.train)
predm3 = predict(m3, newdata = lin.train)
predm4 = predict(m4, newdata = lin.train)
#calculate mean square error between predicted values and values from validation data
mse.lin.1[i] = mean((predm1 - lin.valid$arr_delay)^2)
mse.lin.2[i] = mean((predm2 - lin.valid$arr_delay)^2)
mse.lin.3[i] = mean((predm3 - lin.valid$arr_delay)^2)
mse.lin.4[i] = mean((predm4 - lin.valid$arr_delay)^2)
}
l.mse.1 = mean(mse.lin.1)
l.mse.2 = mean(mse.lin.2)
l.mse.3 = mean(mse.lin.3)
l.mse.4 = mean(mse.lin.4)
mse.mod1 <- data.frame (
x = 1:5,
y = mse.lin.1
)
mse.mod2 <- data.frame (
x = 1:5,
y = mse.lin.2
)
mse.mod3 <- data.frame (
x = 1:5,
y = mse.lin.3
)
mse.mod4 <- data.frame (
x = 1:5,
y = mse.lin.4
)
mse.mod1.plot <- ggplot (mse.mod1, aes(x = x, y = y)) +
geom_point(aes(x = mse.mod1$x, mse.mod1$y, color = "Model 1"))+
geom_line(aes(x = mse.mod1$x, mse.mod1$y, color = "Model 1")) +
geom_point (aes (x = mse.mod2$x, y = mse.mod2$y, color = "Model 2")) +
geom_line (aes (x = mse.mod2$x, y = mse.mod2$y, color = "Model 2")) +
geom_point(aes(x = mse.mod3$x, y = mse.mod3$y, color = "Model 3"))+
geom_line (aes(x = mse.mod3$x, y = mse.mod3$y, color = "Model 3"), linetype = "longdash") +
geom_point(aes(x = mse.mod4$x, y = mse.mod4$y, color = "Model 4"))+
geom_line (aes(x = mse.mod4$x, y = mse.mod4$y, color = "Model 4")) +
labs (x = "Validation Fold", y = "Calculated MSE") +
scale_color_manual (values = c("Model 1" = "black", "Model 2" = "blue", "Model 3" = "deeppink", "Model 4" = "orange4" ))+
theme_bw()
l.mse <- data.frame(
model1 = l.mse.1,
model2 = l.mse.2,
model3 = l.mse.3,
model4 = l.mse.4
)
grid.arrange(
arrangeGrob(mse.mod1.plot, ncol = 1,nrow = 1),
top = textGrob("Prediction Model Cross Validation", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("
Figure 8:
Comparrison of model Mean Square Error values across 5-fold cross validation"))
kable(l.mse, col.names = c("Model 1 (Full)", "Model 2 (PCA)", "Model 3 (Step-wise)", "Model 4 (RFE)"), caption = "Table 16:
<br>Mean MSE values for each model calculated across 5 cross-validation runs") %>%
kable_styling()
Model 1 (Full) | Model 2 (PCA) | Model 3 (Step-wise) | Model 4 (RFE) |
---|---|---|---|
0.0451025 | 0.044092 | 0.0451025 | 0.0444112 |
Once selected as the best fit linear regression prediction model, the
PCA model (Model 2
) was re-evaluated utilizing the test
data rather than the training data to ensure accuracy and ensure the
model was not over-fitted. As a part of the development of each PCA, the
original variables were linearly combined to create these principal
components, effectively setting each PC to be a unique combination of 6
normalized variables (Airport Distance, Number of Flights, Weather,
Support Crews Available, Late Incoming Plane Arrival, & Baggage
Loading Time).
As identified in Table 17
below, each dependent variable
returned a statistically significant p value of well below 0.05 when
Model 2
was run utilizing the test data as well as the.
This has ensured that all variables present in the regression model are
significant to the prediction model and are not contributing noise or
redundancy to the model.
lm.pca.train <- lm(arr_delay ~ PC1 + cluster, data = train.data)
lm.pca.test <- test.data
lm.pca.test$arr_delay_pred <- predict(lm.pca.train, newdata = lm.pca.test)
lm.pca <- final.delay
lm.pca$arr_delay_pred <- predict(lm.pca.train, newdata = lm.pca)
rsquares.lin <- data.frame(
Data = c("Training Data", "Test Data", "Full Delay Data"),
R.squared = c(summary(lm.pca.train)$r.squared ,(cor(lm.pca.test$arr_delay, lm.pca.test$arr_delay_pred))**2,(cor(lm.pca$arr_delay, lm.pca$arr_delay_pred))**2)
)
shrinkage.lin <- rsquares.lin
shrinkage.lin$Shrinkage <- c("", rsquares.lin[1,2] - rsquares.lin[2,2], rsquares.lin[1,2] - rsquares.lin[3,2])
shrinkage.lin$rely <- NULL
for (i in 1:length(shrinkage.lin)){
if(shrinkage.lin[i,3] == ""){
n <- ""}
else if(shrinkage.lin[i,3] < 0.1){
n <- "Reliable Model"}
else if(shrinkage.lin[i,3] > 0.9){
n <- "Very Unreliable"}
else {n <- "Potentially Reliable"}
shrinkage.lin$rely[i] <- n
}
kable(summary(lm.pca.train)$coefficients, col.names = c("Estimate", "Std. Error", "t value", "Pr(> I t I)"), caption = "Table 17:
<br>Regression coefficients for linear regression prediction model 2 (PCA) run on <b>test data</b>") %>%
kable_styling()
Estimate | Std. Error | t value | Pr(> I t I) | |
---|---|---|---|---|
(Intercept) | 0.3660901 | 0.0074979 | 48.825947 | 0.0000000 |
PC1 | 0.0903510 | 0.0015904 | 56.810698 | 0.0000000 |
cluster | 0.0166348 | 0.0057361 | 2.900037 | 0.0037595 |
Utilizing the above outlined final test data and full data regression
coefficients from using the Model 2
prediction model, the
Model 2
equations can be defined as:
\[ \begin{align*} Arrival \ Delay &\sim PC1 \ + \ Cluster \ + \ Intercept\\ \\ Arrival \ Delay \ _{train} &\sim 0.0902174*PC1 \ + \ 0.0163031*Cluster \ + \ 0.3665648 \end{align*} \]
As captured below in Table 18
the calculated \(R^2\) values for Model 2
as
fit to the training data, test data, and full model delay data indicate
Model 2
is a good fit for Overall Flight Delay prediction.
The shrinkage was calculated for both the test data and the full delay
data to measure the reliability of the prediction model. Utilizing the
standard accepted cutoff of \(Shrinkage <
0.1\), the linear regression model, Model 2
(above
equations) was found to be reliable as a prediction model for overall
flight delays.
kable(shrinkage.lin, col.names = c("Data Utilized", "R$^2$ Value", "Calculated Shrinkage", "Reliability"), caption = "Table 18:
<br>Assessment of model fit and reliability") %>%
kable_styling()
Data Utilized | R\(^2\) Value | Calculated Shrinkage | Reliability |
---|---|---|---|
Training Data | 0.7526599 | ||
Test Data | 0.7717050 | -0.0190451144868257 | Reliable Model |
Full Delay Data | 0.7564583 | -0.00379841841295325 | Reliable Model |
In reviewing the residual plots, QQ plots, and Cook’s distance as obtained from model 2 run on the test data the following information was obtained:
These findings indicate that the linear regression prediction model
(Figure 9
) is a good fit for the Overall Flight Delay data.
As this model is a good fit and meets all assumptions of linear
regression, this model has proven useful for live prediction of overall
flight delay & determination of focus in airline budgets and staff
to preemptively mitigate overall flight delays.
lm.pca.test.plots <- lm(arr_delay ~ arr_delay_pred, data = lm.pca.test)
par(mfrow = c(2,2),oma = c(4, 0, 0, 0))
plot(lm.pca.test.plots)
mtext("Figure 9:
Residual plots for linear regression prediction model 2 run on test data", side= 1, line = 1, outer=TRUE)
Live determining predicted delay times is imperative for airport
planning and resourcing to reduce a downstream cascade of delayed
flights throughout the entire flight network. In order to efficiently
pivot to ensure a cascade does not occur, it is imperative airlines and
airports can accurately and quickly feed data into prediction models to
obtain overall delay estimates. As the linear regression model created
is computationally heavy, a second logistic regression model was created
to predict an airline’s k-Means cluster ID to reduce overall time and
computational burden. This will allow for reduced computational input to
calculate the overall flight delay utilizing the linear regression
equation (Model 2
) as outlined above in section 4. Linear
Regression Model for Arrival Delay Prediction. The correlation between
Overall Flight Delay and k-Means Cluster ID was checked and a strong
negative linear relationship was identified, highlighting the importance
in rapidly determine this k-Means Cluster ID with minimal computational
burden to allow for live Overall Flight Delay prediction.
corr.delay.clust <- data.frame(
Correlation = cor(final.delay$arr_delay, final.delay$cluster)
)
kable(corr.delay.clust, caption = "Table 19:
<br>Correlation between Overall Flight Delay and k-Means Cluster ID") %>%
kable_styling()
Correlation |
---|
-0.6852907 |
In order to determine the best logistic regression prediction model for use in k-Means cluster ID prediction to improve live Overall Flight Delay prediction times. The following models were developed and analyzed to determine the best fit model to conduct these tests and determinations:
Final.Delay
data set)
train.data$cluster <- as.numeric(train.data$cluster)
test.data$cluster <- as.numeric(test.data$cluster)
train.data$cluster2 <- NULL
for(i in 1:length(train.data)){
if(train.data$cluster[i] == 1){
n <- 0}
else {n <- 1}
train.data$cluster2[i] <- n
}
test.data$cluster2 <- NULL
for(i in 1:length(train.data)){
if(test.data$cluster[i] == 1){
n <- 0}
else {n <- 1}
test.data$cluster2[i] <- n
}
final.delay$cluster2 <- NULL
for(i in 1:length(final.delay)){
if(final.delay$cluster[i] == 1){
n <- 0}
else {n <- 1}
final.delay$cluster2[i] <- n
}
fm.train <- train.data[-12]
qm.train <- train.data[-c(8:12)]
full.model.logit <- glm(cluster2 ~., data = fm.train, family = "binomial")
quick.model.logit <- glm(cluster2~., data = qm.train, family = "binomial")
logit.step <- stepAIC(full.model.logit, direction = "both", trace = FALSE)
logit.qstep <- stepAIC(quick.model.logit, direction = "both", trace = FALSE)
step.logit <- glm(cluster2 ~ late_incoming, data = train.data, family = "binomial")
ext.logit <- glm(cluster2 ~ weather*late_incoming*num_flights, data = train.data, family = "binomial")
staff.logit <- glm(cluster2 ~ sup_crew*baggage, data = train.data, family = "binomial")
With all four potential regression models created, 5-fold cross
validation was conducted and the Akaike Information Criterion (AIC)
values for each prediction model were obtained. The data was split 80:20
into train:test data subsets respectively, utilizing the training
subsets for the cross validation process. Through comparing the
calculated training data AIC values as represented below in
Table 20
, the best fit prediction model was identified as
Model 3
and subsequently selected for further testing
utilizing the test data.
aic.comp <- data.frame(
Model = c("Full Model (Model 1)", "Stepwise Model (Model 2)", "External Cond. Model (Model 3)", "Staff Model (Model 4)"),
AIC = c(full.model.logit$aic, step.logit$aic, ext.logit$aic, staff.logit$aic)
)
kable(aic.comp, caption = "Table 20:
<br>Comparrison of calculated AIC values for proposed logistic regression prediction models for cluster ID") %>%
kable_styling()
Model | AIC |
---|---|
Full Model (Model 1) | 108.13419 |
Stepwise Model (Model 2) | 98.77241 |
External Cond. Model (Model 3) | 101.69402 |
Staff Model (Model 4) | 102.39538 |
Finally the ROC (Receiver Operating Characteristic) curves for each
prediction model were compared for a second visual representation of the
comparison of true positive and false positive rates as well as the
calculated area under the curve (AUC) measures below in
Figure 10
. As seen with the calculated AUC for
Model 3
the External Cond. Model demonstrates excellent
prediction performance by the model. The best fit sensitivity and
specificity values for use with the model and indicated in
Figure 10
with a yellow triangle and in
Table 21
. These cutoffs are then utilized for weighting of
the false positive and true positive rates when utilizing the
probabilities as output in the binary logistic regression model to
predict the cluster ID without full PCA and k-Means cluster ID
creation.
#model predictions
ROC.step <- roc(step.logit$y, step.logit$fitted.values)
ROC.ext <- roc(ext.logit$y, ext.logit$fitted.values)
ROC.full <- roc(full.model.logit$y, full.model.logit$fitted.values)
ROC.staff <- roc(staff.logit$y, staff.logit$fitted.values)
roc.colors <- c("blue", "purple", "black", "darkgreen")
cp <- coords(ROC.ext, "best", ret = c("threshold", "sensitivity", "specificity"))
par(oma = c(6,2,2,2), cex.main = 2)
plot.roc(ROC.staff, col = "blue", ylab = "True Postive Rate (Sensitivity))", xlab = "False Positive Rate (1 - Specificity)", lwd = 2)
plot.roc(ROC.step, col = "purple", add = TRUE, lwd = 2)
plot.roc(ROC.full, add = TRUE, lwd = 2)
plot.roc(ROC.ext, col = "darkgreen", add = TRUE, lwd = 2)
legend("bottomright", legend =
c(paste(
"Staff Model AUC =", round(ROC.staff$auc,4)),
paste(
"Stepwise Model AUC =", round(ROC.step$auc,4)),
paste(
"Full Model AUC =", round(ROC.full$auc, 4)),
paste(
"External Cond. Model AUC =", round(ROC.ext$auc, 4))), col = roc.colors, lwd = 1)
title(main="ROC Comparrison", outer = T)
title(sub = paste("Figure 10:
Comparrison of ROC and AUC values for each logistic regression prediction model
Yellow triangle represents ideal cutoff point of best fit model.
Threshold =", round(cp$threshold,4), " Sensitivity = ", round(cp$sensitivity,4), " Specificity =", round(cp$specificity, 4)), outer = T, line=4)
points(x = cp$specificity, y = cp$sensitivity, pch = 24, bg = "yellow", cex = 2)
kable(cp, caption = "Table 21:
<br>Threshold and cutoff true positive and false positive rates", col.names = c("Threshold", "True Positive (Sensitivity)", "False Positive (1-Specificity)")) %>%
kable_styling()
Threshold | True Positive (Sensitivity) | False Positive (1-Specificity) |
---|---|---|
0.9964902 | 0.8350192 | 0.8571429 |
In this analysis of global flight delay data, two prediction models were created for separate utilization purposes, both tuned to assist airlines and airports in reduction of overall flight delay data. The PCA based linear regression model proves to create a detailed tool for airlines to proactively plan staffing considerations, overall flight scheduling, and departmental budgets by predicting the impact of overall changes in the terms of average Overall Flight Delay times. As this model is of a high computational burden, a second model was created to predict the k-Means cluster ID utilized in the linear regression model for a rapid approach to live flight delay predictions. By first using the low computational burden logistic regression prediction model the predicted k-Means cluster ID can be fed into the linear regression prediction model while simultaneously dropping the PCA process to 1 variable and omitting the k-Means cluster ID creation to live predict flight delays. This will allow airports to predict downstream impacts of delays based on current airport conditions (weather, staff availability, average incoming flight delays, and number of flights at the airport that day). Through stacking these two robust prediction models airlines and airports should be more than equipped to predict delays live/before they happen and preemptively plan staffing and budgeting considerations for high travel days (ex: Christmas Eve) to reduce bottlenecks and downstream delays.