As discussed in the previous analysis of flight delays, [I’m Going to Miss My Flight!], air travel is a pillar of our globalized world, upholding trade, movement of persons, and facilitating economic growth. As outlined by the Air Transport Action Group, in 2023 alone, a total of 1,138 airlines were operating a total of 29,039 planes across 67,300 global routes, connecting 21,000 unique pairs of cities. The vast size of this transit network presents unique challenges when working to analyze flight data with the goal of reducing flight delays.
It is near impossible to ensure standardized data collection formats globally, resulting in incomplete data sets with missing values. In order to properly analyze flight data to successfully reduce flight delay times, considerations must be taken to predict said missing values, essentially filling in the gaps. In the world of data science, filling in these data gaps is known as imputation. The following imputations were analyzed to determine the best fit approach:
Once the best imputation method was selected, the data was then analysed for any potential skew, and subsequently normalized to reduce skew & prepare the data for predictive modeling using min-max normalization. The normalized data was then analyzed for redundancies and only the high impact variables retained through RFE feature selection. Once fully pruned the data was then subjected to k-Nearest Neighbor cluster based feature extraction. This final extracted clustered data is then ready for use in various predictive models to identify and predict flight delays before they happen based on previously collected global flight data.
The Flight delay data as collected from Applied Analytics through
Case Studies Using SAS and R, Deepti Gupta by APress, ISBN -
978-1-4842-3525-6 as utilized in the previous analysis of flight
delays, [I’m
Going to Miss My Flight!], was utilized for this analysis. In order
to properly assess imputation models, the data set
[delay.clean]
was extracted from the previous analysis
after outliers were removed, the data was binned, and missing values
were forced. To review the processes utilized to create this data set,
more information can be found at the above link.
The flight variables as captured in delay.clean
include:
Variable Name | Variable Type | Details |
---|---|---|
Carrier | Categorical | Airline (Carrier) |
Airport Distance | Numeric | Distance (miles) between departure and arrival airport |
Number of Flights | Numeric | Total number of flights at arrival airport |
Weather | Numeric | A ranking of delays due to weather condition (0: Mild to 10: Extreme) |
Support Crew | Numeric | Total number of support crew available |
Baggage Loading Time | Numeric [Binned] | Total time for baggage loading (minutes) |
Late Plane Arrival | Numeric | Delay time for plane arrival prior to flight (minutes) |
Cleaning | Numeric [Binned] | Time required to clean plane after arrival prior to passenger loading (minutes) |
Fueling | Numeric | Time required to fuel aircraft (minutes) |
Secutiry | Numeric | Time required for security checks (minutes) |
Arrival Delay | Numeric | Total flight delay time (minutes) |
Table 1 Note: numeric variables (Weather & Cleaning) were binned to allow for categorical styled handling
delay.clean <- read.csv("https://nlepera.github.io/sta552/HW02/data/delay_clean.csv")
delay.clean$Cleaning_o <- factor(delay.clean$Cleaning_o, levels = (c("Immediate", "Quick", "Moderate", "Slow", "NA")))
delay.clean$Baggage_loading_time <- factor(delay.clean$Baggage_loading_time, levels = c("Fast", "Moderate", "Slow"))
Prior to completing any analysis, the missing values in the airline delay data must be imputed to ensure no gaps in the data. In the past analysis missing values were forced in this synthetic data set to imitate the realities of data collection and standardization across this globally diverse array of airlines and airports. A summary of the missing values across each variable are included in the below Figure 1, Table 2, and Table 3. As demonstrated by the p value in Little’s MCAR Test (p = 0.6126116) the result is not statistically significant indicating that the missing values are at random and not cross correlated between variables. This highlights that the missing values between variables are not related to one another (ex: a missing value for Weather does not imply that there will also be a missing value in Cleaning Time).
plot_pattern(delay.clean, rotate = TRUE, caption = TRUE) +
theme(axis.title.x.top = element_blank())
Figure 1
Missing value pattern
visualization
kable(misty_test$result$little, col.names = c("# of Obs.", "# Missing Values", "# of Patterns in Missing Vals", "Statistic", "DF", "p Value"), caption = "Table 2:
<br>
Little's MCAR Test") %>%
kable_styling()
# of Obs. | # Missing Values | # of Patterns in Missing Vals | Statistic | DF | p Value |
---|---|---|---|---|---|
3593 | 20 | 5 | 36.85526 | 40 | 0.6126116 |
kable(sapply(delay.clean[c(4,5,7,8)], function (x) sum(is.na(x))), col.names = c("Variable", "Count of Missing Values"), caption = 'Table 3
<br>
Summary count of missing values in each variable of the airline delay data. Variables with no missing values omitted.') %>%
kable_styling()
Variable | Count of Missing Values |
---|---|
Weather | 5 |
Support_Crew_Available | 4 |
Late_Arrival_o | 6 |
Cleaning_o | 5 |
The kNN
function as found in the VIM
package was utilized to impute the missing data. In order to utilize
this approach the best fit k value was determined using a WSS and
Silhouette graph (Figure 1). This calculation for k allows for
determination of the number of nearby points that should be utilized
when creating the new prediction value to fill in the blank caused by
the missing value. As determined in figure 1 below, the best fit k value
was determined to be 2.
With this determined k value the missing data was imputed to remove all missing variables. While this data set does not include missing categorical variable observations at this point in time, the VIM kNN imputation process utilized will successfully impute any missing categorical values should they arise in any future data collection.
delay.clust <- delay.clean[, -(c(1, 6, 8))] #creates numeric only data subset to be used multiple times
clust1 <- fviz_nbclust(delay.clust, FUN = hcut, method = "wss")
clust2 <- fviz_nbclust(delay.clust, FUN = hcut, method = "silhouette")
grid.arrange(
arrangeGrob(clust1, clust2, ncol = 2,nrow = 1),
top = textGrob("Optimal k Value Determination
", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("Figure 2
Optimal k value determination using WSS and Silhouette methods"))
delay.kNN <- impute_knn(delay.clean, Late_Arrival_o + Weather + Support_Crew_Available + Cleaning_o ~ .)
delay.kNN <- dplyr::select(delay.kNN, !ends_with("_imp"))
kable(sapply(delay.kNN[c(4,5,7,8)], function (x) sum(is.na(x))), col.names = c("Variable", "Count of Missing Values"), caption = 'Table 4
<br>
Summary count of missing values in each variable of the airline delay data after kNN imputation. Variables with no missing values in the original airline delay data omitted.') %>%
kable_styling()
Variable | Count of Missing Values |
---|---|
Weather | 0 |
Support_Crew_Available | 0 |
Late_Arrival_o | 0 |
Cleaning_o | 0 |
In order to develop alternative imputation methods for potential
missing observations across the data set, three simple linear regression
models were built. A unique regression model was built to help fill in
the gaps and predict missing values the numeric variables
Late Incoming Plane Arrival
, Weather
and
Support Crews Available
. Prior to completing these
regression model builds, the data was subset to include only numeric
variables:
The correlation coefficients of all numeric variables were compared to select those with the highest correlation for the proposed regression prediction models. This allows for inclusion of only the variables that correlate with the response variable.
kable(cor(delay.clust[], delay.clust[c(5, 3, 4)], use = "complete.obs"), col.names = c("Variable Names", "Late Incoming Planes", "Weather", "Support Crews Available"), caption = 'Table 5:
<br>
Calculated correlation coefficients between late incoming planes and other variables') %>%
kable_styling()
Variable Names | Late Incoming Planes | Weather | Support Crews Available |
---|---|---|---|
Airport_Distance | 0.3189880 | 0.1492136 | -0.1706441 |
Number_of_flights | 0.5706051 | 0.2476546 | -0.3019891 |
Weather | 0.2207013 | 1.0000000 | -0.1089384 |
Support_Crew_Available | -0.2440386 | -0.1089384 | 1.0000000 |
Late_Arrival_o | 1.0000000 | 0.2207013 | -0.2440386 |
Fueling_o | -0.0209795 | 0.0112277 | 0.0379385 |
Security_o | 0.0806220 | 0.0349761 | -0.0155536 |
Arr_Delay | 0.6670593 | 0.3272111 | -0.3622332 |
As the late incoming planes variable included the largest number of missing values of the three numeric variables with missing values, late incoming planes was chosen as the response variable for the first regression model. The three variables with the largest correlation coefficients as demonstrated in the above Table 4 were selected for use in the regression model. The proposed linear regression model for late incoming plane imputation is as follows:
\[ Late \ Incoming \ Plane \ Arrival \sim Overall \ Flight \ Delay + Number \ of \ Flights + Airport \ Distance \]
delay.linear <- delay.clust #create dataset for linear imputation to not screw up clean data
late.arr.lr <- lm(Late_Arrival_o ~ Arr_Delay + Number_of_flights + Airport_Distance, data = delay.linear) #linear regression model for late incoming plane arrival
late.arr.imp <- which(is.na(delay.linear$Late_Arrival_o))
delay.linear$Late_Arrival_o[late.arr.imp] <- predict(late.arr.lr, newdata = delay.linear[late.arr.imp,])
kable(summary(late.arr.lr)$r.squared, caption = "Table 6.a:
<br>
R.squared value of Late Incoming Plane Arrival linear regression prediction model", col.names = "R.squared Value") %>%
kable_styling()
R.squared Value |
---|
0.4462921 |
An additional simple linear regression equation was crafted to impute any missing weather values. The three variables with highest correlation to the weather variable were selected by again utilizing the correlation values as listed above in Table 4. The proposed linear regression model for weather imputation is as follows:
\[ Weather \sim Overall \ Flight \ Delay + Number \ of \ flights + Late \ Incoming \ Plane \ Arrival\]
weather.lr <- lm(Weather ~ Arr_Delay + Number_of_flights + Late_Arrival_o, data = delay.linear) #linear regression model for late incoming plane arrival
weather.imp <- which(is.na(delay.linear$Weather))
delay.linear$Weather[weather.imp] <- predict(weather.lr, newdata = delay.linear[weather.imp,])
kable(summary(weather.lr)$r.squared, caption = "Table 6.b:
<br>
R.squared value of Weather linear regression imputation model", col.names = "R.squared Value") %>%
kable_styling()
R.squared Value |
---|
0.1081162 |
Finally, a third simple linear regression equation was constructed to impute the missing values for the support crews available variable. Again the variables as listed in Table 4 with the greatest correlation coefficients were selected for inclusion in the imputation model. The proposed linear regression model for support crews available imputation is as follows:
\[ Support \ Crews \ Available \sim Overall \ Flight \ Delay + Number \ of \ flights + Late \ Incoming \ Plane \ Arrival\]
sup.crew.lr <- lm(Support_Crew_Available ~ Late_Arrival_o + Number_of_flights + Arr_Delay, data = delay.linear) #linear regression model for late incoming plane arrival
sup.crew.imp <- which(is.na(delay.linear$Support_Crew_Available))
delay.linear$Support_Crew_Available[sup.crew.imp] <- predict(late.arr.lr, newdata = delay.linear[sup.crew.imp,])
kable(summary(sup.crew.lr)$r.squared, caption = "Table 6.c:
<br>
R.squared value of Support Crews Available linear regression imputation model", col.names = "R.squared Value") %>%
kable_styling()
R.squared Value |
---|
0.1310952 |
As demonstrated by the \(R^2\)
values in the above sections, all three linear regression models prove
to be a rather poor fit for imputing missing values. The linear
regression imputation model for Late Incoming Plane Arrival
imputation demonstrated the best fit of the three, with an \(R^2\) value of 0.4462921. The closer
a regression model’s \(R^2\) value is
to 1, the closer fit there is between the model (imputation model in
this case) and the actual data. The linear regression imputation models
created for Weather
and
Support Crews Available
were both less than 0.14 indicating
very poor fits. Ideally these imputation models should not be
used for practical imputation approaches when handling missing values in
global airline data analysis.
While the data set utilized for this analysis had a small number of missing values, manually knocked out of the data set for the purposes of this analysis and data manipulation, real world flight data contains multitudes of gaps. In order to accurately close these missing gaps it is wiser to run an imputation model multiple times to then calculate the predicted missing values based on the multiple imputation runs, rather than relying on a single model to accurately complete all predictions, flawlessly, each time. By utilizing this approach, known as a multiple imputation model, analysts can better close the gap in airline data while accurately reflecting the model’s uncertainty based on the current quantity and pattern of missing observations. Through a multiple imputation model the variance between each imputation run can be compared to capture a measure of the model’s uncertainty.
A bootstrap imputation model was created using the
Amelia
package to impute missing values across all
variables with missing values (Weather
,
Support Crews Available
,
Late Incoming Plane Arrival
, & Cleaning
).
For ease of utilization and reduction in overall system memory usage, 5
imputation runs were completed for each variable. While a larger number
of imputation runs would result in more accurate imputation models, the
number of runs was limited to 5 for simplicity of analysis and reduction
of overall computing power required.
par(mfrow = c(2,2))
compare.density(amelia_imp, var = 3, legend = TRUE)
compare.density(amelia_imp, var = 4, legend = TRUE)
compare.density(amelia_imp, var = 6, legend = TRUE)
compare.density(amelia_imp, var = 7, legend = TRUE)
Figure 3
Distribution of average imputed
values utilizing Amelia bootstrap imputation.
In order to merge all 5 imputation runs into a single workable data frame for use and analysis, Ruben’s rule was utilized. This allowed for sample means and variance of each imputation run as well as the variance between each imputation run to be combined into a working parameters to describe the combinations of the multiple imputation model. By working with these combined parameters it allows for a generalized comparison point to compare efficacy of the various imputation models, allowing for an analyst to determine the best model for the current need.
\[ \begin{align*} Mean \ of \ Quanitity \ Of \ Interest's \ Means-> \ \overline{Q} &= \sum _{i=1} ^{m} \frac{Q_i}{m} \\ Average \ Standard \ of \ Error \ of \ Quant. \ of \ Interest-> \ \overline{U} &= \sum _{i=1} ^{m} \frac{U_i}{m} \\ Variance \ of \ Mean \ of \ Quant. \ of \ Interest-> \ B &= \frac{1}{m-1}\sum _{i=1} ^{m} (Q_i - \overline{Q})^2 \\ Estimated \ Total \ Variance \ of \ \overline{Q}-> \ T &= \overline{U} + \frac{m + 1}{m}B \end{align*} \]
n.amelia <- length(amelia_imp$imputations)
qweather <- c(mean(amelia_imp$imputations$imp1$Weather), mean(amelia_imp$imputations$imp2$Weather), mean(amelia_imp$imputations$imp3$Weather),
mean(amelia_imp$imputations$imp4$Weather), mean(amelia_imp$imputations$imp5$Weather))
uweather <- c(std.error(amelia_imp$imputations$imp1$Weather), std.error(amelia_imp$imputations$imp2$Weather),
std.error(amelia_imp$imputations$imp3$Weather), std.error(amelia_imp$imputations$imp4$Weather),
std.error(amelia_imp$imputations$imp5$Weather))
q.bar.weather <- mean(qweather)
u.bar.weather <- mean(uweather)
b.weather <- var(qweather)
t.weather <- (q.bar.weather + ((n.amelia + 1)/n.amelia)*b.weather)
se.weather <- std.error(qweather)
qsupcrew <- c(mean(amelia_imp$imputations$imp1$Support_Crew_Available), mean(amelia_imp$imputations$imp2$Support_Crew_Available),
mean(amelia_imp$imputations$imp3$Support_Crew_Available), mean(amelia_imp$imputations$imp4$Support_Crew_Available),
mean(amelia_imp$imputations$imp5$Support_Crew_Available))
usupcrew <- c(std.error(amelia_imp$imputations$imp1$Support_Crew_Available), std.error(amelia_imp$imputations$imp2$Support_Crew_Available),
std.error(amelia_imp$imputations$imp3$Support_Crew_Available), std.error(amelia_imp$imputations$imp4$Support_Crew_Available),
std.error(amelia_imp$imputations$imp5$Support_Crew_Available))
q.bar.supcrew <- mean(qsupcrew)
u.bar.supcrew <- mean(usupcrew)
b.supcrew <- var(qsupcrew)
t.supcrew <- (q.bar.supcrew + ((n.amelia + 1)/n.amelia)*b.supcrew)
se.supcrew <- std.error(qsupcrew)
qlateincom <- c(mean(amelia_imp$imputations$imp1$Late_Arrival_o), mean(amelia_imp$imputations$imp2$Late_Arrival_o),
mean(amelia_imp$imputations$imp3$Late_Arrival_o), mean(amelia_imp$imputations$imp4$Late_Arrival_o),
mean(amelia_imp$imputations$imp5$Late_Arrival_o))
ulateincom <- c(std.error(amelia_imp$imputations$imp1$Late_Arrival_o), std.error(amelia_imp$imputations$imp2$Late_Arrival_o),
std.error(amelia_imp$imputations$imp3$Late_Arrival_o), std.error(amelia_imp$imputations$imp4$Late_Arrival_o),
std.error(amelia_imp$imputations$imp5$Late_Arrival_o))
q.bar.lateincom <- mean(qlateincom)
u.bar.lateincom <- mean(ulateincom)
b.lateincom <- var(qlateincom)
t.lateincom <- (q.bar.lateincom + ((n.amelia + 1)/n.amelia)*b.lateincom)
se.lateincom <- std.error(qlateincom)
qclean <- c(mean(amelia_imp$imputations$imp1$Cleaning_o), mean(amelia_imp$imputations$imp2$Cleaning_o),
mean(amelia_imp$imputations$imp3$Cleaning_o), mean(amelia_imp$imputations$imp4$Cleaning_o),
mean(amelia_imp$imputations$imp5$Cleaning_o))
uclean <- c(std.error(amelia_imp$imputations$imp1$Cleaning_o), std.error(amelia_imp$imputations$imp2$Cleaning_o),
std.error(amelia_imp$imputations$imp3$Cleaning_o), std.error(amelia_imp$imputations$imp4$Cleaning_o),
std.error(amelia_imp$imputations$imp5$Cleaning_o))
q.bar.clean <- mean(qclean)
u.bar.clean <- mean(uclean)
b.clean <- var(qclean)
t.clean <- (q.bar.clean + ((n.amelia + 1)/n.amelia)*b.clean)
se.clean <- std.error(qclean)
am.imp.stat <- data.frame(
Statistic = c("Q.bar", "U.bar", "B.var", "T", "Std.Err"),
am.weather = c(q.bar.weather, u.bar.weather, b.weather, t.weather, se.weather),
am.sup.crew = c(q.bar.supcrew, u.bar.supcrew, b.supcrew, t.supcrew, se.supcrew),
am.late.incoming = c(q.bar.lateincom, u.bar.lateincom, b.lateincom, t.lateincom, se.lateincom),
am.clean = c(q.bar.clean, u.bar.clean, b.clean, t.clean, se.clean)
)
kable(am.imp.stat, caption = "Table 7:
<br>
Summary Statistics of Amelia Imputation Runs for Weather, Support Crews Available, Late Incoming Planes, & Cleaning Time <br> (n = 5)") %>%
kable_styling()
Statistic | am.weather | am.sup.crew | am.late.incoming | am.clean |
---|---|---|---|---|
Q.bar | 5.3535059 | 84.9753369 | 18.7399812 | 2.0899498 |
U.bar | 0.0079766 | 0.6916249 | 0.0132382 | 0.0103787 |
B.var | 0.0000000 | 0.0006319 | 0.0000002 | 0.0000002 |
T | 5.3535060 | 84.9760952 | 18.7399814 | 2.0899500 |
Std.Err | 0.0000815 | 0.0112421 | 0.0001797 | 0.0001859 |
A mice imputation model was then created as a comparison to the above
created Amelia bootstrap multiple imputation model to allow for
increased flexibility in the arrival delay predictions and to allow for
better preservation of relationships between variables that may not have
been observed in the initial exploratory data analysis steps. This
process allowed for further improved predictive power when applying
these models to real world data with a larger variety of missing data
patterns than observed above in Figure 1
.
No imputation method was dictated to the MICE package to allow the package to preliminary evaluate each variable with missing data and determine the appropriate imputation approach for the individual variable. This allows for increased flexibility in imputing the data, and prevents accidentally building a model that only works with the current set of data. By allowing for this variable specific flexibility, a re-determination of the best fit imputation model is completed each time the imputation model is run. Therefore the imputation model can employ best fit approach based on the currently available data, each time the imputation is run regardless of the pattern and quantity of missing observations present. This allows for real world data tuning as new flight delay data is obtained globally. Distributions of the imputed values in each of the 5 imputation runs utilized in the MICE imputation model can be seen below in Figure 3.
m.delay <- delay.clean[-1]
m.delay$Cleaning_o <- as.numeric(m.delay$Cleaning_o)
m.delay$Baggage_loading_time <- as.numeric(m.delay$Baggage_loading_time)
mice1 <- NULL
mice1 <- mice(m.delay, m = 5, maxit = 10, seed = 123, print = FALSE)
plot(mice1, main = "MICE Model Imputation", sub = "Figure 4:
A visual representation of the mean and standard deviation of
each imputation run utilizing the MICE imputation package.
A total of 5 imputation runs were completed with
a maximum of 10 iterations per imputation")
kable(mice1$method, col.names = c("Variable Name", "Imputation Method"), caption = "Table 8:
<br>Imputation methods utilized for each variable.
<br>Blank values indicate no missing observations were identified for the variable and thus no imputation was required.") %>%
kable_styling()
Variable Name | Imputation Method |
---|---|
Airport_Distance | |
Number_of_flights | |
Weather | pmm |
Support_Crew_Available | pmm |
Baggage_loading_time | |
Late_Arrival_o | pmm |
Cleaning_o | pmm |
Fueling_o | |
Security_o | |
Arr_Delay |
While the MICE imputation package automatically combines the imputation runs into a single best fit imputed data set, the imputation statistics for each run were calculated using the previously outlined Ruben’s rules for ease of comparison with the previously obtained Amelia bootstrap imputation model. Once these summary statistics were calculated, they were utilized to compare both imputation approaches to determine the best fit imputation model. As demonstrated below in Table 9, comparing the standard error values for each variable imputed in both models, the Amelia bootstrap imputation approach proves to be a better fit for imputing missing data.
n.mice <- length(mice1$imp$Weather)
qweather.m <- c(mean(mice1$imp$Weather[,1]), mean(mice1$imp$Weather[,2]), mean(mice1$imp$Weather[,3]), mean(mice1$imp$Weather[,4]),
mean(mice1$imp$Weather[,5]))
uweather.m <- c(std.error(mice1$imp$Weather[,1]), std.error(mice1$imp$Weather[,2]), std.error(mice1$imp$Weather[,3]),
std.error(mice1$imp$Weather[,4]), std.error(mice1$imp$Weather[,5]))
q.bar.weather.m <- mean(qweather.m)
u.bar.weather.m <- mean(uweather.m)
b.weather.m <- var(qweather.m)
t.weather.m <- (q.bar.weather.m + ((n.mice + 1)/n.mice)*b.weather.m)
se.weather.m <- std.error(qweather.m)
qsupcrew.m <- c(mean(mice1$imp$Support_Crew_Available[,1]), mean(mice1$imp$Support_Crew_Available[,2]), mean(mice1$imp$Support_Crew_Available[,3]),
mean(mice1$imp$Support_Crew_Available[,4]), mean(mice1$imp$Support_Crew_Available[,5]))
usupcrew.m <- c(std.error(mice1$imp$Support_Crew_Available[,1]), std.error(mice1$imp$Support_Crew_Available[,2]),
std.error(mice1$imp$Support_Crew_Available[,3]), std.error(mice1$imp$Support_Crew_Available[,4]),
std.error(mice1$imp$Support_Crew_Available[,5]))
q.bar.supcrew.m <- mean(qsupcrew.m)
u.bar.supcrew.m <- mean(usupcrew.m)
b.supcrew.m <- var(qsupcrew.m)
t.supcrew.m <- (q.bar.supcrew.m + ((n.mice + 1)/n.mice)*b.supcrew.m)
se.supcrew.m <- std.error(qsupcrew.m)
qlateincom.m <- c(mean(mice1$imp$Late_Arrival_o[,1]), mean(mice1$imp$Late_Arrival_o[,2]), mean(mice1$imp$Late_Arrival_o[,3]),
mean(mice1$imp$Late_Arrival_o[,4]), mean(mice1$imp$Late_Arrival_o[,5]))
ulateincom.m <- c(std.error(mice1$imp$Late_Arrival_o[,1]), std.error(mice1$imp$Late_Arrival_o[,2]), std.error(mice1$imp$Late_Arrival_o[,3]),
std.error(mice1$imp$Late_Arrival_o[,4]), std.error(mice1$imp$Late_Arrival_o[,5]))
q.bar.lateincom.m <- mean(qlateincom.m)
u.bar.lateincom.m <- mean(ulateincom.m)
b.lateincom.m <- var(qlateincom.m)
t.lateincom.m <- (q.bar.lateincom.m + ((n.mice + 1)/n.mice)*b.lateincom.m)
se.lateincom.m <- std.error(qlateincom.m)
qclean.m <- c(mean(mice1$imp$Cleaning_o[,1]), mean(mice1$imp$Cleaning_o[,2]), mean(mice1$imp$Cleaning_o[,3]), mean(mice1$imp$Cleaning_o[,4]),
mean(mice1$imp$Cleaning_o[,5]))
uclean.m <- c(std.error(mice1$imp$Cleaning_o[,1]), std.error(mice1$imp$Cleaning_o[,2]), std.error(mice1$imp$Cleaning_o[,3]),
std.error(mice1$imp$Cleaning_o[,4]), std.error(mice1$imp$Cleaning_o[,5]))
q.bar.clean.m <- mean(qclean.m)
u.bar.clean.m <- mean(uclean.m)
b.clean.m <- var(qclean.m)
t.clean.m <- (q.bar.clean.m + ((n.mice + 1)/n.mice)*b.clean.m)
se.clean.m <- std.error(qclean.m)
mice.merge <- am.imp.stat
mice.merge$mice.weather <- c(q.bar.weather.m, u.bar.weather.m, b.weather.m, t.weather.m, se.weather.m)
mice.merge$mice.sup.crew <- c(q.bar.supcrew.m, u.bar.supcrew.m, b.supcrew.m, t.supcrew.m, se.supcrew.m)
mice.merge$mice.late.incoming<- c(q.bar.lateincom.m, u.bar.lateincom.m, b.lateincom.m, t.lateincom.m, se.lateincom.m)
mice.merge$mice.clean<- c(q.bar.clean.m, u.bar.clean.m, b.clean.m, t.clean.m, se.clean.m)
mice.merge <- dplyr::select(mice.merge, Statistic, am.weather, mice.weather, am.sup.crew, mice.sup.crew, am.late.incoming, mice.late.incoming, am.clean, mice.clean)
kable(mice.merge, caption = "Table 9:
<br>Comparrison of Amelia bootstrap imputation model to MICE imputation model
<br>Prefix [am.] indicates amelia model statistics while prefix [mice.] indicates MICE cart model statistics") %>%
kable_styling() %>%
scroll_box(width = "100%")
Statistic | am.weather | mice.weather | am.sup.crew | mice.sup.crew | am.late.incoming | mice.late.incoming | am.clean | mice.clean |
---|---|---|---|---|---|---|---|---|
Q.bar | 5.3535059 | 5.4000000 | 84.9753369 | 98.650000 | 18.7399812 | 18.6333333 | 2.0899498 | 2.0800000 |
U.bar | 0.0079766 | 0.2269694 | 0.6916249 | 21.472021 | 0.0132382 | 0.3032956 | 0.0103787 | 0.2554809 |
B.var | 0.0000000 | 0.0400000 | 0.0006319 | 369.675000 | 0.0000002 | 0.0055556 | 0.0000002 | 0.0520000 |
T | 5.3535060 | 5.4480000 | 84.9760952 | 542.260000 | 18.7399814 | 18.6400000 | 2.0899500 | 2.1424000 |
Std.Err | 0.0000815 | 0.0894427 | 0.0112421 | 8.598546 | 0.0001797 | 0.0333333 | 0.0001859 | 0.1019804 |
As demonstrated above in Table 9
, the Amelia bootstrap
imputation model proved to be the best fit approach for imputing missing
flight data, therefore this imputation model was utilized to create a
linear regression model for analysis of the other utilized imputation
methods. Utilizing the analysis output as found above in the B.var row
of Table 7
it was determined that the variance between each
imputation run was negligible, allowing a single imputation run to be
utilized for proposed model creation. The best fit linear regression
model was created utilizing the step() function. This process was
utilized to allow for an automated selection of the best fit linear
regression model rather than manually assuming the best model by
comparing correlation values as completed above in the
Linear Regression Imputation
section.
This regression model was built in order to utilize the newly imputed data to create a regression model allowing for prediction of potential overall flight arrival delays based on the available global flight data.
am.reg.full <- lm(Arr_Delay ~., data = amelia_imp$imputations$imp1)
kable(step(am.reg.full, direction = "both", trace = 0)$coef, col.names = c("Variable Names", "Coefficients"), caption = "Table 10:
<br>Linear Regression coefficients calculated utilzing the STEP imputation process") %>%
kable_styling()
Variable Names | Coefficients |
---|---|
(Intercept) | -451.7150165 |
Airport_Distance | 0.2123756 |
Number_of_flights | 0.0052688 |
Weather | 5.4674566 |
Support_Crew_Available | -0.0564633 |
Baggage_loading_time | 11.9203541 |
Late_Arrival_o | 8.1911827 |
\[ \begin{align*} Arrival \ Delay &\sim Airport \ Distance + \ Number \ of \ flights + Weather \\ & \ \ \ + Support \ Crew \ Available + Baggage \ Loading \ Time + Late \ Incoming \ Plane \ Arrival \\ Arrival \ Delay &\sim 2.118e^{-1}x \ + \ 5.269e^{-3}x \ + \ 5.498x \ - \ 5.700e^{-2}x \ + 1.191e^{1}x \ + \ 8.180x \ - \ 4.514e^{2} \end{align*} \]
The above created linear regression model was run on all 4 created imputation data sets to further analyze which imputation model works best for a linear regression model designed to predict total flight arrival delays based on the available flight data. To note, as the previously created linear regression imputation model was only applicable for numerical data, the Baggage Loading Time variable was removed from the regression equation only for the linear regression imputation model. This restriction of input data already implies that the linear regression imputation model is likely not the best fit approach for imputing data to then use for flight delay predictions.
am.reg <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = amelia_imp$imputations$imp1)
kNN.reg <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = delay.kNN)
lin.reg <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Late_Arrival_o, data = delay.linear)
mice.reg <- lm(Arr_Delay ~ Airport_Distance + Number_of_flights + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o, data = m.delay)
reg.compare <- data.frame(
model = c("am.reg", "kNN.reg", "lin.reg", "mice.reg"),
r.squared = c(summary(am.reg)$r.squared, summary(kNN.reg)$r.squared, summary(lin.reg)$r.squared, summary(mice.reg)$r.squared)
)
kable(reg.compare, col.names = c("Imputation Model", 'R$^2$ Value'), caption = "Table 11:
<br>Comparrison of utilizing various imputation models for linear regression model prediction of flight delays") %>%
kable_styling()
Imputation Model | R\(^2\) Value |
---|---|
am.reg | 0.7888451 |
kNN.reg | 0.8095868 |
lin.reg | 0.7709490 |
mice.reg | 0.7889214 |
As demonstrated above, the k Nearest Neighbor
imputation
model resulted in the highest R\(^2\)
value indicating that the k Nearest Neighbor
(delay.kNN)
imputation model worked best for predicting overall flight delays when
utilizing the step-wise created linear regression model. This was
followed up by the MICE imputation model and the Amelia bootstrap
imputation model in a near tie for second place. This allows for
selection of the best imputation approach for imputing any missing
global flight data prior to predicting total flight delays. Due to the
high R\(^2\) value the
k Nearest Neighbor
imputation approach will be utilized
going forward for analysis.
In the last analysis of this data set the total support crew members available, total arrival delays for planes needed for outgoing flights, baggage loading times, and fueling times were analyzed for their impacts on the overall flight delays. These variables were re-assessed alongside the remaining variables to determine if these three variables previously selected were truly the best predictor variables for flight delays. The following variables not investigated in the first analysis were reviewed alongside the previously mentioned variables:
In order to determine if any of the variable data is skewed, the continuous variable distributions were plotted with an overlay indicating the standard normal distribution as calculated with the respective variable’s calculated mean and standard deviation. This check was completed to ensure the data as available will cooperate with the prediction models utilized later in the analysis. As a majority of prediction models require the assumption of a normal distribution, the calculated normal distribution curve for each respective variable was overlaid as a comparison reference. In order to properly leverage the prediction models to optimize airport planning to reduce flight delays, it is imperative to ensure the data fits within the assumptions of the prediction model utilized.
kNN.analysis <- delay.kNN
kNN.analysis$Weather <- as.numeric(kNN.analysis$Weather)
kNN.analysis$Support_Crew_Available <- as.numeric(kNN.analysis$Support_Crew_Available)
kNN.analysis$Late_Arrival_o <- as.numeric(kNN.analysis$Late_Arrival_o)
dist_dens <- ggplot(kNN.analysis, aes(x = Airport_Distance)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Airport_Distance), sd = sd(kNN.analysis$Airport_Distance)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Airport_Distance)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Airport_Distance)), color = "black", linetype = "dashed") +
labs(title = "Distance Between Departure-Arrival Airports",
x = "Distance (miles)",
y = "Density")
numflight_dens <- ggplot(kNN.analysis, aes(x = Number_of_flights)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Number_of_flights), sd = sd(kNN.analysis$Number_of_flights)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Number_of_flights)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Number_of_flights)), color = "black", linetype = "dashed") +
labs(title = "Total Flights per Airport *",
x = "Flights per Airport",
y = NULL)
supcrew_dens <- ggplot(kNN.analysis, aes(x = Support_Crew_Available)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Support_Crew_Available), sd = sd(kNN.analysis$Support_Crew_Available, na.rm = TRUE)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Support_Crew_Available)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Support_Crew_Available)), color = "black", linetype = "dashed") +
labs(title = "Total Support Crew Members Available *",
x = "Support Crew Available",
y = NULL)
fueling_dens <- ggplot(kNN.analysis, aes(x = Fueling_o)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Fueling_o), sd = sd(kNN.analysis$Fueling_o)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Fueling_o)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Fueling_o)), color = "black", linetype = "dashed") +
labs(title = "Plane Fueling Times",
x = "Fueling Time (minutes)",
y = NULL)
security_dens <- ggplot(kNN.analysis, aes(x = Security_o)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Security_o), sd = sd(kNN.analysis$Security_o)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Security_o)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Security_o)), color = "black", linetype = "dashed") +
labs(title = "Total Security Times",
x = "Time (minutes)",
y = "Density")
arrival_dens <- ggplot(kNN.analysis, aes(x = Arr_Delay)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Arr_Delay), sd = sd(kNN.analysis$Arr_Delay)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Arr_Delay)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Arr_Delay)), color = "black", linetype = "dashed") +
labs(title = "Total Flight Delays",
x = "Duration (minutes)",
y = "Density")
lateincom_dens <- ggplot(kNN.analysis, aes(x = Late_Arrival_o)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Late_Arrival_o), sd = sd(kNN.analysis$Late_Arrival_o)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Late_Arrival_o)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Late_Arrival_o)), color = "black", linetype = "dashed") +
labs(title = "Late Incomig Plane Delays *",
x = "Duration (minutes)",
y = "Density")
weather_dens <- ggplot(kNN.analysis, aes(x = Weather)) +
geom_density(fill = "blue", alpha = 0.5) +
stat_function(fun = dnorm, args = list(mean = mean(kNN.analysis$Weather), sd = sd(kNN.analysis$Weather)), color = "darkred", linewidth = 1, linetype = "dashed") +
geom_vline(aes(xintercept = mean(kNN.analysis$Weather)), color = "black") +
geom_vline(aes(xintercept = median(kNN.analysis$Weather)), color = "black", linetype = "dashed") +
labs(title = "Weather Conditions",
x = "Ranking of Weather Severity",
y = "Density")
kNN.analysis$Baggage_loading_time <- factor(kNN.analysis$Baggage_loading_time, levels = c("Slow", "Moderate", "Fast"))
kNN.analysis$Cleaning_o <- factor(kNN.analysis$Cleaning_o, levels = c("Slow", "Moderate", "Quick", "Immediate"))
baggage_bar <- ggplot(kNN.analysis, aes(x = Baggage_loading_time)) +
geom_bar(fill = "blue", alpha = 0.5) +
labs(title = "Baggage Loading Times *",
x = "Loading Duration (slowest to fastest from left to right)",
y = "Frequency")
cleaning_bar <- ggplot(kNN.analysis, aes(x = Cleaning_o)) +
geom_bar(fill = "blue", alpha = 0.5) +
labs(title = "Plane Cleaning Times *",
x = "Cleaning Duration (slowest to fastest from left to right)",
y = NULL)
grid.arrange(
arrangeGrob(arrival_dens, numflight_dens, dist_dens, fueling_dens, security_dens, supcrew_dens, lateincom_dens, weather_dens, baggage_bar, cleaning_bar, ncol = 2,nrow = 5),
top = textGrob("Distribution of Flight Variables
", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("Figure 5
Graphs marked with one asterisk (*) indicate skewed data
Black solid line: mean of variable data | Black dashed line: median of variable data"))
As demonstrated in the above Figure 4
, the distance
between airports and total security times hold close to the normal
distribution curves. Meanwhile the total delay times for incoming planes
from previous flights, fueling times, and weather severity illustrate
non-normal distributions. At a cursory review these variables appear to
hold to multimodal and Bernoulli distributions. Meanwhile the number of
flights and total support crews available demonstrated skewed normal
distributions.
In order to confirm the visually identified data skews the means and medians of the following variables were compared:
Upon comparison of the mean and median of the numeric variables it was identified that the late incoming planes variable was not skewed as the median and mean were identical. Frequency counts for the categorical variables also identified that the mode of these variables corroborated a normal distribution. This left the number of flights and support crews available as the two truly skewed variables requiring transformation.
kable(summary(round(kNN.analysis[c(3,5,7)]), digits = 1), caption = "Table 12:
<br>Summary statistics for variables with skewed distributions. Rounded to whole numbers as the fractions of flights and fractions of support crew members cannot exist naturally.", col.names = c("Total Flights", "Support Crews Available", "Late Incoming Planes")) %>%
kable_styling()
Total Flights | Support Crews Available | Late Incoming Planes | |
---|---|---|---|
Min. :29475 | Min. : 0 | Min. :15 | |
1st Qu.:41634 | 1st Qu.: 56 | 1st Qu.:18 | |
Median :43424 | Median : 83 | Median :19 | |
Mean :43311 | Mean : 85 | Mean :19 | |
3rd Qu.:45140 | 3rd Qu.:112 | 3rd Qu.:19 | |
Max. :53461 | Max. :222 | Max. :22 |
kable(summary(kNN.analysis[c(6, 8)]), caption = "Table 13:
<br>Frequency count to capture the mode of categorical variables with skewed distributions.", col.names = c("Baggage Loading Times", "Plane Cleaning Duration")) %>%
kable_styling()
Baggage Loading Times | Plane Cleaning Duration | |
---|---|---|
Slow : 11 | Slow : 30 | |
Moderate:2833 | Moderate : 782 | |
Fast : 749 | Quick :2264 | |
NA | Immediate: 517 |
In order to normalize the variables for future predictive model use, min-max normalization was employed to reduce all variables to a range of [0,1]. The categorical variables were already encoded as factor variables, allowing for easy label encoding to allow for cooperation with the min-max standardization. The following equation was applied to all numeric variables (both originally numeric and forced numeric from factor categorical variables).
\[x_{standardized} = \frac{x - min(x)}{max(x) - min(x)}\]
kNN.analysis$cleaning <- as.numeric(kNN.analysis$Cleaning_o)
kNN.analysis$baggage <- as.numeric(kNN.analysis$Baggage_loading_time)
delay.std <- data.frame(
airport_dist = (kNN.analysis$Airport_Distance - min(kNN.analysis$Airport_Distance)) / (max(kNN.analysis$Airport_Distance) - min(kNN.analysis$Airport_Distance)),
num_flights = (kNN.analysis$Number_of_flights - min(kNN.analysis$Number_of_flights)) / (max(kNN.analysis$Number_of_flights) - min(kNN.analysis$Number_of_flights)),
weather = (kNN.analysis$Weather - min(kNN.analysis$Weather)) / (max(kNN.analysis$Weather) - min(kNN.analysis$Weather)),
sup_crew = (kNN.analysis$Support_Crew_Available - min(kNN.analysis$Support_Crew_Available)) / (max(kNN.analysis$Support_Crew_Available) - min(kNN.analysis$Support_Crew_Available)),
late_incoming = (kNN.analysis$Late_Arrival_o - min(kNN.analysis$Late_Arrival_o)) / (max(kNN.analysis$Late_Arrival_o) - min(kNN.analysis$Late_Arrival_o)),
fueling = (kNN.analysis$Fueling_o - min(kNN.analysis$Fueling_o)) / (max(kNN.analysis$Fueling_o) - min(kNN.analysis$Fueling_o)),
security = (kNN.analysis$Security_o - min(kNN.analysis$Security_o)) / (max(kNN.analysis$Security_o) - min(kNN.analysis$Security_o)),
arr_delay = (kNN.analysis$Arr_Delay - min(kNN.analysis$Arr_Delay)) / (max(kNN.analysis$Arr_Delay) - min(kNN.analysis$Arr_Delay)),
cleaning = (kNN.analysis$cleaning - min(kNN.analysis$cleaning)) / (max(kNN.analysis$cleaning) - min(kNN.analysis$cleaning)),
baggage = (kNN.analysis$baggage - min(kNN.analysis$baggage)) / (max(kNN.analysis$baggage) - min(kNN.analysis$baggage))
)
kable(summary(delay.std), caption = "Table 14:
<br>Summary statistics of normalized flight data") %>%
kable_styling() %>%
scroll_box(width = "100%")
airport_dist | num_flights | weather | sup_crew | late_incoming | fueling | security | arr_delay | cleaning | baggage | |
---|---|---|---|---|---|---|---|---|---|---|
Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | |
1st Qu.:0.4472 | 1st Qu.:0.5069 | 1st Qu.:0.0000 | 1st Qu.:0.2523 | 1st Qu.:0.4286 | 1st Qu.:0.4348 | 1st Qu.:0.3800 | 1st Qu.:0.2722 | 1st Qu.:0.6667 | 1st Qu.:0.5000 | |
Median :0.5447 | Median :0.5815 | Median :0.0000 | Median :0.3739 | Median :0.5714 | Median :0.5217 | Median :0.4800 | Median :0.3889 | Median :0.6667 | Median :0.5000 | |
Mean :0.5395 | Mean :0.5768 | Mean :0.3532 | Mean :0.3828 | Mean :0.5342 | Mean :0.5222 | Mean :0.4817 | Mean :0.3878 | Mean :0.6365 | Mean :0.6027 | |
3rd Qu.:0.6341 | 3rd Qu.:0.6531 | 3rd Qu.:1.0000 | 3rd Qu.:0.5045 | 3rd Qu.:0.5714 | 3rd Qu.:0.6087 | 3rd Qu.:0.5800 | 3rd Qu.:0.5000 | 3rd Qu.:0.6667 | 3rd Qu.:0.5000 | |
Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
Min-max normalization was chosen to allow for future use with feature magnitude sensitive algorithms, such as k Nearest Neighbors cluster analysis.
In order to simplify the data for analysis and predictive modeling,
Recursive Feature Elimination (RFE) was employed utilizing the
caret package. This allows for removal of redundancy in the data set
through a random forest model using 10 iterations of cross validation to
determine which variables are most essential for creating prediction
models for flight delays. Reducing redundancy not only reduces the
computational burden of predictive models while accounting for any
potential feature interactions between variables. In order to prevent
over fitting using RFE, an 80:20 training:test set was created using
delay.caret
to identify the most important variables for
prediction of overall flight delays (arr_delay).
A visualization of the importance of each variable in potential
prediction of overall flight delays is included below in
Figure 6
. As demonstrated below there is a steep drop off
in importance for total cleaning time (cleaning), fueling times
(fueling), & security times (security); therefore these variables
will be omitted from future analysis to reduce computational burden and
prevent redundancies.
x.caret <- delay.std[,-8]
y.caret <- delay.std[,8]
set.seed(12345)
inTrain <- createDataPartition(y.caret, p = 0.80, list = FALSE)[,1]
x.caret.train <- x.caret[inTrain, ]
x.caret.test <- x.caret[-inTrain, ]
y.caret.train <- y.caret[inTrain]
y.caret.test <- y.caret[-inTrain]
control <- rfeControl(functions = rfFuncs, method = "cv", repeats = 10, number = 15)
delay.caret <- rfe(x = x.caret.train, y.caret.train , sizes = c(1:5), rfeControl = control)
delay.caret.var <- data.frame(feature = row.names(varImp(delay.caret))[1:9],
importance = varImp(delay.caret)[1:9, 1])
accuracy.caret <- postResample(predict(delay.caret, x.caret.test), y.caret.test)
ggplot(delay.caret.var, aes(x = reorder(feature, -importance), y = importance, fill = feature)) +
geom_bar(stat="identity") +
labs(x = "Features", y = "Variable Importance", title = "Variable Importance for All Features", caption = "Figure 6:
Importance of each variable in global flight data for predicting flight delays") +
geom_text(aes(label = round(importance, 2)), vjust = 0, color = "black", size = 4) +
theme_bw() +
theme(legend.position = "none",
plot.caption = element_text(hjust = 0.5))
In order to properly utilize the 6 most important variables for
prediction of overall flight delays as outlined above in
Figure 6
a two step feature creation process was utilized.
Initially a principal component analysis was conducted to combine these
6 variables into principal components that capture a combination of
these 6 variables. Utilizing the PCA approach allows for a more
streamlined approach for creating prediction models for overall flight
delay times. Once transformed into principal components, a k-means
cluster analysis was then conducted to add for an additional feature to
utilize when creating flight delay prediction models.
Utilizing the most important variables as identified in figure 6 (number of flights per airport, late incoming plane, baggage loading time, distance between airports, support crews available, and weather) a principal component analysis was conducted to further reduce redundancies. An initial scree plot was utilized to determine the “elbow” cutoff of PC1 & PC2 where the variance steeply cuts off. Upon calculation of the total percentage of variance, it was found that in order to create a model that accounts for >90% of the variance in the total flight delay data, the first 5 PCs must be utilized.
delay.std.km <- delay.std[-c(6, 7:9)]
pca.delay <- prcomp(delay.std.km, scale = TRUE)
pca.delay.1 <- cbind(delay.std.km, pca.delay$x)
pca.var <- apply(pca.delay$x, 2, var)
plot(pca.delay, type ="l", ylim=c(0,4), xlim=c(1,6.5), main = NULL)
text((1:10)+0.15, pca.var+0.3, as.character(round(pca.var, 3)), cex = 0.7)
title(main = "Variance of Individual PCs", xlab ="Principal Components", sub = "Figure 7: Scree Plot to determine number of PCs to use")
points(x = 2, y = (pca.delay$sdev[2])^2, pch = 19, col = "purple", cex = 1.5)
text(x = 2, y = 0.5, "61.938%
of Var.", col = "purple", cex = 0.6)
points(x = 5, y = (pca.delay$sdev[5])^2, pch = 19, col = "darkred", cex = 1.5)
text(x = 5, y = 0.2, "91.953%
of Var.", col = "darkred", cex = 0.6)
kable(pca.delay$rotation[,c(1:5)], caption = "Table 15:
<br>Principal Component Coefficients (rotation matrix)") %>%
kable_styling()
PC1 | PC2 | PC3 | PC4 | PC5 | |
---|---|---|---|---|---|
airport_dist | 0.3707838 | 0.0943230 | -0.4560362 | 0.7932004 | -0.0346455 |
num_flights | 0.5154782 | 0.0538654 | -0.1096925 | -0.1820407 | -0.0161357 |
weather | 0.2581120 | -0.8924380 | 0.3276163 | 0.1706329 | 0.0040928 |
sup_crew | -0.2988248 | -0.4332730 | -0.8057221 | -0.2618513 | 0.0513166 |
late_incoming | 0.4677295 | 0.0212573 | -0.1274563 | -0.3872228 | -0.6697073 |
baggage | -0.4712883 | -0.0598225 | 0.0850466 | 0.3001186 | -0.7398523 |
Utilizing the 5 created principal components, k-Means clustering was
completed to create clusters to aid in total flight delay prediction. In
order to determine the best fit number of clusters, a wss
and silhouette
plots were created as an initial analysis of
the PCA data. As demonstrated below in Figure 7
was
determined that the best fit number of clusters for this transformed PCA
data was 2 clusters. Utilizing that pre-determined number of clusters,
the PCA data was then clustered into 2 groups, using a total of 25 run
throughs of the clustering process to best determine the center point of
the clusters and reduce overall cluster overlap.
pca.delay.2 <- pca.delay.1 %>%
dplyr::select("PC1", "PC2", "PC3", "PC4", "PC5")
kmeans.wss <- fviz_nbclust(pca.delay.2, FUN = hcut, method = "wss")
kmeans.sil <- fviz_nbclust(pca.delay.2, FUN = hcut, method = "silhouette")
grid.arrange(
arrangeGrob(kmeans.wss, kmeans.sil, ncol = 2,nrow = 1),
top = textGrob("Optimal Cluster Determination", gp = gpar(fontsize = 20, fontface = "bold")),
bottom = textGrob("Figure 7
Determination of optimal number of clusters for PCA overall flight data"))
set.seed(12345)
delay.kmean <- kmeans(pca.delay.2, centers = 2, nstart = 25)
pca.delay.2$cluster <- as.factor(delay.kmean$cluster)
pca.delay.final <- pca.delay.2
pca.delay.final$arr_delay <- delay.std$arr_delay
eigen <- round(get_eigenvalue(pca.delay), 1)
var.eigen <- eigen$variance.percent
cluster_centers <- aggregate(pca.delay.2, by=list(cluster = pca.delay.2$cluster), FUN = mean) %>%
dplyr::select(-7)
ggscatter(pca.delay.2, x = "PC1", y = "PC2",
color = "cluster",
palette = "npg",
ellipse = TRUE,
ellipse.type = "convex",
legend = "right",
xlab = paste("PC1 (",var.eigen[1],"%)"),
ylab = paste("PC2 (", var.eigen[2],"%)")) +
geom_point(data = cluster_centers, shape = 18, size = 4, color = "black") +
labs(title = "Cluster Analysis", caption = "Figure 8
Clustering of PC1 & PC2 using k-Means Cluster analysis. Black diamonds represent cluster means.") +
theme_bw() +
theme(plot.caption = element_text(hjust = 0.5))
In a further analysis of global airline data, it has been determined that in order to best create real time prediction models for overall flight delay, the raw global flight data must undergo a litany of preparation steps. First the missing data must be imputed utilizing the best fit approach for the current pattern of missing observations. Once the best fit imputation process was determined, the imputed data must undergo feature transformation to normalize any skew to prevent any unnecessary weighting or skew in prediction models. Normalization is imperative for global flight data as the variables contain widely different units and scales that would prevent a clean analysis without normalization. For example, without normalization number of flights at the departing airport and fueling times would not cooperate in a prediction model due to the massive difference in minimums and maximums of these variables.
After normalization the data then must be analyzed to identify and remove redundancy to prevent over-fitting of prediction models and reduce overall computational burden during the model creation process. As the amount of global flight data is robust, this step is essential in creating both quick and accurate prediction models should they be intended for real time flight delay prediction. Lastly, prior to model creation, the data must undergo additional feature creation such as principal component analysis (PCA) and k-Means clustering to combine the numerous variables into a reduced data set without loss of actual data. The creation of an additional categorical variable through k-Means clustering utilizing the aggregate PCA data by providing the future prediction models with meaningful pre-analysis data to improve prediction accuracy while further preventing over-fitting.
flight delay rights: https://www.battleface.com/blog/your-global-guide-to-flight-delay-rights/
Hotdeck imputation: https://pmc.ncbi.nlm.nih.gov/articles/PMC3130338/