1. Overview

In the world of air travel, 2022 was a year of air travel defined by cancellations. While 2023 has seen a decrease in cancellations, FlightAware reports an increase in delay time from last year.

Per federal data, most flight delays in recent years have been caused by issues within the control of airlines.

Discovering and understanding which specific issues cause delays could inform airlines’ priorities, and could mitigate delay time moving forward.

Of course, with something as complicated as air travel, it is not unreasonable to expect small delays. Instead of splitting hairs over delays in the minutes, we will attempt to see which variables are most closely tied to a delay of 1 hour or more. While it is clear that ‘Air Carrier Delay’ is a sizable chunk of delay times, that is very vague. We will use a dataset of flight records to create a predictive logistical model, with a goal of seeing which variables have an effect on delays of an hour or more, and how strong that effect is.

We will create different models with multiple statistical methods with a goal of determining which is best for prediction on whether see which variables, if any, are shown to consistently impact whether or not a flight experiences a long delay.

1.a Description of Data

The flight delay data has 3593 observations of 11 variables. They are:

Carrier: The airline

Airport_Distance: Distance between two airports

Number_of_Flights: Total number of flights in the airport

Weather: Weather condition, measured on a scale from 0 (mild) to 10 (extreme)

Support_Crew_Available: Number of support crew

Baggage_loading_time: Time in minutes spent loading baggage

Late_Arrival_o: Time in minutes the plane arrived late

Cleaning_o: Time in minutes spent cleaning the aircraft

Fueling_o: Time in minutes spent fueling the aircraft

Security_o: Time in minutes spent in security checking

Arr_Delay: Flight delay in minutes. This is the dependent variable of the dataset.

For logistic regression, an additional variable, Hour_Delay, will be created to see if a plane is 1 hour late or more.

A copy of this publicly available data is stored at https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv.

flight<-read.csv("https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv")

flight.hour<-flight

flight.hour$Hour_Delay = 0 

flight.hour$Hour_Delay<-ifelse(flight.hour$Arr_Delay > 59, 1, 0)

flight.hour<-flight.hour[-c(11)]

1.b Dataset Overview

First, as the majority of the variables are numerical, a summary stats overview is warranted.

summary(flight)
##    Carrier          Airport_Distance Number_of_flights    Weather     
##  Length:3593        Min.   :376.0    Min.   :29475     Min.   :5.000  
##  Class :character   1st Qu.:431.0    1st Qu.:41634     1st Qu.:5.000  
##  Mode  :character   Median :443.0    Median :43424     Median :5.000  
##                     Mean   :442.4    Mean   :43311     Mean   :5.353  
##                     3rd Qu.:454.0    3rd Qu.:45140     3rd Qu.:6.000  
##                     Max.   :499.0    Max.   :53461     Max.   :6.000  
##  Support_Crew_Available Baggage_loading_time Late_Arrival_o    Cleaning_o   
##  Min.   :  0            Min.   :14.00        Min.   :15.00   Min.   :-4.00  
##  1st Qu.: 56            1st Qu.:17.00        1st Qu.:18.00   1st Qu.: 8.00  
##  Median : 83            Median :17.00        Median :19.00   Median :10.00  
##  Mean   : 85            Mean   :16.98        Mean   :18.74   Mean   :10.02  
##  3rd Qu.:112            3rd Qu.:17.00        3rd Qu.:19.00   3rd Qu.:12.00  
##  Max.   :222            Max.   :19.00        Max.   :22.00   Max.   :23.00  
##    Fueling_o       Security_o      Arr_Delay    
##  Min.   :13.00   Min.   :13.00   Min.   :  0.0  
##  1st Qu.:23.00   1st Qu.:32.00   1st Qu.: 49.0  
##  Median :25.00   Median :37.00   Median : 70.0  
##  Mean   :25.01   Mean   :37.09   Mean   : 69.8  
##  3rd Qu.:27.00   3rd Qu.:42.00   3rd Qu.: 90.0  
##  Max.   :36.00   Max.   :63.00   Max.   :180.0

Amazingly, there are no missing values. Of note is the extreme variance within the Support_Crew_Available and Number_of_Flights variables and minimal variance within the Weather variable

We will also take a preliminary look at the Carrier variable with a frequency plot:

carrier.table<-table(flight$Carrier)
barplot(carrier.table, main='Frequency of Each Airline Carrier', xlab = 'Carrier', ylab = 'Number of Observations', col = 'mediumorchid')
mtext('Figure 1: Frequency plot of Carrier Variable' , side=1, line=4, at=9, cex=0.5)

This shows very extreme differences within the representation of each carrier, but there are enough observations from some (B6, UA, etc) that we believe the Carrier variable could prove to be significant during analysis. As such, it will be left in.

1.c Looking for Outliers

While the summary of variables like Airport Distance do not appear to cause alarm, the high variance of some stats suggests the possibility of outliers. While outliers are expected, it may be worth identifying and individually reviewing outliers to see if there could have been an inputting error.

testsup<-rosnerTest(flight$Support_Crew_Available, k = 5)
testnum<-rosnerTest(flight$Number_of_flights, k = 5)
testdist<-rosnerTest(flight$Airport_Distance, k = 5)
testbag<-rosnerTest(flight$Baggage_loading_time, k = 5)

The test of outliers for number of flights returns 3 possible outliers - observations 1652, 3163, and 2729. The delay for each observation is 0. While these observations do not appear to be errors, they make a strong suggestion that number of flights could be a good predictor for delay time. Similarly, the baggage time also shows 2729 and 3163 as low outliers.

1.d Investigation for Collinearity

As there are multiple variables related to airplane prep (fueling, cleaning, baggage loading, and security), it may be worth checking for possible collinearity between two predictor variables. If one can be identified as redundant and removed, it will save us time and computational power.

kable(pcor(flight[,2:10], method = "pearson")$estimate)
Airport_Distance Number_of_flights Weather Support_Crew_Available Baggage_loading_time Late_Arrival_o Cleaning_o Fueling_o Security_o
Airport_Distance 1.0000000 0.1627354 0.0234158 -0.0251957 0.1574454 0.0647102 0.0019674 -0.0279204 0.0070935
Number_of_flights 0.1627354 1.0000000 0.0599088 -0.1074942 0.4518685 0.2962003 -0.0064310 -0.0414294 0.0191193
Weather 0.0234158 0.0599088 1.0000000 -0.0159672 0.1160998 0.0592876 0.0106998 0.0194969 0.0098866
Support_Crew_Available -0.0251957 -0.1074942 -0.0159672 1.0000000 -0.1109424 -0.0538252 -0.0337149 0.0300466 0.0138438
Baggage_loading_time 0.1574454 0.4518685 0.1160998 -0.1109424 1.0000000 0.2350702 -0.0068233 0.0366900 0.0184492
Late_Arrival_o 0.0647102 0.2962003 0.0592876 -0.0538252 0.2350702 1.0000000 -0.0134745 -0.0029305 0.0392253
Cleaning_o 0.0019674 -0.0064310 0.0106998 -0.0337149 -0.0068233 -0.0134745 1.0000000 -0.0063664 -0.0095304
Fueling_o -0.0279204 -0.0414294 0.0194969 0.0300466 0.0366900 -0.0029305 -0.0063664 1.0000000 0.0185953
Security_o 0.0070935 0.0191193 0.0098866 0.0138438 0.0184492 0.0392253 -0.0095304 0.0185953 1.0000000

As no variable appears to have a high correlation with another, there is no definitive preliminary evidence of multicollinearity, so all variables could potentially be included into the model.

1.e Discretizing Variables

Since there are two instances of flights with outliers of both baggage and number of flights, we will see if replacing those variables with bins of values has an impact on the model. we could not find any precedent for this online, so we will use the quartile ranges of each to split each variable into ‘low’, ‘middle’ and ‘high’.

model2<-flight.hour

model2$Number_of_flights <- ifelse(model2$Number_of_flights < 41634, 'Low',
               ifelse(model2$Number_of_flights >= 45140, 'High', 'Middle'))

model2$Baggage_loading_time <- ifelse(model2$Baggage_loading_time < 17, 'Low',
               ifelse(model2$Number_of_flights >= 18, 'High', 'Middle'))
fit<-glm(Hour_Delay~., family = binomial, data=flight.hour)


res<-resid(fit)

fit2<-glm(Hour_Delay~., family = binomial, data=model2)


res2<-resid(fit2)
plot(flight.hour$Hour_Delay, res, ylab="Residuals", xlab="Hour Delay - 0=No, 1=Yes", main="Residuals of Intial Model", col="blue")
        mtext('Figure 2: Residual Plot of Model' , side=1, line=4, cex=0.5)
abline(0,0)

plot(model2$Hour_Delay, res2, ylab="Residuals", xlab="Hour Delay - 0=No, 1=Yes", main="Residuals of Discretized Model", col="orange")
mtext('Figure 3: Residual Plot of Discretized Model' , side=1, line=4, cex=0.5)
abline(0,0)

While the residual plots clearly look different, it is not entirely clear which model has less error.

errmat<-matrix(c(deviance(fit), deviance(fit2)),ncol=2) 
colnames(errmat)<-c('Original Model', 'Discretized Model')
rownames(errmat)<-c('Error')

error.sum<-as.table(errmat)

kable(error.sum)
Original Model Discretized Model
Error 1969.465 2193.526

While not apparent, the discretized model appears to have brought our residuals up. We will not consider a discretized model as a possibility moving forward.

1.f Pairwise Associations

Many variables in our model are numeric, we will look at pairwise associations of all variables, looking mostly at the correlation with the variable of interest, Arr_Delay.

ggpairs(flight.hour,                  
        columns = 2:11)         

This output shows the strength of some correlation between variables (specifically baggage loading, late arrival, and number of flights) and the variable of interest. These variables make a strong case for their inclusion in our predictive model.

2. Building a Model

As stated in our introduction, building a logistic model will allow us to see which variables have a significant effect on the presence of a long delay and how strong that effect is.

2.a Logistic Regression

As we are attempting to predict the probability of a delay of one hour or more, our response variable can only take values between 0 and 1. The predictive model most appropriate for this will be a logistic regression, following a form of:

\(P(Y) = (e^\delta)/1+e^\delta\)

Where \(\delta\) is equal to:

\(\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{k}x_{k}\)

And where each \(\beta\) represents the coefficient associated with each variable (\(x_{k}\)).

2.b Development of Candidate Models

2.b.1 Initial Model

The initial model is simply using every variable given in the initial dataframe. While very possibly accurate, this model is quite large and lacks transformations that possibly could assist with predictive modeling.

initial.model<-glm(Hour_Delay~. , family = binomial, data=flight.hour, na.action = na.exclude)

plot(initial.model$fitted.values, initial.model$residuals, main = 'Model Residuals', ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
mtext('Figure 4: Residual Plot of Model' , side=1, line=4, cex=0.5)
abline(0,0)

deviance(initial.model)
## [1] 1969.465

This model will be good to have as a baseline comparison for other candidate models. Our initial model residuals appear to follow a cubic line, so the inclusion of a second or third order term may help.

2.b.2 Including a Second-order variable

This can help mitigate errors with residuals, as seen above. Only numerical variables will be transformed, as mathematical functions will not work well with categorical variables such as the name of the airline!

quad.df<-flight.hour

quad.df$flights2<-quad.df$Number_of_flights**2
quad.df$bag2<-quad.df$Baggage_loading_time**2
quad.df$dist2<-quad.df$Airport_Distance**2
quad.df$weather2<-quad.df$Weather**2
quad.df$security2<-quad.df$Security_o**2
quad.df$support2<-quad.df$Support_Crew_Available**2
quad.df$late2<-quad.df$Late_Arrival_o**2
quad.df$clean2<-quad.df$Cleaning_o**2
quad.df$fuel2<-quad.df$Fueling_o**2

quad.model<-glm(Hour_Delay~. , family = binomial , data=quad.df)

plot(quad.model$fitted.values, quad.model$residuals, main = 'Model Residuals', ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
mtext('Figure 5: Residual Plot of Model with Quadratic Terms' , side=1, line=4, cex=0.5)
abline(0,0)

deviance(quad.model)
## [1] 1957.575

And, because of the extremely obvious cubic shape to the residual plot of our initial model, we will try some cubed variables.

cube.df<-quad.df

cube.df$flights3<-cube.df$Number_of_flights**3
cube.df$bag3<-cube.df$Baggage_loading_time**3
cube.df$dist3<-cube.df$Airport_Distance**3
cube.df$weather3<-cube.df$Weather**3
cube.df$security3<-cube.df$Security_o**3
cube.df$support3<-cube.df$Support_Crew_Available**3
cube.df$late3<-cube.df$Late_Arrival_o**3
cube.df$clean3<-cube.df$Cleaning_o**3
cube.df$fuel3<-cube.df$Fueling_o**3

cube.model<-glm(Hour_Delay~. , family = binomial , data=cube.df)

plot(cube.model$fitted.values, cube.model$residuals, main = 'Model Residuals', ylab="Residuals", xlab="Predicted Probability of Hour+ Delay")
mtext('Figure 6: Residual Plot of Model with Quadratic and Cubic Terms' , side=1, line=4, cex=0.5)
abline(0,0)

deviance(cube.model)
## [1] 1946.978

The cubic and quadratic models both, somehow, reduce error slightly despite having a single residual that is over 300 off.

2.c Selection of Optimal Model

To identify the optimal model, we will use the ‘step’ function in R to eliminate redundant variables in each of our model sets.

First, we will apply various selection methods to our initial dataframe.

model.reg.both<-step(initial.model, direction = "both" , trace = 1)
## Start:  AIC=2015.46
## Hour_Delay ~ Carrier + Airport_Distance + Number_of_flights + 
##     Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o
## 
##                          Df Deviance    AIC
## - Carrier                13   1976.9 1996.9
## - Fueling_o               1   1969.6 2013.6
## - Cleaning_o              1   1969.8 2013.8
## <none>                        1969.5 2015.5
## - Security_o              1   1972.5 2016.5
## - Weather                 1   1997.4 2041.4
## - Support_Crew_Available  1   2002.1 2046.1
## - Airport_Distance        1   2020.8 2064.8
## - Late_Arrival_o          1   2075.4 2119.4
## - Baggage_loading_time    1   2252.3 2296.3
## - Number_of_flights       1   2426.7 2470.7
## 
## Step:  AIC=1996.9
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Cleaning_o + Fueling_o + Security_o
## 
##                          Df Deviance    AIC
## - Fueling_o               1   1977.0 1995.0
## - Cleaning_o              1   1977.2 1995.2
## <none>                        1976.9 1996.9
## - Security_o              1   1979.9 1997.9
## + Carrier                13   1969.5 2015.5
## - Weather                 1   2005.1 2023.1
## - Support_Crew_Available  1   2009.9 2027.9
## - Airport_Distance        1   2027.1 2045.1
## - Late_Arrival_o          1   2084.7 2102.7
## - Baggage_loading_time    1   2260.9 2278.9
## - Number_of_flights       1   2432.6 2450.6
## 
## Step:  AIC=1994.97
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Cleaning_o + Security_o
## 
##                          Df Deviance    AIC
## - Cleaning_o              1   1977.3 1993.3
## <none>                        1977.0 1995.0
## - Security_o              1   1980.0 1996.0
## + Fueling_o               1   1976.9 1996.9
## + Carrier                13   1969.6 2013.6
## - Weather                 1   2005.1 2021.1
## - Support_Crew_Available  1   2010.1 2026.1
## - Airport_Distance        1   2027.4 2043.4
## - Late_Arrival_o          1   2084.7 2100.7
## - Baggage_loading_time    1   2261.1 2277.1
## - Number_of_flights       1   2434.0 2450.0
## 
## Step:  AIC=1993.28
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Security_o
## 
##                          Df Deviance    AIC
## <none>                        1977.3 1993.3
## - Security_o              1   1980.3 1994.3
## + Cleaning_o              1   1977.0 1995.0
## + Fueling_o               1   1977.2 1995.2
## + Carrier                13   1969.9 2011.9
## - Weather                 1   2005.4 2019.4
## - Support_Crew_Available  1   2010.5 2024.5
## - Airport_Distance        1   2027.8 2041.8
## - Late_Arrival_o          1   2084.7 2098.7
## - Baggage_loading_time    1   2262.2 2276.2
## - Number_of_flights       1   2434.7 2448.7

Both selection processes narrowed down the model to Security_o, Airport_Distance, Number_of_Flights, Weather, Support_Crew_Available, Baggage_Loading_Time, and Late_arrival_o. This suggests that in a candidate model with no transformations or additional terms, those variables should be included.

Next, our quadratic and cubic models

model.quad.both<-step(quad.model, direction = "both" , trace = 1)
## Start:  AIC=2019.58
## Hour_Delay ~ Carrier + Airport_Distance + Number_of_flights + 
##     Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + flights2 + 
##     bag2 + dist2 + weather2 + security2 + support2 + late2 + 
##     clean2 + fuel2
## 
## 
## Step:  AIC=2019.58
## Hour_Delay ~ Carrier + Airport_Distance + Number_of_flights + 
##     Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + flights2 + 
##     bag2 + dist2 + security2 + support2 + late2 + clean2 + fuel2
## 
##                          Df Deviance    AIC
## - Carrier                13   1964.9 2000.9
## - support2                1   1957.6 2017.6
## - Cleaning_o              1   1957.6 2017.6
## - Fueling_o               1   1957.6 2017.6
## - late2                   1   1957.6 2017.6
## - fuel2                   1   1957.6 2017.6
## - clean2                  1   1957.6 2017.6
## - Late_Arrival_o          1   1957.8 2017.8
## - Airport_Distance        1   1958.5 2018.5
## - security2               1   1958.5 2018.5
## - Number_of_flights       1   1958.7 2018.7
## - dist2                   1   1958.9 2018.9
## - Security_o              1   1959.1 2019.1
## <none>                        1957.6 2019.6
## - flights2                1   1959.9 2019.9
## - Support_Crew_Available  1   1960.2 2020.2
## - Baggage_loading_time    1   1962.9 2022.9
## - bag2                    1   1964.1 2024.1
## - Weather                 1   1985.2 2045.2
## 
## Step:  AIC=2000.88
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Cleaning_o + Fueling_o + Security_o + flights2 + bag2 + dist2 + 
##     security2 + support2 + late2 + clean2 + fuel2
## 
##                          Df Deviance    AIC
## - support2                1   1964.9 1998.9
## - Cleaning_o              1   1964.9 1998.9
## - late2                   1   1964.9 1998.9
## - clean2                  1   1964.9 1998.9
## - Fueling_o               1   1965.0 1999.0
## - fuel2                   1   1965.0 1999.0
## - Late_Arrival_o          1   1965.0 1999.0
## - Number_of_flights       1   1965.9 1999.9
## - Airport_Distance        1   1966.0 2000.0
## - security2               1   1966.0 2000.0
## - dist2                   1   1966.3 2000.3
## - Security_o              1   1966.5 2000.5
## <none>                        1964.9 2000.9
## - flights2                1   1967.1 2001.1
## - Support_Crew_Available  1   1967.5 2001.5
## - Baggage_loading_time    1   1970.3 2004.3
## - bag2                    1   1971.5 2005.5
## + Carrier                13   1957.6 2019.6
## - Weather                 1   1992.8 2026.8
## 
## Step:  AIC=1998.88
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Cleaning_o + Fueling_o + Security_o + flights2 + bag2 + dist2 + 
##     security2 + late2 + clean2 + fuel2
## 
##                          Df Deviance    AIC
## - Cleaning_o              1   1964.9 1996.9
## - late2                   1   1964.9 1996.9
## - clean2                  1   1964.9 1996.9
## - Fueling_o               1   1965.0 1997.0
## - fuel2                   1   1965.0 1997.0
## - Late_Arrival_o          1   1965.0 1997.0
## - Number_of_flights       1   1965.9 1997.9
## - Airport_Distance        1   1966.0 1998.0
## - security2               1   1966.0 1998.0
## - dist2                   1   1966.3 1998.3
## - Security_o              1   1966.5 1998.5
## <none>                        1964.9 1998.9
## - flights2                1   1967.1 1999.1
## + support2                1   1964.9 2000.9
## - Baggage_loading_time    1   1970.3 2002.3
## - bag2                    1   1971.5 2003.5
## + Carrier                13   1957.6 2017.6
## - Weather                 1   1992.8 2024.8
## - Support_Crew_Available  1   1997.3 2029.3
## 
## Step:  AIC=1996.89
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Fueling_o + Security_o + flights2 + bag2 + dist2 + security2 + 
##     late2 + clean2 + fuel2
## 
##                          Df Deviance    AIC
## - late2                   1   1964.9 1994.9
## - Fueling_o               1   1965.0 1995.0
## - fuel2                   1   1965.0 1995.0
## - Late_Arrival_o          1   1965.0 1995.0
## - clean2                  1   1965.2 1995.2
## - Number_of_flights       1   1965.9 1995.9
## - Airport_Distance        1   1966.0 1996.0
## - security2               1   1966.0 1996.0
## - dist2                   1   1966.3 1996.3
## - Security_o              1   1966.5 1996.5
## <none>                        1964.9 1996.9
## - flights2                1   1967.1 1997.1
## + Cleaning_o              1   1964.9 1998.9
## + support2                1   1964.9 1998.9
## - Baggage_loading_time    1   1970.3 2000.3
## - bag2                    1   1971.5 2001.5
## + Carrier                13   1957.6 2015.6
## - Weather                 1   1992.8 2022.8
## - Support_Crew_Available  1   1997.4 2027.4
## 
## Step:  AIC=1994.91
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Fueling_o + Security_o + flights2 + bag2 + dist2 + security2 + 
##     clean2 + fuel2
## 
##                          Df Deviance    AIC
## - Fueling_o               1   1965.0 1993.0
## - fuel2                   1   1965.0 1993.0
## - clean2                  1   1965.2 1993.2
## - Number_of_flights       1   1965.9 1993.9
## - security2               1   1966.0 1994.0
## - Airport_Distance        1   1966.0 1994.0
## - dist2                   1   1966.3 1994.3
## - Security_o              1   1966.5 1994.5
## <none>                        1964.9 1994.9
## - flights2                1   1967.1 1995.1
## + late2                   1   1964.9 1996.9
## + Cleaning_o              1   1964.9 1996.9
## + support2                1   1964.9 1996.9
## - Baggage_loading_time    1   1970.3 1998.3
## - bag2                    1   1971.5 1999.5
## + Carrier                13   1957.6 2013.6
## - Weather                 1   1992.8 2020.8
## - Support_Crew_Available  1   1997.4 2025.4
## - Late_Arrival_o          1   2071.6 2099.6
## 
## Step:  AIC=1992.99
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Security_o + flights2 + bag2 + dist2 + security2 + clean2 + 
##     fuel2
## 
##                          Df Deviance    AIC
## - fuel2                   1   1965.1 1991.1
## - clean2                  1   1965.3 1991.3
## - Number_of_flights       1   1966.0 1992.0
## - security2               1   1966.1 1992.1
## - Airport_Distance        1   1966.1 1992.1
## - dist2                   1   1966.5 1992.5
## - Security_o              1   1966.6 1992.6
## <none>                        1965.0 1993.0
## - flights2                1   1967.2 1993.2
## + Fueling_o               1   1964.9 1994.9
## + late2                   1   1965.0 1995.0
## + Cleaning_o              1   1965.0 1995.0
## + support2                1   1965.0 1995.0
## - Baggage_loading_time    1   1970.4 1996.4
## - bag2                    1   1971.6 1997.6
## + Carrier                13   1957.7 2011.7
## - Weather                 1   1992.9 2018.9
## - Support_Crew_Available  1   1997.5 2023.5
## - Late_Arrival_o          1   2071.6 2097.6
## 
## Step:  AIC=1991.06
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Security_o + flights2 + bag2 + dist2 + security2 + clean2
## 
##                          Df Deviance    AIC
## - clean2                  1   1965.4 1989.4
## - Number_of_flights       1   1966.1 1990.1
## - security2               1   1966.2 1990.2
## - Airport_Distance        1   1966.2 1990.2
## - dist2                   1   1966.5 1990.5
## - Security_o              1   1966.7 1990.7
## <none>                        1965.1 1991.1
## - flights2                1   1967.3 1991.3
## + fuel2                   1   1965.0 1993.0
## + Fueling_o               1   1965.0 1993.0
## + late2                   1   1965.0 1993.0
## + Cleaning_o              1   1965.1 1993.1
## + support2                1   1965.1 1993.1
## - Baggage_loading_time    1   1970.4 1994.4
## - bag2                    1   1971.6 1995.6
## + Carrier                13   1957.8 2009.8
## - Weather                 1   1992.9 2016.9
## - Support_Crew_Available  1   1997.6 2021.6
## - Late_Arrival_o          1   2071.6 2095.6
## 
## Step:  AIC=1989.39
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Security_o + flights2 + bag2 + dist2 + security2
## 
##                          Df Deviance    AIC
## - Number_of_flights       1   1966.4 1988.4
## - security2               1   1966.5 1988.5
## - Airport_Distance        1   1966.5 1988.5
## - dist2                   1   1966.9 1988.9
## - Security_o              1   1967.0 1989.0
## <none>                        1965.4 1989.4
## - flights2                1   1967.6 1989.6
## + clean2                  1   1965.1 1991.1
## + Cleaning_o              1   1965.1 1991.1
## + fuel2                   1   1965.3 1991.3
## + Fueling_o               1   1965.3 1991.3
## + late2                   1   1965.4 1991.4
## + support2                1   1965.4 1991.4
## - Baggage_loading_time    1   1970.7 1992.7
## - bag2                    1   1971.9 1993.9
## + Carrier                13   1958.2 2008.2
## - Weather                 1   1993.3 2015.3
## - Support_Crew_Available  1   1998.1 2020.1
## - Late_Arrival_o          1   2071.7 2093.7
## 
## Step:  AIC=1988.43
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Baggage_loading_time + Late_Arrival_o + Security_o + flights2 + 
##     bag2 + dist2 + security2
## 
##                          Df Deviance    AIC
## - security2               1   1967.5 1987.5
## - Airport_Distance        1   1967.6 1987.6
## - dist2                   1   1968.0 1988.0
## - Security_o              1   1968.0 1988.0
## <none>                        1966.4 1988.4
## + Number_of_flights       1   1965.4 1989.4
## + clean2                  1   1966.1 1990.1
## + Cleaning_o              1   1966.1 1990.1
## + fuel2                   1   1966.4 1990.4
## + Fueling_o               1   1966.4 1990.4
## + late2                   1   1966.4 1990.4
## + support2                1   1966.4 1990.4
## - Baggage_loading_time    1   1972.0 1992.0
## - bag2                    1   1973.2 1993.2
## + Carrier                13   1959.3 2007.3
## - Weather                 1   1994.5 2014.5
## - Support_Crew_Available  1   1999.3 2019.3
## - Late_Arrival_o          1   2072.3 2092.3
## - flights2                1   2425.0 2445.0
## 
## Step:  AIC=1987.46
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Baggage_loading_time + Late_Arrival_o + Security_o + flights2 + 
##     bag2 + dist2
## 
##                          Df Deviance    AIC
## - Airport_Distance        1   1968.6 1986.6
## - dist2                   1   1969.0 1987.0
## <none>                        1967.5 1987.5
## + security2               1   1966.4 1988.4
## + Number_of_flights       1   1966.5 1988.5
## - Security_o              1   1970.6 1988.6
## + clean2                  1   1967.1 1989.1
## + Cleaning_o              1   1967.2 1989.2
## + fuel2                   1   1967.4 1989.4
## + Fueling_o               1   1967.4 1989.4
## + late2                   1   1967.4 1989.4
## + support2                1   1967.5 1989.5
## - Baggage_loading_time    1   1973.0 1991.0
## - bag2                    1   1974.2 1992.2
## + Carrier                13   1960.2 2006.2
## - Weather                 1   1995.6 2013.6
## - Support_Crew_Available  1   2000.6 2018.6
## - Late_Arrival_o          1   2073.6 2091.6
## - flights2                1   2425.3 2443.3
## 
## Step:  AIC=1986.63
## Hour_Delay ~ Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Security_o + flights2 + bag2 + dist2
## 
##                          Df Deviance    AIC
## <none>                        1968.6 1986.6
## + Airport_Distance        1   1967.5 1987.5
## + security2               1   1967.6 1987.6
## + Number_of_flights       1   1967.7 1987.7
## - Security_o              1   1971.7 1987.7
## + clean2                  1   1968.2 1988.2
## + Cleaning_o              1   1968.3 1988.3
## + fuel2                   1   1968.5 1988.5
## + Fueling_o               1   1968.5 1988.5
## + late2                   1   1968.6 1988.6
## + support2                1   1968.6 1988.6
## - Baggage_loading_time    1   1974.4 1990.4
## - bag2                    1   1975.6 1991.6
## + Carrier                13   1961.3 2005.3
## - Weather                 1   1996.7 2012.7
## - Support_Crew_Available  1   2001.8 2017.8
## - dist2                   1   2019.7 2035.7
## - Late_Arrival_o          1   2075.1 2091.1
## - flights2                1   2426.7 2442.7
model.cube.both<-step(cube.model, direction = "both" , trace = 1)
## Start:  AIC=2024.98
## Hour_Delay ~ Carrier + Airport_Distance + Number_of_flights + 
##     Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + flights2 + 
##     bag2 + dist2 + weather2 + security2 + support2 + late2 + 
##     clean2 + fuel2 + flights3 + bag3 + dist3 + weather3 + security3 + 
##     support3 + late3 + clean3 + fuel3
## 
## 
## Step:  AIC=2024.98
## Hour_Delay ~ Carrier + Airport_Distance + Number_of_flights + 
##     Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + flights2 + 
##     bag2 + dist2 + weather2 + security2 + support2 + late2 + 
##     clean2 + fuel2 + flights3 + bag3 + dist3 + security3 + support3 + 
##     late3 + clean3 + fuel3
## 
## 
## Step:  AIC=2024.98
## Hour_Delay ~ Carrier + Airport_Distance + Number_of_flights + 
##     Weather + Support_Crew_Available + Baggage_loading_time + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + flights2 + 
##     bag2 + dist2 + security2 + support2 + late2 + clean2 + fuel2 + 
##     flights3 + bag3 + dist3 + security3 + support3 + late3 + 
##     clean3 + fuel3
## 
##                          Df Deviance    AIC
## - Carrier                13   1954.6 2006.6
## - dist3                   1   1947.0 2023.0
## - dist2                   1   1947.0 2023.0
## - Airport_Distance        1   1947.0 2023.0
## - Baggage_loading_time    1   1947.1 2023.1
## - bag2                    1   1947.1 2023.1
## - bag3                    1   1947.1 2023.1
## - support2                1   1947.5 2023.5
## - support3                1   1947.5 2023.5
## - Number_of_flights       1   1947.5 2023.5
## - flights2                1   1947.6 2023.6
## - flights3                1   1947.6 2023.6
## - Fueling_o               1   1948.0 2024.0
## - Cleaning_o              1   1948.1 2024.1
## - fuel2                   1   1948.1 2024.1
## - fuel3                   1   1948.1 2024.1
## - late3                   1   1948.1 2024.1
## - late2                   1   1948.1 2024.1
## - Late_Arrival_o          1   1948.2 2024.2
## - clean2                  1   1948.2 2024.2
## - clean3                  1   1948.4 2024.4
## - Support_Crew_Available  1   1948.8 2024.8
## <none>                        1947.0 2025.0
## - Security_o              1   1951.2 2027.2
## - security2               1   1952.2 2028.2
## - security3               1   1952.7 2028.7
## - Weather                 1   1976.6 2052.6
## 
## Step:  AIC=2006.62
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Cleaning_o + Fueling_o + Security_o + flights2 + bag2 + dist2 + 
##     security2 + support2 + late2 + clean2 + fuel2 + flights3 + 
##     bag3 + dist3 + security3 + support3 + late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - dist3                   1   1954.6 2004.6
## - dist2                   1   1954.6 2004.6
## - Airport_Distance        1   1954.6 2004.6
## - Baggage_loading_time    1   1954.7 2004.7
## - bag2                    1   1954.7 2004.7
## - bag3                    1   1954.7 2004.7
## - support2                1   1955.2 2005.2
## - support3                1   1955.2 2005.2
## - Number_of_flights       1   1955.2 2005.2
## - flights2                1   1955.2 2005.2
## - flights3                1   1955.3 2005.3
## - Fueling_o               1   1955.5 2005.5
## - fuel2                   1   1955.5 2005.5
## - fuel3                   1   1955.5 2005.5
## - Cleaning_o              1   1955.7 2005.7
## - late3                   1   1955.8 2005.8
## - late2                   1   1955.8 2005.8
## - Late_Arrival_o          1   1955.8 2005.8
## - clean2                  1   1955.8 2005.8
## - clean3                  1   1956.0 2006.0
## - Support_Crew_Available  1   1956.5 2006.5
## <none>                        1954.6 2006.6
## - Security_o              1   1958.7 2008.7
## - security2               1   1959.6 2009.6
## - security3               1   1960.2 2010.2
## + Carrier                13   1947.0 2025.0
## - Weather                 1   1984.6 2034.6
## 
## Step:  AIC=2004.63
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + 
##     Cleaning_o + Fueling_o + Security_o + flights2 + bag2 + dist2 + 
##     security2 + support2 + late2 + clean2 + fuel2 + flights3 + 
##     bag3 + security3 + support3 + late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - Baggage_loading_time    1   1954.7 2002.7
## - bag2                    1   1954.7 2002.7
## - bag3                    1   1954.8 2002.8
## - support2                1   1955.2 2003.2
## - support3                1   1955.2 2003.2
## - Number_of_flights       1   1955.2 2003.2
## - flights2                1   1955.3 2003.3
## - flights3                1   1955.3 2003.3
## - Fueling_o               1   1955.5 2003.5
## - fuel2                   1   1955.5 2003.5
## - fuel3                   1   1955.5 2003.5
## - Airport_Distance        1   1955.7 2003.7
## - Cleaning_o              1   1955.7 2003.7
## - late3                   1   1955.8 2003.8
## - late2                   1   1955.8 2003.8
## - Late_Arrival_o          1   1955.8 2003.8
## - clean2                  1   1955.9 2003.9
## - clean3                  1   1956.0 2004.0
## - dist2                   1   1956.0 2004.0
## - Support_Crew_Available  1   1956.5 2004.5
## <none>                        1954.6 2004.6
## + dist3                   1   1954.6 2006.6
## - Security_o              1   1958.7 2006.7
## - security2               1   1959.6 2007.6
## - security3               1   1960.2 2008.2
## + Carrier                13   1947.0 2023.0
## - Weather                 1   1984.6 2032.6
## 
## Step:  AIC=2002.73
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Late_Arrival_o + Cleaning_o + Fueling_o + 
##     Security_o + flights2 + bag2 + dist2 + security2 + support2 + 
##     late2 + clean2 + fuel2 + flights3 + bag3 + security3 + support3 + 
##     late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - support2                1   1955.3 2001.3
## - support3                1   1955.3 2001.3
## - Number_of_flights       1   1955.4 2001.4
## - flights2                1   1955.4 2001.4
## - flights3                1   1955.4 2001.4
## - Fueling_o               1   1955.6 2001.6
## - fuel2                   1   1955.6 2001.6
## - fuel3                   1   1955.7 2001.7
## - Airport_Distance        1   1955.8 2001.8
## - Cleaning_o              1   1955.8 2001.8
## - late3                   1   1955.9 2001.9
## - late2                   1   1955.9 2001.9
## - Late_Arrival_o          1   1955.9 2001.9
## - clean2                  1   1956.0 2002.0
## - clean3                  1   1956.1 2002.1
## - dist2                   1   1956.1 2002.1
## - Support_Crew_Available  1   1956.6 2002.6
## <none>                        1954.7 2002.7
## + Baggage_loading_time    1   1954.6 2004.6
## + dist3                   1   1954.7 2004.7
## - Security_o              1   1958.8 2004.8
## - bag2                    1   1958.8 2004.8
## - security2               1   1959.7 2005.7
## - bag3                    1   1959.8 2005.8
## - security3               1   1960.3 2006.3
## + Carrier                13   1947.1 2021.1
## - Weather                 1   1984.7 2030.7
## 
## Step:  AIC=2001.3
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Late_Arrival_o + Cleaning_o + Fueling_o + 
##     Security_o + flights2 + bag2 + dist2 + security2 + late2 + 
##     clean2 + fuel2 + flights3 + bag3 + security3 + support3 + 
##     late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - support3                1   1955.3 1999.3
## - Number_of_flights       1   1956.0 2000.0
## - flights2                1   1956.0 2000.0
## - flights3                1   1956.0 2000.0
## - Fueling_o               1   1956.1 2000.1
## - fuel2                   1   1956.1 2000.1
## - fuel3                   1   1956.2 2000.2
## - Airport_Distance        1   1956.4 2000.4
## - late3                   1   1956.4 2000.4
## - late2                   1   1956.4 2000.4
## - Cleaning_o              1   1956.4 2000.4
## - Late_Arrival_o          1   1956.4 2000.4
## - clean2                  1   1956.6 2000.6
## - clean3                  1   1956.7 2000.7
## - dist2                   1   1956.7 2000.7
## <none>                        1955.3 2001.3
## + support2                1   1954.7 2002.7
## + Baggage_loading_time    1   1955.2 2003.2
## + dist3                   1   1955.3 2003.3
## - Security_o              1   1959.3 2003.3
## - bag2                    1   1959.4 2003.4
## - security2               1   1960.3 2004.3
## - bag3                    1   1960.4 2004.4
## - security3               1   1960.9 2004.9
## - Support_Crew_Available  1   1961.6 2005.6
## + Carrier                13   1947.6 2019.6
## - Weather                 1   1984.9 2028.9
## 
## Step:  AIC=1999.31
## Hour_Delay ~ Airport_Distance + Number_of_flights + Weather + 
##     Support_Crew_Available + Late_Arrival_o + Cleaning_o + Fueling_o + 
##     Security_o + flights2 + bag2 + dist2 + security2 + late2 + 
##     clean2 + fuel2 + flights3 + bag3 + security3 + late3 + clean3 + 
##     fuel3
## 
##                          Df Deviance    AIC
## - Number_of_flights       1   1956.0 1998.0
## - flights2                1   1956.0 1998.0
## - flights3                1   1956.0 1998.0
## - Fueling_o               1   1956.1 1998.1
## - fuel2                   1   1956.2 1998.2
## - fuel3                   1   1956.2 1998.2
## - Airport_Distance        1   1956.4 1998.4
## - late3                   1   1956.4 1998.4
## - late2                   1   1956.4 1998.4
## - Cleaning_o              1   1956.4 1998.4
## - Late_Arrival_o          1   1956.5 1998.5
## - clean2                  1   1956.6 1998.6
## - clean3                  1   1956.7 1998.7
## - dist2                   1   1956.8 1998.8
## <none>                        1955.3 1999.3
## + Baggage_loading_time    1   1955.2 2001.2
## + dist3                   1   1955.3 2001.3
## + support3                1   1955.3 2001.3
## + support2                1   1955.3 2001.3
## - Security_o              1   1959.3 2001.3
## - bag2                    1   1959.4 2001.4
## - security2               1   1960.3 2002.3
## - bag3                    1   1960.4 2002.4
## - security3               1   1960.9 2002.9
## + Carrier                13   1947.7 2017.7
## - Weather                 1   1984.9 2026.9
## - Support_Crew_Available  1   1987.7 2029.7
## 
## Step:  AIC=1997.99
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + flights2 + 
##     bag2 + dist2 + security2 + late2 + clean2 + fuel2 + flights3 + 
##     bag3 + security3 + late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - flights2                1   1956.2 1996.2
## - Fueling_o               1   1956.8 1996.8
## - fuel2                   1   1956.8 1996.8
## - flights3                1   1956.8 1996.8
## - fuel3                   1   1956.9 1996.9
## - late3                   1   1957.1 1997.1
## - late2                   1   1957.1 1997.1
## - Airport_Distance        1   1957.1 1997.1
## - Late_Arrival_o          1   1957.1 1997.1
## - Cleaning_o              1   1957.1 1997.1
## - clean2                  1   1957.3 1997.3
## - clean3                  1   1957.4 1997.4
## - dist2                   1   1957.5 1997.5
## <none>                        1956.0 1998.0
## + Number_of_flights       1   1955.3 1999.3
## + Baggage_loading_time    1   1955.9 1999.9
## + dist3                   1   1956.0 2000.0
## + support3                1   1956.0 2000.0
## + support2                1   1956.0 2000.0
## - Security_o              1   1960.0 2000.0
## - bag2                    1   1960.1 2000.1
## - security2               1   1961.0 2001.0
## - bag3                    1   1961.2 2001.2
## - security3               1   1961.5 2001.5
## + Carrier                13   1948.3 2016.3
## - Weather                 1   1985.5 2025.5
## - Support_Crew_Available  1   1988.3 2028.3
## 
## Step:  AIC=1996.2
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Late_Arrival_o + Cleaning_o + Fueling_o + Security_o + bag2 + 
##     dist2 + security2 + late2 + clean2 + fuel2 + flights3 + bag3 + 
##     security3 + late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - Fueling_o               1   1957.0 1995.0
## - fuel2                   1   1957.0 1995.0
## - fuel3                   1   1957.1 1995.1
## - late3                   1   1957.3 1995.3
## - Airport_Distance        1   1957.3 1995.3
## - late2                   1   1957.3 1995.3
## - Late_Arrival_o          1   1957.3 1995.3
## - Cleaning_o              1   1957.4 1995.4
## - clean2                  1   1957.5 1995.5
## - clean3                  1   1957.6 1995.6
## - dist2                   1   1957.7 1995.7
## <none>                        1956.2 1996.2
## + flights2                1   1956.0 1998.0
## + Number_of_flights       1   1956.0 1998.0
## + Baggage_loading_time    1   1956.1 1998.1
## + dist3                   1   1956.2 1998.2
## + support3                1   1956.2 1998.2
## + support2                1   1956.2 1998.2
## - Security_o              1   1960.3 1998.3
## - bag2                    1   1960.4 1998.4
## - security2               1   1961.3 1999.3
## - bag3                    1   1961.5 1999.5
## - security3               1   1961.8 1999.8
## + Carrier                13   1948.5 2014.5
## - Weather                 1   1985.8 2023.8
## - Support_Crew_Available  1   1988.6 2026.6
## - flights3                1   2413.1 2451.1
## 
## Step:  AIC=1995
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Late_Arrival_o + Cleaning_o + Security_o + bag2 + dist2 + 
##     security2 + late2 + clean2 + fuel2 + flights3 + bag3 + security3 + 
##     late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - fuel2                   1   1957.1 1993.1
## - fuel3                   1   1957.1 1993.1
## - Airport_Distance        1   1958.0 1994.0
## - late3                   1   1958.1 1994.1
## - late2                   1   1958.1 1994.1
## - Late_Arrival_o          1   1958.1 1994.1
## - Cleaning_o              1   1958.1 1994.1
## - clean2                  1   1958.3 1994.3
## - dist2                   1   1958.4 1994.4
## - clean3                  1   1958.4 1994.4
## <none>                        1957.0 1995.0
## + Fueling_o               1   1956.2 1996.2
## + flights2                1   1956.8 1996.8
## + Number_of_flights       1   1956.8 1996.8
## + Baggage_loading_time    1   1956.9 1996.9
## + support3                1   1957.0 1997.0
## + dist3                   1   1957.0 1997.0
## + support2                1   1957.0 1997.0
## - Security_o              1   1961.2 1997.2
## - bag2                    1   1961.2 1997.2
## - security2               1   1962.2 1998.2
## - bag3                    1   1962.3 1998.3
## - security3               1   1962.8 1998.8
## + Carrier                13   1949.5 2013.5
## - Weather                 1   1986.5 2022.5
## - Support_Crew_Available  1   1989.6 2025.6
## - flights3                1   2413.7 2449.7
## 
## Step:  AIC=1993.12
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Late_Arrival_o + Cleaning_o + Security_o + bag2 + dist2 + 
##     security2 + late2 + clean2 + flights3 + bag3 + security3 + 
##     late3 + clean3 + fuel3
## 
##                          Df Deviance    AIC
## - fuel3                   1   1957.2 1991.2
## - Airport_Distance        1   1958.2 1992.2
## - late3                   1   1958.2 1992.2
## - late2                   1   1958.2 1992.2
## - Cleaning_o              1   1958.3 1992.3
## - Late_Arrival_o          1   1958.3 1992.3
## - clean2                  1   1958.4 1992.4
## - dist2                   1   1958.5 1992.5
## - clean3                  1   1958.5 1992.5
## <none>                        1957.1 1993.1
## + flights2                1   1956.9 1994.9
## + Number_of_flights       1   1956.9 1994.9
## + fuel2                   1   1957.0 1995.0
## + Baggage_loading_time    1   1957.0 1995.0
## + Fueling_o               1   1957.0 1995.0
## + support3                1   1957.1 1995.1
## + dist3                   1   1957.1 1995.1
## + support2                1   1957.1 1995.1
## - bag2                    1   1961.3 1995.3
## - Security_o              1   1961.3 1995.3
## - security2               1   1962.3 1996.3
## - bag3                    1   1962.4 1996.4
## - security3               1   1962.9 1996.9
## + Carrier                13   1949.6 2011.6
## - Weather                 1   1986.6 2020.6
## - Support_Crew_Available  1   1989.8 2023.8
## - flights3                1   2413.7 2447.7
## 
## Step:  AIC=1991.2
## Hour_Delay ~ Airport_Distance + Weather + Support_Crew_Available + 
##     Late_Arrival_o + Cleaning_o + Security_o + bag2 + dist2 + 
##     security2 + late2 + clean2 + flights3 + bag3 + security3 + 
##     late3 + clean3
## 
##                          Df Deviance    AIC
## - Airport_Distance        1   1958.3 1990.3
## - late3                   1   1958.3 1990.3
## - late2                   1   1958.3 1990.3
## - Cleaning_o              1   1958.3 1990.3
## - Late_Arrival_o          1   1958.3 1990.3
## - clean2                  1   1958.5 1990.5
## - clean3                  1   1958.6 1990.6
## - dist2                   1   1958.6 1990.6
## <none>                        1957.2 1991.2
## + flights2                1   1957.0 1993.0
## + Number_of_flights       1   1957.0 1993.0
## + Baggage_loading_time    1   1957.1 1993.1
## + fuel3                   1   1957.1 1993.1
## + fuel2                   1   1957.1 1993.1
## + Fueling_o               1   1957.1 1993.1
## + support3                1   1957.2 1993.2
## + dist3                   1   1957.2 1993.2
## + support2                1   1957.2 1993.2
## - bag2                    1   1961.4 1993.4
## - Security_o              1   1961.5 1993.5
## - security2               1   1962.4 1994.4
## - bag3                    1   1962.5 1994.5
## - security3               1   1963.0 1995.0
## + Carrier                13   1949.7 2009.7
## - Weather                 1   1986.6 2018.6
## - Support_Crew_Available  1   1990.0 2022.0
## - flights3                1   2415.3 2447.3
## 
## Step:  AIC=1990.27
## Hour_Delay ~ Weather + Support_Crew_Available + Late_Arrival_o + 
##     Cleaning_o + Security_o + bag2 + dist2 + security2 + late2 + 
##     clean2 + flights3 + bag3 + security3 + late3 + clean3
## 
##                          Df Deviance    AIC
## - late3                   1   1959.3 1989.3
## - late2                   1   1959.3 1989.3
## - Late_Arrival_o          1   1959.4 1989.4
## - Cleaning_o              1   1959.4 1989.4
## - clean2                  1   1959.6 1989.6
## - clean3                  1   1959.8 1989.8
## <none>                        1958.3 1990.3
## + Airport_Distance        1   1957.2 1991.2
## + dist3                   1   1957.2 1991.2
## + flights2                1   1958.0 1992.0
## + Number_of_flights       1   1958.1 1992.1
## + Baggage_loading_time    1   1958.2 1992.2
## + fuel3                   1   1958.2 1992.2
## + fuel2                   1   1958.2 1992.2
## + Fueling_o               1   1958.2 1992.2
## + support3                1   1958.2 1992.2
## + support2                1   1958.3 1992.3
## - Security_o              1   1962.5 1992.5
## - bag2                    1   1962.6 1992.6
## - security2               1   1963.5 1993.5
## - bag3                    1   1963.7 1993.7
## - security3               1   1964.1 1994.1
## + Carrier                13   1950.7 2008.7
## - Weather                 1   1987.7 2017.7
## - Support_Crew_Available  1   1991.1 2021.1
## - dist2                   1   2008.8 2038.8
## - flights3                1   2416.6 2446.6
## 
## Step:  AIC=1989.33
## Hour_Delay ~ Weather + Support_Crew_Available + Late_Arrival_o + 
##     Cleaning_o + Security_o + bag2 + dist2 + security2 + late2 + 
##     clean2 + flights3 + bag3 + security3 + clean3
## 
##                          Df Deviance    AIC
## - late2                   1   1959.3 1987.3
## - Late_Arrival_o          1   1959.5 1987.5
## - Cleaning_o              1   1960.5 1988.5
## - clean2                  1   1960.7 1988.7
## - clean3                  1   1960.8 1988.8
## <none>                        1959.3 1989.3
## + late3                   1   1958.3 1990.3
## + Airport_Distance        1   1958.3 1990.3
## + dist3                   1   1958.3 1990.3
## + flights2                1   1959.1 1991.1
## + Number_of_flights       1   1959.1 1991.1
## + Baggage_loading_time    1   1959.2 1991.2
## + fuel3                   1   1959.2 1991.2
## + fuel2                   1   1959.2 1991.2
## + Fueling_o               1   1959.3 1991.3
## + support3                1   1959.3 1991.3
## + support2                1   1959.3 1991.3
## - Security_o              1   1963.5 1991.5
## - bag2                    1   1963.7 1991.7
## - security2               1   1964.5 1992.5
## - bag3                    1   1964.8 1992.8
## - security3               1   1965.1 1993.1
## + Carrier                13   1951.7 2007.7
## - Weather                 1   1988.5 2016.5
## - Support_Crew_Available  1   1992.0 2020.0
## - dist2                   1   2009.5 2037.5
## - flights3                1   2419.3 2447.3
## 
## Step:  AIC=1987.35
## Hour_Delay ~ Weather + Support_Crew_Available + Late_Arrival_o + 
##     Cleaning_o + Security_o + bag2 + dist2 + security2 + clean2 + 
##     flights3 + bag3 + security3 + clean3
## 
##                          Df Deviance    AIC
## - Cleaning_o              1   1960.5 1986.5
## - clean2                  1   1960.7 1986.7
## - clean3                  1   1960.8 1986.8
## <none>                        1959.3 1987.3
## + Airport_Distance        1   1958.3 1988.3
## + dist3                   1   1958.3 1988.3
## + flights2                1   1959.1 1989.1
## + Number_of_flights       1   1959.2 1989.2
## + Baggage_loading_time    1   1959.2 1989.2
## + fuel3                   1   1959.2 1989.2
## + fuel2                   1   1959.3 1989.3
## + Fueling_o               1   1959.3 1989.3
## + support3                1   1959.3 1989.3
## + late2                   1   1959.3 1989.3
## + late3                   1   1959.3 1989.3
## + support2                1   1959.3 1989.3
## - Security_o              1   1963.5 1989.5
## - bag2                    1   1963.7 1989.7
## - security2               1   1964.5 1990.5
## - bag3                    1   1964.8 1990.8
## - security3               1   1965.1 1991.1
## + Carrier                13   1951.8 2005.8
## - Weather                 1   1988.6 2014.6
## - Support_Crew_Available  1   1992.0 2018.0
## - dist2                   1   2009.5 2035.5
## - Late_Arrival_o          1   2065.0 2091.0
## - flights3                1   2419.7 2445.7
## 
## Step:  AIC=1986.52
## Hour_Delay ~ Weather + Support_Crew_Available + Late_Arrival_o + 
##     Security_o + bag2 + dist2 + security2 + clean2 + flights3 + 
##     bag3 + security3 + clean3
## 
##                          Df Deviance    AIC
## - clean2                  1   1960.7 1984.7
## - clean3                  1   1960.8 1984.8
## <none>                        1960.5 1986.5
## + Cleaning_o              1   1959.3 1987.3
## + Airport_Distance        1   1959.5 1987.5
## + dist3                   1   1959.5 1987.5
## + flights2                1   1960.3 1988.3
## + Number_of_flights       1   1960.3 1988.3
## + Baggage_loading_time    1   1960.4 1988.4
## + fuel3                   1   1960.4 1988.4
## + fuel2                   1   1960.5 1988.5
## + Fueling_o               1   1960.5 1988.5
## + support3                1   1960.5 1988.5
## - Security_o              1   1964.5 1988.5
## + late2                   1   1960.5 1988.5
## + late3                   1   1960.5 1988.5
## + support2                1   1960.5 1988.5
## - bag2                    1   1964.9 1988.9
## - security2               1   1965.5 1989.5
## - bag3                    1   1966.0 1990.0
## - security3               1   1966.0 1990.0
## + Carrier                13   1952.9 2004.9
## - Weather                 1   1989.4 2013.4
## - Support_Crew_Available  1   1993.5 2017.5
## - dist2                   1   2011.0 2035.0
## - Late_Arrival_o          1   2065.9 2089.9
## - flights3                1   2419.9 2443.9
## 
## Step:  AIC=1984.7
## Hour_Delay ~ Weather + Support_Crew_Available + Late_Arrival_o + 
##     Security_o + bag2 + dist2 + security2 + flights3 + bag3 + 
##     security3 + clean3
## 
##                          Df Deviance    AIC
## - clean3                  1   1961.2 1983.2
## <none>                        1960.7 1984.7
## + Airport_Distance        1   1959.6 1985.6
## + dist3                   1   1959.6 1985.6
## + flights2                1   1960.5 1986.5
## + Number_of_flights       1   1960.5 1986.5
## + clean2                  1   1960.5 1986.5
## + Baggage_loading_time    1   1960.6 1986.6
## + fuel3                   1   1960.6 1986.6
## + fuel2                   1   1960.6 1986.6
## + Fueling_o               1   1960.7 1986.7
## + support3                1   1960.7 1986.7
## - Security_o              1   1964.7 1986.7
## + Cleaning_o              1   1960.7 1986.7
## + late2                   1   1960.7 1986.7
## + late3                   1   1960.7 1986.7
## + support2                1   1960.7 1986.7
## - bag2                    1   1965.2 1987.2
## - security2               1   1965.6 1987.6
## - security3               1   1966.3 1988.3
## - bag3                    1   1966.3 1988.3
## + Carrier                13   1953.1 2003.1
## - Weather                 1   1989.6 2011.6
## - Support_Crew_Available  1   1993.6 2015.6
## - dist2                   1   2011.1 2033.1
## - Late_Arrival_o          1   2066.4 2088.4
## - flights3                1   2420.9 2442.9
## 
## Step:  AIC=1983.18
## Hour_Delay ~ Weather + Support_Crew_Available + Late_Arrival_o + 
##     Security_o + bag2 + dist2 + security2 + flights3 + bag3 + 
##     security3
## 
##                          Df Deviance    AIC
## <none>                        1961.2 1983.2
## + Airport_Distance        1   1960.0 1984.0
## + dist3                   1   1960.1 1984.1
## + clean3                  1   1960.7 1984.7
## + clean2                  1   1960.8 1984.8
## + Cleaning_o              1   1960.8 1984.8
## + flights2                1   1960.9 1984.9
## + Number_of_flights       1   1961.0 1985.0
## + Baggage_loading_time    1   1961.1 1985.1
## + fuel3                   1   1961.1 1985.1
## + fuel2                   1   1961.1 1985.1
## + Fueling_o               1   1961.1 1985.1
## + support3                1   1961.1 1985.1
## + late2                   1   1961.2 1985.2
## + late3                   1   1961.2 1985.2
## + support2                1   1961.2 1985.2
## - Security_o              1   1965.2 1985.2
## - bag2                    1   1965.6 1985.6
## - security2               1   1966.2 1986.2
## - bag3                    1   1966.7 1986.7
## - security3               1   1966.8 1986.8
## + Carrier                13   1953.7 2001.7
## - Weather                 1   1990.1 2010.1
## - Support_Crew_Available  1   1994.1 2014.1
## - dist2                   1   2011.6 2031.6
## - Late_Arrival_o          1   2066.6 2086.6
## - flights3                1   2422.7 2442.7

The two selection processes settled on largely the same variables as our first-order model, but also included higher order terms for some of those same variables. We do not want a final model to contain the same variable at multiple levels, as it could give undue weight to those variables. However, this may indicate a strong connection between the reiterated terms and the variable of interest, we will create a reduced model with those.

2.d Candidate Models

Based on the previous results, we have 2 Candidate models, as well as our initial model:

can.reg<-glm(Hour_Delay ~ Security_o + Airport_Distance + Number_of_flights + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial, data=flight.hour)

can.min<-glm(Hour_Delay ~ Airport_Distance + Number_of_flights + Baggage_loading_time + Security_o , family = binomial , data=flight.hour)

We will look at the significance of each coefficient, of each model, as an initial comparison:

full.table=summary(initial.model)$coef
kable(full.table, caption = "Significance of each coefficient in the full model")
Significance of each coefficient in the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -102.9257623 4.2859982 -24.0144202 0.0000000
CarrierAA 0.0628418 0.3963437 0.1585538 0.8740205
CarrierAS -0.2516016 1.4132306 -0.1780330 0.8586971
CarrierB6 -0.1677252 0.3783899 -0.4432602 0.6575776
CarrierDL -0.3374653 0.3865832 -0.8729433 0.3826940
CarrierEV -0.2168769 0.3862913 -0.5614335 0.5745020
CarrierF9 -1.2389646 1.3356440 -0.9276159 0.3536068
CarrierFL -0.8452742 0.6406167 -1.3194695 0.1870122
CarrierHA -3.8793661 4.6055613 -0.8423221 0.3996077
CarrierMQ -0.3142455 0.4121902 -0.7623799 0.4458333
CarrierUA -0.2084461 0.3752656 -0.5554627 0.5785782
CarrierUS -0.1524217 0.4372685 -0.3485770 0.7274069
CarrierVX -0.4453947 0.5916292 -0.7528275 0.4515536
CarrierWN -0.2453927 0.4772950 -0.5141321 0.6071596
Airport_Distance 0.0272436 0.0038857 7.0111599 0.0000000
Number_of_flights 0.0006873 0.0000378 18.2063162 0.0000000
Weather 0.6592219 0.1265406 5.2095670 0.0000002
Support_Crew_Available -0.0081644 0.0014465 -5.6444012 0.0000000
Baggage_loading_time 2.4382013 0.1787911 13.6371526 0.0000000
Late_Arrival_o 0.9455250 0.0954982 9.9009740 0.0000000
Cleaning_o 0.0095363 0.0163363 0.5837477 0.5593900
Fueling_o -0.0051694 0.0161900 -0.3192934 0.7495040
Security_o 0.0140206 0.0080607 1.7393731 0.0819692
reg.table=summary(can.reg)$coef
kable(reg.table, caption = "Significance of each coefficient in the reduced model")
Significance of each coefficient in the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -102.6831868 4.2206844 -24.328563 0.0000000
Security_o 0.0138565 0.0080020 1.731626 0.0833401
Airport_Distance 0.0268672 0.0038634 6.954268 0.0000000
Number_of_flights 0.0006841 0.0000375 18.238906 0.0000000
Support_Crew_Available -0.0082071 0.0014400 -5.699314 0.0000000
Weather 0.6593858 0.1260339 5.231814 0.0000002
Baggage_loading_time 2.4304099 0.1778089 13.668666 0.0000000
Late_Arrival_o 0.9436507 0.0945881 9.976420 0.0000000
min.table=summary(can.min)$coef
kable(min.table, caption = "Significance of each coefficient in the minimal model")
Significance of each coefficient in the minimal model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -91.1238012 3.7056368 -24.590592 0.0000000
Airport_Distance 0.0290380 0.0036613 7.931082 0.0000000
Number_of_flights 0.0007462 0.0000359 20.793681 0.0000000
Baggage_loading_time 2.7358787 0.1717724 15.927352 0.0000000
Security_o 0.0151718 0.0076559 1.981719 0.0475107

As expected, the full model shows many variables with insignificant p values. The minimal model, while exclusively significant values, does not have as many strong predictors as the mid-sized model. The mid-sized model has Security time not achieving statistical significance, but is fairly close.

As expected, almost all variables show a positive relationship with probability of longer delays. This should not come as a surprise, as most variables are number of minutes spent at some stage before a flight takes off (security, baggage loading time, etc). The number of support staff is negatively correlated, which also makes sense, as more hands available should lead to faster performance of those (and other, possibly unrecorded) tasks.

2.e Prediction

To test the similarity of the candidates, we will create a sample dataset by choosing values for each variable. We will then use R to predict the Delay for each sample flight in both the initial and candidate model. We will then compare the two models’ results.

samp.data=data.frame(Carrier=c("UA", "AA", "DL", "EV", "VX"), Airport_Distance=c(350, 500, 445, 430, 450), Number_of_flights=c(30123, 25001, 43333, 55757, 45454), Weather=c(5, 5, 2, 6, 9), Support_Crew_Available=c(17, 300, 85, 92, 71), Baggage_loading_time=c(14.5, 20, 16, 12, 17), Late_Arrival_o=c(10, 20, 15, 19, 18), Cleaning_o=c(25, 6, 8, 10, 10), Fueling_o=c(19.5, 22, 25, 27, 24), Security_o=c(40, 38, 18, 40, 26) )

pred.ini=predict(initial.model, newdata=samp.data , type = "response")
cut.off.prob = 0.5
pred.response.ini = ifelse(pred.ini > cut.off.prob, 1, 0)

pred.can=predict(can.reg, newdata=samp.data , type = "response")
cut.off.prob = 0.5
pred.response.can = ifelse(pred.can > cut.off.prob, 1, 0)

combined.table=cbind(samp.data, Initial_Model=pred.response.ini , Candidate_Model=pred.response.can)

kable(combined.table, caption = "Predicted Delay Time of Each Model Based on Sample Data")
Predicted Delay Time of Each Model Based on Sample Data
Carrier Airport_Distance Number_of_flights Weather Support_Crew_Available Baggage_loading_time Late_Arrival_o Cleaning_o Fueling_o Security_o Initial_Model Candidate_Model
UA 350 30123 5 17 14.5 10 25 19.5 40 0 0
AA 500 25001 5 300 20.0 20 6 22.0 38 0 0
DL 445 43333 2 85 16.0 15 8 25.0 18 0 0
EV 430 55757 6 92 12.0 19 10 27.0 40 0 0
VX 450 45454 9 71 17.0 18 10 24.0 26 1 1

As demonstrated in the table, the models have similar predictions despite very extreme x values. It is our belief that the candidate model is ideal, as it generates comparable predictions at a fraction of the memory. This advantage will only continue to grow as larger datasets are used with this same model.

2.f Model Building Conclusions

In this section we utilized different strategies to optimize our predictive model and arrived at a reduced model formula. We were able to test and rule out different transformations of variables, compare possible models, and select a ‘champion’ model to be compared against other model building strategies in subsequent sections.

can.df<-subset(flight.hour , select=c('Hour_Delay' , 'Security_o' , 'Airport_Distance' , 'Number_of_flights' , 'Support_Crew_Available' , 'Weather' , 'Baggage_loading_time' , 'Late_Arrival_o'))

A copy of this data is now publicly available at: https://raw.githubusercontent.com/AlexDragonetti/flightcandidate/main/can.csv

3. Training, Cross Validation, and Testing the Model

To train our model, we will only a portion of the data to determine an optimal cut-off probability and then test it on the remaining portion.

3.a Separating and Training the Model

First, we need to create separate dataframes for testing and training purposes.

nn = dim(can.df)[1]
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 
training = can.df[train.id,]
testing = can.df[-train.id,]

Next, we ‘train’ the model by finding the most accurate cutoff probability. We do this by splitting our training data into 5 ‘folds’. Next, we use different combinations of those folds to test 20 different cutoff points. Finally, we see which cutoff point had the highest combined probability in each combination. This is called cross-validation.

n0 = dim(training)[1]/5
cut.0ff.prob = seq(0,1, length = 22)[-c(1,22)]

pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)  

for (i in 1:5){
  valid.id = ((i-1)*n0 + 1):(i*n0)
  valid.data = training[valid.id,]
  train.data = training[-valid.id,]
  train.model = glm(Hour_Delay ~ Security_o + Number_of_flights + Airport_Distance + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial(link = logit), data = can.df)
  
  newdata = data.frame(Security_o = valid.data$Security_o , Number_of_flights=valid.data$Number_of_flights , Airport_Distance=valid.data$Airport_Distance , Support_Crew_Available=valid.data$Support_Crew_Available , Weather=valid.data$Weather , Baggage_loading_time=valid.data$Baggage_loading_time , Late_Arrival_o=valid.data$Late_Arrival_o )
  
  pred.prob = predict.glm(train.model, newdata, type = "response")

  
  for(j in 1:20){
    pred.status = rep(0,length(pred.prob))
    valid.data$pred.status = as.numeric(pred.prob >cut.0ff.prob[j])
    a11 = sum(valid.data$pred.status == valid.data$Hour_Delay)
    pred.accuracy[i,j] = a11/length(pred.prob)
  }
}


avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))


tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.5,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "5-fold CV performance"
     )
mtext('Figure 7: CV performance' , side=1, line=4, cex=0.5)
axis(1, at=1:20, label = tick.label, las = 2)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "purple2")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "purple2", cex = 0.8)

The above shows the model’s optimal cutoff probability and accuracy.

3.b Testing our Cut-off Probability

We now will test this cutoff probability against the separate testing dataframe. This probability should be different from our previous probabilities, as the testing data has not been used in its calculation. Note that, because the training/testing data split is different every time the code is run (including to make this document), it is impossible to know what the optimal cutoff point would be without setting a seed (which then isn’t random). Because of this, we will simply test the cutoff probability of .52 for demonstration purposes.

test.model = glm(Hour_Delay ~ Security_o + Airport_Distance + Number_of_flights + Support_Crew_Available + Weather + Baggage_loading_time + Late_Arrival_o , family = binomial (link=logit), data = training)

newdata = data.frame(Hour_Delay= testing$Hour_Delay , Security_o = testing$Security_o , Number_of_flights=testing$Number_of_flights ,                 Airport_Distance=testing$Airport_Distance ,                     Support_Crew_Available=testing$Support_Crew_Available , Weather=testing$Weather , Baggage_loading_time=testing$Baggage_loading_time , Late_Arrival_o=testing$Late_Arrival_o)

pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.52)
a11 = sum(testing$test.status == testing$Hour_Delay)
test.accuracy = a11/length(pred.prob.test)
kable(as.data.frame(test.accuracy), align='c')
test.accuracy
0.8729128

Above is our model’s accuracy, when applied to the testing data subset.

3.c Specificity and Sensitivity of our Model

Specificity and Sensitivity are two important pieces of information about our predictive model. The Sensitivity is the percentage of positive predictions that are correct: “How many of our predicted yeses are correct?”

The Specificity answers the opposite. It is the percentage of correct negative guesses: “How many of our predicted nos are correct?”

We will use the newdata dataframe generated in the previous section to split our guesses into four groups:

p0.a0 - the number of correct nos p0.a1 - the number of incorrect nos p1.a0 - the number of incorrect yeses p1.a1 - the number of correct yeses

pred.prob.test = predict.glm(test.model, newdata, type = "response")
testing$test.status = as.numeric(pred.prob.test > 0.52)
p0.a0 = sum(testing$test.status ==0 & testing$Hour_Delay==0)
p0.a1 = sum(testing$test.status ==0 & testing$Hour_Delay ==1)
p1.a0 = sum(testing$test.status ==1 & testing$Hour_Delay==0)
p1.a1 = sum(testing$test.status ==1 & testing$Hour_Delay ==1)

sensitivity = p1.a1 / (p1.a1 + p0.a1)
specificity = p0.a0 / (p0.a0 + p1.a0)

precision = p1.a1 / (p1.a1 + p1.a0)
recall = sensitivity
F1 = 2*precision*recall/(precision + recall)
metric.list = cbind(sensitivity = sensitivity, 
                    specificity = specificity, 
                    precision = precision,
                    recall = recall,
                    F1 = F1)
kable(as.data.frame(metric.list), align='c', caption = "Local performance metrics")
Local performance metrics
sensitivity specificity precision recall F1
0.9246988 0.7898551 0.8758916 0.9246988 0.8996337

From this, we can see that our sensitivity is about 91%, while our specificity is only 81%. Our model is better at predicting the presence of an hour delay than the lack of an hour delay. If we wanted to adjust the cutoff point to increase the specificity, we could use an ROC curve to establish an ideal point while factoring in both sensitivity and specificity.

3.d Generating and ROC (Reciever Operating Characteristic) Curve

An ROC curve can help us make decisions about a cutoff point. While our cutoff point is optimal for success, we may prefer a cutoff point with a higher sensitivity (correct designation of positive, which is in this case, the presence of an hour+ delay) or specificity (correct prediction of a negative (delay under an hour). An ROC curve shows the concurrent sensitivity and specificity, allowing one to pick and choose what an ideal balance would be. It also allows us to measure the Area Under the Curve (AUC), which allows us to measure the overall performance of a model. A theoretical perfect AUC is 1, while a theoretical random guess model’s AUC is .5 (imagine your model was flipping a coin, and if heads, you say ‘yes, the plane will be an hour late’. That is the .5 random guess model).

cut.off.seq = seq(0, 1, length = 20)
sensitivity.vec = NULL
specificity.vec = NULL
for (i in 1:20){
  testing$test.status = as.numeric(pred.prob.test > cut.off.seq[i])

p0.a0 = sum(testing$test.status ==0 & testing$Hour_Delay==0)
p0.a1 = sum(testing$test.status ==0 & testing$Hour_Delay ==1)
p1.a0 = sum(testing$test.status ==1 & testing$Hour_Delay==0)
p1.a1 = sum(testing$test.status ==1 & testing$Hour_Delay ==1)

sensitivity.vec[i] = p1.a1 / (p1.a1 + p0.a1)
specificity.vec[i] = p0.a0 / (p0.a0 + p1.a0)
}


one.minus.spec = c(1,1 - specificity.vec)
sens.vec = c(1,sensitivity.vec)

par(pty = "s")  
plot(one.minus.spec, sens.vec, type = "l", xlim = c(0,1), ylim = c(0,1) ,
     xlab ="1 - specificity",
     ylab = "sensitivity",
     main = "ROC curve of Logistic Regression Model",
     lwd = 2,
     col = "blue", )
mtext('Figure 8: Model ROC Curve' , side=1, line=4, cex=0.5)
segments(0,0,1,1, col = "red", lty = 2, lwd = 2)
AUC = round(sum(sens.vec*(one.minus.spec[-101]-one.minus.spec[-1])),4)
text(0.8, 0.3, paste("AUC = ", AUC), col = "blue")

The above graph was generated with a series of cutoff thresholds and could be used to determine the ideal cutoff point, if specificity or sensitivity need to be accommodated.

3.e Testing and Training Conclusion

In this section we established a baseline understanding of ROC/AUC and optimal cutoff points, which we will use to compare our champion model against future model building strategies.

4. Neural Network Predictions

Another possible means to an ideal cutoff could could be creating a Neural Network (NN) model. Creating an NN model, code-wise, will be very similar to the cross validation done in the previous section.

4.a Normalizing Variables

For good results using the neuralnet function, we will normalize our numerical variables. Our scaling formula for each numeric value is as follows:

\(\hat{x}_{ij} = \frac{x_{ij} - min value_{i}(x_{ij})}{range(x_{ij})}\)

Meaning that the scale X values will be arrived at by starting with the original x, subtracting the lowest value of that variable, then diving the difference by the difference between the highest and lowest x value. We will only do this for numerical variables, not dummy variables, as it will help mitigate variance. As dummy variables are only taking on values of 0 or 1, it would not change much, but it will be import for variables like airport distance (which has a range in the hundreds ).

flight.h2<-flight.hour

flight.h2$Airport_Distance = (flight.h2$Airport_Distance-min(flight.h2$Airport_Distance))/(max(flight.h2$Airport_Distance)-min(flight.h2$Airport_Distance))
flight.h2$Number_of_flights = (flight.h2$Number_of_flights-min(flight.h2$Number_of_flights))/(max(flight.h2$Number_of_flights)-min(flight.h2$Number_of_flights))
flight.h2$Weather = (flight.h2$Weather-min(flight.h2$Weather))/(max(flight.h2$Weather)-min(flight.h2$Weather))
flight.h2$Support_Crew_Available = (flight.h2$Support_Crew_Available-min(flight.h2$Support_Crew_Available))/(max(flight.h2$Support_Crew_Available)-min(flight.h2$Support_Crew_Available))
flight.h2$Baggage_loading_time = (flight.h2$Baggage_loading_time-min(flight.h2$Baggage_loading_time))/(max(flight.h2$Baggage_loading_time)-min(flight.h2$Baggage_loading_time))
flight.h2$Late_Arrival_o = (flight.h2$Late_Arrival_o-min(flight.h2$Late_Arrival_o))/(max(flight.h2$Late_Arrival_o)-min(flight.h2$Late_Arrival_o))
flight.h2$Cleaning_o = (flight.h2$Cleaning_o-min(flight.h2$Cleaning_o))/(max(flight.h2$Cleaning_o)-min(flight.h2$Cleaning_o))
flight.h2$Fueling_o = (flight.h2$Fueling_o-min(flight.h2$Fueling_o))/(max(flight.h2$Fueling_o)-min(flight.h2$Fueling_o))
flight.h2$Security_o = (flight.h2$Security_o-min(flight.h2$Security_o))/(max(flight.h2$Security_o)-min(flight.h2$Security_o))

4.b Preparing Dataset for NN

Now, we will organize our standardized, complete, flight.hour dataframe in such a way that it can be read as a model formula. For conveneince, we will make a new dataset and adhere to CamelCase naming conventions, where words in the phrase are separated but capital letters, allowing us to avoid spaces.

flightmtx<-model.matrix(~. , data=flight.h2)
colnames(flightmtx)
##  [1] "(Intercept)"            "CarrierAA"              "CarrierAS"             
##  [4] "CarrierB6"              "CarrierDL"              "CarrierEV"             
##  [7] "CarrierF9"              "CarrierFL"              "CarrierHA"             
## [10] "CarrierMQ"              "CarrierUA"              "CarrierUS"             
## [13] "CarrierVX"              "CarrierWN"              "Airport_Distance"      
## [16] "Number_of_flights"      "Weather"                "Support_Crew_Available"
## [19] "Baggage_loading_time"   "Late_Arrival_o"         "Cleaning_o"            
## [22] "Fueling_o"              "Security_o"             "Hour_Delay"

Many variable names are workable. We will keep all of the carrier names as is and rename the others.

colnames(flightmtx)[15]<-"AirportDistance"
colnames(flightmtx)[16]<-"NumberOfFlights"
colnames(flightmtx)[17]<-"Weather"
colnames(flightmtx)[18]<-"SupportCrewAvailable"
colnames(flightmtx)[19]<-"BaggageLoadingTime"
colnames(flightmtx)[20]<-"LateArrival"
colnames(flightmtx)[21]<-"Cleaning"
colnames(flightmtx)[22]<-"Fueling"
colnames(flightmtx)[23]<-"Security"
colnames(flightmtx)[24]<-"HourDelay"

Now that our variables are renamed and scaled, we will write the model as an object called modelFormula which we can call upon to put into code at any time. Note that we are dropping the intercept, as our NN model will generate its own.

columnNames = colnames(flightmtx)
columnList = paste(columnNames[-c(1,length(columnNames))], collapse = "+")
columnList = paste(c(columnNames[length(columnNames)],"~",columnList), collapse="")
modelFormula = formula(columnList)

4.c Splitting Data into Training and Testing

Just like the previous section, we will split our dataset into training and testing data for the purpose of cross-validation. See the previous section for a more detailed explanation.

n = dim(flightmtx)[1]
testID = sample(1:n, round(n*0.7), replace = FALSE)
testDat = flightmtx[testID,]
trainDat = flightmtx[-testID,]
train.Dat=as.data.frame(trainDat[,-1])

4.d Building the NN Model

Our NN model will use the modelFormula object that we coded earlier. Aside from the use of the modelFormula object, the neuralnet command begins similarly to our glm code in the previous section. It differs by adding commands afterwards, such as the hidden layers and the use of the rprop+ algorithim, telling the NN to use Backpropagation. Backpropagation is a means to improve predictive accuracy by calculating the gradient of a loss function (which we would call the Mean Square Error in classical statistics) of weights within the NN and using them to update other weights.

NetworkModel = neuralnet(HourDelay ~ .,
data = train.Dat,hidden = 1, rep = 1, threshold = 0.01, learningrate = 0.1, linear.output = TRUE, algorithm = "rprop+")

kable(NetworkModel$result.matrix)
error 43.6146297
reached.threshold 0.0095573
steps 8722.0000000
Intercept.to.1layhid1 -27.5409068
CarrierAA.to.1layhid1 1.5738414
CarrierAS.to.1layhid1 0.8628386
CarrierB6.to.1layhid1 0.9942199
CarrierDL.to.1layhid1 1.0810445
CarrierEV.to.1layhid1 0.9387980
CarrierF9.to.1layhid1 -1.5733892
CarrierFL.to.1layhid1 0.7337419
CarrierHA.to.1layhid1 -3.8215925
CarrierMQ.to.1layhid1 1.2465042
CarrierUA.to.1layhid1 1.0271042
CarrierUS.to.1layhid1 2.2616540
CarrierVX.to.1layhid1 2.6885478
CarrierWN.to.1layhid1 1.0077988
AirportDistance.to.1layhid1 4.2000234
NumberOfFlights.to.1layhid1 19.5688212
Weather.to.1layhid1 1.0349227
SupportCrewAvailable.to.1layhid1 0.4459910
BaggageLoadingTime.to.1layhid1 13.4553695
LateArrival.to.1layhid1 10.4934908
Cleaning.to.1layhid1 0.0337969
Fueling.to.1layhid1 -1.0036715
Security.to.1layhid1 0.7144986
Intercept.to.HourDelay 0.0138318
1layhid1.to.HourDelay 0.9894038

Above is a table of each variable’s coefficient, below is a graphical model of them and how they are used to determine the predicted outcome.

plot(NetworkModel, rep="best") 

‘Figure 9: Network Model Visualization’

And below is our logistic regression model, derived by the NN.

logiModel = glm(factor(Hour_Delay) ~., family = binomial, data = flight.hour)
pander(summary(logiModel)$coefficients)
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -102.9 4.286 -24.01 1.966e-127
CarrierAA 0.06284 0.3963 0.1586 0.874
CarrierAS -0.2516 1.413 -0.178 0.8587
CarrierB6 -0.1677 0.3784 -0.4433 0.6576
CarrierDL -0.3375 0.3866 -0.8729 0.3827
CarrierEV -0.2169 0.3863 -0.5614 0.5745
CarrierF9 -1.239 1.336 -0.9276 0.3536
CarrierFL -0.8453 0.6406 -1.319 0.187
CarrierHA -3.879 4.606 -0.8423 0.3996
CarrierMQ -0.3142 0.4122 -0.7624 0.4458
CarrierUA -0.2084 0.3753 -0.5555 0.5786
CarrierUS -0.1524 0.4373 -0.3486 0.7274
CarrierVX -0.4454 0.5916 -0.7528 0.4516
CarrierWN -0.2454 0.4773 -0.5141 0.6072
Airport_Distance 0.02724 0.003886 7.011 2.364e-12
Number_of_flights 0.0006873 3.775e-05 18.21 4.599e-74
Weather 0.6592 0.1265 5.21 1.893e-07
Support_Crew_Available -0.008164 0.001446 -5.644 1.658e-08
Baggage_loading_time 2.438 0.1788 13.64 2.408e-42
Late_Arrival_o 0.9455 0.0955 9.901 4.122e-23
Cleaning_o 0.009536 0.01634 0.5837 0.5594
Fueling_o -0.005169 0.01619 -0.3193 0.7495
Security_o 0.01402 0.008061 1.739 0.08197

Finally, we will have our NN model find an optimal cutoff probability, using the same cross-validation technique that we used in our previous section. We expect that it will be slightly different, partially because it is a different technique, but also because our code generates a completely different random split for training and testing.

n0 = dim(train.Dat)[1]/5
cut.0ff.prob = seq(0,1, length = 20)

pred.accuracy = matrix(0,ncol=20, nrow=5, byrow = T)  

for (i in 1:5){
  valid.id = ((i-1)*n0 + 1):(i*n0)
  valid.data = train.Dat[valid.id,]
  train.data = train.Dat[-valid.id,]
  train.model = neuralnet(modelFormula, train.data, hidden = 1, rep = 1, threshold = 0.1,    learningrate = 0.1, algorithm = "rprop+")
  
  pred.nn.val = predict(NetworkModel, newdata=valid.data, type="prob")

for(j in 1:20){
    pred.status = rep(0,length(pred.nn.val))
    valid.data$pred.status = as.numeric(pred.nn.val >cut.0ff.prob[j])
    a11 = sum(valid.data$pred.status == valid.data$HourDelay)
    pred.accuracy[i,j] = a11/length(pred.nn.val)
  }
}


avg.accuracy = apply(pred.accuracy, 2, mean)
max.id = which(avg.accuracy ==max(avg.accuracy ))


tick.label = as.character(round(cut.0ff.prob,2))
plot(1:20, avg.accuracy, type = "b",
     xlim=c(1,20), 
     ylim=c(0.5,1), 
     axes = FALSE,
     xlab = "Cut-off Probability",
     ylab = "Accuracy",
     main = "5-fold CV performance"
     )
axis(1, at=1:20, label = tick.label, las = 2)
mtext('Figure 10: Model ROC Curve' , side=1, line=4, cex=0.7)
axis(2)
segments(max.id, 0.5, max.id, avg.accuracy[max.id], col = "purple")
text(max.id, avg.accuracy[max.id]+0.03, as.character(round(avg.accuracy[max.id],4)), col = "purple", cex = 0.8)

Above is the optimal Cut-off Probability, as suggested by our NN model. The ROC curve below will show the same information as in the previous section.

nn.results = predict(NetworkModel, train.Dat)
cut0 = seq(0,1, length = 20)
SenSpe = matrix(0, ncol = length(cut0), nrow = 2, byrow = FALSE)
for (i in 1:length(cut0)){
    a = sum(train.Dat[,"HourDelay"] == 1 & (nn.results > cut0[i]))
    d = sum(train.Dat[,"HourDelay"] == 0 & (nn.results < cut0[i]))
    b = sum(train.Dat[,"HourDelay"] == 0 & (nn.results > cut0[i]))    
    c = sum(train.Dat[,"HourDelay"] == 1 & (nn.results < cut0[i]))   
    sen = a/(a + c)
    spe = d/(b + d)
    SenSpe[,i] = c(sen, spe)
}

plot(1-SenSpe[2,], SenSpe[1,], type ="l", xlim=c(0,1), ylim=c(0,1),
     xlab = "1 - specificity", ylab = "Sensitivity", lty = 1,
     main = "ROC Curve", col = "blue")
abline(0,1, lty = 2, col = "red")
mtext('Figure 11: Model ROC Curve' , side=1, line=4, cex=0.7)
xx = 1-SenSpe[2,]
yy = SenSpe[1,]
width = xx[-length(xx)] - xx[-1]
height = yy[-1]

  prediction = nn.results
  category = train.Dat[,"HourDelay"] == 1
  ROCobj <- roc(category, prediction)
  AUC = auc(ROCobj)[1]

AUC =mean(sum(width*height), sum(width*yy[-length(yy)]))
text(0.8, 0.3, paste("AUC = ", round(AUC,4)), col = "blue4", cex = 0.9)
legend("bottomright", c("ROC of the model", "Random (Coin Flip) Guess"), lty=c(1,2),
       col = c("blue4", "pink2"), bty = "n", cex = 0.8)

Similar to the previous section, the ROC curve helps determine an ideal cutoff probability, based on the desired sensitivity and specificity of our model. While favoring a more sensitive or specific model could be important due to external factors, there is no apparent need for either in the realm of flight delay predictions.

4.e Neural Network Conclusion

The above NN model building shows how machine learning can be utilized to create a predictive model and gave an ROC/AUC and cutoff point to compare to our previous ‘champion’.

5. Decision Tree Analysis

A decision tree is a means of making prediction based on answers to series of ‘questions’ to generate ‘rules’, which are defined by the available data. Whether or not a condition is met is defined at a node. Beginning with a root node, representing the entire sample, the data branches into different nodes. These. If a node splits into sub-nodes, it is aptly called a decision node. If a node does not split, it is called a terminal node or - to fit the theme - a leaf. A simpler decision tree can be seen below:

Here, the root node splits the sample based on income bracket. From there, it shows conditions to either continue to further decision nodes or end at a leaf, which shows that the model predicts a ‘yes’ or a ‘no’. Our own decision tree will follow the same principles, but will likely be more complex (as it contains more variables).

A decision tree may allow us to simplify our model and see if any variables are only relevant if certain conditions are met,like how in the above example, Credit Rating (CR) is not used to make a prediction for the ‘high’ income bracket.

5.a Splitting the Dataframe for training and validation

First, we will need an unweighted datafame.

flightmtx2<-model.matrix(~. , data=flight.h2)

colnames(flightmtx2)[15]<-"AirportDistance"
colnames(flightmtx2)[16]<-"NumberOfFlights"
colnames(flightmtx2)[17]<-"Weather"
colnames(flightmtx2)[18]<-"SupportCrewAvailable"
colnames(flightmtx2)[19]<-"BaggageLoadingTime"
colnames(flightmtx2)[20]<-"LateArrival"
colnames(flightmtx2)[21]<-"Cleaning"
colnames(flightmtx2)[22]<-"Fueling"
colnames(flightmtx2)[23]<-"Security"
colnames(flightmtx2)[24]<-"HourDelay"

Next, we will need to separate it into training and testing sets. This is the same code from previous sections and accomplishes the same split.

nn = dim(flightmtx2)[1]
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 
training = as.data.frame(flightmtx2[train.id,])
testing = as.data.frame(flightmtx2[-train.id,])

5.b Different options for Decision Tree

When designing our decision tree, we have two options that we will explore. They are: 1) whether or not to weight false positives and false negatives differently and 2) How to determine if and when to split.

5.b.1 Assigning weight to different errors

No predictive model is perfect. We will have false positives and false negatives in any model, but the rpart() function allows us to treat them differently. As a reminder, the definitions are:

False Positive - Our model predicts that there will be a delay of 1 hour or more but in reality, there is not False Negative - Our model predicts that there will not be a delay of 1 hour or more but in reality, there is

We believe that exploring false positives having more weight to them is much more relevant than a false negative. Put simply, if you incorrectly predict a short delay and act accordingly (get to the airport with enough time to go through security, check-in, etc), the end result is waiting in an airport terminal (something to be expected anyway). On the other hand, if you incorrectly predict a long delay and act accordingly (take your time in arriving to the airport, etc), you will likely miss your flight.

Keeping this in mind, our trees will explore weighing false positives and negatives the same, but also giving false positives ten times the weight of false negatives.

5.b.2 Gini vs Entropy Splitting

Our Decision Trees will consider splitting into different nodes based on one of two possible criteria: Gini or Information Gain (referred to as Entropy, which will be explained shortly).

Not to be confused with the summary measure of income inequality, the Gini index measures the variation of subgroups split by a feature variable. In the example above, the Gini measure (D) of the Student variable would vary numerically based on how many answered ‘Yes’ vs how many answered ‘No’. If all of them answered one or the other, D would be equal to 0.

The Information Gain of a variable measures the percentage of each child node class derived from splitting a decision node. It is calculated by subtracting the Entropy of the child node from Entropy of the parent node. Highest Information Gain (difference between parent and child node Entropy) is usually taken first.

5.c Making the Trees

As we will be making four trees (weighted and unweighted, Gini and Entropy), it will be faster (and more space efficient) to write an R function rather than copy the same code four times. This is done using the rplot() package.

tree.build = function(in.data, fp, fn, purity){
   tree = rpart(HourDelay ~ .,                
                data = in.data, 
                na.action  = na.rpart,       
                                             
                method = "class",            
                model  = FALSE,
                x = FALSE,
                y = TRUE,
            parms = list(
                         loss = matrix(c(0, fp, fn, 0), ncol = 2, byrow = TRUE),   
                         split = purity),
            control = rpart.control(
                        minsplit = 10,
                        minbucket= 10,  
                        cp  = 0.01, 
                       xval = 10     )
                        )
             
  }

We can now use this function to create decision trees with different splitting and weighing methods.

gini.unweighted=tree.build(in.data=training, fp=1, fn =1, purity = "gini")
gini.weighted=tree.build(in.data=training, fp=10, fn =1, purity = "gini")
info.unweighted=tree.build(in.data=training, fp=1, fn =1, purity = "information")
info.weighted=tree.build(in.data=training, fp=10, fn =1, purity = "information")

Now, we will use the rpart.plot() package to visualize the Decision Trees

par(mfrow=c(1,1))
rpart.plot(gini.unweighted, main = "Unweighted DT with Gini Index")
mtext('Figure 12: Decision Tree Model' , side=1, line=4, cex=0.5)

rpart.plot(gini.weighted, main = "Weighted DT with Gini Index")
mtext('Figure 13: Decision Tree Model' , side=1, line=4, cex=0.5)

rpart.plot(info.unweighted, main = "Unweighted DT with Entropy")
mtext('Figure 14: Decision Tree Model' , side=1, line=4, cex=0.5)

rpart.plot(info.weighted, main = "Weighted DT with Entropy")
mtext('Figure 15: Decision Tree Model' , side=1, line=4, cex=0.5)

5.d ROC Curve for different Decision Tree models

As the ROC curve requires both sensitivity and specificity, we will write a function to determine them.

rocstuff<-function(in.data, fp, fn, purity){
  cutoff = seq(0,1, length=20)
  model=tree.build(in.data, fp, fn, purity)
  pred=predict(model, in.data, type="prob")
  ssmtx=matrix(0, ncol=length(cutoff), nrow=2, byrow=FALSE)
  for (i in 1:length(cutoff)){
    pred.out = ifelse(pred[,2] >= cutoff[i], 1, 0)
    TP = sum(pred.out ==1 & in.data$HourDelay == 1)
    TN = sum(pred.out ==0 & in.data$HourDelay == 0)
    FP = sum(pred.out ==1 & in.data$HourDelay == 0)
    FN = sum(pred.out ==0 & in.data$HourDelay == 1)
    ssmtx[1,i] = TP/(TP + FN)
    ssmtx[2,i] = TN/(TN+FP)
    
  }
  prediction = pred[,2]
  category = in.data$HourDelay == 1
  ROCobj<-roc(category,prediction)
  AUC<-auc(ROCobj)
  list(ssmtx= ssmtx, AUC = round(AUC,3))
}

Now, we can use this new function to define and plot the ROC and AUC for each of our models.

guROC=rocstuff(in.data=training, fp=1, fn=1, purity="gini")
gwROC=rocstuff(in.data=training, fp=10, fn=1, purity="gini")
iuROC=rocstuff(in.data=training, fp=1, fn=1, purity="information")
iwROC=rocstuff(in.data=training, fp=10, fn=1, purity="information")

par(pty="s")
plot((1-guROC$ssmtx[2,]), guROC$ssmtx[1,], type = "l", xlim=c(0,1), ylim=c(0,1), xlab = "False Positive Rate", ylab = "True Positive Rate", lwd=2.2, col="blue", main="ROC Curves of Decision Trees")
mtext('Figure 16: ROC Curves of Tree Models' , side=1, line=4, cex=0.5)
abline(0,1, lty=2)
lines((1-gwROC$ssmtx[2,]), gwROC$ssmtx[1,], col="deeppink", lwd=2.2)
lines((1-iuROC$ssmtx[2,]), iuROC$ssmtx[1,], col="mediumorchid", lwd=2.2)
lines((1-iwROC$ssmtx[2,]), iwROC$ssmtx[1,], col="forestgreen", lwd=2.2)

legend("bottomright", c(paste("Gini Unweighted, AUC=",guROC$AUC),
paste("Gini Weighted, AUC=",gwROC$AUC), 
paste("Info Unweighted, AUC=",iuROC$AUC), 
paste("Info Weighted, AUC=",iwROC$AUC)), 
col=c("blue", "deeppink", "mediumorchid", "forestgreen"),
lty=rep(1,6), cex=.75, bty="n")

Above is the ROC curve for each model considered. We will now look at the optimal cutoff point for each decision tree model. As, again, we are doing this for four different models, it will be faster if we write a function.

Optm.cutoff = function(in.data, fp, fn, purity){
  n0 = dim(in.data)[1]/5
  cutoff = seq(0,1, length = 20)              
  
  accuracy.mtx = matrix(0, ncol=20, nrow=5)    
  
  for (k in 1:5){
     valid.id = ((k-1)*n0 + 1):(k*n0)
     valid.dat = in.data[valid.id,]
     train.dat = in.data[-valid.id,] 
     
     tree.model = tree.build(in.data, fp, fn, purity)
     
     pred = predict(tree.model, newdata = valid.dat, type = "prob")[,2]
     
     for (i in 1:20){
        
        pc.1 = ifelse(pred > cutoff[i], 1, 0)
        
        a1 = mean(pc.1 == valid.dat$HourDelay)
        accuracy.mtx[k,i] = a1
       }
      }
   avg.acc = apply(accuracy.mtx, 2, mean)
   
   n = length(avg.acc)
   idx = which(avg.acc == max(avg.acc))
   tick.label = as.character(round(cutoff,2))
   
   plot(1:n, avg.acc, xlab="Cutoff score", ylab="Average Accuracy", 
        ylim=c(min(avg.acc), 1), 
        axes = FALSE,
        main=paste("5-fold CV Optimal Cutoff \n ",purity,"(fp, fn) = (", fp, ",", fn,")" , collapse = ""),
        cex.main = 0.9,
        col.main = "mediumorchid")
        axis(1, at=1:20, label = tick.label, las = 2)
        axis(2)
        points(idx, avg.acc[idx], pch=19, col = "deeppink")
        segments(idx , min(avg.acc), idx , avg.acc[idx ], col = "deeppink")
       text(idx, avg.acc[idx]+0.03, as.character(round(avg.acc[idx],4)), col = "deeppink", cex = 0.7) 
   }

par(mfrow=c(2,2))

Optm.cutoff(in.data=training, fp=1, fn=1, purity = "gini")
mtext('Figure 17: Optimal Cutoff Model' , side=1, line=4, cex=0.5)

Optm.cutoff(in.data=training, fp=10, fn=1, purity = "gini")
mtext('Figure 18: Optimal Cutoff Model' , side=1, line=4, cex=0.5)
Optm.cutoff(in.data=training, fp=1, fn=1, purity = "information")
mtext('Figure 19: Optimal Cutoff Model' , side=1, line=4, cex=0.5)
Optm.cutoff(in.data=training, fp=10, fn=1, purity = "information")
mtext('Figure 20: Optimal Cutoff Model' , side=1, line=4, cex=0.5)

Above is the optimal cutoff point for each decision tree model. They do not seem to be radically different, but it would depend on user preference for weighting false positives and purity measure.

5.e Decision Tree Conclusion

While the decision trees use far fewer variables than any other method outlined so far - especially the Entropy + Weighted model, which managed an AUC of .914 only using 2 variables on one occasion - they are able to demonstrate comparable accuracy other models, as evidenced by the cut-off scores. They provide the advantage of versatility and simplicity, which would be a boon if this model was adopted to a larger dataset. For the size, the number of variables is certainly appropriate, but in a sample size 100 times larger, being able to cut down a model to just two predictor variables would significantly reduce the computational power needed.

6. Principal Component Analysis

Our dataframe contains many variables that could be related. For example, the number of support staff available could very easily impact the amount of time spent in baggage loading. Principal Component Analysis (PCA) can be very useful in the event of a dataset with multicolinearity.

6.a Preparing the Data for PCA

To prepare our data for PCA, we will remove categorical variable and, in order to mitigate the effect of outliers, we will scale it. We cannot use log scaling, as there are variables with a value of 0. Log-scaling changes their value to -Inf, which only returns

reduced.df<-flight.hour[,-1]
reduced.df<-reduced.df[,-10]

pca.df<-prcomp(reduced.df, center = TRUE, scale = TRUE)

6.b PCA analysis and Output

Next, we can see each variable’s factor from the PCA. For an example of how this grid can be read, consider PC1 below. PC1 can be defined as:

\(PC_1 = (x_{Airport_Distance}*PC1_{Airport_Distance}) + (x_{Number_of_Flights}*PC1_{Number_of_Flights}) + ... + (x_{k}*PC1_{k})\)

kable(round(pca.df$rotation, 2))
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Airport_Distance 0.36 -0.04 -0.06 0.00 0.09 -0.47 -0.78 -0.17 -0.01
Number_of_flights 0.51 -0.01 -0.03 0.02 0.06 -0.10 0.22 0.38 0.73
Weather 0.25 0.12 0.18 0.14 -0.86 0.31 -0.18 -0.05 0.03
Support_Crew_Available -0.28 0.24 -0.06 -0.02 -0.41 -0.78 0.27 0.09 0.00
Baggage_loading_time 0.50 0.03 0.01 0.05 0.04 -0.08 0.18 0.50 -0.67
Late_Arrival_o 0.46 0.04 -0.02 0.01 0.05 -0.11 0.45 -0.75 -0.08
Cleaning_o -0.01 -0.47 0.82 -0.26 -0.01 -0.17 0.06 0.00 0.00
Fueling_o -0.03 0.65 0.52 0.47 0.28 -0.01 -0.05 -0.01 0.04
Security_o 0.08 0.53 0.07 -0.83 0.03 0.11 -0.07 0.01 0.00

And now the importance - ‘proportion of variance’ - of each component.

kable(summary(pca.df)$importance, caption='Amount of Variance Explained by Each Component')  
Amount of Variance Explained by Each Component
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 1.666442 1.018077 0.999428 0.9869021 0.9401294 0.9139692 0.8387211 0.6825275 0.5702442
Proportion of Variance 0.308560 0.115160 0.110980 0.1082200 0.0982000 0.0928200 0.0781600 0.0517600 0.0361300
Cumulative Proportion 0.308560 0.423720 0.534710 0.6429300 0.7411300 0.8339500 0.9121100 0.9638700 1.0000000

This shows the amount of variance explained by each PC. As the first explains the most, by far, we will attempt to include it in a model.

final.pca.df=flight.hour
final.pca.df$PC=pca.df$x[,1]

6.c Testing The PC model

We will run the same analysis we have done on the first candidate model to see if it has statistical significance.

can.pca<-glm(Hour_Delay~ Number_of_flights + Airport_Distance + Weather + Support_Crew_Available + Baggage_loading_time + Late_Arrival_o + Security_o + PC , family=binomial, data = final.pca.df)

pca.table=summary(can.pca)$coef

kable(pca.table, caption = "Significance of each coefficient in the PCA model")
Significance of each coefficient in the PCA model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -92.2725113 78.2832582 -1.1787004 0.2385175
Number_of_flights 0.0006401 0.0003322 1.9267461 0.0540113
Airport_Distance 0.0217655 0.0385026 0.5653001 0.5718697
Weather 0.5319155 0.9654126 0.5509723 0.5816527
Support_Crew_Available -0.0065485 0.0125371 -0.5223327 0.6014387
Baggage_loading_time 2.2532262 1.3421825 1.6787779 0.0931953
Late_Arrival_o 0.8048893 1.0462123 0.7693365 0.4416936
Security_o 0.0111714 0.0216915 0.5150101 0.6065460
PC 0.2416191 1.8143756 0.1331693 0.8940595

The output does not indicate that the PC variable is significant. We will not consider a PC variable in final analysis.

6.d PCA Conclusion

Despite not coming away with a new piece for prospective models, our PCA has shown that there is not significant multicollinearity between different variables. It was certainly worth investigation, as one would expect an airport experiencing a staff shortage to experience longer security times, baggage loading times, etc.

7. Bootstrap Aggreggation Decision Trees

Bootstrap Aggregation (Bagging, for short) is a method that may allow for improvement over our basic decision trees. This method first bootstraps the data, trains models individually (called parallel training), then aggregates the outputs to ‘vote’ on the classification, given a set of values.

7.a Splitting our Data for Testing and Training

This is the same testing/training code from earlier in the report.

nn = dim(flight.hour)[1]
train.id = sample(1:nn, round(nn*0.7), replace = FALSE) 
training = flight.hour[train.id,]
testing = flight.hour[-train.id,]

7.b Bagging our Data

The code is very similar to the Decision Tree modeling code. We will again weigh false positives more heavily than false negatives, for reasons outlined in section 5.

We will then plot the ROC curve and AUC to measure the efficacy of this model.

fh.bag.train<-bagging(as.factor(Hour_Delay)~. , data=training,
       nbagg=100,
       coob=TRUE,
       parms=list(loss=matrix(c(0,1,10,0),
                              ncol=20, byrow=TRUE), split="gini"), control=rpart.control(minsplit = 10, cp=0.02))

pred=predict(fh.bag.train, training, type="prob")

cutoff.prob=seq(0,1,length=20)
ss.mtx=matrix(0, ncol=length(cutoff.prob), nrow= 3, byrow = FALSE)
for (i in 1:length(cutoff.prob)){
  pred.out= ifelse(pred[,"1"]>=cutoff.prob[i], 1,0)
TP = sum(pred.out ==1 & training$Hour_Delay == 1)
  TN = sum(pred.out ==0 & training$Hour_Delay == 0)
  FP = sum(pred.out ==1 & training$Hour_Delay == 0)
  FN = sum(pred.out ==0 & training$Hour_Delay == 1)
ss.mtx[1,i] = TP/(TP+FN)
ss.mtx[2,i] = TN/(TN+FP)
ss.mtx[3,i] = (TP+TN)/(TP+FN + TN + FP)
}

prediction=pred[,1]
category= training$Hour_Delay==1
ROCobj<-roc(category, prediction)
AUC = auc (ROCobj)
AUC = round(as.vector(AUC[1]), 3)

n=length(ss.mtx[3,])
idx=which(ss.mtx[3,] == max(ss.mtx[3,]))
tick.label = as.character(round(cutoff.prob,2))

 par(mfrow = c(1,2))
  plot(1-ss.mtx[2,], ss.mtx[1,], xlim=c(0,1), ylim = c(0,1),
       xlab = "1 - specificity", 
       ylab = "Sensitivity", main = "ROC (Training Data)", type="l",cex.main = 0.8)
    mtext('Figure 21: ROC Curve of Model' , side=1, line=4, cex=0.5)
  legend("bottomright",c("fn = 1", "fp = 10", "cp = 0.02", paste("AUC =", AUC)), bty="n", cex = 0.8)       
  
  plot(1:length(cutoff.prob), ss.mtx[3,], xlab="cut-off probability",
       ylab = "Accuracy", ylim=c(min(ss.mtx[3,]),1),
               axes = FALSE,
        main="Accuracy of Different Cutoff Points",
         mtext('Figure 22: Optimal Cutoff of Model' , side=1, line=4, cex=0.5),

        cex.main = 0.9,
        col.main = "mediumorchid")
        axis(1, at=1:20, label = tick.label, las = 2)
        axis(2)
        points(idx, ss.mtx[3,][idx], pch=19, col = "deeppink")
        segments(idx , min(ss.mtx[3,]), idx , ss.mtx[3,][idx ], col = "deeppink")
       text(idx, ss.mtx[3,][idx]+0.03, as.character(round(ss.mtx[3,][idx],4)), col = "deeppink", cex = 0.8)

The above output shows the optimal cutoff probability and ROC curve for our Bootstrap Aggregation model. It should be compared against the weighted gini model in the earlier Decision Tree section.

7.c Bagging Conclusions

Bagging presents a means of refining the previously discussed Decision Tree model. Through utilization of Bootstrap resampling, we can improve our Decision Trees, as seen in the increase from section 5’s unweighted gini model.

8. Conclusion and Discussion

8.a Conclusions

First, we would like to make clear: no one involved in this study is particularly familiar with research related to flights, travel, etc., so any recommendations should be run by an expert before being acted upon.

From the exploratory phase, it seemed clear that, in this dataset, Number of Flights, Arrival Delay, and Baggage Loading Time have a strong relationship with a flight delay, while Carrier does not. Our reduced candidate model makes a case that Security time, Airport Distance, Support Crew, Arrival Delay, and Weather all have a significant impact on the presence of a long delay. While this is all actionable information, these variables can be split into three categories:

Projection - Understanding the impact that Number of Flights, Airport Distance, and to an extent, available staff and weather, can help create a more realistic departure time, available to a passenger before they even reach the airport. Obviously, while they can be forecast with weather reports and staffing information, weather and available staff may change, so they should not be as heavily relied on as the number of flights and airport distance.

Live Updates - While this information does not do a passenger as much good, a live update of their flight situation could leave them less in the dark. If the recording of like security, baggage loading time, and the late arrival of a plane could be automated, they could be factored into a model of one’s choosing to provide a live update of a flight’s projected departure.

Systemic Improvements - Having established above which variables impact the presence of a long delay more, an airport or carrier may want to focus on improving those above others. For example, bagging loading time was considered a significant variable by most models. Average baggage loading time was about the same across all carriers. With this in mind, if infrastructure improvement is being budgeted for with the explicit purpose of reducing delays, one could prioritize that over a variable that did not have as strong of an impact, like fueling.

Each modeling technique outlined in an earlier section has its own merits, but depending on the intended purpose, a simpler model that does not lose too much information is ideal (or even computationally necessary). For this, we would like to highlight the Decision Tree models. During some testing sessions, trees were able to reach an AUC of above .9 while only using 3-4 variables. The ability to strengthen these trees with different techniques presents a variety of possibilities (especially as machine learning techniques related to decision trees are developed).

8.b Recommendations for Future Study

Should this be replicated, there are a few things that we would recommend: First, analyzing the impact that the airports involved may have. It is impossible to say based on this data alone, but there is plenty of potential analysis to be done with the airport a plane is arriving from, departing from, and heading towards. While this information will create more computational demand, we believe that it is worth investigating.

Second, further exploration into Decision Tree techniques. Due to our aim to broadly analyze the data, we cannot fit more in this report, but there are many - often more advanced - tree algorithms that could prove useful, such as boosting, gradient boosting, or the recently developed XG boosting.

Finally, we chose an hour delay as the response variable somewhat arbitrarily. A cutoff point needed to be established to use some of these statistical techniques, but one could use these techniques with different delay times. Our research shows that a flight that is late by 15 minutes or more is considered ‘delayed’, but treating that as the cutoff point would have meant there were only 92 ‘on time’ flights out of a total of nearly 3600. One may want to consider expert opinions for a third party analysis, or, for internal analysis (done by a single carrier or airport), use carrier or airport specific data to find an ideal cutoff point.