Description of Data

The flight delay data has 3593 observations of 11 variables. They are:

Carrier: The airline

Airport_Distance: Distance between two airports

Number_of_Flights: Total number of flights in the airport

Weather: Weather condition, measured on a scale from 0 (mild) to 10 (extreme)

Support_Crew_Available: Number of support crew

Baggage_loading_time: Time in minutes spent loading baggage

Late_Arrival_o: Time in minutes the plane arrived late

Cleaning_o: Time in minutes spent cleaning the aircraft

Fueling_o: Time in minutes spent fueling the aircraft

Security_o: Time in minutes spent in security checking

Arr_Delay: Flight delay in minutes. This is the dependent variable of the dataset.

For logistic regression, an additional variable, Hour_Delay, will be created to see if a plane is 1 hour late or more.

A copy of this publicly available data is stored at https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv.

flight<-read.csv("https://pengdsci.github.io/datasets/FlightDelay/Flight_delay-data.csv")

flight.hour<-flight

flight.hour$Hour_Delay = 0 

flight.hour$Hour_Delay<-ifelse(flight.hour$Arr_Delay > 59, 1, 0)

flight.hour<-flight.hour[-c(11)]

Dataset Overview

First, as the majority of the variables are numerical, a summary stats overview is warranted.

summary(flight)

##    Carrier          Airport_Distance Number_of_flights    Weather     
##  Length:3593        Min.   :376.0    Min.   :29475     Min.   :5.000  
##  Class :character   1st Qu.:431.0    1st Qu.:41634     1st Qu.:5.000  
##  Mode  :character   Median :443.0    Median :43424     Median :5.000  
##                     Mean   :442.4    Mean   :43311     Mean   :5.353  
##                     3rd Qu.:454.0    3rd Qu.:45140     3rd Qu.:6.000  
##                     Max.   :499.0    Max.   :53461     Max.   :6.000  
##  Support_Crew_Available Baggage_loading_time Late_Arrival_o    Cleaning_o   
##  Min.   :  0            Min.   :14.00        Min.   :15.00   Min.   :-4.00  
##  1st Qu.: 56            1st Qu.:17.00        1st Qu.:18.00   1st Qu.: 8.00  
##  Median : 83            Median :17.00        Median :19.00   Median :10.00  
##  Mean   : 85            Mean   :16.98        Mean   :18.74   Mean   :10.02  
##  3rd Qu.:112            3rd Qu.:17.00        3rd Qu.:19.00   3rd Qu.:12.00  
##  Max.   :222            Max.   :19.00        Max.   :22.00   Max.   :23.00  
##    Fueling_o       Security_o      Arr_Delay    
##  Min.   :13.00   Min.   :13.00   Min.   :  0.0  
##  1st Qu.:23.00   1st Qu.:32.00   1st Qu.: 49.0  
##  Median :25.00   Median :37.00   Median : 70.0  
##  Mean   :25.01   Mean   :37.09   Mean   : 69.8  
##  3rd Qu.:27.00   3rd Qu.:42.00   3rd Qu.: 90.0  
##  Max.   :36.00   Max.   :63.00   Max.   :180.0

Amazingly, there are no missing values. Of note is the extreme variance within the Support_Crew_Available and Number_of_Flights variables and minimal variance within the weather category.

Looking for Outliers

While the summary of variables like Airport Distance do not appear to cause alarm, the high variance of some stats suggests the possibility of outliers. While outliers are expected, it may be worth identifying and individually reviewing outliers to see if there could have been an inputting error.

testsup<-rosnerTest(flight$Support_Crew_Available, k = 5)
testnum<-rosnerTest(flight$Number_of_flights, k = 5)
testdist<-rosnerTest(flight$Airport_Distance, k = 5)
testbag<-rosnerTest(flight$Baggage_loading_time, k = 5)

The test of outliers for number of flights returns 3 possible outliers - observations 1652, 3163, and 2729. The delay for each observations is 0. While these observations do not appear to be errors, they make a strong suggestion that number of flights could be a good predictor for delay time. Similarly, the baggage time also shows 2729 and 3163 as low outliers.

Investigation for Collinearity

As there are multiple variables related to airplane prep (fueling, cleaning, baggage loading, and security), it may be worth checking for possible collinearity between two predictor variables. If one can be identified as redundant and removed, it will save us time and computational power.

pcor(flight[,2:11], method = "pearson")

As no variable appears to have a high correlation with another, there is no preliminary evidence of multicollinearity, so all variables could potentially be included into the model.

Discretizing Variables

Since there are two instances of flights with outliers of both baggage and number of flights, we will see if replacing those variables with bins of values has an impact on the model. we could not find any precedent for this online, so we will use the quartile ranges of each to split each variable into ‘low’, ‘middle’ and ‘high’.

model2<-flight.hour

model2$Number_of_flights <- ifelse(model2$Number_of_flights < 41634, 'Low',
               ifelse(model2$Number_of_flights >= 45140, 'High', 'Middle'))

model2$Baggage_loading_time <- ifelse(model2$Baggage_loading_time < 17, 'Low',
               ifelse(model2$Number_of_flights >= 18, 'High', 'Middle'))

fit<-glm(Hour_Delay~., family = binomial, data=flight.hour)


res<-resid(fit)

fit2<-glm(Hour_Delay~., family = binomial, data=model2)


res2<-resid(fit2)

plot(flight.hour$Hour_Delay, res, ylab="Residuals", xlab="Hour Delay - 0=No, 1=Yes", main="Residuals of Intial Model", col="blue")
abline(0,0)

plot(model2$Hour_Delay, res2, ylab="Residuals", xlab="Hour Delay - 0=No, 1=Yes", main="Residuals of Discretized Model", col="orange")
abline(0,0)

While the residual plots clearly look different, it is not entirely clear which model has less error.

errmat<-matrix(c(deviance(fit), deviance(fit2)),ncol=2) 
colnames(errmat)<-c('Original Model', 'Discretized Model')
rownames(errmat)<-c('Error')

error.sum<-as.table(errmat)

kable(error.sum)

	Original Model	Discretized Model
Error	1969.465	2193.526

While not apparent, the discretized model appears to have brought our residuals up. We will not consider a discretized model as a possiblity moving forward.

Pairwise Associations

Many variables in our model are numeric, we will look at pairwise associations of all variables, looking mostly at the correlation with the variable of interest, Arr_Delay.

ggpairs(flight.hour,                  
        columns = 2:11)

This output shows the strength of some correlation between variables (specifically baggage loading, late arrival, and number of flights) and the variable of interest. These variables make a strong case for their inclusion in our predictive model.

EDA: Exploratory Data Analysis

Alex Dragonetti

July 13, 2023

Overview