Flight delays is a serious problem, which costs airlines, passengers, and U.S. economy. A 2007 study by National Center of Excellence for Aviation Operations Research (NEXTOR) estimates that air transportation delays costs total of $32.9 billion (Ball, 2010). The same report mentions over 25 percent of flights delayed (15+ minutes) and cancelled. 43.6% of all flight delays is caused by weather-related conditions (BTS, 2019). A better understanding of how weather affects flights can help to develop a prediction model and to mitigate the uncertainty of flight delays and flight cancellations.
This research is motivated by the following questions. Which factors have the most influence on flight on-time performance? Among those, which ones are measurable and have available data? Does weather data improve prediction of flight delays?
The data used in this paper come from the nycflight13 R package. In the package, I use 3 datasets: flights, weather, and airlines. The flights dataset contains data for all flights that departed New York City airports in 2013, including on-time performance. The weather dataset contains hourly meteorological data for LaGuadia Airport, John F. Kennedy International Airport, and Newark International Airport. The airlines dataset maps airline names to their carrier codes in the flights dataset.
To answer my research questions, I joined flights with weather on the origin airport and time at the hour. After removing missing values and irrelevant columns, the dimension for the data table is 291,220 rows and 15 columns. Since my goal is to predict whether a given flight is delayed/cancelled or on-time at take-off, I categorized my response variable, delay, as 1 when a flight is either dep_delay of more than 15 minutes or cancelled (dep_delay is NA) and 0 otherwise. Potential predictors are the flight information (month, day, day_of_week, carrier, origin, distance, hour) and associated weather features (temp, dewp, humid, wind_speed, precip, pressure, visib). Among of all predictors, day_of_week, carrier, and origin are categorical variables; while the rest are quantitative variables.
To understand the relationship about predictors and response variable, I perform exploratory analysis, and here are the summary findings. There is a correlation of 0.91 between arrival delays and departure delays, which means a flight that is delayed at take-off is more likely to arrive late than scheduled. However, the correlation is not exactly 1 because a flight could pick up more speed on air to avoid late arriving as much as possible or other factors (such as weather, busy traffic) at the destination might affect the scheduled arrival. Since departure delay is associated to cause arrival delay, my model will predict the probability of flight delayed at take-off. Before building the model, I did an exploratory analysis of the proportion of delayed and cancelled flights in my dataset. There are approximately 24% delayed and cancelled flights (81,169 out of 291,220 flights) in my dataset. Among those delayed flights, the months (July, June, and December) contributed the highest proportion compared to other months in the data (figure 1). Thus, I used the date of the flights (month, day, day of the week) as predictors to account for the variability of flight delays. Besides that, the number of delayed flights are highest between 3 - 7 pm. The explanation for that could be from an accumulation of earlier flight delays.
Distribution of flight delays and cancellation group by month.
Number of flights categorized by different airlines.
I also looked into the distribution of categorized flights for all airlines (figure 2). My statistics show Mesa Airline with the highest proportion of flight delays (35.94% of its total flights), but the graph shows the number of flights that Mesa Airline operated is very small compared to other airlines. I looked up and found that Mesa Airline is a regional airline. Moving down to the second highest rank in the list, ExpressJet Airline (34.71% of its total flights) is confirmed to have the highest count of flight delays and cancellations. ExpressJet operates scheduled United Express flights to destinations in the U.S. (East, Midwest, and South regions), Canada and Mexico. Thus, the delays could be caused by bad winter storms in winter month and storms in summer months contributing to 24% of total flight delays.
Figure 3 below shows the correlation matrix of the quantitative predictors in my dataset. There is a strong correlation between temperature and dewpoint (0.893). A consequence of highly correlated predictor variables is that the coefficients in the regression model are of the opposite sign than expected. From the full model output (figure 4), I suspect that temp and dewp have opposite signs even they have a positive correlation. I decide to remove dewp from the model because I want to see the effect of temperature on the probability of flight delays. From figure 1, I suspect that either low temperature or high temperature will increase the probability of flight delays, suggesting to add a quadratic term in temp to my model. Not only the quadratic term is significant in the summary output (figure 5), but the prediction rate results also increase comparing to the one, not including the quadratic term.
Correlation matrix.
Collinearity affecting coefficient estimates.
Because the response variable is a binary categorical variable, I applied Logistic Regression model to predict the likelihood of a given flight is delayed. For model validation, I used cross-validation technique to split 70:30 ratio of my dataset to training and validation sets. The fitted model in probability form is given by:
\[\hat{p}(\boldsymbol{x}) = \frac{e^{\hat\beta_0 + \hat\beta_1 X_1+...+\hat\beta_{34}X_{34}}}{1+e^{\hat\beta_0 + \hat\beta_1 X_1+...+\hat\beta_{34}X_{34}}}\] where x = \((X_1, X_2,...,X_{34})\) are 34 predictors in my model.
We wish to test:
\(H_0:\) logistic regression model is appropriate
\(H_A:\) logistic model is inappropriate so a saturated model is needed
Since p-value is 1, we fail to reject null hypothesis. Thus, the deviance goodness-of-fit test finds that the logistic regression model is an adequate fit overall for the training data.
Regression summary output for final model.
The results from the summary output in figure 5 show all weather features are significant to the likelihood of flight delays since all the p-values are below 0.001. The signs of \(\beta_i\) have meaningful interpretation. For example, coefficient of wind_speed is greater than 0, then increasing wind speed will be associated with increasing the probability of flight delays. Another interpreting example on temperature’s coefficients, \(\hat{\beta}_{temp}, \hat{\beta}_{temp^2}\) have a quadratic effect (an u-shaped form), while holding other predictors constant. Graphing this equation, I see that y is at a minimum when X is 55. We can interpret as when the temperature is around 55*F degree, a given flight will be associated with the lowest chance of being delayed or cancelled; while extreme temperature (low or high) will have more chance of being delayed.
The cross-validation results have a high accuracy rate (78.74%) and sensitivity (97.58%), but low specificity (12.87%) because my response variable is imbalanced. Thus, I evaluate my model using AUC (Area under the ROC curve) metric with the result of 0.7144.
While fixing other sources of variability in my model, weather features have a significant impact on the likelihood of flight delays and cancellations. Even though the chi-square goodness-of-fit test confirms my model overall is appropriated, perform model diagnostics on residuals of multiple logistic regression for more than 3 predictors is limited. Also, there might exist a serial correlation, that is, results from the current time period are correlated with results from earlier time periods (Sheather, p.305). For future work, I want to analyze and model the autocorrelated errors for my losgistic regression model if it is possible. Besides that, I also want to find different machine learning algorithms such as XGBoost, Random Forest, or Neural Network to improve the prediction rates comparing to the model in this project.
Ball, Michael, et al. “A Comprehensive Assessment of the Costs and Impacts of Flight Delay in the United States.” (2010, October). Retrieved May 12, 2019, from https://isr.umd.edu/NEXTOR/pubs/TDI_Report_Final_10_18_10_V3.pdf
Route Maps. https://www.expressjet.com/about/route-maps/
Sheather, S. J. (n.d.). Logistic Regression. In A Modern Approach to Regression with R.
Understanding the Reporting of Causes of Flight Delays and Cancellations. (2019, March 29). Retrieved May 12, 2019, from https://www.bts.gov/topics/airlines-and-airports/understanding-reporting-causes-flight-delays-and-cancellations
library(nycflights13)
head(flights)
dim(flights)
[1] 336776 19
head(weather)
summary(weather)
origin year month day hour temp
Length:26115 Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 0.00 Min. : 10.94
Class :character 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 6.00 1st Qu.: 39.92
Mode :character Median :2013 Median : 7.000 Median :16.00 Median :11.00 Median : 55.40
Mean :2013 Mean : 6.504 Mean :15.68 Mean :11.49 Mean : 55.26
3rd Qu.:2013 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:17.00 3rd Qu.: 69.98
Max. :2013 Max. :12.000 Max. :31.00 Max. :23.00 Max. :100.04
NA's :1
dewp humid wind_dir wind_speed wind_gust precip
Min. :-9.94 Min. : 12.74 Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.000000
1st Qu.:26.06 1st Qu.: 47.05 1st Qu.:120.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.000000
Median :42.08 Median : 61.79 Median :220.0 Median : 10.357 Median :24.17 Median :0.000000
Mean :41.44 Mean : 62.53 Mean :199.8 Mean : 10.518 Mean :25.49 Mean :0.004469
3rd Qu.:57.92 3rd Qu.: 78.79 3rd Qu.:290.0 3rd Qu.: 13.809 3rd Qu.:28.77 3rd Qu.:0.000000
Max. :78.08 Max. :100.00 Max. :360.0 Max. :1048.361 Max. :66.75 Max. :1.210000
NA's :1 NA's :1 NA's :460 NA's :4 NA's :20778
pressure visib time_hour
Min. : 983.8 Min. : 0.000 Min. :2013-01-01 01:00:00
1st Qu.:1012.9 1st Qu.:10.000 1st Qu.:2013-04-01 21:30:00
Median :1017.6 Median :10.000 Median :2013-07-01 14:00:00
Mean :1017.9 Mean : 9.255 Mean :2013-07-01 18:26:37
3rd Qu.:1023.0 3rd Qu.:10.000 3rd Qu.:2013-09-30 13:00:00
Max. :1042.1 Max. :10.000 Max. :2013-12-30 18:00:00
NA's :2729
library(tidyverse)
library(lubridate)
flight_sum <- flights %>%
#group flight cancellation and flight delay into one level
mutate(delay = ifelse(dep_delay >= 15 | is.na(dep_delay) == TRUE, 1, 0),
day_of_week = wday(time_hour, label=TRUE, abbr = FALSE),
carrier = factor(carrier),
origin = factor(origin)) %>%
#select relevant variables and save to a new data table
select(delay, year, month, day, day_of_week, carrier, origin, distance, hour, time_hour)
head(flight_sum)
#Correlation between departure delay and arrival delay, excluding cancelled flights
cor(flights[c("dep_delay", "arr_delay")],use = "pairwise.complete.obs")
dep_delay arr_delay
dep_delay 1.0000000 0.9148028
arr_delay 0.9148028 1.0000000
pairs(flights[c("dep_delay", "arr_delay")])
#Number of flight delays and cancellations in this dataset
table(flight_sum$delay)
0 1
255607 81169
#Proportion of flight delays and cancellations in this dataset
round(table(flight_sum$delay)/nrow(flight_sum),2)
0 1
0.76 0.24
#distribution of flight delays and cancellation group by month
flight_sum %>% group_by(month, delay) %>% summarize(n_delays = n()) %>%
ggplot(aes(x= month, y = n_delays, group = delay, col = factor(delay))) +
geom_col() +
scale_x_discrete(limits=1:12) +
ylab("Flight Count")
#closer look into distribution of flight delays and cancellations by months in 2013
flight_sum %>% filter(delay == 1) %>% group_by(month) %>% summarize(n_delays = n()) %>%
ggplot(aes(x= month, y = n_delays)) +
geom_point() +
geom_line(col = "blue") +
scale_x_discrete(limits=1:12)
#distribution of flight delays and cancellation group by days of month
flight_sum %>% filter(delay == 1) %>% group_by(day) %>% summarize(n_delays = n()) %>%
ggplot(aes(x= day, y = n_delays)) +
geom_point() +
geom_line(col = "blue") +
scale_x_discrete(limits = 1:31)
#distribution of flight delays and cancellations by the day of the week
flight_sum %>% filter(delay == 1) %>% group_by(day_of_week) %>% summarize(n_delays = n()) %>%
ggplot(aes(x= day_of_week, y = n_delays)) +
geom_col()
#flight delay group by hour in a day
flight_sum %>% filter(delay == 1) %>% group_by(hour) %>% summarize(n_delays = n()) %>%
ggplot(aes(x= hour, y = n_delays)) +
geom_point() +
geom_line(col = "blue")
#join tables to get airline names
flight_airline <- left_join(flights, airlines, by= "carrier")
#Plot flight delays at take-off (not including cancellations - NAs) group by airlines
ggplot(flight_airline, aes(x= name, y = dep_delay, col = name)) +
geom_jitter(alpha = 0.5, size = 0.3) +
coord_flip() +
theme(legend.position = "none") +
xlab("Airline Names") +
ylab("Departure Delay (in minutes)")
#Proportion of flight delays by airlines
flight_sum %>% group_by(carrier) %>% summarize(prop.delay = mean(delay==1)) %>% arrange(desc(prop.delay)) %>% left_join(airlines, by = "carrier")
Column `carrier` joining factor and character vector, coercing into character vector
#Number of flights by different airlines
flight_airline %>% mutate(delay_group = case_when(dep_delay <15 ~ "on-time", dep_delay >=15 ~ "delayed", is.na(dep_delay) == TRUE ~ "cancelled")) %>%
ggplot(aes(x = name, fill = delay_group)) +
geom_bar(stat = "count", position = "dodge") +
coord_flip() +
theme(legend.position = "top") +
scale_fill_manual(values = c("on-time" = "deepskyblue4", "delayed" = "yellow", "cancelled" = "red")) +
xlab("Airline Names") +
ylab("Flight Count")
#Joining flights and weather tables for model building
flights_weather <- left_join(x = flight_sum, y = weather, by = c("origin","time_hour", "year", "month", "day", "hour"))
Column `origin` joining factor and character vector, coercing into character vector
#first few rows of the joined table
head(flights_weather)
#Change origin data type to factor
flights_weather$origin <- factor(flights_weather$origin)
#dimension of the joined table
dim(flights_weather)
[1] 336776 19
summary(flights_weather)
delay year month day day_of_week carrier origin
Min. :0.000 Min. :2013 Min. : 1.000 Min. : 1.00 Sunday :46357 UA :58665 EWR:120835
1st Qu.:0.000 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 Monday :50690 B6 :54635 JFK:111279
Median :0.000 Median :2013 Median : 7.000 Median :16.00 Tuesday :50422 EV :54173 LGA:104662
Mean :0.241 Mean :2013 Mean : 6.549 Mean :15.71 Wednesday:50060 DL :48110
3rd Qu.:0.000 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 Thursday :50219 AA :32729
Max. :1.000 Max. :2013 Max. :12.000 Max. :31.00 Friday :50308 MQ :26397
Saturday :38720 (Other):62067
distance hour time_hour temp dewp humid
Min. : 17 Min. : 1.00 Min. :2013-01-01 05:00:00 Min. : 10.94 Min. :-9.94 Min. : 12.74
1st Qu.: 502 1st Qu.: 9.00 1st Qu.:2013-04-04 13:00:00 1st Qu.: 42.08 1st Qu.:26.06 1st Qu.: 43.99
Median : 872 Median :13.00 Median :2013-07-03 10:00:00 Median : 57.20 Median :42.80 Median : 57.73
Mean :1040 Mean :13.18 Mean :2013-07-03 05:22:54 Mean : 57.00 Mean :41.63 Mean : 59.56
3rd Qu.:1389 3rd Qu.:17.00 3rd Qu.:2013-10-01 07:00:00 3rd Qu.: 71.96 3rd Qu.:57.92 3rd Qu.: 75.33
Max. :4983 Max. :23.00 Max. :2013-12-31 23:00:00 Max. :100.04 Max. :78.08 Max. :100.00
NA's :1573 NA's :1573 NA's :1573
wind_dir wind_speed wind_gust precip pressure visib
Min. : 0.0 Min. : 0.000 Min. :16.11 Min. :0.0000 Min. : 983.8 Min. : 0.000
1st Qu.:130.0 1st Qu.: 6.905 1st Qu.:20.71 1st Qu.:0.0000 1st Qu.:1012.7 1st Qu.:10.000
Median :220.0 Median :10.357 Median :24.17 Median :0.0000 Median :1017.5 Median :10.000
Mean :201.5 Mean :11.114 Mean :25.25 Mean :0.0046 Mean :1017.8 Mean : 9.256
3rd Qu.:290.0 3rd Qu.:14.960 3rd Qu.:28.77 3rd Qu.:0.0000 3rd Qu.:1022.8 3rd Qu.:10.000
Max. :360.0 Max. :42.579 Max. :66.75 Max. :1.2100 Max. :1042.1 Max. :10.000
NA's :9796 NA's :1634 NA's :256391 NA's :1556 NA's :38788 NA's :1556
missmap(flights_weather)
the condition has length > 1 and only the first element will be usedUnknown or uninitialised column: 'arguments'.Unknown or uninitialised column: 'arguments'.Unknown or uninitialised column: 'imputations'.
#Removing some columns
flights_weather$wind_gust <- NULL
flights_weather$year <- NULL
flights_weather$time_hour <- NULL
flights_weather$wind_dir <- NULL
#Dimension of the table after removing columns
dim(flights_weather)
[1] 336776 15
#Removing NA values
flights_weather_rm <- na.omit(flights_weather)
#Dimension of data table
dim(flights_weather_rm)
[1] 297924 15
# exploring relationships among features: correlation matrix
round(cor(flights_weather_rm[,which(sapply(flights_weather_rm,is.numeric))]), 3)
delay month day distance hour temp dewp humid wind_speed precip pressure visib
delay 1.000 -0.024 0.006 -0.043 0.226 0.056 0.080 0.079 0.053 0.058 -0.126 -0.067
month -0.024 1.000 0.008 0.022 0.001 0.264 0.268 0.075 -0.133 0.000 0.090 0.039
day 0.006 0.008 1.000 0.003 -0.010 0.013 -0.005 -0.037 -0.012 -0.003 0.009 0.030
distance -0.043 0.022 0.003 1.000 -0.019 0.008 0.022 0.036 0.014 0.000 0.011 -0.005
hour 0.226 0.001 -0.010 -0.019 1.000 0.098 0.008 -0.162 0.120 0.010 -0.078 0.058
temp 0.056 0.264 0.013 0.008 0.098 1.000 0.892 0.060 -0.150 -0.026 -0.246 0.057
dewp 0.080 0.268 -0.005 0.022 0.008 0.892 1.000 0.496 -0.245 0.051 -0.278 -0.132
humid 0.079 0.075 -0.037 0.036 -0.162 0.060 0.496 1.000 -0.261 0.199 -0.160 -0.481
wind_speed 0.053 -0.133 -0.012 0.014 0.120 -0.150 -0.245 -0.261 1.000 0.005 -0.215 0.105
precip 0.058 0.000 -0.003 0.000 0.010 -0.026 0.051 0.199 0.005 1.000 -0.088 -0.358
pressure -0.126 0.090 0.009 0.011 -0.078 -0.246 -0.278 -0.160 -0.215 -0.088 1.000 0.108
visib -0.067 0.039 0.030 -0.005 0.058 0.057 -0.132 -0.481 0.105 -0.358 0.108 1.000
Visualization of predictors and response variable relationship:
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = delay, y=temp, col = delay)) +
geom_boxplot()
#Scatter plot between temperature and dewpoint - strong positive correlation
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = temp, y=dewp, col = delay)) +
geom_point(alpha=0.5, size = 0.3)
#scatter plot between temperature and wind speed
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = temp, y=wind_speed, col = delay)) +
geom_point(alpha=0.5, size = 0.5)
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(y = temp, x=month, col = delay)) +
geom_jitter() +
scale_x_discrete(limits = 1:12)
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(y = wind_speed, x=month, col = delay)) +
geom_jitter() +
scale_x_discrete(limits = 1:12)
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = delay, y=precip, col = delay)) +
geom_boxplot()
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = delay, y=humid, col = delay)) +
geom_boxplot()
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = delay, y=wind_speed, col = delay)) +
geom_boxplot()
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = delay, y=visib, col = delay)) +
geom_boxplot()
flights_weather_rm %>% mutate(delay = factor(delay)) %>%
ggplot(aes(x = delay, y=pressure, col = delay)) +
geom_boxplot()
dim(train)
[1] 208547 15
#Use all predictors in the model
mod.full <- glm(delay ~ ., data = train, family = binomial)
summary(mod.full)
Call:
glm(formula = delay ~ ., family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1357 -0.7313 -0.5367 -0.3333 2.7282
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.911e+01 9.011e-01 21.211 < 2e-16 ***
month -2.266e-02 1.799e-03 -12.594 < 2e-16 ***
day 3.454e-03 6.341e-04 5.446 5.15e-08 ***
day_of_week.L 2.903e-02 1.545e-02 1.880 0.060175 .
day_of_week.Q -1.745e-01 1.549e-02 -11.264 < 2e-16 ***
day_of_week.C -1.914e-01 1.494e-02 -12.813 < 2e-16 ***
day_of_week^4 -2.252e-01 1.445e-02 -15.586 < 2e-16 ***
day_of_week^5 1.201e-01 1.434e-02 8.376 < 2e-16 ***
day_of_week^6 4.142e-02 1.447e-02 2.863 0.004199 **
carrierAA -5.097e-01 3.145e-02 -16.208 < 2e-16 ***
carrierAS -1.024e+00 1.508e-01 -6.789 1.13e-11 ***
carrierB6 -2.412e-01 2.633e-02 -9.160 < 2e-16 ***
carrierDL -6.379e-01 2.915e-02 -21.882 < 2e-16 ***
carrierEV 2.557e-01 2.903e-02 8.809 < 2e-16 ***
carrierF9 3.575e-02 1.180e-01 0.303 0.761848
carrierFL 4.110e-02 5.839e-02 0.704 0.481475
carrierHA -1.060e+00 2.924e-01 -3.626 0.000288 ***
carrierMQ -1.904e-01 3.051e-02 -6.241 4.34e-10 ***
carrierOO 1.911e-01 4.945e-01 0.386 0.699180
carrierUA -3.211e-01 3.150e-02 -10.195 < 2e-16 ***
carrierUS -7.356e-01 3.620e-02 -20.322 < 2e-16 ***
carrierVX -4.073e-01 5.669e-02 -7.185 6.74e-13 ***
carrierWN 1.245e-01 3.796e-02 3.281 0.001035 **
carrierYV -3.700e-03 1.199e-01 -0.031 0.975392
originJFK -2.586e-01 1.948e-02 -13.278 < 2e-16 ***
originLGA -9.342e-02 1.713e-02 -5.452 4.98e-08 ***
distance 1.241e-05 9.944e-06 1.248 0.212206
hour 1.267e-01 1.272e-03 99.643 < 2e-16 ***
temp 2.658e-02 3.365e-03 7.899 2.81e-15 ***
dewp -2.287e-02 3.615e-03 -6.326 2.51e-10 ***
humid 2.534e-02 1.846e-03 13.729 < 2e-16 ***
wind_speed 2.871e-02 1.136e-03 25.279 < 2e-16 ***
precip 1.882e+00 4.043e-01 4.655 3.23e-06 ***
pressure -2.339e-02 8.503e-04 -27.508 < 2e-16 ***
visib -2.961e-02 4.595e-03 -6.443 1.17e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 221757 on 208546 degrees of freedom
Residual deviance: 200783 on 208512 degrees of freedom
AIC: 200853
Number of Fisher Scoring iterations: 4
#Remove collinearity by dropping *dewp* variable from the model
mod.red <- glm(delay ~ month + day + day_of_week + carrier + origin +
distance + hour + temp + humid + wind_speed +
precip + pressure + visib, family = binomial, data = train)
summary(mod.red)
Call:
glm(formula = delay ~ month + day + day_of_week + carrier + origin +
distance + hour + temp + humid + wind_speed + precip + pressure +
visib, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1455 -0.7321 -0.5359 -0.3341 2.7213
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.040e+01 8.770e-01 23.261 < 2e-16 ***
month -2.310e-02 1.797e-03 -12.858 < 2e-16 ***
day 3.460e-03 6.340e-04 5.458 4.81e-08 ***
day_of_week.L 2.863e-02 1.545e-02 1.853 0.063847 .
day_of_week.Q -1.729e-01 1.549e-02 -11.160 < 2e-16 ***
day_of_week.C -1.887e-01 1.493e-02 -12.641 < 2e-16 ***
day_of_week^4 -2.256e-01 1.445e-02 -15.613 < 2e-16 ***
day_of_week^5 1.195e-01 1.434e-02 8.332 < 2e-16 ***
day_of_week^6 4.604e-02 1.445e-02 3.187 0.001436 **
carrierAA -5.091e-01 3.144e-02 -16.194 < 2e-16 ***
carrierAS -1.022e+00 1.508e-01 -6.776 1.24e-11 ***
carrierB6 -2.400e-01 2.632e-02 -9.118 < 2e-16 ***
carrierDL -6.373e-01 2.914e-02 -21.870 < 2e-16 ***
carrierEV 2.557e-01 2.902e-02 8.810 < 2e-16 ***
carrierF9 3.869e-02 1.180e-01 0.328 0.742973
carrierFL 4.283e-02 5.840e-02 0.733 0.463314
carrierHA -1.069e+00 2.924e-01 -3.658 0.000254 ***
carrierMQ -1.906e-01 3.050e-02 -6.250 4.09e-10 ***
carrierOO 1.728e-01 4.947e-01 0.349 0.726838
carrierUA -3.206e-01 3.149e-02 -10.181 < 2e-16 ***
carrierUS -7.347e-01 3.619e-02 -20.300 < 2e-16 ***
carrierVX -4.070e-01 5.666e-02 -7.183 6.83e-13 ***
carrierWN 1.256e-01 3.796e-02 3.310 0.000934 ***
carrierYV -1.705e-03 1.200e-01 -0.014 0.988664
originJFK -2.567e-01 1.947e-02 -13.185 < 2e-16 ***
originLGA -9.867e-02 1.712e-02 -5.765 8.15e-09 ***
distance 1.226e-05 9.940e-06 1.233 0.217519
hour 1.269e-01 1.271e-03 99.848 < 2e-16 ***
temp 5.399e-03 3.420e-04 15.788 < 2e-16 ***
humid 1.392e-02 3.765e-04 36.981 < 2e-16 ***
wind_speed 2.850e-02 1.135e-03 25.111 < 2e-16 ***
precip 1.998e+00 4.048e-01 4.936 7.99e-07 ***
pressure -2.363e-02 8.489e-04 -27.835 < 2e-16 ***
visib -4.166e-02 4.177e-03 -9.973 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 221757 on 208546 degrees of freedom
Residual deviance: 200823 on 208513 degrees of freedom
AIC: 200891
Number of Fisher Scoring iterations: 4
#Adding quadratic term
mod.red2 <- glm(delay ~ month + day + day_of_week + carrier + origin + distance + hour + temp + I(temp^2) + humid + wind_speed + precip + pressure + visib, family = binomial, data = train)
summary(mod.red2)
Call:
glm(formula = delay ~ month + day + day_of_week + carrier + origin +
distance + hour + temp + I(temp^2) + humid + wind_speed +
precip + pressure + visib, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1143 -0.7273 -0.5281 -0.3284 2.7509
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.344e+01 8.903e-01 26.327 < 2e-16 ***
month -1.483e-02 1.805e-03 -8.215 < 2e-16 ***
day 3.349e-03 6.364e-04 5.263 1.42e-07 ***
day_of_week.L 2.452e-02 1.552e-02 1.580 0.114020
day_of_week.Q -1.529e-01 1.556e-02 -9.827 < 2e-16 ***
day_of_week.C -1.942e-01 1.499e-02 -12.955 < 2e-16 ***
day_of_week^4 -2.347e-01 1.450e-02 -16.188 < 2e-16 ***
day_of_week^5 1.266e-01 1.440e-02 8.792 < 2e-16 ***
day_of_week^6 4.656e-02 1.449e-02 3.214 0.001311 **
carrierAA -5.184e-01 3.153e-02 -16.443 < 2e-16 ***
carrierAS -1.034e+00 1.514e-01 -6.830 8.52e-12 ***
carrierB6 -2.488e-01 2.639e-02 -9.431 < 2e-16 ***
carrierDL -6.419e-01 2.922e-02 -21.966 < 2e-16 ***
carrierEV 2.614e-01 2.913e-02 8.974 < 2e-16 ***
carrierF9 9.279e-03 1.184e-01 0.078 0.937513
carrierFL 3.865e-02 5.866e-02 0.659 0.509995
carrierHA -1.093e+00 2.925e-01 -3.738 0.000186 ***
carrierMQ -1.937e-01 3.060e-02 -6.329 2.47e-10 ***
carrierOO 1.713e-01 4.987e-01 0.344 0.731213
carrierUA -3.259e-01 3.160e-02 -10.312 < 2e-16 ***
carrierUS -7.414e-01 3.630e-02 -20.423 < 2e-16 ***
carrierVX -4.099e-01 5.683e-02 -7.213 5.47e-13 ***
carrierWN 1.261e-01 3.811e-02 3.310 0.000934 ***
carrierYV -2.884e-02 1.206e-01 -0.239 0.810937
originJFK -2.213e-01 1.957e-02 -11.305 < 2e-16 ***
originLGA -7.786e-02 1.721e-02 -4.524 6.06e-06 ***
distance 1.331e-05 9.972e-06 1.334 0.182141
hour 1.286e-01 1.274e-03 100.924 < 2e-16 ***
temp -6.644e-02 2.002e-03 -33.190 < 2e-16 ***
I(temp^2) 6.155e-04 1.694e-05 36.327 < 2e-16 ***
humid 1.593e-02 3.869e-04 41.164 < 2e-16 ***
wind_speed 2.532e-02 1.143e-03 22.145 < 2e-16 ***
precip 2.266e+00 4.042e-01 5.605 2.09e-08 ***
pressure -2.498e-02 8.580e-04 -29.107 < 2e-16 ***
visib -3.663e-02 4.190e-03 -8.742 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 221757 on 208546 degrees of freedom
Residual deviance: 199535 on 208512 degrees of freedom
AIC: 199605
Number of Fisher Scoring iterations: 4
#Try stepwise backward variable selection approach- no improvement
mod.step <- step(mod.red2, trace = F)
summary(mod.step)
Call:
glm(formula = delay ~ month + day + day_of_week + carrier + origin +
hour + temp + I(temp^2) + humid + wind_speed + precip + pressure +
visib, family = binomial, data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1129 -0.7273 -0.5282 -0.3284 2.7493
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.344e+01 8.903e-01 26.327 < 2e-16 ***
month -1.475e-02 1.804e-03 -8.179 2.87e-16 ***
day 3.351e-03 6.364e-04 5.266 1.40e-07 ***
day_of_week.L 2.448e-02 1.552e-02 1.578 0.114637
day_of_week.Q -1.529e-01 1.556e-02 -9.827 < 2e-16 ***
day_of_week.C -1.943e-01 1.499e-02 -12.961 < 2e-16 ***
day_of_week^4 -2.347e-01 1.450e-02 -16.186 < 2e-16 ***
day_of_week^5 1.267e-01 1.440e-02 8.794 < 2e-16 ***
day_of_week^6 4.657e-02 1.449e-02 3.214 0.001308 **
carrierAA -5.052e-01 2.994e-02 -16.875 < 2e-16 ***
carrierAS -1.006e+00 1.500e-01 -6.710 1.95e-11 ***
carrierB6 -2.414e-01 2.579e-02 -9.361 < 2e-16 ***
carrierDL -6.303e-01 2.790e-02 -22.593 < 2e-16 ***
carrierEV 2.649e-01 2.902e-02 9.128 < 2e-16 ***
carrierF9 2.867e-02 1.175e-01 0.244 0.807161
carrierFL 4.531e-02 5.845e-02 0.775 0.438238
carrierHA -1.035e+00 2.893e-01 -3.579 0.000345 ***
carrierMQ -1.901e-01 3.048e-02 -6.237 4.46e-10 ***
carrierOO 1.749e-01 4.987e-01 0.351 0.725781
carrierUA -3.101e-01 2.929e-02 -10.586 < 2e-16 ***
carrierUS -7.377e-01 3.620e-02 -20.378 < 2e-16 ***
carrierVX -3.836e-01 5.330e-02 -7.197 6.14e-13 ***
carrierWN 1.361e-01 3.737e-02 3.642 0.000270 ***
carrierYV -2.606e-02 1.205e-01 -0.216 0.828811
originJFK -2.175e-01 1.937e-02 -11.229 < 2e-16 ***
originLGA -8.005e-02 1.713e-02 -4.672 2.98e-06 ***
hour 1.286e-01 1.274e-03 100.921 < 2e-16 ***
temp -6.643e-02 2.002e-03 -33.183 < 2e-16 ***
I(temp^2) 6.154e-04 1.694e-05 36.323 < 2e-16 ***
humid 1.592e-02 3.869e-04 41.162 < 2e-16 ***
wind_speed 2.533e-02 1.143e-03 22.156 < 2e-16 ***
precip 2.266e+00 4.043e-01 5.605 2.09e-08 ***
pressure -2.497e-02 8.580e-04 -29.104 < 2e-16 ***
visib -3.661e-02 4.190e-03 -8.739 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 221757 on 208546 degrees of freedom
Residual deviance: 199537 on 208513 degrees of freedom
AIC: 199605
Number of Fisher Scoring iterations: 4
length(coef(mod.step))
[1] 34
#Deviance goodness-of-fit test between final model and null model
pchisq(mod.step$deviance,mod.step$df.residual,lower=FALSE)
[1] 1
#prediction result for reduced model (removed collinearity) - include quadratic term
pred.step <- predict(mod.step, newdata = test, type = "response")
head(pred.step)
1 2 3 4 5 6
0.09250440 0.09332772 0.09594710 0.10049122 0.06710848 0.08892616
#Probability distribution of prediction
hist(pred.step)
#Comparing Classify results
#Reduced model
pred.class <- factor(ifelse(pred.step >=0.5, 1, 0))
head(pred.class)
1 2 3 4 5 6
0 0 0 0 0 0
Levels: 0 1
table(pred.class, test$delay)
pred.class 0 1
0 67820 17317
1 1682 2558
#Confusion Matrix results
library(caret)
confusionMatrix(pred.class, factor(test$delay))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 67820 17317
1 1682 2558
Accuracy : 0.7874
95% CI : (0.7847, 0.7901)
No Information Rate : 0.7776
P-Value [Acc > NIR] : 7.243e-13
Kappa : 0.1453
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9758
Specificity : 0.1287
Pos Pred Value : 0.7966
Neg Pred Value : 0.6033
Prevalence : 0.7776
Detection Rate : 0.7588
Detection Prevalence : 0.9526
Balanced Accuracy : 0.5523
'Positive' Class : 0
library(pROC)
#Area under the ROC curve
roc_step <- roc(test$delay, pred.step)
roc_step$auc
Area under the curve: 0.7144
Random Forest - AUC: 0.74
library(randomForest)
mod.rf <- randomForest(delay ~ ., data = train, prob=TRUE)
plot(mod.rf)
importance(mod.rf)
pred.rf <- predict(mod.rf, newdata = test, type = "prob")
head(pred.rf)
roc_rf <- roc(test$delay, pred.rf[,1])
roc_rf$auc