Intro

There are thousands of flights that fly within the inter-continental United States on a daily basis. Of these thousands of daily flights many are delayed, and it is these delays that cost the economy substantial financial loss. It would be of interest to understand if there is a pattern to the delays, then it would be of interest to airline carriers, airports, city/state/federal government entities, business and individual consumers to understand these patterns to either take the appropriate steps to further address them or for individual consumers take the necessary steps to avoid travel during such patterns. Some questions of interest would be:
1. What is the biggest contribution of flight delays?
2. Which holiday month (July, November, or December) is the worst for travel?
3. What is the most efficient airport?
4. Are airports generally getting more efficient over time?

Data & Packages

The data was collected via the Bureau of Transportation Statistics and loaded to R via CSV loaded into GitHub:
https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1

#Load Packages
library(ggplot2)
library(kableExtra)
library(scales)
library(dplyr)
library(tidyr)

#Load Raw Data
flights_raw <- 
  read.csv("https://raw.githubusercontent.com/dhairavc/DATA606/master/flights_delays.csv")

head(flights_raw)

Data Preparation & Transformation

#Update column names
flights_raw <- flights_raw[, 1:21]
names(flights_raw) <- c("year","month","carrier","carrier_name","airport","airport_name",
                        "arr_flights","arr_del15","carrier_ct","weather_ct","nas_ct",
                        "security_ct","late_aircraft_ct","arr_cancelled","arr_diverted",
                        "arr_delay","carrier_delay","weather_delay","nas_delay",
                        "security_delay","late_aircraft_delay")


#Calculate percetage of flights delayed per observation
flights_raw <- flights_raw %>% mutate(del_pct = arr_del15/arr_flights) 

#Add a Categorical varibale called month name
month <- 1:12
month_nm <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", 
              "Aug", "Sep", "Oct", "Nov","Dec")
months <- data.frame(month, month_nm)
flights_raw <- left_join(flights_raw, months, by = "month")

#Data descriptions
Field <- c("year",  "month",    "carrier",  "carrier_name", "airport",  "airport_name", 
           "arr_flights",   "arr_del15",    "carrier_ct",   "weather_ct",   "nas_ct",   
           "security_ct",   "late_aircraft_ct", "arr_cancelled",    "arr_diverted", 
           "arr_delay", "carrier_delay",    "weather_delay",    "nas_delay",    
           "security_delay",    "late_aircraft_delay")

Description <- c("Year (yyyy)", "Month (mm)",   "Airline carrier abbreviation", 
                 "Airline carrier name",    "Airport Code", "Airport Name", 
                 "Total number of arriving flights in the observation", 
                 "Total number of delayed flights in the observation",  
                 "Number of flights delayed due to air carrier (subset of arr_del15)",  
                 "Number of flights delayed due to weather (subset of arr_del15)",  
                 "Number of flights delayed due to National Aviation System (subset of arr_del15)", 
                 "Number of flights delayed due to airport security (subset of arr_del15)", 
                 "Number of flights delayed due to a previous flight using the same aircraft being late",   
                 "Number of cancelled flights", 
                 "Number of flights diverted",  
                 "Arrival delay in minutes",    
                 "Carrier delay in minutes (subset of arr_delay)",  
                 "Weather delayed in minutes (subset of arr_delay)",    
                 "National Aviation System in minutes (subset of arr_delay)",   
                 "Security delay in minutes (subset of arr_delay)", 
                 "Aircraft delay in minutes (subset of arr_delay)")

VariableType <- c("Qualitative",    "Qualitative",  "Qualitative",  "Qualitative",  
                  "Qualitative",    "Qualitative",  "Quantitative", "Quantitative", 
                  "Quantitative",   "Quantitative", "Quantitative", "Quantitative", 
                  "Quantitative",   "Quantitative", "Quantitative", "Quantitative", 
                  "Quantitative",   "Quantitative", "Quantitative", "Quantitative", 
                  "Quantitative") 

VariableMeasure <- c("Independent", "Independent",  "Independent",  "Independent",  
                     "Independent", "Independent",  "Independent",  "Response", 
                     "Explanatory", "Explanatory",  "Explanatory",  "Explanatory",  
                     "Explanatory", "Independent",  "Independent",  "Response", 
                     "Explanatory", "Explanatory",  "Explanatory",  "Explanatory",  
                     "Explanatory")

FieldDefinitions <- data.frame(Field, VariableType, VariableMeasure, Description)

About the Study

This is an observational study. The creators of the data set, observed flight arrivals and noted down the total number of flights and some datapoints on the flights that arrived late. The population of interest is all late flights in the inter-continental United States. From the subset of flight arrival times that we get from DOT, we would like to make an inference on the entire population of flights.

Dependent Variable

The dependent variables are arr_del15 and arr_delay

FieldDefinitions %>% filter(VariableMeasure == "Response") %>% kable() %>% kable_styling()
Field VariableType VariableMeasure Description
arr_del15 Quantitative Response Total number of delayed flights in the observation
arr_delay Quantitative Response Arrival delay in minutes

Independent Variable

FieldDefinitions %>% filter(VariableMeasure == "Independent") %>% kable() %>% 
  kable_styling()
Field VariableType VariableMeasure Description
year Qualitative Independent Year (yyyy)
month Qualitative Independent Month (mm)
carrier Qualitative Independent Airline carrier abbreviation
carrier_name Qualitative Independent Airline carrier name
airport Qualitative Independent Airport Code
airport_name Qualitative Independent Airport Name
arr_flights Quantitative Independent Total number of arriving flights in the observation
arr_cancelled Quantitative Independent Number of cancelled flights
arr_diverted Quantitative Independent Number of flights diverted

Exploratory Analysis

Below is some exploratory analysis that give some general suggestions

Cases and Summary Statistics

There are 68,153 observations

#Observations count
nrow(flights_raw)
## [1] 68153
#summary statistics for all quantitative variables 
flights_raw %>% select(arr_flights, arr_del15, carrier_ct, weather_ct, nas_ct, security_ct, 
                       late_aircraft_ct, arr_cancelled, arr_diverted, arr_delay, 
                       carrier_delay, weather_delay, nas_delay, security_delay, 
                       late_aircraft_delay) %>% summary()
##   arr_flights      arr_del15        carrier_ct        weather_ct     
##  Min.   :    1   Min.   :   0.0   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:  126   1st Qu.:  27.0   1st Qu.:   8.01   1st Qu.:  0.000  
##  Median :  333   Median :  71.0   Median :  20.63   Median :  1.500  
##  Mean   : 1002   Mean   : 198.1   Mean   :  48.13   Mean   :  6.386  
##  3rd Qu.:  861   3rd Qu.: 180.0   3rd Qu.:  49.65   3rd Qu.:  5.630  
##  Max.   :21977   Max.   :6377.0   Max.   :1792.07   Max.   :641.540  
##  NA's   :36      NA's   :40       NA's   :36        NA's   :36       
##      nas_ct         security_ct      late_aircraft_ct  arr_cancelled    
##  Min.   :  -0.01   Min.   : 0.0000   Min.   :   0.00   Min.   :   0.00  
##  1st Qu.:   8.60   1st Qu.: 0.0000   1st Qu.:   4.91   1st Qu.:   0.00  
##  Median :  25.20   Median : 0.0000   Median :  16.67   Median :   3.00  
##  Mean   :  77.24   Mean   : 0.4423   Mean   :  65.91   Mean   :  17.35  
##  3rd Qu.:  68.55   3rd Qu.: 0.1300   3rd Qu.:  53.65   3rd Qu.:  12.00  
##  Max.   :4091.27   Max.   :80.5600   Max.   :1885.47   Max.   :1389.00  
##  NA's   :36        NA's   :36        NA's   :36        NA's   :36       
##   arr_diverted       arr_delay      carrier_delay    weather_delay    
##  Min.   :  0.000   Min.   :     0   Min.   :     0   Min.   :    0.0  
##  1st Qu.:  0.000   1st Qu.:  1320   1st Qu.:   397   1st Qu.:    0.0  
##  Median :  0.000   Median :  3745   Median :  1111   Median :   98.0  
##  Mean   :  2.375   Mean   : 11674   Mean   :  3105   Mean   :  541.6  
##  3rd Qu.:  2.000   3rd Qu.: 10218   3rd Qu.:  2874   3rd Qu.:  442.0  
##  Max.   :256.000   Max.   :433687   Max.   :196944   Max.   :57707.0  
##  NA's   :36        NA's   :36       NA's   :36       NA's   :36       
##    nas_delay      security_delay    late_aircraft_delay
##  Min.   :   -19   Min.   :   0.00   Min.   :     0     
##  1st Qu.:   323   1st Qu.:   0.00   1st Qu.:   276     
##  Median :  1010   Median :   0.00   Median :  1057     
##  Mean   :  3729   Mean   :  17.72   Mean   :  4280     
##  3rd Qu.:  2996   3rd Qu.:   5.00   3rd Qu.:  3565     
##  Max.   :238440   Max.   :3194.00   Max.   :148181     
##  NA's   :36       NA's   :36        NA's   :36

Arrival vs. Delays

Excluding 2003 and 2019, the amount of delays seems to have peaked at 2007 and in general look be around 15%-20% of the proportion of all flights that are delayed

#Barchart of arriving and delayed flights
flights_raw %>% select(year, arr_flights, arr_del15) %>% drop_na() %>% group_by(year) %>% 
  summarise_all(sum) %>% gather(key = "Type", "NumCount", 2:3) %>% 
  ggplot( aes(x=year, y=NumCount, fill=Type)) + geom_col(position = 'dodge') + 
  scale_y_continuous(labels = comma) + scale_x_discrete(limits=c(2003:2019)) 

Volume per Airport

Looking at the amount of flights per airport Atlanta International hands the most amount of flights with 2nd and 3rd being Orlando and Dallas Fort Worth

#Barchart of volume per airport
flights_raw %>% select(airport, arr_flights) %>% drop_na() %>% group_by(airport) %>% 
  dplyr::summarise(TotalFlights = sum(arr_flights)) %>% 
  ggplot( aes(x=reorder(airport, -TotalFlights), y=TotalFlights, fill=TotalFlights))+
  geom_col() + coord_flip() + scale_y_continuous(labels = comma) + xlab("Airport") +
  scale_colour_continuous(labels = comma) + theme(legend.position="bottom")

Delays by Month

Delays across the months seem almost the same, there is a slight increase in delays in the summer months of Jun, Jul, and Aug. Though this could be because there is an increased amount of flights in these months.

#Flight Delays by Month
by_month <- flights_raw %>% select(month, arr_del15) %>% drop_na() %>% group_by(month) %>% 
  dplyr::summarize(delayed = n()) 

by_month$month <- recode(by_month$month, `1`="Jan", `2`="Feb", `3`="Mar", `4`="Apr",
                         `5`="May", `6`="Jun", `7`="Jul", `8`="Aug", `9`="Sep", 
                         `10`="Oct", `11`="Nov", `12`="Dec")

by_month %>% ggplot( aes(x = month, y=delayed)) + geom_bar(stat="identity") + 
  scale_x_discrete(limits=c("Jan", "Feb","Mar", "Apr", "May", "Jun", "Jul", 
                            "Aug", "Sep", "Oct", "Nov", "Dec"))

Delays by Airline per Airport

Not all airline carriers operate at all airports. From the below we can see that ATA Airlines is usually has a large percentage of their flights delays at JFK and ORD airports

#Most Delayed Carrier per Airport by Percentage
flights_raw %>% select(carrier_name, airport, arr_del15, arr_flights) %>% drop_na() %>% 
  group_by(airport, carrier_name) %>% dplyr::summarize_all(funs(sum)) %>% 
  mutate(del_pct = arr_del15/arr_flights) %>% 
  ggplot(aes(x=airport, y=carrier_name, fill=del_pct)) + geom_tile() + 
  theme(axis.text.x=element_text(angle=90, hjust=1)) + 
  scale_fill_gradient(low = "pink", high = "red") + 
  theme(legend.position="bottom")

Reasons for Delays

Over the years it looks like weather and security has stayed fairly stable, however the other reasons have fluctuated peaking at 2007

#Delayed Reasons by Year
del_by_year <- flights_raw %>% select(year, carrier_ct, weather_ct, nas_ct, security_ct, 
                                      late_aircraft_ct) %>% drop_na() %>% 
  group_by(year) %>% dplyr::summarise_all(funs(sum)) %>% gather(key = del_reason, 
                                                                value = del_count, 
                                                          carrier_ct:late_aircraft_ct) 

ggplot(del_by_year, aes(x=del_by_year$year, y=del_by_year$del_count, 
                        group=del_by_year$del_reason, 
                        color = del_by_year$del_reason)) + 
  geom_line(size=1) + geom_point() + xlab("Year") + ylab("Count") +
  scale_y_continuous(labels = comma) + 
  theme(legend.position="bottom") + 
  scale_x_discrete(limits=c(2003:2019))

Number of Delays and Time Spent

A very positive correlation between the amount of time spent and number of delays. This indicates that increase in time was linear with increase in delays

#Delayed Time Spent
flights_raw %>% ggplot( aes(x=arr_del15, y=arr_delay)) + geom_point(size=1) + 
  scale_y_continuous(labels = comma) + geom_smooth(method = lm) + 
  ylab("Arrival Delay Minutes") + xlab("Delayed Arrival Count")

Number of Flights and Number of Delays

As the number of flights per airport increases past 10,000, then the number of delays gets sparse. This could also be because not many airports handle large amounts of flights

#Delayed Time Spent
flights_raw %>% ggplot( aes(x=arr_flights, y=arr_del15)) + geom_point(size=1) + 
  scale_y_continuous(labels = comma) + geom_smooth(method = lm) + 
  ylab("Delayed Arrival Count") + xlab("Arriving Flights Count")

Delays Airports

The below shows all the delay percentages observations per airport. Note that SFO, LGA, and EWR have higher delayed percentages

#Delayed Airports
ggplot(flights_raw, aes(x=flights_raw$airport, y=flights_raw$del_pct)) + 
  geom_jitter(aes(colour = flights_raw$airport), size=1) + coord_flip() + 
  theme(legend.position='none') + xlab("Percentage Delayed") + 
  ylab("Airports")

Delayed Airline Carriers

The below shows all the delay percentages observations per airline carrier. Note that the larger carriers such as United, American Airlines, American Eagle, Virgin, do not have frequent high delayed percentages

#Delayed Airlines
ggplot(flights_raw, aes(x =flights_raw$carrier_name, y=flights_raw$del_pct)) + 
  geom_jitter(aes(colour = flights_raw$carrier_name), size=1) + coord_flip() + 
  theme(legend.position='none') + xlab("Percentage Delayed") + 
  ylab("Arline Carrier")

Boxplots by Month, Per Year

Below is a set of boxplots of delay percentages by each month per year. The trend year over year is that June, July, August, and December experience higher delay percentages

ggplot(data = flights_raw, mapping = aes(x=flights_raw$month_nm, y=flights_raw$del_pct)) + 
  geom_boxplot() + scale_x_discrete(limits=c("Jan", "Feb","Mar", "Apr", "May", "Jun", 
                                             "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) + 
  facet_wrap(~year) + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("Month") +
  ylab("Delayed Percentage")

Inference

Since we are concerned about airport delays, we will try to infer about delays in all the United States across all major and minor airports. In summary we will try to infer on the mean delay percentage across the population at a 95% confidence interval

Check conditions

To run an inference test, we have to check conditions that are appropriate for inference: 1) Samples are random 2) The sampling distribution is normal 3) Each observation is independent

-From the below histogram & QQ plot we can see that the delayed percentage of all 68k observations is not normally distributed

#Historgram of all delayed percentages
flights_raw %>% ggplot( aes(x=del_pct)) + geom_histogram()

#QQ Plot of all delayed percentages
flights_raw %>% ggplot( aes(sample=del_pct)) + geom_qq() + geom_qq_line()

Inference and Confidence Interval

To conduct an inference test, a sample of 500 is taken 2000 times. Then a 95% confidence level interval is conducted of the sample mean

# Percentage of all flights that are going to be delayed (Inference)
n1 <- 2000
all_del_pct <- rep(NA, n1)
all_del_pct_sd <- rep(NA, n1)

for(i in 1:n1){
  temp <- sample(flights_raw$del_pct, 500)
  all_del_pct[i] <- mean(temp)
  all_del_pct_sd[i] <- sd(temp)
}

lower1 <- all_del_pct - 1.96 * all_del_pct_sd/sqrt(n1)
upper1 <- all_del_pct + 1.96 * all_del_pct_sd/sqrt(n1)

#Sample distribution
data.frame(all_del_pct) %>% ggplot( aes(x=all_del_pct))+geom_histogram()

#QQ Plot of the sample distribution
data.frame(all_del_pct) %>% ggplot( aes(sample=all_del_pct)) + geom_qq() + geom_qq_line()

#95% confidence interval of the delayed percentage mean
c(lower1[2], upper1[2])
## [1] 0.2142634 0.2234856
#Conclusion of Inference
#We can say with 95% confidence that on average flights will be delayed between
lower1[2]
## [1] 0.2142634
#and
upper1[2]
## [1] 0.2234856

Linear Regression

We conduct a multiple linear regression of factors that are impacting the delay percentage. We build a regression model of independent variables of the airport, month, and the carrier to see how each of these variables impacts delays.

The guiding question for our statistical analysis is the following hypothesis:

\[H0: Airports, \ Months, \ or \ Airline \ carriers \ do \ not \ impact\ airline \ delays\ (R^2 = 0) \] \[HA: Airports, \ Months, \ or \ Airline \ carriers \ do \ impact\ airline \ delays\ (R^2 \ != 0) \]

We first check the conditions for regression:
- linear residuals
- residuals normally distributed
- constant variability of residuals

Conclusion:
From the below we can see that: - Linearity: Looks to be a linear relationship
- Near Normal Residuals: Data is somewhat normal, slightly skewed but ok
- Constant Variability: this looks to be not normal, but not too bad. Skews at the end

#Build model
airport_m <- lm(del_pct ~  airport + month_nm + carrier_name,  data = flights_raw)

#Linearity: Looks to be a linear relationship
ggplot(airport_m, aes(x=airport_m$fitted.values, y=airport_m$residuals)) + 
  geom_point(size = 1) + geom_smooth(method = lm)

#Near Normal Residuals: Data is somewhat normal, slightly skewed but ok
data.frame(airport_m$residuals) %>% ggplot( aes(x=airport_m.residuals)) + geom_histogram()

#Constant Variability: this looks to be not normal, skewd towards the end
data.frame(airport_m$residuals) %>% ggplot( aes(sample=airport_m.residuals)) + geom_qq() + 
  geom_qq_line()

Multiple Regression Model

Given the model, the adjusted r-squared is .1976, indicating that the variables explain approx. 20% of the variability. Looking at the airport variable, note that EWR, LGA, and SFO contribute the most to a positive delay Looking at the month variable, the months June, July, and December contribute most to a positive delay.

Since the r-squared is not 0, that means that we can reject H0. However, prior to doing this, since the p-value of the anova variances are near 0, that indicates that the values are significant.

Based on this, we can reject the null hypothesis.

#Model Summary
summary(airport_m)
## 
## Call:
## lm(formula = del_pct ~ airport + month_nm + carrier_name, data = flights_raw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34027 -0.05931 -0.01040  0.04789  0.84528 
## 
## Coefficients:
##                                           Estimate Std. Error t value
## (Intercept)                               0.221068   0.002814  78.552
## airportBOS                               -0.013066   0.002654  -4.924
## airportBWI                               -0.034821   0.002716 -12.819
## airportCLT                               -0.013660   0.002741  -4.984
## airportDCA                               -0.036817   0.002644 -13.927
## airportDEN                               -0.023708   0.002648  -8.954
## airportDFW                               -0.025156   0.002651  -9.489
## airportDTW                               -0.018993   0.002628  -7.228
## airportEWR                                0.073754   0.002700  27.316
## airportFLL                               -0.013994   0.002806  -4.987
## airportHNL                               -0.030156   0.003164  -9.531
## airportIAD                               -0.033208   0.002712 -12.245
## airportIAH                               -0.019236   0.002749  -6.997
## airportJFK                                0.007875   0.002902   2.714
## airportLAS                               -0.026058   0.002642  -9.861
## airportLAX                               -0.013950   0.002635  -5.294
## airportLGA                                0.044562   0.002653  16.794
## airportMCO                               -0.031043   0.002706 -11.472
## airportMDW                               -0.025725   0.003267  -7.873
## airportMIA                               -0.016947   0.002932  -5.780
## airportMSP                               -0.018419   0.002672  -6.894
## airportORD                                0.016601   0.002676   6.204
## airportPDX                               -0.016488   0.002764  -5.966
## airportPHL                                0.011024   0.002667   4.134
## airportPHX                               -0.027546   0.002651 -10.390
## airportSAN                               -0.031008   0.002661 -11.653
## airportSEA                               -0.006453   0.002716  -2.376
## airportSFO                                0.043260   0.002697  16.040
## airportSLC                               -0.032712   0.002766 -11.828
## airportTPA                               -0.029609   0.002801 -10.572
## month_nmAug                               0.021582   0.001743  12.385
## month_nmDec                               0.057879   0.001769  32.716
## month_nmFeb                               0.022629   0.001770  12.784
## month_nmJan                               0.024848   0.001768  14.052
## month_nmJul                               0.044984   0.001744  25.788
## month_nmJun                               0.052442   0.001741  30.124
## month_nmMar                               0.014971   0.001770   8.457
## month_nmMay                               0.007226   0.001770   4.083
## month_nmNov                              -0.020484   0.001769 -11.576
## month_nmOct                              -0.018810   0.001773 -10.610
## month_nmSep                              -0.036387   0.001770 -20.557
## carrier_nameAlaska Airlines Inc.         -0.056025   0.002337 -23.969
## carrier_nameAllegiant Air                -0.001402   0.008119  -0.173
## carrier_nameAloha Airlines Inc.          -0.061631   0.011457  -5.379
## carrier_nameAmerica West Airlines Inc.    0.020355   0.003744   5.437
## carrier_nameAmerican Airlines Inc.        0.007247   0.002171   3.339
## carrier_nameAmerican Eagle Airlines Inc.  0.005211   0.002624   1.986
## carrier_nameATA Airlines d/b/a ATA       -0.004476   0.004096  -1.093
## carrier_nameAtlantic Coast Airlines      -0.031279   0.007261  -4.308
## carrier_nameAtlantic Southeast Airlines   0.027203   0.003038   8.954
## carrier_nameComair Inc.                   0.045447   0.002901  15.666
## carrier_nameContinental Air Lines Inc.    0.007978   0.002483   3.214
## carrier_nameDelta Air Lines Inc.         -0.034254   0.002161 -15.849
## carrier_nameEndeavor Air Inc.            -0.035255   0.004734  -7.448
## carrier_nameEnvoy Air                    -0.002814   0.004396  -0.640
## carrier_nameExpressJet Airlines Inc.      0.001083   0.002354   0.460
## carrier_nameExpressJet Airlines LLC       0.048784   0.013351   3.654
## carrier_nameFrontier Airlines Inc.        0.032340   0.002334  13.858
## carrier_nameHawaiian Airlines Inc.       -0.020816   0.003008  -6.920
## carrier_nameIndependence Air              0.039067   0.007090   5.510
## carrier_nameJetBlue Airways               0.020843   0.002332   8.938
## carrier_nameMesa Airlines Inc.           -0.008356   0.002739  -3.051
## carrier_nameNorthwest Airlines Inc.       0.052496   0.002618  20.049
## carrier_namePinnacle Airlines Inc.       -0.014728   0.003718  -3.961
## carrier_namePSA Airlines Inc.            -0.024509   0.006941  -3.531
## carrier_nameRepublic Airline             -0.051728   0.005245  -9.862
## carrier_nameSkyWest Airlines Inc.        -0.025936   0.002371 -10.939
## carrier_nameSouthwest Airlines Co.       -0.015924   0.002322  -6.859
## carrier_nameSpirit Air Lines              0.004326   0.003293   1.314
## carrier_nameUnited Air Lines Inc.        -0.009556   0.002179  -4.386
## carrier_nameUS Airways Inc.              -0.007511   0.002311  -3.250
## carrier_nameVirgin America               -0.046620   0.003258 -14.309
##                                          Pr(>|t|)    
## (Intercept)                               < 2e-16 ***
## airportBOS                               8.50e-07 ***
## airportBWI                                < 2e-16 ***
## airportCLT                               6.24e-07 ***
## airportDCA                                < 2e-16 ***
## airportDEN                                < 2e-16 ***
## airportDFW                                < 2e-16 ***
## airportDTW                               4.94e-13 ***
## airportEWR                                < 2e-16 ***
## airportFLL                               6.15e-07 ***
## airportHNL                                < 2e-16 ***
## airportIAD                                < 2e-16 ***
## airportIAH                               2.64e-12 ***
## airportJFK                               0.006658 ** 
## airportLAS                                < 2e-16 ***
## airportLAX                               1.20e-07 ***
## airportLGA                                < 2e-16 ***
## airportMCO                                < 2e-16 ***
## airportMDW                               3.50e-15 ***
## airportMIA                               7.51e-09 ***
## airportMSP                               5.48e-12 ***
## airportORD                               5.54e-10 ***
## airportPDX                               2.45e-09 ***
## airportPHL                               3.57e-05 ***
## airportPHX                                < 2e-16 ***
## airportSAN                                < 2e-16 ***
## airportSEA                               0.017525 *  
## airportSFO                                < 2e-16 ***
## airportSLC                                < 2e-16 ***
## airportTPA                                < 2e-16 ***
## month_nmAug                               < 2e-16 ***
## month_nmDec                               < 2e-16 ***
## month_nmFeb                               < 2e-16 ***
## month_nmJan                               < 2e-16 ***
## month_nmJul                               < 2e-16 ***
## month_nmJun                               < 2e-16 ***
## month_nmMar                               < 2e-16 ***
## month_nmMay                              4.44e-05 ***
## month_nmNov                               < 2e-16 ***
## month_nmOct                               < 2e-16 ***
## month_nmSep                               < 2e-16 ***
## carrier_nameAlaska Airlines Inc.          < 2e-16 ***
## carrier_nameAllegiant Air                0.862861    
## carrier_nameAloha Airlines Inc.          7.51e-08 ***
## carrier_nameAmerica West Airlines Inc.   5.44e-08 ***
## carrier_nameAmerican Airlines Inc.       0.000843 ***
## carrier_nameAmerican Eagle Airlines Inc. 0.047032 *  
## carrier_nameATA Airlines d/b/a ATA       0.274457    
## carrier_nameAtlantic Coast Airlines      1.65e-05 ***
## carrier_nameAtlantic Southeast Airlines   < 2e-16 ***
## carrier_nameComair Inc.                   < 2e-16 ***
## carrier_nameContinental Air Lines Inc.   0.001311 ** 
## carrier_nameDelta Air Lines Inc.          < 2e-16 ***
## carrier_nameEndeavor Air Inc.            9.61e-14 ***
## carrier_nameEnvoy Air                    0.522093    
## carrier_nameExpressJet Airlines Inc.     0.645567    
## carrier_nameExpressJet Airlines LLC      0.000258 ***
## carrier_nameFrontier Airlines Inc.        < 2e-16 ***
## carrier_nameHawaiian Airlines Inc.       4.56e-12 ***
## carrier_nameIndependence Air             3.60e-08 ***
## carrier_nameJetBlue Airways               < 2e-16 ***
## carrier_nameMesa Airlines Inc.           0.002283 ** 
## carrier_nameNorthwest Airlines Inc.       < 2e-16 ***
## carrier_namePinnacle Airlines Inc.       7.47e-05 ***
## carrier_namePSA Airlines Inc.            0.000414 ***
## carrier_nameRepublic Airline              < 2e-16 ***
## carrier_nameSkyWest Airlines Inc.         < 2e-16 ***
## carrier_nameSouthwest Airlines Co.       7.01e-12 ***
## carrier_nameSpirit Air Lines             0.188931    
## carrier_nameUnited Air Lines Inc.        1.16e-05 ***
## carrier_nameUS Airways Inc.              0.001156 ** 
## carrier_nameVirgin America                < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09345 on 68041 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.1985, Adjusted R-squared:  0.1976 
## F-statistic: 237.3 on 71 and 68041 DF,  p-value: < 2.2e-16
#Anova to check for variance
anova(airport_m)

Conclusion

Based on the dataset, we will visit the original questions:

  1. What is the biggest contribution of flight delays?
    Flying out of EWR and traveling in the month of December have the most significance in airline delays

  2. Which holiday month (July, November, or December) is the worst for travel?
    The months of June, July, and December are the worst for travel, with December contributing to delays the most

  3. What is the most efficient airport?
    DCA (Ronald Reagan Washington National Airport), contributes least to delays

  4. Are airports generally getting more efficient over time?
    Airports are remaining relatively the same in terms of efficiency

Given this dataset, we can say the incoming flights will be approx. 22% delayed 95% of the times and
the most significant factors of delays are the month of December and EWR airport. With that said however, this model has some flaws that the reader should be aware of:
- More than 80% of the variability remains unexplained
- There are other factors to delays such as departure delays which are not included in this dataset

These other factors may help in having a better model.