There are thousands of flights that fly within the inter-continental United States on a daily basis. Of these thousands of daily flights many are delayed, and it is these delays that cost the economy substantial financial loss. It would be of interest to understand if there is a pattern to the delays, then it would be of interest to airline carriers, airports, city/state/federal government entities, business and individual consumers to understand these patterns to either take the appropriate steps to further address them or for individual consumers take the necessary steps to avoid travel during such patterns. Some questions of interest would be:
1. What is the biggest contribution of flight delays?
2. Which holiday month (July, November, or December) is the worst for travel?
3. What is the most efficient airport?
4. Are airports generally getting more efficient over time?
The data was collected via the Bureau of Transportation Statistics and loaded to R via CSV loaded into GitHub:
https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp?pn=1
#Load Packages
library(ggplot2)
library(kableExtra)
library(scales)
library(dplyr)
library(tidyr)
#Load Raw Data
flights_raw <-
read.csv("https://raw.githubusercontent.com/dhairavc/DATA606/master/flights_delays.csv")
head(flights_raw)
#Update column names
flights_raw <- flights_raw[, 1:21]
names(flights_raw) <- c("year","month","carrier","carrier_name","airport","airport_name",
"arr_flights","arr_del15","carrier_ct","weather_ct","nas_ct",
"security_ct","late_aircraft_ct","arr_cancelled","arr_diverted",
"arr_delay","carrier_delay","weather_delay","nas_delay",
"security_delay","late_aircraft_delay")
#Calculate percetage of flights delayed per observation
flights_raw <- flights_raw %>% mutate(del_pct = arr_del15/arr_flights)
#Add a Categorical varibale called month name
month <- 1:12
month_nm <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul",
"Aug", "Sep", "Oct", "Nov","Dec")
months <- data.frame(month, month_nm)
flights_raw <- left_join(flights_raw, months, by = "month")
#Data descriptions
Field <- c("year", "month", "carrier", "carrier_name", "airport", "airport_name",
"arr_flights", "arr_del15", "carrier_ct", "weather_ct", "nas_ct",
"security_ct", "late_aircraft_ct", "arr_cancelled", "arr_diverted",
"arr_delay", "carrier_delay", "weather_delay", "nas_delay",
"security_delay", "late_aircraft_delay")
Description <- c("Year (yyyy)", "Month (mm)", "Airline carrier abbreviation",
"Airline carrier name", "Airport Code", "Airport Name",
"Total number of arriving flights in the observation",
"Total number of delayed flights in the observation",
"Number of flights delayed due to air carrier (subset of arr_del15)",
"Number of flights delayed due to weather (subset of arr_del15)",
"Number of flights delayed due to National Aviation System (subset of arr_del15)",
"Number of flights delayed due to airport security (subset of arr_del15)",
"Number of flights delayed due to a previous flight using the same aircraft being late",
"Number of cancelled flights",
"Number of flights diverted",
"Arrival delay in minutes",
"Carrier delay in minutes (subset of arr_delay)",
"Weather delayed in minutes (subset of arr_delay)",
"National Aviation System in minutes (subset of arr_delay)",
"Security delay in minutes (subset of arr_delay)",
"Aircraft delay in minutes (subset of arr_delay)")
VariableType <- c("Qualitative", "Qualitative", "Qualitative", "Qualitative",
"Qualitative", "Qualitative", "Quantitative", "Quantitative",
"Quantitative", "Quantitative", "Quantitative", "Quantitative",
"Quantitative", "Quantitative", "Quantitative", "Quantitative",
"Quantitative", "Quantitative", "Quantitative", "Quantitative",
"Quantitative")
VariableMeasure <- c("Independent", "Independent", "Independent", "Independent",
"Independent", "Independent", "Independent", "Response",
"Explanatory", "Explanatory", "Explanatory", "Explanatory",
"Explanatory", "Independent", "Independent", "Response",
"Explanatory", "Explanatory", "Explanatory", "Explanatory",
"Explanatory")
FieldDefinitions <- data.frame(Field, VariableType, VariableMeasure, Description)
This is an observational study. The creators of the data set, observed flight arrivals and noted down the total number of flights and some datapoints on the flights that arrived late. The population of interest is all late flights in the inter-continental United States. From the subset of flight arrival times that we get from DOT, we would like to make an inference on the entire population of flights.
The dependent variables are arr_del15 and arr_delay
FieldDefinitions %>% filter(VariableMeasure == "Response") %>% kable() %>% kable_styling()
| Field | VariableType | VariableMeasure | Description |
|---|---|---|---|
| arr_del15 | Quantitative | Response | Total number of delayed flights in the observation |
| arr_delay | Quantitative | Response | Arrival delay in minutes |
FieldDefinitions %>% filter(VariableMeasure == "Independent") %>% kable() %>%
kable_styling()
| Field | VariableType | VariableMeasure | Description |
|---|---|---|---|
| year | Qualitative | Independent | Year (yyyy) |
| month | Qualitative | Independent | Month (mm) |
| carrier | Qualitative | Independent | Airline carrier abbreviation |
| carrier_name | Qualitative | Independent | Airline carrier name |
| airport | Qualitative | Independent | Airport Code |
| airport_name | Qualitative | Independent | Airport Name |
| arr_flights | Quantitative | Independent | Total number of arriving flights in the observation |
| arr_cancelled | Quantitative | Independent | Number of cancelled flights |
| arr_diverted | Quantitative | Independent | Number of flights diverted |
Below is some exploratory analysis that give some general suggestions
There are 68,153 observations
#Observations count
nrow(flights_raw)
## [1] 68153
#summary statistics for all quantitative variables
flights_raw %>% select(arr_flights, arr_del15, carrier_ct, weather_ct, nas_ct, security_ct,
late_aircraft_ct, arr_cancelled, arr_diverted, arr_delay,
carrier_delay, weather_delay, nas_delay, security_delay,
late_aircraft_delay) %>% summary()
## arr_flights arr_del15 carrier_ct weather_ct
## Min. : 1 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 126 1st Qu.: 27.0 1st Qu.: 8.01 1st Qu.: 0.000
## Median : 333 Median : 71.0 Median : 20.63 Median : 1.500
## Mean : 1002 Mean : 198.1 Mean : 48.13 Mean : 6.386
## 3rd Qu.: 861 3rd Qu.: 180.0 3rd Qu.: 49.65 3rd Qu.: 5.630
## Max. :21977 Max. :6377.0 Max. :1792.07 Max. :641.540
## NA's :36 NA's :40 NA's :36 NA's :36
## nas_ct security_ct late_aircraft_ct arr_cancelled
## Min. : -0.01 Min. : 0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 8.60 1st Qu.: 0.0000 1st Qu.: 4.91 1st Qu.: 0.00
## Median : 25.20 Median : 0.0000 Median : 16.67 Median : 3.00
## Mean : 77.24 Mean : 0.4423 Mean : 65.91 Mean : 17.35
## 3rd Qu.: 68.55 3rd Qu.: 0.1300 3rd Qu.: 53.65 3rd Qu.: 12.00
## Max. :4091.27 Max. :80.5600 Max. :1885.47 Max. :1389.00
## NA's :36 NA's :36 NA's :36 NA's :36
## arr_diverted arr_delay carrier_delay weather_delay
## Min. : 0.000 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 1320 1st Qu.: 397 1st Qu.: 0.0
## Median : 0.000 Median : 3745 Median : 1111 Median : 98.0
## Mean : 2.375 Mean : 11674 Mean : 3105 Mean : 541.6
## 3rd Qu.: 2.000 3rd Qu.: 10218 3rd Qu.: 2874 3rd Qu.: 442.0
## Max. :256.000 Max. :433687 Max. :196944 Max. :57707.0
## NA's :36 NA's :36 NA's :36 NA's :36
## nas_delay security_delay late_aircraft_delay
## Min. : -19 Min. : 0.00 Min. : 0
## 1st Qu.: 323 1st Qu.: 0.00 1st Qu.: 276
## Median : 1010 Median : 0.00 Median : 1057
## Mean : 3729 Mean : 17.72 Mean : 4280
## 3rd Qu.: 2996 3rd Qu.: 5.00 3rd Qu.: 3565
## Max. :238440 Max. :3194.00 Max. :148181
## NA's :36 NA's :36 NA's :36
Excluding 2003 and 2019, the amount of delays seems to have peaked at 2007 and in general look be around 15%-20% of the proportion of all flights that are delayed
#Barchart of arriving and delayed flights
flights_raw %>% select(year, arr_flights, arr_del15) %>% drop_na() %>% group_by(year) %>%
summarise_all(sum) %>% gather(key = "Type", "NumCount", 2:3) %>%
ggplot( aes(x=year, y=NumCount, fill=Type)) + geom_col(position = 'dodge') +
scale_y_continuous(labels = comma) + scale_x_discrete(limits=c(2003:2019))
Looking at the amount of flights per airport Atlanta International hands the most amount of flights with 2nd and 3rd being Orlando and Dallas Fort Worth
#Barchart of volume per airport
flights_raw %>% select(airport, arr_flights) %>% drop_na() %>% group_by(airport) %>%
dplyr::summarise(TotalFlights = sum(arr_flights)) %>%
ggplot( aes(x=reorder(airport, -TotalFlights), y=TotalFlights, fill=TotalFlights))+
geom_col() + coord_flip() + scale_y_continuous(labels = comma) + xlab("Airport") +
scale_colour_continuous(labels = comma) + theme(legend.position="bottom")
Delays across the months seem almost the same, there is a slight increase in delays in the summer months of Jun, Jul, and Aug. Though this could be because there is an increased amount of flights in these months.
#Flight Delays by Month
by_month <- flights_raw %>% select(month, arr_del15) %>% drop_na() %>% group_by(month) %>%
dplyr::summarize(delayed = n())
by_month$month <- recode(by_month$month, `1`="Jan", `2`="Feb", `3`="Mar", `4`="Apr",
`5`="May", `6`="Jun", `7`="Jul", `8`="Aug", `9`="Sep",
`10`="Oct", `11`="Nov", `12`="Dec")
by_month %>% ggplot( aes(x = month, y=delayed)) + geom_bar(stat="identity") +
scale_x_discrete(limits=c("Jan", "Feb","Mar", "Apr", "May", "Jun", "Jul",
"Aug", "Sep", "Oct", "Nov", "Dec"))
Not all airline carriers operate at all airports. From the below we can see that ATA Airlines is usually has a large percentage of their flights delays at JFK and ORD airports
#Most Delayed Carrier per Airport by Percentage
flights_raw %>% select(carrier_name, airport, arr_del15, arr_flights) %>% drop_na() %>%
group_by(airport, carrier_name) %>% dplyr::summarize_all(funs(sum)) %>%
mutate(del_pct = arr_del15/arr_flights) %>%
ggplot(aes(x=airport, y=carrier_name, fill=del_pct)) + geom_tile() +
theme(axis.text.x=element_text(angle=90, hjust=1)) +
scale_fill_gradient(low = "pink", high = "red") +
theme(legend.position="bottom")
Over the years it looks like weather and security has stayed fairly stable, however the other reasons have fluctuated peaking at 2007
#Delayed Reasons by Year
del_by_year <- flights_raw %>% select(year, carrier_ct, weather_ct, nas_ct, security_ct,
late_aircraft_ct) %>% drop_na() %>%
group_by(year) %>% dplyr::summarise_all(funs(sum)) %>% gather(key = del_reason,
value = del_count,
carrier_ct:late_aircraft_ct)
ggplot(del_by_year, aes(x=del_by_year$year, y=del_by_year$del_count,
group=del_by_year$del_reason,
color = del_by_year$del_reason)) +
geom_line(size=1) + geom_point() + xlab("Year") + ylab("Count") +
scale_y_continuous(labels = comma) +
theme(legend.position="bottom") +
scale_x_discrete(limits=c(2003:2019))
A very positive correlation between the amount of time spent and number of delays. This indicates that increase in time was linear with increase in delays
#Delayed Time Spent
flights_raw %>% ggplot( aes(x=arr_del15, y=arr_delay)) + geom_point(size=1) +
scale_y_continuous(labels = comma) + geom_smooth(method = lm) +
ylab("Arrival Delay Minutes") + xlab("Delayed Arrival Count")
As the number of flights per airport increases past 10,000, then the number of delays gets sparse. This could also be because not many airports handle large amounts of flights
#Delayed Time Spent
flights_raw %>% ggplot( aes(x=arr_flights, y=arr_del15)) + geom_point(size=1) +
scale_y_continuous(labels = comma) + geom_smooth(method = lm) +
ylab("Delayed Arrival Count") + xlab("Arriving Flights Count")
The below shows all the delay percentages observations per airport. Note that SFO, LGA, and EWR have higher delayed percentages
#Delayed Airports
ggplot(flights_raw, aes(x=flights_raw$airport, y=flights_raw$del_pct)) +
geom_jitter(aes(colour = flights_raw$airport), size=1) + coord_flip() +
theme(legend.position='none') + xlab("Percentage Delayed") +
ylab("Airports")
The below shows all the delay percentages observations per airline carrier. Note that the larger carriers such as United, American Airlines, American Eagle, Virgin, do not have frequent high delayed percentages
#Delayed Airlines
ggplot(flights_raw, aes(x =flights_raw$carrier_name, y=flights_raw$del_pct)) +
geom_jitter(aes(colour = flights_raw$carrier_name), size=1) + coord_flip() +
theme(legend.position='none') + xlab("Percentage Delayed") +
ylab("Arline Carrier")
Below is a set of boxplots of delay percentages by each month per year. The trend year over year is that June, July, August, and December experience higher delay percentages
ggplot(data = flights_raw, mapping = aes(x=flights_raw$month_nm, y=flights_raw$del_pct)) +
geom_boxplot() + scale_x_discrete(limits=c("Jan", "Feb","Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) +
facet_wrap(~year) + theme(axis.text.x=element_text(angle=90, hjust=1)) + xlab("Month") +
ylab("Delayed Percentage")
Since we are concerned about airport delays, we will try to infer about delays in all the United States across all major and minor airports. In summary we will try to infer on the mean delay percentage across the population at a 95% confidence interval
To run an inference test, we have to check conditions that are appropriate for inference: 1) Samples are random 2) The sampling distribution is normal 3) Each observation is independent
-From the below histogram & QQ plot we can see that the delayed percentage of all 68k observations is not normally distributed
#Historgram of all delayed percentages
flights_raw %>% ggplot( aes(x=del_pct)) + geom_histogram()
#QQ Plot of all delayed percentages
flights_raw %>% ggplot( aes(sample=del_pct)) + geom_qq() + geom_qq_line()
To conduct an inference test, a sample of 500 is taken 2000 times. Then a 95% confidence level interval is conducted of the sample mean
# Percentage of all flights that are going to be delayed (Inference)
n1 <- 2000
all_del_pct <- rep(NA, n1)
all_del_pct_sd <- rep(NA, n1)
for(i in 1:n1){
temp <- sample(flights_raw$del_pct, 500)
all_del_pct[i] <- mean(temp)
all_del_pct_sd[i] <- sd(temp)
}
lower1 <- all_del_pct - 1.96 * all_del_pct_sd/sqrt(n1)
upper1 <- all_del_pct + 1.96 * all_del_pct_sd/sqrt(n1)
#Sample distribution
data.frame(all_del_pct) %>% ggplot( aes(x=all_del_pct))+geom_histogram()
#QQ Plot of the sample distribution
data.frame(all_del_pct) %>% ggplot( aes(sample=all_del_pct)) + geom_qq() + geom_qq_line()
#95% confidence interval of the delayed percentage mean
c(lower1[2], upper1[2])
## [1] 0.2142634 0.2234856
#Conclusion of Inference
#We can say with 95% confidence that on average flights will be delayed between
lower1[2]
## [1] 0.2142634
#and
upper1[2]
## [1] 0.2234856
We conduct a multiple linear regression of factors that are impacting the delay percentage. We build a regression model of independent variables of the airport, month, and the carrier to see how each of these variables impacts delays.
The guiding question for our statistical analysis is the following hypothesis:
\[H0: Airports, \ Months, \ or \ Airline \ carriers \ do \ not \ impact\ airline \ delays\ (R^2 = 0) \] \[HA: Airports, \ Months, \ or \ Airline \ carriers \ do \ impact\ airline \ delays\ (R^2 \ != 0) \]
We first check the conditions for regression:
- linear residuals
- residuals normally distributed
- constant variability of residuals
Conclusion:
From the below we can see that: - Linearity: Looks to be a linear relationship
- Near Normal Residuals: Data is somewhat normal, slightly skewed but ok
- Constant Variability: this looks to be not normal, but not too bad. Skews at the end
#Build model
airport_m <- lm(del_pct ~ airport + month_nm + carrier_name, data = flights_raw)
#Linearity: Looks to be a linear relationship
ggplot(airport_m, aes(x=airport_m$fitted.values, y=airport_m$residuals)) +
geom_point(size = 1) + geom_smooth(method = lm)
#Near Normal Residuals: Data is somewhat normal, slightly skewed but ok
data.frame(airport_m$residuals) %>% ggplot( aes(x=airport_m.residuals)) + geom_histogram()
#Constant Variability: this looks to be not normal, skewd towards the end
data.frame(airport_m$residuals) %>% ggplot( aes(sample=airport_m.residuals)) + geom_qq() +
geom_qq_line()
Given the model, the adjusted r-squared is .1976, indicating that the variables explain approx. 20% of the variability. Looking at the airport variable, note that EWR, LGA, and SFO contribute the most to a positive delay Looking at the month variable, the months June, July, and December contribute most to a positive delay.
Since the r-squared is not 0, that means that we can reject H0. However, prior to doing this, since the p-value of the anova variances are near 0, that indicates that the values are significant.
Based on this, we can reject the null hypothesis.
#Model Summary
summary(airport_m)
##
## Call:
## lm(formula = del_pct ~ airport + month_nm + carrier_name, data = flights_raw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34027 -0.05931 -0.01040 0.04789 0.84528
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.221068 0.002814 78.552
## airportBOS -0.013066 0.002654 -4.924
## airportBWI -0.034821 0.002716 -12.819
## airportCLT -0.013660 0.002741 -4.984
## airportDCA -0.036817 0.002644 -13.927
## airportDEN -0.023708 0.002648 -8.954
## airportDFW -0.025156 0.002651 -9.489
## airportDTW -0.018993 0.002628 -7.228
## airportEWR 0.073754 0.002700 27.316
## airportFLL -0.013994 0.002806 -4.987
## airportHNL -0.030156 0.003164 -9.531
## airportIAD -0.033208 0.002712 -12.245
## airportIAH -0.019236 0.002749 -6.997
## airportJFK 0.007875 0.002902 2.714
## airportLAS -0.026058 0.002642 -9.861
## airportLAX -0.013950 0.002635 -5.294
## airportLGA 0.044562 0.002653 16.794
## airportMCO -0.031043 0.002706 -11.472
## airportMDW -0.025725 0.003267 -7.873
## airportMIA -0.016947 0.002932 -5.780
## airportMSP -0.018419 0.002672 -6.894
## airportORD 0.016601 0.002676 6.204
## airportPDX -0.016488 0.002764 -5.966
## airportPHL 0.011024 0.002667 4.134
## airportPHX -0.027546 0.002651 -10.390
## airportSAN -0.031008 0.002661 -11.653
## airportSEA -0.006453 0.002716 -2.376
## airportSFO 0.043260 0.002697 16.040
## airportSLC -0.032712 0.002766 -11.828
## airportTPA -0.029609 0.002801 -10.572
## month_nmAug 0.021582 0.001743 12.385
## month_nmDec 0.057879 0.001769 32.716
## month_nmFeb 0.022629 0.001770 12.784
## month_nmJan 0.024848 0.001768 14.052
## month_nmJul 0.044984 0.001744 25.788
## month_nmJun 0.052442 0.001741 30.124
## month_nmMar 0.014971 0.001770 8.457
## month_nmMay 0.007226 0.001770 4.083
## month_nmNov -0.020484 0.001769 -11.576
## month_nmOct -0.018810 0.001773 -10.610
## month_nmSep -0.036387 0.001770 -20.557
## carrier_nameAlaska Airlines Inc. -0.056025 0.002337 -23.969
## carrier_nameAllegiant Air -0.001402 0.008119 -0.173
## carrier_nameAloha Airlines Inc. -0.061631 0.011457 -5.379
## carrier_nameAmerica West Airlines Inc. 0.020355 0.003744 5.437
## carrier_nameAmerican Airlines Inc. 0.007247 0.002171 3.339
## carrier_nameAmerican Eagle Airlines Inc. 0.005211 0.002624 1.986
## carrier_nameATA Airlines d/b/a ATA -0.004476 0.004096 -1.093
## carrier_nameAtlantic Coast Airlines -0.031279 0.007261 -4.308
## carrier_nameAtlantic Southeast Airlines 0.027203 0.003038 8.954
## carrier_nameComair Inc. 0.045447 0.002901 15.666
## carrier_nameContinental Air Lines Inc. 0.007978 0.002483 3.214
## carrier_nameDelta Air Lines Inc. -0.034254 0.002161 -15.849
## carrier_nameEndeavor Air Inc. -0.035255 0.004734 -7.448
## carrier_nameEnvoy Air -0.002814 0.004396 -0.640
## carrier_nameExpressJet Airlines Inc. 0.001083 0.002354 0.460
## carrier_nameExpressJet Airlines LLC 0.048784 0.013351 3.654
## carrier_nameFrontier Airlines Inc. 0.032340 0.002334 13.858
## carrier_nameHawaiian Airlines Inc. -0.020816 0.003008 -6.920
## carrier_nameIndependence Air 0.039067 0.007090 5.510
## carrier_nameJetBlue Airways 0.020843 0.002332 8.938
## carrier_nameMesa Airlines Inc. -0.008356 0.002739 -3.051
## carrier_nameNorthwest Airlines Inc. 0.052496 0.002618 20.049
## carrier_namePinnacle Airlines Inc. -0.014728 0.003718 -3.961
## carrier_namePSA Airlines Inc. -0.024509 0.006941 -3.531
## carrier_nameRepublic Airline -0.051728 0.005245 -9.862
## carrier_nameSkyWest Airlines Inc. -0.025936 0.002371 -10.939
## carrier_nameSouthwest Airlines Co. -0.015924 0.002322 -6.859
## carrier_nameSpirit Air Lines 0.004326 0.003293 1.314
## carrier_nameUnited Air Lines Inc. -0.009556 0.002179 -4.386
## carrier_nameUS Airways Inc. -0.007511 0.002311 -3.250
## carrier_nameVirgin America -0.046620 0.003258 -14.309
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## airportBOS 8.50e-07 ***
## airportBWI < 2e-16 ***
## airportCLT 6.24e-07 ***
## airportDCA < 2e-16 ***
## airportDEN < 2e-16 ***
## airportDFW < 2e-16 ***
## airportDTW 4.94e-13 ***
## airportEWR < 2e-16 ***
## airportFLL 6.15e-07 ***
## airportHNL < 2e-16 ***
## airportIAD < 2e-16 ***
## airportIAH 2.64e-12 ***
## airportJFK 0.006658 **
## airportLAS < 2e-16 ***
## airportLAX 1.20e-07 ***
## airportLGA < 2e-16 ***
## airportMCO < 2e-16 ***
## airportMDW 3.50e-15 ***
## airportMIA 7.51e-09 ***
## airportMSP 5.48e-12 ***
## airportORD 5.54e-10 ***
## airportPDX 2.45e-09 ***
## airportPHL 3.57e-05 ***
## airportPHX < 2e-16 ***
## airportSAN < 2e-16 ***
## airportSEA 0.017525 *
## airportSFO < 2e-16 ***
## airportSLC < 2e-16 ***
## airportTPA < 2e-16 ***
## month_nmAug < 2e-16 ***
## month_nmDec < 2e-16 ***
## month_nmFeb < 2e-16 ***
## month_nmJan < 2e-16 ***
## month_nmJul < 2e-16 ***
## month_nmJun < 2e-16 ***
## month_nmMar < 2e-16 ***
## month_nmMay 4.44e-05 ***
## month_nmNov < 2e-16 ***
## month_nmOct < 2e-16 ***
## month_nmSep < 2e-16 ***
## carrier_nameAlaska Airlines Inc. < 2e-16 ***
## carrier_nameAllegiant Air 0.862861
## carrier_nameAloha Airlines Inc. 7.51e-08 ***
## carrier_nameAmerica West Airlines Inc. 5.44e-08 ***
## carrier_nameAmerican Airlines Inc. 0.000843 ***
## carrier_nameAmerican Eagle Airlines Inc. 0.047032 *
## carrier_nameATA Airlines d/b/a ATA 0.274457
## carrier_nameAtlantic Coast Airlines 1.65e-05 ***
## carrier_nameAtlantic Southeast Airlines < 2e-16 ***
## carrier_nameComair Inc. < 2e-16 ***
## carrier_nameContinental Air Lines Inc. 0.001311 **
## carrier_nameDelta Air Lines Inc. < 2e-16 ***
## carrier_nameEndeavor Air Inc. 9.61e-14 ***
## carrier_nameEnvoy Air 0.522093
## carrier_nameExpressJet Airlines Inc. 0.645567
## carrier_nameExpressJet Airlines LLC 0.000258 ***
## carrier_nameFrontier Airlines Inc. < 2e-16 ***
## carrier_nameHawaiian Airlines Inc. 4.56e-12 ***
## carrier_nameIndependence Air 3.60e-08 ***
## carrier_nameJetBlue Airways < 2e-16 ***
## carrier_nameMesa Airlines Inc. 0.002283 **
## carrier_nameNorthwest Airlines Inc. < 2e-16 ***
## carrier_namePinnacle Airlines Inc. 7.47e-05 ***
## carrier_namePSA Airlines Inc. 0.000414 ***
## carrier_nameRepublic Airline < 2e-16 ***
## carrier_nameSkyWest Airlines Inc. < 2e-16 ***
## carrier_nameSouthwest Airlines Co. 7.01e-12 ***
## carrier_nameSpirit Air Lines 0.188931
## carrier_nameUnited Air Lines Inc. 1.16e-05 ***
## carrier_nameUS Airways Inc. 0.001156 **
## carrier_nameVirgin America < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09345 on 68041 degrees of freedom
## (40 observations deleted due to missingness)
## Multiple R-squared: 0.1985, Adjusted R-squared: 0.1976
## F-statistic: 237.3 on 71 and 68041 DF, p-value: < 2.2e-16
#Anova to check for variance
anova(airport_m)
Based on the dataset, we will visit the original questions:
What is the biggest contribution of flight delays?
Flying out of EWR and traveling in the month of December have the most significance in airline delays
Which holiday month (July, November, or December) is the worst for travel?
The months of June, July, and December are the worst for travel, with December contributing to delays the most
What is the most efficient airport?
DCA (Ronald Reagan Washington National Airport), contributes least to delays
Are airports generally getting more efficient over time?
Airports are remaining relatively the same in terms of efficiency
Given this dataset, we can say the incoming flights will be approx. 22% delayed 95% of the times and
the most significant factors of delays are the month of December and EWR airport. With that said however, this model has some flaws that the reader should be aware of:
- More than 80% of the variability remains unexplained
- There are other factors to delays such as departure delays which are not included in this dataset
These other factors may help in having a better model.