FlightDate - Flight Date (yyyymmdd)
Carrier - Carrier code for the airline
TailNum - Tail Number
FlightNum - Flight Number
Origin - The airport code
OriginCityName - Origin city name
OriginState - Origin State
Dest - Destination airport code
DestCityName - Destination city name
DestState - Destination state
CRSDepTime - Scheduled departure time (local time: hhmm)
DepTime - Actual departure time (local time: hhmm)
WheelsOff - Wheels off time (local time: hhmm)
WheelsOn - Wheels on time (local time: hhmm)
CRSArrTime - Scheduled arrival time (local time: hhmm)
ArrTime - Actual arrival time (local time: hhmm)
Cancelled - Cancelled flight indicator (1 = yes)
Diverted - Diverted flight indicator (1 = yes)
CRSElapsedTime - Scheduled elapsed time of flight, in minutes
ActualElapsedTime - Actual elapsed time of flight, in minutes
Distance - Distance between airports, in miles
For this assignment we will be analyzing data on all the U.S. domestic flights reported for on-time performance in January 2016. I am particularly interested in determining if certain origin airports experienced more delayed departures than others. I will be looking for which origin airports experienced delayed departures of 30 minutes or more and how many flights per origin airport experienced this, across carriers, and the percentage of flights that fell into this category for each origin airport/location. Due to the large size of the dataset (originally 445,827 observations), I have decided to perform my analysis on a subset of the data set, I will focus on the Top 10 Busiest Airports in the U.S.. Out of these top 10 busiest airports, I will determine which is the “worst” based on whichever has the largest percentage of delayed departures falling into the 30 minutes or more category and compare that airport to all 10 airports in the subset. Importing just the flights for the Top 10 Busiest Airports in the U.S. changed the dataset to 157,155 observations. However, after removing the flights that were cancelled we are left with 154,169 flights which is the denominator I will be using in the analysis portion of this assignment.
First I will look at the Top 10 Busiest Airports in the U.S. and convert the necessary variables such that characters and integers representing dates and times are changed to the correct format. Then I will perform the same calculations that we performed in our Unit 6 Lecture Notes to create new variables of interest. After I have tidied the data and created the necessary variables in their appropriate format I will look at the departure delay times, based on origin.
## 'data.frame': 157155 obs. of 21 variables:
## $ FlightDate : chr "1/1/2016" "1/2/2016" "1/3/2016" "1/4/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N029AA" "N005AA" "N014AA" "N018AA" ...
## $ FlightNum : int 139 139 139 139 139 139 139 139 139 139 ...
## $ Origin : chr "ATL" "ATL" "ATL" "ATL" ...
## $ OriginCityName : chr "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" ...
## $ OriginState : chr "GA" "GA" "GA" "GA" ...
## $ Dest : chr "MIA" "MIA" "MIA" "MIA" ...
## $ DestCityName : chr "Miami, FL" "Miami, FL" "Miami, FL" "Miami, FL" ...
## $ DestState : chr "FL" "FL" "FL" "FL" ...
## $ CRSDepTime : int 810 810 810 810 810 810 810 810 810 810 ...
## $ DepTime : int 808 812 807 825 805 806 806 818 823 1033 ...
## $ WheelsOff : int 820 827 819 838 819 828 822 834 843 1152 ...
## $ WheelsOn : int 944 952 1000 1004 952 959 949 1005 1017 1318 ...
## $ CRSArrTime : int 1016 1016 1016 1016 1010 1010 1010 1010 1010 1010 ...
## $ ArrTime : int 1001 1002 1006 1010 1033 1035 1010 1019 1027 1347 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 126 126 126 126 120 120 120 120 120 120 ...
## $ ActualElapsedTime: int 113 110 119 105 148 149 124 121 124 194 ...
## $ Distance : int 594 594 594 594 594 594 594 594 594 594 ...
n_distinct(Top10_Busiest$Origin)
## [1] 10
Before looking at the departure delay times by origin airport for the top 10 busiest airports, I will need to check the subset dataset for missing data.
kable(str(Top10_Busiest %>% filter(!complete.cases(.))))
## 'data.frame': 3315 obs. of 21 variables:
## $ FlightDate : chr "1/3/2016" "1/23/2016" "1/23/2016" "1/23/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N029AA" "N003AA" "N4YUAA" "N488AA" ...
## $ FlightNum : int 1127 1127 1574 1635 488 821 1808 2062 2085 451 ...
## $ Origin : chr "ATL" "ATL" "ATL" "ATL" ...
## $ OriginCityName : chr "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" ...
## $ OriginState : chr "GA" "GA" "GA" "GA" ...
## $ Dest : chr "MIA" "MIA" "ORD" "DFW" ...
## $ DestCityName : chr "Miami, FL" "Miami, FL" "Chicago, IL" "Dallas/Fort Worth, TX" ...
## $ DestState : chr "FL" "FL" "IL" "TX" ...
## $ CRSDepTime : int 645 645 744 608 1145 848 1040 545 630 1728 ...
## $ DepTime : int 640 NA NA NA NA NA NA NA NA NA ...
## $ WheelsOff : int 659 NA NA NA NA NA NA NA NA NA ...
## $ WheelsOn : int 1028 NA NA NA NA NA NA NA NA NA ...
## $ CRSArrTime : int 837 837 856 736 1358 1009 1200 651 826 1845 ...
## $ ArrTime : int 1037 NA NA NA NA NA NA NA NA NA ...
## $ Cancelled : int 0 1 1 1 1 1 1 1 1 1 ...
## $ Diverted : int 1 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 112 112 132 148 253 81 80 66 116 77 ...
## $ ActualElapsedTime: int NA NA NA NA NA NA NA NA NA NA ...
## $ Distance : int 594 594 606 731 1587 226 226 226 666 226 ...
kable(str(Top10_Busiest %>% filter(Cancelled == 1| Diverted == 1)))
## 'data.frame': 3315 obs. of 21 variables:
## $ FlightDate : chr "1/3/2016" "1/23/2016" "1/23/2016" "1/23/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N029AA" "N003AA" "N4YUAA" "N488AA" ...
## $ FlightNum : int 1127 1127 1574 1635 488 821 1808 2062 2085 451 ...
## $ Origin : chr "ATL" "ATL" "ATL" "ATL" ...
## $ OriginCityName : chr "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" ...
## $ OriginState : chr "GA" "GA" "GA" "GA" ...
## $ Dest : chr "MIA" "MIA" "ORD" "DFW" ...
## $ DestCityName : chr "Miami, FL" "Miami, FL" "Chicago, IL" "Dallas/Fort Worth, TX" ...
## $ DestState : chr "FL" "FL" "IL" "TX" ...
## $ CRSDepTime : int 645 645 744 608 1145 848 1040 545 630 1728 ...
## $ DepTime : int 640 NA NA NA NA NA NA NA NA NA ...
## $ WheelsOff : int 659 NA NA NA NA NA NA NA NA NA ...
## $ WheelsOn : int 1028 NA NA NA NA NA NA NA NA NA ...
## $ CRSArrTime : int 837 837 856 736 1358 1009 1200 651 826 1845 ...
## $ ArrTime : int 1037 NA NA NA NA NA NA NA NA NA ...
## $ Cancelled : int 0 1 1 1 1 1 1 1 1 1 ...
## $ Diverted : int 1 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 112 112 132 148 253 81 80 66 116 77 ...
## $ ActualElapsedTime: int NA NA NA NA NA NA NA NA NA NA ...
## $ Distance : int 594 594 606 731 1587 226 226 226 666 226 ...
We can see that 3,315 observations are missing data, we can also see that these observations are missing data likely because they are flights that were either cancelled or diverted. We will filter these out during our calculations.
Next we will convert the time variables from characters and integers to date and time objects. The variables that need to be changed include, FlightDate, CRSDepTime, CRSArrTime, DepTime, WheelsOff, WheelsOn, and ARRTime.
#Converting FlightDate to date format:
Top10_Busiest$FlightDate <- as.Date(Top10_Busiest$FlightDate, format = "%m/%d/%Y")
kable(head(Top10_Busiest %>% select(1:5)))
| FlightDate | Carrier | TailNum | FlightNum | Origin |
|---|---|---|---|---|
| 2016-01-01 | AA | N029AA | 139 | ATL |
| 2016-01-02 | AA | N005AA | 139 | ATL |
| 2016-01-03 | AA | N014AA | 139 | ATL |
| 2016-01-04 | AA | N018AA | 139 | ATL |
| 2016-01-05 | AA | N030AA | 139 | ATL |
| 2016-01-12 | AA | N007AA | 139 | ATL |
#Converting CRSDepTime to date-time format:
#Note: Cancelled flights are not filtered out for scheduled departure (CRSDepTime) because this was not a missing variable for the cancelled flights.
Top10_Busiest <- Top10_Busiest %>% mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime)))
Top10_Busiest$new_CRSDepTime <- as.POSIXct(Top10_Busiest$new_CRSDepTime, format = "%Y-%m-%d %H%M")
kable(head(Top10_Busiest %>% select(new_CRSDepTime, CRSDepTime)))
| new_CRSDepTime | CRSDepTime |
|---|---|
| 2016-01-01 08:10:00 | 810 |
| 2016-01-02 08:10:00 | 810 |
| 2016-01-03 08:10:00 | 810 |
| 2016-01-04 08:10:00 | 810 |
| 2016-01-05 08:10:00 | 810 |
| 2016-01-12 08:10:00 | 810 |
#Converting CRSArrTime, DepTime, WheelsOff, WheelsOn, ArrTime to date-time format:
#Note: Cancelled flights are not filtered out for arrival time (ArrTime) because this was not a missing variable for the cancelled flights.
Top10_Busiest <- Top10_Busiest %>% mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
Top10_Busiest$new_CRSArrTime <- as.POSIXct(Top10_Busiest$new_CRSArrTime, format = "%Y -%m - %d %H%M")
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)),
new_WheelsOff=paste(FlightDate, sprintf("%04d", WheelsOff)),
new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)),
new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
Top10_Busiest$new_DepTime <- as.POSIXct(Top10_Busiest$new_DepTime, format="%Y-%m-%d %H%M")
Top10_Busiest$new_WheelsOff <- as.POSIXct(Top10_Busiest$new_WheelsOff, format = "%Y-%m-%d %H%M")
Top10_Busiest$new_WheelsOn <- as.POSIXct(Top10_Busiest$new_WheelsOn, format = "%Y-%m-%d %H%M")
Top10_Busiest$new_ArrTime <- as.POSIXct(Top10_Busiest$new_ArrTime, format = "%Y-%m-%d %H%M")
kable(head(Top10_Busiest %>% select(FlightDate, new_CRSDepTime, new_CRSArrTime, new_DepTime, new_WheelsOff, new_WheelsOn, new_ArrTime)))
| FlightDate | new_CRSDepTime | new_CRSArrTime | new_DepTime | new_WheelsOff | new_WheelsOn | new_ArrTime |
|---|---|---|---|---|---|---|
| 2016-01-01 | 2016-01-01 08:10:00 | 2016-01-01 10:16:00 | 2016-01-01 08:08:00 | 2016-01-01 08:20:00 | 2016-01-01 09:44:00 | 2016-01-01 10:01:00 |
| 2016-01-02 | 2016-01-02 08:10:00 | 2016-01-02 10:16:00 | 2016-01-02 08:12:00 | 2016-01-02 08:27:00 | 2016-01-02 09:52:00 | 2016-01-02 10:02:00 |
| 2016-01-03 | 2016-01-03 08:10:00 | 2016-01-03 10:16:00 | 2016-01-03 08:07:00 | 2016-01-03 08:19:00 | 2016-01-03 10:00:00 | 2016-01-03 10:06:00 |
| 2016-01-04 | 2016-01-04 08:10:00 | 2016-01-04 10:16:00 | 2016-01-04 08:25:00 | 2016-01-04 08:38:00 | 2016-01-04 10:04:00 | 2016-01-04 10:10:00 |
| 2016-01-05 | 2016-01-05 08:10:00 | 2016-01-05 10:10:00 | 2016-01-05 08:05:00 | 2016-01-05 08:19:00 | 2016-01-05 09:52:00 | 2016-01-05 10:33:00 |
| 2016-01-12 | 2016-01-12 08:10:00 | 2016-01-12 10:10:00 | 2016-01-12 08:06:00 | 2016-01-12 08:28:00 | 2016-01-12 09:59:00 | 2016-01-12 10:35:00 |
I will be calculating the following:
DepDelay = new_DepTime - new_CRSDepTime
DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay)
DepDel30 = ifelse(DepDelay >= 30, 1, 0)
TaxiOut = new_WheelsOff - new_DepTime
TaxiIn = new_ArrTime - new_WheelsOn
ArrDelay = new_ArrTime - new_CRSArrTime
ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay)
ArrDel15 = ifelse(ArrDelay >= 15, 1, 0)
FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime
AirTime = ActualElapsedTime - TaxiOut - TaxiIn
AirSpeed = Distance / (Airtime / 60)
#Using difftime() to subtract two date/time objects, in this case the actual departure time minue the scheduled desparture time and provide results in minutes:
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units ="mins")))
pander(head(Top10_Busiest %>% select(CRSDepTime, DepTime, DepDelay)))
| CRSDepTime | DepTime | DepDelay |
|---|---|---|
| 810 | 808 | -2 |
| 810 | 812 | 2 |
| 810 | 807 | -3 |
| 810 | 825 | 15 |
| 810 | 805 | -5 |
| 810 | 806 | -4 |
#Using ifelse() to show the departure delays that were greater than or equal to 30 minutes (note: this is different than the Lecture Notes which calculated 15 minutes)
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(DepDelayMinutes = ifelse(DepDelay <0, 0, DepDelay),
DepDel30 = ifelse(DepDelay >=30, 1, 0))
pander(head(Top10_Busiest %>% select(DepDelay, DepDelayMinutes, DepDel30)))
| DepDelay | DepDelayMinutes | DepDel30 |
|---|---|---|
| -2 | 0 | 0 |
| 2 | 2 | 0 |
| -3 | 0 | 0 |
| 15 | 15 | 0 |
| -5 | 0 | 0 |
| -4 | 0 | 0 |
#Calculating the remaining variables from the "Variables Extended List"
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")),
TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")),
ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")), ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay),
ArrDel15 = ifelse(ArrDelay >=15, 1, 0),
FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(AirTime = ActualElapsedTime - TaxiIn - TaxiOut)
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled == 0) %>% mutate(AirSpeed = Distance / (AirTime / 60))
kable(head(Top10_Busiest) %>% select(DepDelay, DepDelayMinutes, DepDel30, TaxiOut, TaxiIn, ArrDelay, ArrDelayMinutes, ArrDel15, FlightTimeBuffer, AirTime, AirSpeed), digits=0)
| DepDelay | DepDelayMinutes | DepDel30 | TaxiOut | TaxiIn | ArrDelay | ArrDelayMinutes | ArrDel15 | FlightTimeBuffer | AirTime | AirSpeed |
|---|---|---|---|---|---|---|---|---|---|---|
| -2 | 0 | 0 | 12 | 17 | -15 | 0 | 0 | 13 | 84 | 424 |
| 2 | 2 | 0 | 15 | 10 | -14 | 0 | 0 | 16 | 85 | 419 |
| -3 | 0 | 0 | 12 | 6 | -10 | 0 | 0 | 7 | 101 | 353 |
| 15 | 15 | 0 | 13 | 6 | -6 | 0 | 0 | 21 | 86 | 414 |
| -5 | 0 | 0 | 14 | 41 | 23 | 23 | 1 | -28 | 93 | 383 |
| -4 | 0 | 0 | 22 | 36 | 25 | 25 | 1 | -29 | 91 | 392 |
Top10_Busiest <- Top10_Busiest %>%
mutate(NextDayArrival = ifelse(new_DepTime > new_ArrTime, 1, 0)) %>%
mutate(ArrDate = ifelse(NextDayArrival == 1, FlightDate + 1, FlightDate))
Top10_Busiest$ArrDate <- as.Date(Top10_Busiest$ArrDate, Origin = "1970-01-01")
Top10_Busiest2 <- Top10_Busiest %>% select(FlightDate, Origin, DepTime, DepDelay, ArrTime, NextDayArrival)
(pander(head(Top10_Busiest2)))
| FlightDate | Origin | DepTime | DepDelay | ArrTime | NextDayArrival |
|---|---|---|---|---|---|
| 2016-01-01 | ATL | 808 | -2 | 1001 | 0 |
| 2016-01-02 | ATL | 812 | 2 | 1002 | 0 |
| 2016-01-03 | ATL | 807 | -3 | 1006 | 0 |
| 2016-01-04 | ATL | 825 | 15 | 1010 | 0 |
| 2016-01-05 | ATL | 805 | -5 | 1033 | 0 |
| 2016-01-12 | ATL | 806 | -4 | 1035 | 0 |
Looking at Figures 8 and 9, we can see that 8,629 flights experienced next day arrivals. Out of those flights experiencing next day arrivals, LAX (Los Angeles International Airport) experienced the most with 1,850 flights having next day arrivals.
pander(Top10_Busiest %>% group_by(Origin) %>% filter(NextDayArrival == 1) %>% summarise(max(DepDelayMinutes), mean(DepDelayMinutes),n()))
| Origin | max(DepDelayMinutes) | mean(DepDelayMinutes) | n() |
|---|---|---|---|
| ATL | 861 | 17.28 | 1459 |
| DEN | 1438 | 37.48 | 610 |
| DFW | 277 | 31.94 | 411 |
| IAH | 280 | 38.98 | 277 |
| JFK | 334 | 38.21 | 637 |
| LAS | 909 | 23.57 | 1131 |
| LAX | 611 | 19.4 | 1850 |
| ORD | 1437 | 71.13 | 486 |
| PHX | 307 | 19.43 | 645 |
| SFO | 1435 | 31.94 | 1123 |
Top10_Busiest %>% filter(NextDayArrival == 1) %>% summarise(n())
## n()
## 1 8629
All data shown is for the top 10 busiest airports in the U.S.for January 2016.
First, I wanted to check how many flights had any departure delays and how many flights had departure delays greater than or equal to 30 minutes in our subset.
Top10_Busiest %>% filter(DepDelay>1) %>% summarise(count=n())
## count
## 1 51735
Top10_Busiest %>% filter(DepDel30 ==1) %>% summarise(count = n())
## count
## 1 16235
DepDelay_30 <- Top10_Busiest %>% count(DepDelay>=30)
DepDelay_30
We can see that 51,735 flights of the 154,169 total flights had departure delays that exceeded 1 minute, which is about 33.5% of the flights. Additionally, we can see that 16,235 flights had departure delays greater than or equal to 30 minutes, which is about 10.5% of all the flights for the top 10 busiest airports in the U.S.
per_delayed <- (51735/154169)
per_delayed
## [1] 0.3355733
perc_delayed30 <- (16235/154169)
perc_delayed30
## [1] 0.1053065
FlightsBYOrigin <- Top10_Busiest %>% group_by(Origin) %>% count(n()) %>% arrange(desc(n))
pander(FlightsBYOrigin)
| Origin | n() | n |
|---|---|---|
| ATL | 29456 | 29456 |
| ORD | 18251 | 18251 |
| DEN | 17317 | 17317 |
| DFW | 16341 | 16341 |
| LAX | 16128 | 16128 |
| PHX | 12864 | 12864 |
| SFO | 12788 | 12788 |
| LAS | 11984 | 11984 |
| IAH | 11495 | 11495 |
| JFK | 7545 | 7545 |
Dep_Delay_Origin <- Top10_Busiest %>% group_by(Origin) %>% filter(DepDelay >0) %>% summarise(mean(DepDelay),n())
pander(Dep_Delay_Origin)
| Origin | mean(DepDelay) | n() |
|---|---|---|
| ATL | 24.2 | 10266 |
| DEN | 26.08 | 5877 |
| DFW | 27.28 | 5155 |
| IAH | 27.3 | 3674 |
| JFK | 32.98 | 3039 |
| LAS | 30.25 | 4457 |
| LAX | 29.7 | 6309 |
| ORD | 33.22 | 7898 |
| PHX | 23.77 | 4472 |
| SFO | 40.31 | 5130 |
When looking at the origin airports with departure delays, it is clear that ATL (Hartsfield-Jackson Atlanta International Airport, GA) had the most depature delays with 10,266 flights experiencing some type of delay in January 2016. While SFO (San Francisco International, CA) experienced the highest average departure delay time at 40 minutes.
Dep_Delay_Origin <- Top10_Busiest %>% group_by(Origin) %>% filter(DepDelay >=30) %>% summarise(mean(DepDelay),n())
pander(Dep_Delay_Origin)
| Origin | mean(DepDelay) | n() |
|---|---|---|
| ATL | 71.09 | 2431 |
| DEN | 77.43 | 1449 |
| DFW | 71.47 | 1464 |
| IAH | 77.02 | 977 |
| JFK | 78.7 | 1006 |
| LAS | 81.22 | 1288 |
| LAX | 75.67 | 1877 |
| ORD | 76.89 | 2701 |
| PHX | 70.81 | 1039 |
| SFO | 87.34 | 2003 |
When looking at the departure delays that were greater than or equal to 30 minutes, ORD (O’Hare International Airport, IL) experienced the highest number of flights with departure delays greater than or equal to 30 minutes, at 2,701 flights. While once again, SFO experienced the highest average departure time over 30 minutes, at 87 minutes.
#Percentage of Flights with Departure Delays by Origin
FDD <- Top10_Busiest %>% group_by(Origin) %>% mutate(delayed = ifelse(DepDelay >0,1,0)) %>% summarize(p_delayed = sum(delayed) / n())
FDD %>% ggvis(~Origin, ~p_delayed) %>% layer_bars(fill :="steelblue") %>% add_axis("x", title = "Origin Airport") %>%
add_axis("y", title = "Percentage of Flights with Departure Delays")
#Percentage of Flights with Departure Delays Over 30 Minutes by Origin
FDD_30 <- Top10_Busiest %>% group_by(Origin) %>% mutate(delayed30 = ifelse(DepDelay >=30, 1, 0)) %>% summarise(p_delayed = sum(delayed30)/n())
FDD_30 %>% ggvis(~Origin, ~p_delayed) %>% layer_bars(fill:="steelblue") %>% add_axis("x", title = "Origin Airport") %>%
add_axis("y", title = "Percentage of Flights with Departure Delays")
From looking at Figure 15 and 16, we can see that ORD (O’Hare International Airport, IL) had the hightest percentage of flights with departure delays of any type. While SFO (San Francisco International, CA) has the highest percentage of flights experiencing departure delays greater than or equal to 30 minutes. This means we can consider SFO to be the “worst” of the top 10 busiest airports in the U.S. because it experiences the highest percentage of departure delays greater than or equal to 30 minutes. However, this percentage is still under 16% which does not seem to be a very alarming rate. Nevertheless, for this assignement it is now deemed the “worst.”
It would be interesting to also look at what day of the week SFO experienced the highest departure delays in January 2016. There were 12,788 flights originating from SFO in January 2016. Additionally, 5,130 of these flights experienced some type of depature delay.
SFO_Data <- Top10_Busiest %>% filter(Origin == "SFO")
summarise(SFO_Data, n())
## n()
## 1 12788
SFO_Data %>% filter(DepDelay >0) %>% summarise(n())
## n()
## 1 5130
SFO_Delays <- SFO_Data %>% filter(DepDelay >0)
SFO_Data %>% group_by(day = wday(FlightDate, label = TRUE)) %>% mutate(delayed = ifelse(DepDelay >0, 1, 0)) %>% ggvis(~day, ~delayed) %>% layer_bars(fill:="steelblue") %>% add_axis("x", title = "Day of the Week") %>%
add_axis("y", title = "Number of Flights with Departure Delays")
Mean_SFO_Data <- SFO_Delays %>% group_by(day = wday(FlightDate, label = TRUE)) %>% summarise(max(DepDelay), mean(DepDelay), n())
pander(Mean_SFO_Data)
| day | max(DepDelay) | mean(DepDelay) | n() |
|---|---|---|---|
| Sun | 575 | 33.33 | 829 |
| Mon | 1435 | 41.09 | 760 |
| Tues | 1432 | 49.63 | 724 |
| Wed | 463 | 48.37 | 715 |
| Thurs | 402 | 27.13 | 567 |
| Fri | 1435 | 42.31 | 970 |
| Sat | 1435 | 37.16 | 565 |
For SFO, Friday experienced the highest number of flights with departure delays at 970 flights and an average departure delay of almost 43 minutes. While Tuesday averaged the highest departure delay time with an average of almost 50 minutes.
In conclusion, of the top 10 busiest airports in the U.S., SFO is considered the “worst” when it comes to departure delay times. Additionally, departing from SFO on Friday or Tuesday may be problematic if you are in a rush, since these are the days experiencing the largest number of departure delays and longest average departure delay time, respectively.
While you wouldn’t think SFO would be directly impacted by winter weather conditions, this airport may be indirectly impacted by flights coming from origins with winter conditions (this is something to look at with the entire dataset, possibly segmented by different region: South, West, MidWest, NorthEast). One thing I took from this analysis was, if you’re planning to travel out of the top 10 busiest airports in the U.S. (during January), a good book to pass the time would be helpful, since there is about a 33% chance you will experience a departure delay. To be on the safe side, a good book may be appropriate when traveling in general.
This datasets had a number of interesting relationships that could be explored. One of which could be regions and their implications for different departure times. Another would be to segment based on if an airport offers red-eye flights and how this impacts departure times.