Original Variables List

FlightDate - Flight Date (yyyymmdd)
Carrier - Carrier code for the airline
TailNum - Tail Number
FlightNum - Flight Number
Origin - The airport code
OriginCityName - Origin city name
OriginState - Origin State
Dest - Destination airport code
DestCityName - Destination city name
DestState - Destination state
CRSDepTime - Scheduled departure time (local time: hhmm)
DepTime - Actual departure time (local time: hhmm)
WheelsOff - Wheels off time (local time: hhmm)
WheelsOn - Wheels on time (local time: hhmm)
CRSArrTime - Scheduled arrival time (local time: hhmm)
ArrTime - Actual arrival time (local time: hhmm)
Cancelled - Cancelled flight indicator (1 = yes)
Diverted - Diverted flight indicator (1 = yes)
CRSElapsedTime - Scheduled elapsed time of flight, in minutes
ActualElapsedTime - Actual elapsed time of flight, in minutes
Distance - Distance between airports, in miles

Introduction

For this assignment we will be analyzing data on all the U.S. domestic flights reported for on-time performance in January 2016. I am particularly interested in determining if certain origin airports experienced more delayed departures than others. I will be looking for which origin airports experienced delayed departures of 30 minutes or more and how many flights per origin airport experienced this, across carriers, and the percentage of flights that fell into this category for each origin airport/location. Due to the large size of the dataset (originally 445,827 observations), I have decided to perform my analysis on a subset of the data set, I will focus on the Top 10 Busiest Airports in the U.S.. Out of these top 10 busiest airports, I will determine which is the “worst” based on whichever has the largest percentage of delayed departures falling into the 30 minutes or more category and compare that airport to all 10 airports in the subset. Importing just the flights for the Top 10 Busiest Airports in the U.S. changed the dataset to 157,155 observations. However, after removing the flights that were cancelled we are left with 154,169 flights which is the denominator I will be using in the analysis portion of this assignment.

Methodology

First I will look at the Top 10 Busiest Airports in the U.S. and convert the necessary variables such that characters and integers representing dates and times are changed to the correct format. Then I will perform the same calculations that we performed in our Unit 6 Lecture Notes to create new variables of interest. After I have tidied the data and created the necessary variables in their appropriate format I will look at the departure delay times, based on origin.

## 'data.frame':    157155 obs. of  21 variables:
##  $ FlightDate       : chr  "1/1/2016" "1/2/2016" "1/3/2016" "1/4/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N029AA" "N005AA" "N014AA" "N018AA" ...
##  $ FlightNum        : int  139 139 139 139 139 139 139 139 139 139 ...
##  $ Origin           : chr  "ATL" "ATL" "ATL" "ATL" ...
##  $ OriginCityName   : chr  "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" ...
##  $ OriginState      : chr  "GA" "GA" "GA" "GA" ...
##  $ Dest             : chr  "MIA" "MIA" "MIA" "MIA" ...
##  $ DestCityName     : chr  "Miami, FL" "Miami, FL" "Miami, FL" "Miami, FL" ...
##  $ DestState        : chr  "FL" "FL" "FL" "FL" ...
##  $ CRSDepTime       : int  810 810 810 810 810 810 810 810 810 810 ...
##  $ DepTime          : int  808 812 807 825 805 806 806 818 823 1033 ...
##  $ WheelsOff        : int  820 827 819 838 819 828 822 834 843 1152 ...
##  $ WheelsOn         : int  944 952 1000 1004 952 959 949 1005 1017 1318 ...
##  $ CRSArrTime       : int  1016 1016 1016 1016 1010 1010 1010 1010 1010 1010 ...
##  $ ArrTime          : int  1001 1002 1006 1010 1033 1035 1010 1019 1027 1347 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  126 126 126 126 120 120 120 120 120 120 ...
##  $ ActualElapsedTime: int  113 110 119 105 148 149 124 121 124 194 ...
##  $ Distance         : int  594 594 594 594 594 594 594 594 594 594 ...
n_distinct(Top10_Busiest$Origin)
## [1] 10

Before looking at the departure delay times by origin airport for the top 10 busiest airports, I will need to check the subset dataset for missing data.

kable(str(Top10_Busiest %>% filter(!complete.cases(.))))
## 'data.frame':    3315 obs. of  21 variables:
##  $ FlightDate       : chr  "1/3/2016" "1/23/2016" "1/23/2016" "1/23/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N029AA" "N003AA" "N4YUAA" "N488AA" ...
##  $ FlightNum        : int  1127 1127 1574 1635 488 821 1808 2062 2085 451 ...
##  $ Origin           : chr  "ATL" "ATL" "ATL" "ATL" ...
##  $ OriginCityName   : chr  "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" ...
##  $ OriginState      : chr  "GA" "GA" "GA" "GA" ...
##  $ Dest             : chr  "MIA" "MIA" "ORD" "DFW" ...
##  $ DestCityName     : chr  "Miami, FL" "Miami, FL" "Chicago, IL" "Dallas/Fort Worth, TX" ...
##  $ DestState        : chr  "FL" "FL" "IL" "TX" ...
##  $ CRSDepTime       : int  645 645 744 608 1145 848 1040 545 630 1728 ...
##  $ DepTime          : int  640 NA NA NA NA NA NA NA NA NA ...
##  $ WheelsOff        : int  659 NA NA NA NA NA NA NA NA NA ...
##  $ WheelsOn         : int  1028 NA NA NA NA NA NA NA NA NA ...
##  $ CRSArrTime       : int  837 837 856 736 1358 1009 1200 651 826 1845 ...
##  $ ArrTime          : int  1037 NA NA NA NA NA NA NA NA NA ...
##  $ Cancelled        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ Diverted         : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  112 112 132 148 253 81 80 66 116 77 ...
##  $ ActualElapsedTime: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Distance         : int  594 594 606 731 1587 226 226 226 666 226 ...
kable(str(Top10_Busiest %>% filter(Cancelled == 1| Diverted == 1)))
## 'data.frame':    3315 obs. of  21 variables:
##  $ FlightDate       : chr  "1/3/2016" "1/23/2016" "1/23/2016" "1/23/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N029AA" "N003AA" "N4YUAA" "N488AA" ...
##  $ FlightNum        : int  1127 1127 1574 1635 488 821 1808 2062 2085 451 ...
##  $ Origin           : chr  "ATL" "ATL" "ATL" "ATL" ...
##  $ OriginCityName   : chr  "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" "Atlanta, GA" ...
##  $ OriginState      : chr  "GA" "GA" "GA" "GA" ...
##  $ Dest             : chr  "MIA" "MIA" "ORD" "DFW" ...
##  $ DestCityName     : chr  "Miami, FL" "Miami, FL" "Chicago, IL" "Dallas/Fort Worth, TX" ...
##  $ DestState        : chr  "FL" "FL" "IL" "TX" ...
##  $ CRSDepTime       : int  645 645 744 608 1145 848 1040 545 630 1728 ...
##  $ DepTime          : int  640 NA NA NA NA NA NA NA NA NA ...
##  $ WheelsOff        : int  659 NA NA NA NA NA NA NA NA NA ...
##  $ WheelsOn         : int  1028 NA NA NA NA NA NA NA NA NA ...
##  $ CRSArrTime       : int  837 837 856 736 1358 1009 1200 651 826 1845 ...
##  $ ArrTime          : int  1037 NA NA NA NA NA NA NA NA NA ...
##  $ Cancelled        : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ Diverted         : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  112 112 132 148 253 81 80 66 116 77 ...
##  $ ActualElapsedTime: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Distance         : int  594 594 606 731 1587 226 226 226 666 226 ...

We can see that 3,315 observations are missing data, we can also see that these observations are missing data likely because they are flights that were either cancelled or diverted. We will filter these out during our calculations.

Next we will convert the time variables from characters and integers to date and time objects. The variables that need to be changed include, FlightDate, CRSDepTime, CRSArrTime, DepTime, WheelsOff, WheelsOn, and ARRTime.

Figure 1: Converting FlightDate to Date Format

#Converting FlightDate to date format:
Top10_Busiest$FlightDate <- as.Date(Top10_Busiest$FlightDate, format = "%m/%d/%Y")
kable(head(Top10_Busiest %>% select(1:5)))
FlightDate Carrier TailNum FlightNum Origin
2016-01-01 AA N029AA 139 ATL
2016-01-02 AA N005AA 139 ATL
2016-01-03 AA N014AA 139 ATL
2016-01-04 AA N018AA 139 ATL
2016-01-05 AA N030AA 139 ATL
2016-01-12 AA N007AA 139 ATL

Figure 2: Converting CRSDepTime to Date-Time Format

#Converting CRSDepTime to date-time format:
#Note: Cancelled flights are not filtered out for scheduled departure (CRSDepTime) because this was not a missing variable for the cancelled flights.

Top10_Busiest <- Top10_Busiest %>% mutate(new_CRSDepTime = paste(FlightDate, sprintf("%04d", CRSDepTime))) 
Top10_Busiest$new_CRSDepTime <- as.POSIXct(Top10_Busiest$new_CRSDepTime, format = "%Y-%m-%d %H%M")
kable(head(Top10_Busiest %>% select(new_CRSDepTime, CRSDepTime)))
new_CRSDepTime CRSDepTime
2016-01-01 08:10:00 810
2016-01-02 08:10:00 810
2016-01-03 08:10:00 810
2016-01-04 08:10:00 810
2016-01-05 08:10:00 810
2016-01-12 08:10:00 810

Figure 3: Converting Variables from Lecture 6 Notes

#Converting CRSArrTime, DepTime, WheelsOff, WheelsOn, ArrTime to date-time format:
#Note: Cancelled flights are not filtered out for arrival time (ArrTime) because this was not a missing variable for the cancelled flights.
Top10_Busiest <- Top10_Busiest %>% mutate(new_CRSArrTime = paste(FlightDate, sprintf("%04d", CRSArrTime)))
Top10_Busiest$new_CRSArrTime <- as.POSIXct(Top10_Busiest$new_CRSArrTime, format = "%Y -%m - %d %H%M")
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(new_DepTime = paste(FlightDate, sprintf("%04d", DepTime)),
                                                        new_WheelsOff=paste(FlightDate, sprintf("%04d", WheelsOff)),
                                                        new_WheelsOn = paste(FlightDate, sprintf("%04d", WheelsOn)), 
                                                        new_ArrTime = paste(FlightDate, sprintf("%04d", ArrTime)))
Top10_Busiest$new_DepTime <- as.POSIXct(Top10_Busiest$new_DepTime, format="%Y-%m-%d %H%M")
Top10_Busiest$new_WheelsOff <- as.POSIXct(Top10_Busiest$new_WheelsOff, format = "%Y-%m-%d %H%M")
Top10_Busiest$new_WheelsOn <- as.POSIXct(Top10_Busiest$new_WheelsOn, format = "%Y-%m-%d %H%M")
Top10_Busiest$new_ArrTime <- as.POSIXct(Top10_Busiest$new_ArrTime, format = "%Y-%m-%d %H%M")
kable(head(Top10_Busiest %>% select(FlightDate, new_CRSDepTime, new_CRSArrTime, new_DepTime, new_WheelsOff, new_WheelsOn, new_ArrTime)))
FlightDate new_CRSDepTime new_CRSArrTime new_DepTime new_WheelsOff new_WheelsOn new_ArrTime
2016-01-01 2016-01-01 08:10:00 2016-01-01 10:16:00 2016-01-01 08:08:00 2016-01-01 08:20:00 2016-01-01 09:44:00 2016-01-01 10:01:00
2016-01-02 2016-01-02 08:10:00 2016-01-02 10:16:00 2016-01-02 08:12:00 2016-01-02 08:27:00 2016-01-02 09:52:00 2016-01-02 10:02:00
2016-01-03 2016-01-03 08:10:00 2016-01-03 10:16:00 2016-01-03 08:07:00 2016-01-03 08:19:00 2016-01-03 10:00:00 2016-01-03 10:06:00
2016-01-04 2016-01-04 08:10:00 2016-01-04 10:16:00 2016-01-04 08:25:00 2016-01-04 08:38:00 2016-01-04 10:04:00 2016-01-04 10:10:00
2016-01-05 2016-01-05 08:10:00 2016-01-05 10:10:00 2016-01-05 08:05:00 2016-01-05 08:19:00 2016-01-05 09:52:00 2016-01-05 10:33:00
2016-01-12 2016-01-12 08:10:00 2016-01-12 10:10:00 2016-01-12 08:06:00 2016-01-12 08:28:00 2016-01-12 09:59:00 2016-01-12 10:35:00

Next I will include the remaining calculations from the Unit 6 Lecture Notes:

Variables Extended List

I will be calculating the following:
DepDelay = new_DepTime - new_CRSDepTime
DepDelayMinutes = ifelse(DepDelay < 0, 0, DepDelay)
DepDel30 = ifelse(DepDelay >= 30, 1, 0)
TaxiOut = new_WheelsOff - new_DepTime
TaxiIn = new_ArrTime - new_WheelsOn
ArrDelay = new_ArrTime - new_CRSArrTime
ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay)
ArrDel15 = ifelse(ArrDelay >= 15, 1, 0)
FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime
AirTime = ActualElapsedTime - TaxiOut - TaxiIn
AirSpeed = Distance / (Airtime / 60)

Figure 4: Using DiffTime() to Work with Date/Time Objects

#Using difftime() to subtract two date/time objects, in this case the actual departure time minue the scheduled desparture time and provide results in minutes:

Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(DepDelay = as.integer(difftime(new_DepTime, new_CRSDepTime, units ="mins")))
pander(head(Top10_Busiest %>% select(CRSDepTime, DepTime, DepDelay)))
CRSDepTime DepTime DepDelay
810 808 -2
810 812 2
810 807 -3
810 825 15
810 805 -5
810 806 -4

Figure 5: Using ifelse() to Show Departure Delays >=30 Minutes

#Using ifelse() to show the departure delays that were greater than or equal to 30 minutes (note: this is different than the Lecture Notes which calculated 15 minutes) 


Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(DepDelayMinutes = ifelse(DepDelay <0, 0, DepDelay), 
      DepDel30 = ifelse(DepDelay >=30, 1, 0))
pander(head(Top10_Busiest %>% select(DepDelay, DepDelayMinutes, DepDel30)))
DepDelay DepDelayMinutes DepDel30
-2 0 0
2 2 0
-3 0 0
15 15 0
-5 0 0
-4 0 0

Figure 6: Calculating the Remaining Variables from Variables Extended List

#Calculating the remaining variables from the "Variables Extended List"
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(TaxiOut = as.integer(difftime(new_WheelsOff, new_DepTime, units = "mins")), 
                                   TaxiIn = as.integer(difftime(new_ArrTime, new_WheelsOn, units = "mins")),
                                  ArrDelay = as.integer(difftime(new_ArrTime, new_CRSArrTime, units = "mins")),                                        ArrDelayMinutes = ifelse(ArrDelay < 0, 0, ArrDelay),  
                                  ArrDel15 = ifelse(ArrDelay >=15, 1, 0), 
                                  FlightTimeBuffer = CRSElapsedTime - ActualElapsedTime)
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled ==0) %>% mutate(AirTime = ActualElapsedTime - TaxiIn - TaxiOut)
Top10_Busiest <- Top10_Busiest %>% filter(Cancelled == 0) %>% mutate(AirSpeed = Distance / (AirTime / 60))

kable(head(Top10_Busiest) %>% select(DepDelay, DepDelayMinutes, DepDel30, TaxiOut, TaxiIn, ArrDelay, ArrDelayMinutes, ArrDel15, FlightTimeBuffer, AirTime, AirSpeed), digits=0)
DepDelay DepDelayMinutes DepDel30 TaxiOut TaxiIn ArrDelay ArrDelayMinutes ArrDel15 FlightTimeBuffer AirTime AirSpeed
-2 0 0 12 17 -15 0 0 13 84 424
2 2 0 15 10 -14 0 0 16 85 419
-3 0 0 12 6 -10 0 0 7 101 353
15 15 0 13 6 -6 0 0 21 86 414
-5 0 0 14 41 23 23 1 -28 93 383
-4 0 0 22 36 25 25 1 -29 91 392

Figure 7: Departure Delays and Identifying Next Day Arrivals

Top10_Busiest <- Top10_Busiest %>% 
  mutate(NextDayArrival = ifelse(new_DepTime > new_ArrTime, 1, 0)) %>% 
  mutate(ArrDate = ifelse(NextDayArrival == 1, FlightDate + 1, FlightDate))
Top10_Busiest$ArrDate <- as.Date(Top10_Busiest$ArrDate, Origin = "1970-01-01") 
Top10_Busiest2 <- Top10_Busiest %>%  select(FlightDate, Origin, DepTime, DepDelay, ArrTime, NextDayArrival)
(pander(head(Top10_Busiest2)))
FlightDate Origin DepTime DepDelay ArrTime NextDayArrival
2016-01-01 ATL 808 -2 1001 0
2016-01-02 ATL 812 2 1002 0
2016-01-03 ATL 807 -3 1006 0
2016-01-04 ATL 825 15 1010 0
2016-01-05 ATL 805 -5 1033 0
2016-01-12 ATL 806 -4 1035 0

Looking at Figures 8 and 9, we can see that 8,629 flights experienced next day arrivals. Out of those flights experiencing next day arrivals, LAX (Los Angeles International Airport) experienced the most with 1,850 flights having next day arrivals.

Figure 8: Maximum and Mean Departure Delays for Flights Experiecing Next Day Arrivals by Origin

pander(Top10_Busiest %>% group_by(Origin) %>% filter(NextDayArrival == 1) %>% summarise(max(DepDelayMinutes), mean(DepDelayMinutes),n())) 
Origin max(DepDelayMinutes) mean(DepDelayMinutes) n()
ATL 861 17.28 1459
DEN 1438 37.48 610
DFW 277 31.94 411
IAH 280 38.98 277
JFK 334 38.21 637
LAS 909 23.57 1131
LAX 611 19.4 1850
ORD 1437 71.13 486
PHX 307 19.43 645
SFO 1435 31.94 1123

Figure 9: Number of Flights Experiencing Next Day Arrival for Top 10 Busiest Airtports in January 2016.

Top10_Busiest %>% filter(NextDayArrival == 1) %>% summarise(n())
##    n()
## 1 8629

Top 10 Busiest Airports in the U.S. Analysis

All data shown is for the top 10 busiest airports in the U.S.for January 2016.

First, I wanted to check how many flights had any departure delays and how many flights had departure delays greater than or equal to 30 minutes in our subset.

Figure 10: Number of Flights Experiencing Departure Delays

Top10_Busiest %>% filter(DepDelay>1) %>% summarise(count=n())
##   count
## 1 51735

Figure 11: Number of Flights Experiencing Departure Delays >= 30 Minutes

Top10_Busiest %>% filter(DepDel30 ==1) %>% summarise(count = n())
##   count
## 1 16235
DepDelay_30 <- Top10_Busiest %>% count(DepDelay>=30)
DepDelay_30

We can see that 51,735 flights of the 154,169 total flights had departure delays that exceeded 1 minute, which is about 33.5% of the flights. Additionally, we can see that 16,235 flights had departure delays greater than or equal to 30 minutes, which is about 10.5% of all the flights for the top 10 busiest airports in the U.S.

per_delayed <- (51735/154169)
per_delayed
## [1] 0.3355733
perc_delayed30 <- (16235/154169)
perc_delayed30
## [1] 0.1053065

Figure 12: Number of Flights Per Origin Location

FlightsBYOrigin <- Top10_Busiest %>% group_by(Origin) %>% count(n()) %>% arrange(desc(n))
pander(FlightsBYOrigin)
Origin n() n
ATL 29456 29456
ORD 18251 18251
DEN 17317 17317
DFW 16341 16341
LAX 16128 16128
PHX 12864 12864
SFO 12788 12788
LAS 11984 11984
IAH 11495 11495
JFK 7545 7545

Figure 13: Number of Flights Experiencing Departure Delays by Origin

Dep_Delay_Origin <- Top10_Busiest %>% group_by(Origin) %>% filter(DepDelay >0) %>%  summarise(mean(DepDelay),n())
pander(Dep_Delay_Origin)
Origin mean(DepDelay) n()
ATL 24.2 10266
DEN 26.08 5877
DFW 27.28 5155
IAH 27.3 3674
JFK 32.98 3039
LAS 30.25 4457
LAX 29.7 6309
ORD 33.22 7898
PHX 23.77 4472
SFO 40.31 5130

When looking at the origin airports with departure delays, it is clear that ATL (Hartsfield-Jackson Atlanta International Airport, GA) had the most depature delays with 10,266 flights experiencing some type of delay in January 2016. While SFO (San Francisco International, CA) experienced the highest average departure delay time at 40 minutes.

Figure 14: Number of Flights Experiencing Departure Delays >= 30 Minutes by Origin

Dep_Delay_Origin <- Top10_Busiest %>% group_by(Origin) %>% filter(DepDelay >=30) %>% summarise(mean(DepDelay),n())
pander(Dep_Delay_Origin)
Origin mean(DepDelay) n()
ATL 71.09 2431
DEN 77.43 1449
DFW 71.47 1464
IAH 77.02 977
JFK 78.7 1006
LAS 81.22 1288
LAX 75.67 1877
ORD 76.89 2701
PHX 70.81 1039
SFO 87.34 2003

When looking at the departure delays that were greater than or equal to 30 minutes, ORD (O’Hare International Airport, IL) experienced the highest number of flights with departure delays greater than or equal to 30 minutes, at 2,701 flights. While once again, SFO experienced the highest average departure time over 30 minutes, at 87 minutes.

Figure 15: Percentage of Flights with Departure Delays by Origin

#Percentage of Flights with Departure Delays by Origin

FDD <- Top10_Busiest %>% group_by(Origin) %>% mutate(delayed = ifelse(DepDelay >0,1,0)) %>% summarize(p_delayed = sum(delayed) / n())

FDD %>% ggvis(~Origin, ~p_delayed) %>% layer_bars(fill :="steelblue") %>% add_axis("x", title = "Origin Airport") %>% 
add_axis("y", title = "Percentage of Flights with Departure Delays")

Figure 16: Percentage of Flights with Departure Delays Over 30 Minutes by Origin

#Percentage of Flights with Departure Delays Over 30 Minutes by Origin


FDD_30 <- Top10_Busiest %>% group_by(Origin) %>% mutate(delayed30 = ifelse(DepDelay >=30, 1, 0)) %>% summarise(p_delayed = sum(delayed30)/n())

FDD_30 %>% ggvis(~Origin, ~p_delayed) %>% layer_bars(fill:="steelblue") %>% add_axis("x", title = "Origin Airport") %>% 
add_axis("y", title = "Percentage of Flights with Departure Delays")

From looking at Figure 15 and 16, we can see that ORD (O’Hare International Airport, IL) had the hightest percentage of flights with departure delays of any type. While SFO (San Francisco International, CA) has the highest percentage of flights experiencing departure delays greater than or equal to 30 minutes. This means we can consider SFO to be the “worst” of the top 10 busiest airports in the U.S. because it experiences the highest percentage of departure delays greater than or equal to 30 minutes. However, this percentage is still under 16% which does not seem to be a very alarming rate. Nevertheless, for this assignement it is now deemed the “worst.”

Analysis for the “Worst” Airport in the U.S. - SFO

It would be interesting to also look at what day of the week SFO experienced the highest departure delays in January 2016. There were 12,788 flights originating from SFO in January 2016. Additionally, 5,130 of these flights experienced some type of depature delay.

SFO_Data <- Top10_Busiest %>% filter(Origin == "SFO")
summarise(SFO_Data, n())
##     n()
## 1 12788
SFO_Data %>% filter(DepDelay >0) %>% summarise(n())
##    n()
## 1 5130
SFO_Delays <- SFO_Data %>%  filter(DepDelay >0)

Figure 17: Number of Flights with Departure Delays for each Day of the Week (for SFO)

SFO_Data %>% group_by(day = wday(FlightDate, label = TRUE)) %>% mutate(delayed = ifelse(DepDelay >0, 1, 0)) %>% ggvis(~day, ~delayed) %>% layer_bars(fill:="steelblue") %>% add_axis("x", title = "Day of the Week") %>% 
add_axis("y", title = "Number of Flights with Departure Delays")
Mean_SFO_Data <- SFO_Delays %>% group_by(day = wday(FlightDate, label = TRUE)) %>% summarise(max(DepDelay), mean(DepDelay), n())
pander(Mean_SFO_Data)
day max(DepDelay) mean(DepDelay) n()
Sun 575 33.33 829
Mon 1435 41.09 760
Tues 1432 49.63 724
Wed 463 48.37 715
Thurs 402 27.13 567
Fri 1435 42.31 970
Sat 1435 37.16 565

For SFO, Friday experienced the highest number of flights with departure delays at 970 flights and an average departure delay of almost 43 minutes. While Tuesday averaged the highest departure delay time with an average of almost 50 minutes.

Conclusion

In conclusion, of the top 10 busiest airports in the U.S., SFO is considered the “worst” when it comes to departure delay times. Additionally, departing from SFO on Friday or Tuesday may be problematic if you are in a rush, since these are the days experiencing the largest number of departure delays and longest average departure delay time, respectively.

While you wouldn’t think SFO would be directly impacted by winter weather conditions, this airport may be indirectly impacted by flights coming from origins with winter conditions (this is something to look at with the entire dataset, possibly segmented by different region: South, West, MidWest, NorthEast). One thing I took from this analysis was, if you’re planning to travel out of the top 10 busiest airports in the U.S. (during January), a good book to pass the time would be helpful, since there is about a 33% chance you will experience a departure delay. To be on the safe side, a good book may be appropriate when traveling in general.

Additional Analysis That Would Be Interesting:

This datasets had a number of interesting relationships that could be explored. One of which could be regions and their implications for different departure times. Another would be to segment based on if an airport offers red-eye flights and how this impacts departure times.