We are interested in the relationship between flight traffic and the frequency and length of flight delays. To this end, we looked at the relationship between the average number of scheduled flights for hour intervals at US airports and the average frequency and length of delays during those intervals. While we did observe a temporal pattern of delay frequency, we did not observe any strong correlation between air traffic and the length or frequency of delays. This is somewhat surprising since we expected that as air-traffic increase so would the proportion of delays. This analysis was done using data on all US domestic flights during the month of January 2016.
Load packages:
library(dplyr)
library(lubridate)
library(ggvis)
Load data:
URL <- 'http://media.usm.maine.edu/~suleiman/mba676/domestic_flights_jan_2016.csv'
jan.flights <- read.csv(url(URL), stringsAsFactors = FALSE)
rm(URL)
Take a look at the data and convert to tibble:
dim(jan.flights)
## [1] 445827 21
head(jan.flights)
## FlightDate Carrier TailNum FlightNum Origin OriginCityName
## 1 1/6/2016 AA N4YBAA 43 DFW Dallas/Fort Worth, TX
## 2 1/7/2016 AA N434AA 43 DFW Dallas/Fort Worth, TX
## 3 1/8/2016 AA N541AA 43 DFW Dallas/Fort Worth, TX
## 4 1/9/2016 AA N489AA 43 DFW Dallas/Fort Worth, TX
## 5 1/10/2016 AA N439AA 43 DFW Dallas/Fort Worth, TX
## 6 1/11/2016 AA N468AA 43 DFW Dallas/Fort Worth, TX
## OriginState Dest DestCityName DestState CRSDepTime DepTime WheelsOff
## 1 TX DTW Detroit, MI MI 1100 1057 1112
## 2 TX DTW Detroit, MI MI 1100 1056 1110
## 3 TX DTW Detroit, MI MI 1100 1055 1116
## 4 TX DTW Detroit, MI MI 1100 1102 1115
## 5 TX DTW Detroit, MI MI 1100 1240 1300
## 6 TX DTW Detroit, MI MI 1100 1107 1118
## WheelsOn CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime
## 1 1424 1438 1432 0 0 158
## 2 1416 1438 1426 0 0 158
## 3 1431 1438 1445 0 0 158
## 4 1424 1438 1433 0 0 158
## 5 1617 1438 1631 0 0 158
## 6 1426 1438 1435 0 0 158
## ActualElapsedTime Distance
## 1 155 986
## 2 150 986
## 3 170 986
## 4 151 986
## 5 171 986
## 6 148 986
tail(jan.flights)
## FlightDate Carrier TailNum FlightNum Origin
## 445822 1/3/2016 F9 N934FR 576 SLC
## 445823 1/3/2016 F9 N934FR 579 DEN
## 445824 1/3/2016 F9 N934FR 720 DEN
## 445825 1/3/2016 F9 N935FR 444 DEN
## 445826 1/3/2016 F9 N935FR 445 FSD
## 445827 1/3/2016 F9 N935FR 693 CID
## OriginCityName OriginState Dest DestCityName
## 445822 Salt Lake City, UT UT DEN Denver, CO
## 445823 Denver, CO CO SLC Salt Lake City, UT
## 445824 Denver, CO CO DCA Washington, DC
## 445825 Denver, CO CO FSD Sioux Falls, SD
## 445826 Sioux Falls, SD SD DEN Denver, CO
## 445827 Cedar Rapids/Iowa City, IA IA DEN Denver, CO
## DestState CRSDepTime DepTime WheelsOff WheelsOn CRSArrTime ArrTime
## 445822 CO 1449 1450 1500 1600 1609 1616
## 445823 UT 1245 1240 1249 1353 1409 1404
## 445824 VA 1649 1728 1745 2233 2159 2241
## 445825 SD 1020 1038 1050 1255 1250 1301
## 445826 CO 1335 1358 1408 1417 1426 1429
## 445827 CO 1909 1900 1920 1953 2029 2000
## Cancelled Diverted CRSElapsedTime ActualElapsedTime Distance
## 445822 0 0 80 86 391
## 445823 0 0 84 84 391
## 445824 0 0 190 193 1476
## 445825 0 0 90 83 483
## 445826 0 0 111 91 483
## 445827 0 0 140 120 692
str(jan.flights)
## 'data.frame': 445827 obs. of 21 variables:
## $ FlightDate : chr "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N4YBAA" "N434AA" "N541AA" "N489AA" ...
## $ FlightNum : int 43 43 43 43 43 43 43 43 43 43 ...
## $ Origin : chr "DFW" "DFW" "DFW" "DFW" ...
## $ OriginCityName : chr "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
## $ OriginState : chr "TX" "TX" "TX" "TX" ...
## $ Dest : chr "DTW" "DTW" "DTW" "DTW" ...
## $ DestCityName : chr "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
## $ DestState : chr "MI" "MI" "MI" "MI" ...
## $ CRSDepTime : int 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
## $ DepTime : int 1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
## $ WheelsOff : int 1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
## $ WheelsOn : int 1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
## $ CRSArrTime : int 1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
## $ ArrTime : int 1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 158 158 158 158 158 158 158 158 158 158 ...
## $ ActualElapsedTime: int 155 150 170 151 171 148 159 156 150 158 ...
## $ Distance : int 986 986 986 986 986 986 986 986 986 986 ...
summary(jan.flights)
## FlightDate Carrier TailNum FlightNum
## Length:445827 Length:445827 Length:445827 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 702
## Mode :character Mode :character Mode :character Median :1594
## Mean :2079
## 3rd Qu.:2763
## Max. :7438
##
## Origin OriginCityName OriginState
## Length:445827 Length:445827 Length:445827
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Dest DestCityName DestState CRSDepTime
## Length:445827 Length:445827 Length:445827 Min. : 1
## Class :character Class :character Class :character 1st Qu.: 920
## Mode :character Mode :character Mode :character Median :1325
## Mean :1330
## 3rd Qu.:1730
## Max. :2359
##
## DepTime WheelsOff WheelsOn CRSArrTime
## Min. : 1 Min. : 1 Min. : 1 Min. : 1
## 1st Qu.: 924 1st Qu.: 939 1st Qu.:1104 1st Qu.:1118
## Median :1331 Median :1344 Median :1519 Median :1527
## Mean :1334 Mean :1357 Mean :1483 Mean :1503
## 3rd Qu.:1737 3rd Qu.:1750 3rd Qu.:1914 3rd Qu.:1920
## Max. :2400 Max. :2400 Max. :2400 Max. :2359
## NA's :11473 NA's :11600 NA's :11907
## ArrTime Cancelled Diverted CRSElapsedTime
## Min. : 1 Min. :0.00000 Min. :0.000000 Min. : 21.0
## 1st Qu.:1108 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.: 90.0
## Median :1522 Median :0.00000 Median :0.000000 Median :128.0
## Mean :1488 Mean :0.02616 Mean :0.001938 Mean :146.5
## 3rd Qu.:1919 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:180.0
## Max. :2400 Max. :1.00000 Max. :1.000000 Max. :705.0
## NA's :11907
## ActualElapsedTime Distance
## Min. : 15.0 Min. : 31.0
## 1st Qu.: 85.0 1st Qu.: 391.0
## Median :122.0 Median : 679.0
## Mean :140.1 Mean : 844.2
## 3rd Qu.:173.0 3rd Qu.:1086.0
## Max. :721.0 Max. :4983.0
## NA's :12529
flights <- tbl_df(jan.flights)
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 445827 obs. of 21 variables:
## $ FlightDate : chr "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N4YBAA" "N434AA" "N541AA" "N489AA" ...
## $ FlightNum : int 43 43 43 43 43 43 43 43 43 43 ...
## $ Origin : chr "DFW" "DFW" "DFW" "DFW" ...
## $ OriginCityName : chr "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
## $ OriginState : chr "TX" "TX" "TX" "TX" ...
## $ Dest : chr "DTW" "DTW" "DTW" "DTW" ...
## $ DestCityName : chr "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
## $ DestState : chr "MI" "MI" "MI" "MI" ...
## $ CRSDepTime : int 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
## $ DepTime : int 1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
## $ WheelsOff : int 1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
## $ WheelsOn : int 1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
## $ CRSArrTime : int 1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
## $ ArrTime : int 1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 158 158 158 158 158 158 158 158 158 158 ...
## $ ActualElapsedTime: int 155 150 170 151 171 148 159 156 150 158 ...
## $ Distance : int 986 986 986 986 986 986 986 986 986 986 ...
Now we’ll take a look at at the missing values:
dim(flights %>% filter(complete.cases(.)))
## [1] 433298 21
dim(flights %>% filter(!complete.cases(.)))
## [1] 12529 21
head(flights %>% filter(!complete.cases(.)))
## # A tibble: 6 Ă— 21
## FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
## <chr> <chr> <chr> <int> <chr> <chr> <chr>
## 1 1/16/2016 AA N3CXAA 44 SEA Seattle, WA WA
## 2 1/23/2016 AA 44 SEA Seattle, WA WA
## 3 1/24/2016 AA N3DGAA 44 SEA Seattle, WA WA
## 4 1/25/2016 AA N3MNAA 44 SEA Seattle, WA WA
## 5 1/15/2016 AA N3JSAA 45 JFK New York, NY NY
## 6 1/23/2016 AA 45 JFK New York, NY NY
## # ... with 14 more variables: Dest <chr>, DestCityName <chr>,
## # DestState <chr>, CRSDepTime <int>, DepTime <int>, WheelsOff <int>,
## # WheelsOn <int>, CRSArrTime <int>, ArrTime <int>, Cancelled <int>,
## # Diverted <int>, CRSElapsedTime <int>, ActualElapsedTime <int>,
## # Distance <int>
tail(flights %>% filter(!complete.cases(.)))
## # A tibble: 6 Ă— 21
## FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
## <chr> <chr> <chr> <int> <chr> <chr> <chr>
## 1 1/25/2016 EV N13975 6136 EWR Newark, NJ NJ
## 2 1/24/2016 EV N14920 6136 EWR Newark, NJ NJ
## 3 1/26/2016 EV N14998 6033 EWR Newark, NJ NJ
## 4 1/25/2016 EV N14570 6033 EWR Newark, NJ NJ
## 5 1/23/2016 EV N15574 6033 EWR Newark, NJ NJ
## 6 1/12/2016 EV N13979 6033 EWR Newark, NJ NJ
## # ... with 14 more variables: Dest <chr>, DestCityName <chr>,
## # DestState <chr>, CRSDepTime <int>, DepTime <int>, WheelsOff <int>,
## # WheelsOn <int>, CRSArrTime <int>, ArrTime <int>, Cancelled <int>,
## # Diverted <int>, CRSElapsedTime <int>, ActualElapsedTime <int>,
## # Distance <int>
str(flights %>% filter(!complete.cases(.)))
## Classes 'tbl_df', 'tbl' and 'data.frame': 12529 obs. of 21 variables:
## $ FlightDate : chr "1/16/2016" "1/23/2016" "1/24/2016" "1/25/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N3CXAA" "" "N3DGAA" "N3MNAA" ...
## $ FlightNum : int 44 44 44 44 45 45 45 45 50 52 ...
## $ Origin : chr "SEA" "SEA" "SEA" "SEA" ...
## $ OriginCityName : chr "Seattle, WA" "Seattle, WA" "Seattle, WA" "Seattle, WA" ...
## $ OriginState : chr "WA" "WA" "WA" "WA" ...
## $ Dest : chr "JFK" "JFK" "JFK" "JFK" ...
## $ DestCityName : chr "New York, NY" "New York, NY" "New York, NY" "New York, NY" ...
## $ DestState : chr "NY" "NY" "NY" "NY" ...
## $ CRSDepTime : int 640 640 645 645 1830 1830 1830 1820 1410 1315 ...
## $ DepTime : int NA NA NA NA NA NA NA NA NA NA ...
## $ WheelsOff : int NA NA NA NA NA NA NA NA NA NA ...
## $ WheelsOn : int NA NA NA NA NA NA NA NA NA NA ...
## $ CRSArrTime : int 1501 1501 1506 1506 2152 2152 2152 2142 1604 2110 ...
## $ ArrTime : int NA NA NA NA NA NA NA NA NA NA ...
## $ Cancelled : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 321 321 321 321 382 382 382 382 114 295 ...
## $ ActualElapsedTime: int NA NA NA NA NA NA NA NA NA NA ...
## $ Distance : int 2422 2422 2422 2422 2422 2422 2422 2422 550 2311 ...
str(flights %>% filter(complete.cases(.)))
## Classes 'tbl_df', 'tbl' and 'data.frame': 433298 obs. of 21 variables:
## $ FlightDate : chr "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
## $ Carrier : chr "AA" "AA" "AA" "AA" ...
## $ TailNum : chr "N4YBAA" "N434AA" "N541AA" "N489AA" ...
## $ FlightNum : int 43 43 43 43 43 43 43 43 43 43 ...
## $ Origin : chr "DFW" "DFW" "DFW" "DFW" ...
## $ OriginCityName : chr "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
## $ OriginState : chr "TX" "TX" "TX" "TX" ...
## $ Dest : chr "DTW" "DTW" "DTW" "DTW" ...
## $ DestCityName : chr "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
## $ DestState : chr "MI" "MI" "MI" "MI" ...
## $ CRSDepTime : int 1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
## $ DepTime : int 1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
## $ WheelsOff : int 1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
## $ WheelsOn : int 1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
## $ CRSArrTime : int 1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
## $ ArrTime : int 1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CRSElapsedTime : int 158 158 158 158 158 158 158 158 158 158 ...
## $ ActualElapsedTime: int 155 150 170 151 171 148 159 156 150 158 ...
## $ Distance : int 986 986 986 986 986 986 986 986 986 986 ...
It looks like nearly all the flights with missing values are cancelled flights. Let’s take a look at flights with missing values that were not cancelled:
head(flights %>% filter(!complete.cases(.), Cancelled == 0))
## # A tibble: 6 Ă— 21
## FlightDate Carrier TailNum FlightNum Origin OriginCityName
## <chr> <chr> <chr> <int> <chr> <chr>
## 1 1/15/2016 AA N3ALAA 56 DEN Denver, CO
## 2 1/15/2016 AA N3GUAA 208 SFO San Francisco, CA
## 3 1/10/2016 AA N3BVAA 210 LAS Las Vegas, NV
## 4 1/15/2016 AA N3HFAA 217 LAS Las Vegas, NV
## 5 1/10/2016 AA N796AA 34 LAX Los Angeles, CA
## 6 1/22/2016 AA N480AA 248 DFW Dallas/Fort Worth, TX
## # ... with 15 more variables: OriginState <chr>, Dest <chr>,
## # DestCityName <chr>, DestState <chr>, CRSDepTime <int>, DepTime <int>,
## # WheelsOff <int>, WheelsOn <int>, CRSArrTime <int>, ArrTime <int>,
## # Cancelled <int>, Diverted <int>, CRSElapsedTime <int>,
## # ActualElapsedTime <int>, Distance <int>
tail(flights %>% filter(!complete.cases(.), Cancelled == 0))
## # A tibble: 6 Ă— 21
## FlightDate Carrier TailNum FlightNum Origin OriginCityName
## <chr> <chr> <chr> <int> <chr> <chr>
## 1 1/21/2016 EV N11150 5921 ORD Chicago, IL
## 2 1/10/2016 EV N14558 5961 SCE State College, PA
## 3 1/8/2016 EV N25134 6158 JAX Jacksonville, FL
## 4 1/8/2016 EV N11127 6182 SDF Louisville, KY
## 5 1/7/2016 EV N11194 6089 ORD Chicago, IL
## 6 1/14/2016 EV N14105 6104 IAH Houston, TX
## # ... with 15 more variables: OriginState <chr>, Dest <chr>,
## # DestCityName <chr>, DestState <chr>, CRSDepTime <int>, DepTime <int>,
## # WheelsOff <int>, WheelsOn <int>, CRSArrTime <int>, ArrTime <int>,
## # Cancelled <int>, Diverted <int>, CRSElapsedTime <int>,
## # ActualElapsedTime <int>, Distance <int>
summary(flights %>% filter(!complete.cases(.), Cancelled == 0))
## FlightDate Carrier TailNum FlightNum
## Length:864 Length:864 Length:864 Min. : 4.0
## Class :character Class :character Class :character 1st Qu.: 798.5
## Mode :character Mode :character Mode :character Median :1790.5
## Mean :2534.7
## 3rd Qu.:4587.5
## Max. :7437.0
##
## Origin OriginCityName OriginState
## Length:864 Length:864 Length:864
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Dest DestCityName DestState CRSDepTime
## Length:864 Length:864 Length:864 Min. : 15
## Class :character Class :character Class :character 1st Qu.: 829
## Mode :character Mode :character Mode :character Median :1155
## Mean :1249
## 3rd Qu.:1626
## Max. :2359
##
## DepTime WheelsOff WheelsOn CRSArrTime
## Min. : 2 Min. : 8 Min. : 2 Min. : 13
## 1st Qu.: 825 1st Qu.: 839 1st Qu.:1202 1st Qu.:1028
## Median :1210 Median :1228 Median :1528 Median :1410
## Mean :1254 Mean :1278 Mean :1511 Mean :1424
## 3rd Qu.:1641 3rd Qu.:1654 3rd Qu.:1918 3rd Qu.:1816
## Max. :2356 Max. :2355 Max. :2358 Max. :2356
## NA's :242
## ArrTime Cancelled Diverted CRSElapsedTime ActualElapsedTime
## Min. : 2 Min. :0 Min. :1 Min. : 30.0 Min. : NA
## 1st Qu.:1210 1st Qu.:0 1st Qu.:1 1st Qu.: 94.0 1st Qu.: NA
## Median :1533 Median :0 Median :1 Median :139.0 Median : NA
## Mean :1517 Mean :0 Mean :1 Mean :159.4 Mean :NaN
## 3rd Qu.:1924 3rd Qu.:0 3rd Qu.:1 3rd Qu.:195.0 3rd Qu.: NA
## Max. :2357 Max. :0 Max. :1 Max. :527.0 Max. : NA
## NA's :242 NA's :864
## Distance
## Min. : 82.0
## 1st Qu.: 416.8
## Median : 767.0
## Mean : 947.6
## 3rd Qu.:1173.0
## Max. :3784.0
##
So all the missing value flights were either cancelled or diverted.
We need to discern how dates-times are dealt with. Are all dates-times given in terms of one time zone? Are date-times given in terms of the origin airport? Are date-times given in terms of the airport of event (e.g. departure times given in terms of origin airport time zone and arrival times given in terms of destination airport time zone). To answer these questions let’s look at American Airlines (“AA”) flights between JFK (east coast) and LAX (west coast) on the second of January. LAX is three hours behind New York. If departure times are given in terms of origin airport and arrivals in terms of destination airports than the flight speed of the plane should appear impossibly fast.
EastWest.Flights <- flights %>% filter(Origin == 'JFK',
Dest == 'LAX',
Carrier == 'AA',
FlightDate == '1/2/2016')
dim(EastWest.Flights)
## [1] 12 21
slice(EastWest.Flights, 1)
## # A tibble: 1 Ă— 21
## FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
## <chr> <chr> <chr> <int> <chr> <chr> <chr>
## 1 1/2/2016 AA N797AA 1 JFK New York, NY NY
## # ... with 14 more variables: Dest <chr>, DestCityName <chr>,
## # DestState <chr>, CRSDepTime <int>, DepTime <int>, WheelsOff <int>,
## # WheelsOn <int>, CRSArrTime <int>, ArrTime <int>, Cancelled <int>,
## # Diverted <int>, CRSElapsedTime <int>, ActualElapsedTime <int>,
## # Distance <int>
EastWest.Flights <- EastWest.Flights %>%
mutate(WheelsOff = paste(sprintf("%04d", WheelsOff)),
WheelsOn = paste(sprintf("%04d", WheelsOn)),
WheelsOff = as.POSIXct(WheelsOff, format="%H%M"),
WheelsOn = as.POSIXct(WheelsOn, format="%H%M"),
FlightTime = WheelsOn - WheelsOff
)
EastWest.Flights2 <- EastWest.Flights %>%
transmute(FlightTime = as.numeric(FlightTime),
FlightSpeed1 = Distance / FlightTime,
FlightSpeed2 = Distance / (FlightTime + 3))
EastWest.Flights2
## # A tibble: 12 Ă— 3
## FlightTime FlightSpeed1 FlightSpeed2
## <dbl> <dbl> <dbl>
## 1 2.733333 905.4878 431.6860
## 2 2.550000 970.5882 445.9459
## 3 2.683333 922.3602 435.4839
## 4 2.566667 964.2857 444.6108
## 5 2.533333 976.9737 447.2892
## 6 2.650000 933.9623 438.0531
## 7 2.516667 983.4437 448.6405
## 8 2.416667 1024.1379 456.9231
## 9 2.433333 1017.1233 455.5215
## 10 2.583333 958.0645 443.2836
## 11 2.666667 928.1250 436.7647
## 12 2.483333 996.6443 451.3678
rm(EastWest.Flights, EastWest.Flights2)
FlightSpeed1 in the above table reports the calculated flight speed if both WheelsOff and WheelsOn were reported in New York time, while FlightSpeed2 reports the flight speed if WheelsOff was reported in New York time and WheelsOn was reported in California time. The speeds reported in FlightSpeed1 seem unreasonably fast, whereas those reported in FlightSpeed2 seem much more credible. Thus, times are reported according to the timezone in which they occur. So, arrivals and landing times are reported in the destination airport’s time zone, and departure and lift-off times are reported in the origin airport’s time zone. We will keep this in mind if we do any calculations with flight speed (as done above).
The first thing we want to do is convert the dates and times to proper date and time objects in R. While doing this we also need to keep an eye out for flights that occur late at night, since successive events (departure and landing) may occur in different days. Also, from here on out we are going to ignore cancelled and diverted flights since they don’t have information on delays.
# First convert flight dates to Date objects
flights2 <- flights %>%
filter(complete.cases(.)) %>%
mutate(FlightDate = as.Date(FlightDate, format = '%m/%d/%Y'),
newCRSDepTime = paste(FlightDate, sprintf('%04d', CRSDepTime)),
newCRSDepTime = as.POSIXct(newCRSDepTime, format = '%Y-%m-%d %H%M')
)
# Now, before converting event times to date-time objects, let's build a
# a function that will properly deal with successive events that cross
# the day boundary
time.calc <- function(Time.Interest, Time.prev, Day, form = '%Y-%m-%d %H%M'){
require(dplyr) # Require dplr package so that if it weren't already loaded
# I would still have access to the pipe operator
if(Time.Interest < Time.prev){
if(Time.prev - Time.Interest > 300){ # Only increase day count by 1 if
Day <- Day + 1 # time difference is great. This will
} # prevent us from misdating planes that
} # departed slightly ahead of schedule.
# In this case, we will assume planes
# don't depart more than 3 hours early
Time <- paste(Day, sprintf('%04d', Time.Interest)) %>%
as.POSIXct(format = form)
}
# Now determine date-times for events.
flights2 <- flights2 %>%
rowwise() %>%
mutate(newWheelsOff = time.calc(WheelsOff, CRSDepTime, FlightDate),
newWheelsOn = time.calc(WheelsOn, CRSDepTime, FlightDate),
newArrTime = time.calc(ArrTime, CRSDepTime, FlightDate),
newDepTime = time.calc(DepTime, CRSDepTime, FlightDate),
newCRSArrTime = time.calc(CRSArrTime, CRSDepTime, FlightDate)) %>%
ungroup()
flights.complete <- flights2 %>%
mutate(CRSDepTime = newCRSDepTime,
DepTime = newDepTime,
WheelsOff = newWheelsOff,
WheelsOn = newWheelsOn,
CRSArrTime = newCRSArrTime,
ArrTime = newArrTime) %>%
select(-starts_with('new'))
Now we’re going to start taking a look at flight delays. For each flight, we’ll calculate both a departure delay, “delay.dep”, and an arrival delay, “delay.arr”. Both of these variables will be in terms of minutes delayed. I’m interested in how delays vary with time of day, so I’m also going to create a new variable “flight.hour” which gives the hour in which the flight departed (an integer between 0 and 23).
flights.complete <- flights.complete %>%
mutate(delay.dep = as.numeric(difftime(DepTime, CRSDepTime, units = 'mins')),
delay.arr = as.numeric(difftime(ArrTime, CRSArrTime, units = 'mins')),
delay.dep = ifelse(delay.dep < 0, 0, delay.dep),
delay.arr = ifelse(delay.arr < 0, 0, delay.arr),
delayed = ifelse((delay.dep > 0)|(delay.arr > 0), 1, 0),
flight.hour = hour(CRSDepTime),
flight.day = mday(CRSDepTime)
)
prop.delay <- mean(flights.complete$delayed)
From the data set we can see that out of 433298 completed flights (not cancelled or diverted), 0.4414121 were delayed to some degree. Now let’s take a look at when these delays occur and how they might be related with overall scheduling of flights.
First let’s take a look how the frequency of delays and air traffic varies by hour:
Delay.By.Hour <- flights.complete %>%
group_by(flight.hour, Origin, flight.day) %>%
summarise(prop.delay = mean(delayed),
flights = n())
Delay.By.Hour %>%
summarise(prop.delay = mean(prop.delay)) %>%
summarise(prop.delay = mean(prop.delay)) %>%
ggvis(~flight.hour, ~prop.delay) %>%
layer_bars() %>%
add_axis("x", title = "Scheduled Hour of Departure")%>%
add_axis("y", title = "Average Proportion of Flights Delayed")
Delay.By.Hour %>%
summarise(flights = mean(flights)) %>%
summarise(flights = mean(flights)) %>%
ggvis(~flight.hour, ~flights) %>%
layer_bars() %>%
add_axis("x", title = "Scheduled Hour of Departure")%>%
add_axis("y", title = "Average Number of Scheduled Flights")
From the above charts it appears that the proportion of delayed flights seems to gradually increase from 5 am through 11 pm, and then drop off slightly before peaking again in the 2 am hour. Interstingly, the 2 am peak delay period (time with largest proportion of delays) seems also to be around the time with the fewest scheduled flights (as seen from the second bar chart).
Now let’s take a look at the distribution of delay times:
flights.complete %>%
filter(delayed == 1) %>%
ggvis(~delay.arr) %>%
layer_histograms() %>%
add_axis("x", title = "Length of Delay (minutes)")
The histogram above shows how skewed the delay times are. Most delays are less than 200 minutes long, but some delays can be much, much longer than that. The maximum observed arrival delay was 1125.
Below, we take at look at the average arrival days for each hour for each origin airport, averaged across the whole month. This helps reduce the variation in our set so we can get a different perspective on the distribution of delay times.
Delay.By.Hour <- flights.complete %>%
filter(delayed == 1) %>%
group_by(flight.hour, Origin) %>%
summarise(delay.arr = mean(delay.arr),
delay.dep = mean(delay.dep))
Delay.By.Hour %>%
as.data.frame() %>%
ggvis(~delay.arr) %>%
layer_histograms(center = 15, width = 5) %>%
add_axis("x", title = "Length of Delay (minutes)")
summary(Delay.By.Hour$delay.arr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 15.00 22.67 28.55 33.53 393.00
From the above histogram it appears that for any given hour at a given airport, the average delay times for a delayed flight is generally between 10 and 30 minutes.
First let’s take a look at the relationship between traffic and frequency of delays:
Delay.By.Hour <- flights.complete %>%
group_by(flight.hour, Origin, flight.day) %>%
summarise(prop.delay = mean(delayed),
flights = n())
# Scatterplot showing the between proportion of delayed flights v.s. the
# number of scheduled flights. Each dot represents the average for one
# hour of the day at a particular airport for the whole month of January
Delay.By.Hour %>%
summarise(prop.delay = mean(prop.delay),
flights = mean(flights)) %>%
ungroup() %>%
ggvis(~flights, ~prop.delay) %>%
layer_points(fill = ~flight.hour) %>%
add_axis("x", title = "Number of Scheduled Flights")%>%
add_axis("y", title = "Proportion of Flights Delayed") %>%
layer_smooths(stroke := 'red')
The graph above doesn’t seem to indicate any strong relationship between airport traffic and flight delays. The smoothing curve indicates that there is a peak during at mid-traffic, but the effect is small.
Now let’s look at the relationship between traffic and length of delays:
Delay.By.Hour <- flights.complete %>%
group_by(flight.hour, Origin, flight.day) %>%
summarise(delay = mean(delay.arr),
flights = n())
Delay.By.Hour %>%
summarise(delay = mean(delay),
flights = mean(flights)) %>%
ungroup() %>%
ggvis(~flights, ~delay) %>%
layer_points(size = 0.5) %>%
add_axis("x", title = "Total Number of Flights in Scheduled Hour")%>%
add_axis("y", title = "Average Delay") %>%
layer_smooths(stroke := 'red')
As with frequency of delays, there doesn’t appear to be any obvious relationship between airport traffic and length of delays.
Although one of the peaks times of delay occurance (2 am) corresponds to a minimum in airport traffic, no other relationship between airport traffic and the behavior of delays was observed.