Overview

We are interested in the relationship between flight traffic and the frequency and length of flight delays. To this end, we looked at the relationship between the average number of scheduled flights for hour intervals at US airports and the average frequency and length of delays during those intervals. While we did observe a temporal pattern of delay frequency, we did not observe any strong correlation between air traffic and the length or frequency of delays. This is somewhat surprising since we expected that as air-traffic increase so would the proportion of delays. This analysis was done using data on all US domestic flights during the month of January 2016.


Data Loading and Manipulation

Becoming Familiar with the Data

Load packages:

library(dplyr)
library(lubridate)
library(ggvis)


Load data:

URL <- 'http://media.usm.maine.edu/~suleiman/mba676/domestic_flights_jan_2016.csv'

jan.flights <- read.csv(url(URL), stringsAsFactors = FALSE)
rm(URL)


Take a look at the data and convert to tibble:

dim(jan.flights)
## [1] 445827     21
head(jan.flights)
##   FlightDate Carrier TailNum FlightNum Origin        OriginCityName
## 1   1/6/2016      AA  N4YBAA        43    DFW Dallas/Fort Worth, TX
## 2   1/7/2016      AA  N434AA        43    DFW Dallas/Fort Worth, TX
## 3   1/8/2016      AA  N541AA        43    DFW Dallas/Fort Worth, TX
## 4   1/9/2016      AA  N489AA        43    DFW Dallas/Fort Worth, TX
## 5  1/10/2016      AA  N439AA        43    DFW Dallas/Fort Worth, TX
## 6  1/11/2016      AA  N468AA        43    DFW Dallas/Fort Worth, TX
##   OriginState Dest DestCityName DestState CRSDepTime DepTime WheelsOff
## 1          TX  DTW  Detroit, MI        MI       1100    1057      1112
## 2          TX  DTW  Detroit, MI        MI       1100    1056      1110
## 3          TX  DTW  Detroit, MI        MI       1100    1055      1116
## 4          TX  DTW  Detroit, MI        MI       1100    1102      1115
## 5          TX  DTW  Detroit, MI        MI       1100    1240      1300
## 6          TX  DTW  Detroit, MI        MI       1100    1107      1118
##   WheelsOn CRSArrTime ArrTime Cancelled Diverted CRSElapsedTime
## 1     1424       1438    1432         0        0            158
## 2     1416       1438    1426         0        0            158
## 3     1431       1438    1445         0        0            158
## 4     1424       1438    1433         0        0            158
## 5     1617       1438    1631         0        0            158
## 6     1426       1438    1435         0        0            158
##   ActualElapsedTime Distance
## 1               155      986
## 2               150      986
## 3               170      986
## 4               151      986
## 5               171      986
## 6               148      986
tail(jan.flights)
##        FlightDate Carrier TailNum FlightNum Origin
## 445822   1/3/2016      F9  N934FR       576    SLC
## 445823   1/3/2016      F9  N934FR       579    DEN
## 445824   1/3/2016      F9  N934FR       720    DEN
## 445825   1/3/2016      F9  N935FR       444    DEN
## 445826   1/3/2016      F9  N935FR       445    FSD
## 445827   1/3/2016      F9  N935FR       693    CID
##                    OriginCityName OriginState Dest       DestCityName
## 445822         Salt Lake City, UT          UT  DEN         Denver, CO
## 445823                 Denver, CO          CO  SLC Salt Lake City, UT
## 445824                 Denver, CO          CO  DCA     Washington, DC
## 445825                 Denver, CO          CO  FSD    Sioux Falls, SD
## 445826            Sioux Falls, SD          SD  DEN         Denver, CO
## 445827 Cedar Rapids/Iowa City, IA          IA  DEN         Denver, CO
##        DestState CRSDepTime DepTime WheelsOff WheelsOn CRSArrTime ArrTime
## 445822        CO       1449    1450      1500     1600       1609    1616
## 445823        UT       1245    1240      1249     1353       1409    1404
## 445824        VA       1649    1728      1745     2233       2159    2241
## 445825        SD       1020    1038      1050     1255       1250    1301
## 445826        CO       1335    1358      1408     1417       1426    1429
## 445827        CO       1909    1900      1920     1953       2029    2000
##        Cancelled Diverted CRSElapsedTime ActualElapsedTime Distance
## 445822         0        0             80                86      391
## 445823         0        0             84                84      391
## 445824         0        0            190               193     1476
## 445825         0        0             90                83      483
## 445826         0        0            111                91      483
## 445827         0        0            140               120      692
str(jan.flights)
## 'data.frame':    445827 obs. of  21 variables:
##  $ FlightDate       : chr  "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N4YBAA" "N434AA" "N541AA" "N489AA" ...
##  $ FlightNum        : int  43 43 43 43 43 43 43 43 43 43 ...
##  $ Origin           : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ OriginCityName   : chr  "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
##  $ OriginState      : chr  "TX" "TX" "TX" "TX" ...
##  $ Dest             : chr  "DTW" "DTW" "DTW" "DTW" ...
##  $ DestCityName     : chr  "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
##  $ DestState        : chr  "MI" "MI" "MI" "MI" ...
##  $ CRSDepTime       : int  1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
##  $ DepTime          : int  1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
##  $ WheelsOff        : int  1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
##  $ WheelsOn         : int  1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
##  $ CRSArrTime       : int  1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
##  $ ArrTime          : int  1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  158 158 158 158 158 158 158 158 158 158 ...
##  $ ActualElapsedTime: int  155 150 170 151 171 148 159 156 150 158 ...
##  $ Distance         : int  986 986 986 986 986 986 986 986 986 986 ...
summary(jan.flights)
##   FlightDate          Carrier            TailNum            FlightNum   
##  Length:445827      Length:445827      Length:445827      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 702  
##  Mode  :character   Mode  :character   Mode  :character   Median :1594  
##                                                           Mean   :2079  
##                                                           3rd Qu.:2763  
##                                                           Max.   :7438  
##                                                                         
##     Origin          OriginCityName     OriginState       
##  Length:445827      Length:445827      Length:445827     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Dest           DestCityName        DestState           CRSDepTime  
##  Length:445827      Length:445827      Length:445827      Min.   :   1  
##  Class :character   Class :character   Class :character   1st Qu.: 920  
##  Mode  :character   Mode  :character   Mode  :character   Median :1325  
##                                                           Mean   :1330  
##                                                           3rd Qu.:1730  
##                                                           Max.   :2359  
##                                                                         
##     DepTime        WheelsOff        WheelsOn       CRSArrTime  
##  Min.   :   1    Min.   :   1    Min.   :   1    Min.   :   1  
##  1st Qu.: 924    1st Qu.: 939    1st Qu.:1104    1st Qu.:1118  
##  Median :1331    Median :1344    Median :1519    Median :1527  
##  Mean   :1334    Mean   :1357    Mean   :1483    Mean   :1503  
##  3rd Qu.:1737    3rd Qu.:1750    3rd Qu.:1914    3rd Qu.:1920  
##  Max.   :2400    Max.   :2400    Max.   :2400    Max.   :2359  
##  NA's   :11473   NA's   :11600   NA's   :11907                 
##     ArrTime        Cancelled          Diverted        CRSElapsedTime 
##  Min.   :   1    Min.   :0.00000   Min.   :0.000000   Min.   : 21.0  
##  1st Qu.:1108    1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.: 90.0  
##  Median :1522    Median :0.00000   Median :0.000000   Median :128.0  
##  Mean   :1488    Mean   :0.02616   Mean   :0.001938   Mean   :146.5  
##  3rd Qu.:1919    3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:180.0  
##  Max.   :2400    Max.   :1.00000   Max.   :1.000000   Max.   :705.0  
##  NA's   :11907                                                       
##  ActualElapsedTime    Distance     
##  Min.   : 15.0     Min.   :  31.0  
##  1st Qu.: 85.0     1st Qu.: 391.0  
##  Median :122.0     Median : 679.0  
##  Mean   :140.1     Mean   : 844.2  
##  3rd Qu.:173.0     3rd Qu.:1086.0  
##  Max.   :721.0     Max.   :4983.0  
##  NA's   :12529
flights <- tbl_df(jan.flights)

str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame':    445827 obs. of  21 variables:
##  $ FlightDate       : chr  "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N4YBAA" "N434AA" "N541AA" "N489AA" ...
##  $ FlightNum        : int  43 43 43 43 43 43 43 43 43 43 ...
##  $ Origin           : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ OriginCityName   : chr  "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
##  $ OriginState      : chr  "TX" "TX" "TX" "TX" ...
##  $ Dest             : chr  "DTW" "DTW" "DTW" "DTW" ...
##  $ DestCityName     : chr  "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
##  $ DestState        : chr  "MI" "MI" "MI" "MI" ...
##  $ CRSDepTime       : int  1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
##  $ DepTime          : int  1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
##  $ WheelsOff        : int  1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
##  $ WheelsOn         : int  1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
##  $ CRSArrTime       : int  1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
##  $ ArrTime          : int  1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  158 158 158 158 158 158 158 158 158 158 ...
##  $ ActualElapsedTime: int  155 150 170 151 171 148 159 156 150 158 ...
##  $ Distance         : int  986 986 986 986 986 986 986 986 986 986 ...


Now we’ll take a look at at the missing values:

dim(flights %>% filter(complete.cases(.)))
## [1] 433298     21
dim(flights %>% filter(!complete.cases(.)))
## [1] 12529    21
head(flights %>% filter(!complete.cases(.)))
## # A tibble: 6 Ă— 21
##   FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
##        <chr>   <chr>   <chr>     <int>  <chr>          <chr>       <chr>
## 1  1/16/2016      AA  N3CXAA        44    SEA    Seattle, WA          WA
## 2  1/23/2016      AA                44    SEA    Seattle, WA          WA
## 3  1/24/2016      AA  N3DGAA        44    SEA    Seattle, WA          WA
## 4  1/25/2016      AA  N3MNAA        44    SEA    Seattle, WA          WA
## 5  1/15/2016      AA  N3JSAA        45    JFK   New York, NY          NY
## 6  1/23/2016      AA                45    JFK   New York, NY          NY
## # ... with 14 more variables: Dest <chr>, DestCityName <chr>,
## #   DestState <chr>, CRSDepTime <int>, DepTime <int>, WheelsOff <int>,
## #   WheelsOn <int>, CRSArrTime <int>, ArrTime <int>, Cancelled <int>,
## #   Diverted <int>, CRSElapsedTime <int>, ActualElapsedTime <int>,
## #   Distance <int>
tail(flights %>% filter(!complete.cases(.)))
## # A tibble: 6 Ă— 21
##   FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
##        <chr>   <chr>   <chr>     <int>  <chr>          <chr>       <chr>
## 1  1/25/2016      EV  N13975      6136    EWR     Newark, NJ          NJ
## 2  1/24/2016      EV  N14920      6136    EWR     Newark, NJ          NJ
## 3  1/26/2016      EV  N14998      6033    EWR     Newark, NJ          NJ
## 4  1/25/2016      EV  N14570      6033    EWR     Newark, NJ          NJ
## 5  1/23/2016      EV  N15574      6033    EWR     Newark, NJ          NJ
## 6  1/12/2016      EV  N13979      6033    EWR     Newark, NJ          NJ
## # ... with 14 more variables: Dest <chr>, DestCityName <chr>,
## #   DestState <chr>, CRSDepTime <int>, DepTime <int>, WheelsOff <int>,
## #   WheelsOn <int>, CRSArrTime <int>, ArrTime <int>, Cancelled <int>,
## #   Diverted <int>, CRSElapsedTime <int>, ActualElapsedTime <int>,
## #   Distance <int>
str(flights %>% filter(!complete.cases(.)))
## Classes 'tbl_df', 'tbl' and 'data.frame':    12529 obs. of  21 variables:
##  $ FlightDate       : chr  "1/16/2016" "1/23/2016" "1/24/2016" "1/25/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N3CXAA" "" "N3DGAA" "N3MNAA" ...
##  $ FlightNum        : int  44 44 44 44 45 45 45 45 50 52 ...
##  $ Origin           : chr  "SEA" "SEA" "SEA" "SEA" ...
##  $ OriginCityName   : chr  "Seattle, WA" "Seattle, WA" "Seattle, WA" "Seattle, WA" ...
##  $ OriginState      : chr  "WA" "WA" "WA" "WA" ...
##  $ Dest             : chr  "JFK" "JFK" "JFK" "JFK" ...
##  $ DestCityName     : chr  "New York, NY" "New York, NY" "New York, NY" "New York, NY" ...
##  $ DestState        : chr  "NY" "NY" "NY" "NY" ...
##  $ CRSDepTime       : int  640 640 645 645 1830 1830 1830 1820 1410 1315 ...
##  $ DepTime          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WheelsOff        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ WheelsOn         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ CRSArrTime       : int  1501 1501 1506 1506 2152 2152 2152 2142 1604 2110 ...
##  $ ArrTime          : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Cancelled        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  321 321 321 321 382 382 382 382 114 295 ...
##  $ ActualElapsedTime: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Distance         : int  2422 2422 2422 2422 2422 2422 2422 2422 550 2311 ...
str(flights %>% filter(complete.cases(.)))
## Classes 'tbl_df', 'tbl' and 'data.frame':    433298 obs. of  21 variables:
##  $ FlightDate       : chr  "1/6/2016" "1/7/2016" "1/8/2016" "1/9/2016" ...
##  $ Carrier          : chr  "AA" "AA" "AA" "AA" ...
##  $ TailNum          : chr  "N4YBAA" "N434AA" "N541AA" "N489AA" ...
##  $ FlightNum        : int  43 43 43 43 43 43 43 43 43 43 ...
##  $ Origin           : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ OriginCityName   : chr  "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" "Dallas/Fort Worth, TX" ...
##  $ OriginState      : chr  "TX" "TX" "TX" "TX" ...
##  $ Dest             : chr  "DTW" "DTW" "DTW" "DTW" ...
##  $ DestCityName     : chr  "Detroit, MI" "Detroit, MI" "Detroit, MI" "Detroit, MI" ...
##  $ DestState        : chr  "MI" "MI" "MI" "MI" ...
##  $ CRSDepTime       : int  1100 1100 1100 1100 1100 1100 1100 1100 1100 1100 ...
##  $ DepTime          : int  1057 1056 1055 1102 1240 1107 1059 1055 1058 1056 ...
##  $ WheelsOff        : int  1112 1110 1116 1115 1300 1118 1113 1107 1110 1110 ...
##  $ WheelsOn         : int  1424 1416 1431 1424 1617 1426 1429 1419 1420 1423 ...
##  $ CRSArrTime       : int  1438 1438 1438 1438 1438 1438 1438 1438 1438 1438 ...
##  $ ArrTime          : int  1432 1426 1445 1433 1631 1435 1438 1431 1428 1434 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRSElapsedTime   : int  158 158 158 158 158 158 158 158 158 158 ...
##  $ ActualElapsedTime: int  155 150 170 151 171 148 159 156 150 158 ...
##  $ Distance         : int  986 986 986 986 986 986 986 986 986 986 ...



It looks like nearly all the flights with missing values are cancelled flights. Let’s take a look at flights with missing values that were not cancelled:

head(flights %>% filter(!complete.cases(.), Cancelled == 0))
## # A tibble: 6 Ă— 21
##   FlightDate Carrier TailNum FlightNum Origin        OriginCityName
##        <chr>   <chr>   <chr>     <int>  <chr>                 <chr>
## 1  1/15/2016      AA  N3ALAA        56    DEN            Denver, CO
## 2  1/15/2016      AA  N3GUAA       208    SFO     San Francisco, CA
## 3  1/10/2016      AA  N3BVAA       210    LAS         Las Vegas, NV
## 4  1/15/2016      AA  N3HFAA       217    LAS         Las Vegas, NV
## 5  1/10/2016      AA  N796AA        34    LAX       Los Angeles, CA
## 6  1/22/2016      AA  N480AA       248    DFW Dallas/Fort Worth, TX
## # ... with 15 more variables: OriginState <chr>, Dest <chr>,
## #   DestCityName <chr>, DestState <chr>, CRSDepTime <int>, DepTime <int>,
## #   WheelsOff <int>, WheelsOn <int>, CRSArrTime <int>, ArrTime <int>,
## #   Cancelled <int>, Diverted <int>, CRSElapsedTime <int>,
## #   ActualElapsedTime <int>, Distance <int>
tail(flights %>% filter(!complete.cases(.), Cancelled == 0))
## # A tibble: 6 Ă— 21
##   FlightDate Carrier TailNum FlightNum Origin    OriginCityName
##        <chr>   <chr>   <chr>     <int>  <chr>             <chr>
## 1  1/21/2016      EV  N11150      5921    ORD       Chicago, IL
## 2  1/10/2016      EV  N14558      5961    SCE State College, PA
## 3   1/8/2016      EV  N25134      6158    JAX  Jacksonville, FL
## 4   1/8/2016      EV  N11127      6182    SDF    Louisville, KY
## 5   1/7/2016      EV  N11194      6089    ORD       Chicago, IL
## 6  1/14/2016      EV  N14105      6104    IAH       Houston, TX
## # ... with 15 more variables: OriginState <chr>, Dest <chr>,
## #   DestCityName <chr>, DestState <chr>, CRSDepTime <int>, DepTime <int>,
## #   WheelsOff <int>, WheelsOn <int>, CRSArrTime <int>, ArrTime <int>,
## #   Cancelled <int>, Diverted <int>, CRSElapsedTime <int>,
## #   ActualElapsedTime <int>, Distance <int>
summary(flights %>% filter(!complete.cases(.), Cancelled == 0))
##   FlightDate          Carrier            TailNum            FlightNum     
##  Length:864         Length:864         Length:864         Min.   :   4.0  
##  Class :character   Class :character   Class :character   1st Qu.: 798.5  
##  Mode  :character   Mode  :character   Mode  :character   Median :1790.5  
##                                                           Mean   :2534.7  
##                                                           3rd Qu.:4587.5  
##                                                           Max.   :7437.0  
##                                                                           
##     Origin          OriginCityName     OriginState       
##  Length:864         Length:864         Length:864        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      Dest           DestCityName        DestState           CRSDepTime  
##  Length:864         Length:864         Length:864         Min.   :  15  
##  Class :character   Class :character   Class :character   1st Qu.: 829  
##  Mode  :character   Mode  :character   Mode  :character   Median :1155  
##                                                           Mean   :1249  
##                                                           3rd Qu.:1626  
##                                                           Max.   :2359  
##                                                                         
##     DepTime       WheelsOff       WheelsOn      CRSArrTime  
##  Min.   :   2   Min.   :   8   Min.   :   2   Min.   :  13  
##  1st Qu.: 825   1st Qu.: 839   1st Qu.:1202   1st Qu.:1028  
##  Median :1210   Median :1228   Median :1528   Median :1410  
##  Mean   :1254   Mean   :1278   Mean   :1511   Mean   :1424  
##  3rd Qu.:1641   3rd Qu.:1654   3rd Qu.:1918   3rd Qu.:1816  
##  Max.   :2356   Max.   :2355   Max.   :2358   Max.   :2356  
##                                NA's   :242                  
##     ArrTime       Cancelled    Diverted CRSElapsedTime  ActualElapsedTime
##  Min.   :   2   Min.   :0   Min.   :1   Min.   : 30.0   Min.   : NA      
##  1st Qu.:1210   1st Qu.:0   1st Qu.:1   1st Qu.: 94.0   1st Qu.: NA      
##  Median :1533   Median :0   Median :1   Median :139.0   Median : NA      
##  Mean   :1517   Mean   :0   Mean   :1   Mean   :159.4   Mean   :NaN      
##  3rd Qu.:1924   3rd Qu.:0   3rd Qu.:1   3rd Qu.:195.0   3rd Qu.: NA      
##  Max.   :2357   Max.   :0   Max.   :1   Max.   :527.0   Max.   : NA      
##  NA's   :242                                            NA's   :864      
##     Distance     
##  Min.   :  82.0  
##  1st Qu.: 416.8  
##  Median : 767.0  
##  Mean   : 947.6  
##  3rd Qu.:1173.0  
##  Max.   :3784.0  
## 

So all the missing value flights were either cancelled or diverted.


We need to discern how dates-times are dealt with. Are all dates-times given in terms of one time zone? Are date-times given in terms of the origin airport? Are date-times given in terms of the airport of event (e.g. departure times given in terms of origin airport time zone and arrival times given in terms of destination airport time zone). To answer these questions let’s look at American Airlines (“AA”) flights between JFK (east coast) and LAX (west coast) on the second of January. LAX is three hours behind New York. If departure times are given in terms of origin airport and arrivals in terms of destination airports than the flight speed of the plane should appear impossibly fast.

EastWest.Flights <- flights %>% filter(Origin == 'JFK', 
                                       Dest == 'LAX',
                                       Carrier == 'AA',
                                       FlightDate == '1/2/2016')
dim(EastWest.Flights)
## [1] 12 21
slice(EastWest.Flights, 1)
## # A tibble: 1 Ă— 21
##   FlightDate Carrier TailNum FlightNum Origin OriginCityName OriginState
##        <chr>   <chr>   <chr>     <int>  <chr>          <chr>       <chr>
## 1   1/2/2016      AA  N797AA         1    JFK   New York, NY          NY
## # ... with 14 more variables: Dest <chr>, DestCityName <chr>,
## #   DestState <chr>, CRSDepTime <int>, DepTime <int>, WheelsOff <int>,
## #   WheelsOn <int>, CRSArrTime <int>, ArrTime <int>, Cancelled <int>,
## #   Diverted <int>, CRSElapsedTime <int>, ActualElapsedTime <int>,
## #   Distance <int>
EastWest.Flights <- EastWest.Flights %>%
     mutate(WheelsOff = paste(sprintf("%04d", WheelsOff)),
            WheelsOn = paste(sprintf("%04d", WheelsOn)),
            WheelsOff = as.POSIXct(WheelsOff, format="%H%M"),
            WheelsOn = as.POSIXct(WheelsOn, format="%H%M"),
            FlightTime = WheelsOn - WheelsOff
            )


EastWest.Flights2 <- EastWest.Flights %>%
     transmute(FlightTime = as.numeric(FlightTime),
               FlightSpeed1 = Distance / FlightTime,
               FlightSpeed2 = Distance / (FlightTime + 3))

EastWest.Flights2
## # A tibble: 12 Ă— 3
##    FlightTime FlightSpeed1 FlightSpeed2
##         <dbl>        <dbl>        <dbl>
## 1    2.733333     905.4878     431.6860
## 2    2.550000     970.5882     445.9459
## 3    2.683333     922.3602     435.4839
## 4    2.566667     964.2857     444.6108
## 5    2.533333     976.9737     447.2892
## 6    2.650000     933.9623     438.0531
## 7    2.516667     983.4437     448.6405
## 8    2.416667    1024.1379     456.9231
## 9    2.433333    1017.1233     455.5215
## 10   2.583333     958.0645     443.2836
## 11   2.666667     928.1250     436.7647
## 12   2.483333     996.6443     451.3678
rm(EastWest.Flights, EastWest.Flights2)


FlightSpeed1 in the above table reports the calculated flight speed if both WheelsOff and WheelsOn were reported in New York time, while FlightSpeed2 reports the flight speed if WheelsOff was reported in New York time and WheelsOn was reported in California time. The speeds reported in FlightSpeed1 seem unreasonably fast, whereas those reported in FlightSpeed2 seem much more credible. Thus, times are reported according to the timezone in which they occur. So, arrivals and landing times are reported in the destination airport’s time zone, and departure and lift-off times are reported in the origin airport’s time zone. We will keep this in mind if we do any calculations with flight speed (as done above).


Manipulating the Data

Manipulating the Data

The first thing we want to do is convert the dates and times to proper date and time objects in R. While doing this we also need to keep an eye out for flights that occur late at night, since successive events (departure and landing) may occur in different days. Also, from here on out we are going to ignore cancelled and diverted flights since they don’t have information on delays.

# First convert flight dates to Date objects
flights2 <- flights %>%
     filter(complete.cases(.)) %>%
     mutate(FlightDate = as.Date(FlightDate, format = '%m/%d/%Y'),
            newCRSDepTime = paste(FlightDate, sprintf('%04d', CRSDepTime)),
            newCRSDepTime = as.POSIXct(newCRSDepTime, format = '%Y-%m-%d %H%M')
            )

# Now, before converting event times to date-time objects, let's build a
# a function that will properly deal with successive events that cross
# the day boundary

time.calc <- function(Time.Interest, Time.prev, Day, form = '%Y-%m-%d %H%M'){
     require(dplyr) # Require dplr package so that if it weren't already loaded
                    # I would still have access to the pipe operator
     
     if(Time.Interest < Time.prev){
          if(Time.prev - Time.Interest > 300){ # Only increase day count by 1 if
               Day <- Day + 1                  # time difference is great. This will
          }                                    # prevent us from misdating planes that
     }                                         # departed slightly ahead of schedule.
                                               # In this case, we will assume planes
                                               # don't depart more than 3 hours early
     Time <- paste(Day, sprintf('%04d', Time.Interest)) %>%
          as.POSIXct(format = form)
}

# Now determine date-times for events.

flights2 <- flights2 %>%
     rowwise() %>%
     mutate(newWheelsOff = time.calc(WheelsOff, CRSDepTime, FlightDate),
            newWheelsOn = time.calc(WheelsOn, CRSDepTime, FlightDate),
            newArrTime = time.calc(ArrTime, CRSDepTime, FlightDate),
            newDepTime = time.calc(DepTime, CRSDepTime, FlightDate),
            newCRSArrTime = time.calc(CRSArrTime, CRSDepTime, FlightDate)) %>%
     ungroup()

flights.complete <- flights2 %>%
     mutate(CRSDepTime = newCRSDepTime,
            DepTime = newDepTime,
            WheelsOff = newWheelsOff,
            WheelsOn = newWheelsOn,
            CRSArrTime = newCRSArrTime,
            ArrTime = newArrTime) %>%
     select(-starts_with('new'))


Now we’re going to start taking a look at flight delays. For each flight, we’ll calculate both a departure delay, “delay.dep”, and an arrival delay, “delay.arr”. Both of these variables will be in terms of minutes delayed. I’m interested in how delays vary with time of day, so I’m also going to create a new variable “flight.hour” which gives the hour in which the flight departed (an integer between 0 and 23).

flights.complete <- flights.complete %>%
     mutate(delay.dep = as.numeric(difftime(DepTime, CRSDepTime, units = 'mins')),
            delay.arr = as.numeric(difftime(ArrTime, CRSArrTime, units = 'mins')),
            delay.dep = ifelse(delay.dep < 0, 0, delay.dep),
            delay.arr = ifelse(delay.arr < 0, 0, delay.arr),
            delayed = ifelse((delay.dep > 0)|(delay.arr > 0), 1, 0),
            flight.hour = hour(CRSDepTime),
            flight.day = mday(CRSDepTime)
            )

prop.delay <- mean(flights.complete$delayed)

From the data set we can see that out of 433298 completed flights (not cancelled or diverted), 0.4414121 were delayed to some degree. Now let’s take a look at when these delays occur and how they might be related with overall scheduling of flights.

Visualizing the Data and Relationships

Distribution of Delays and Flight Traffic

First let’s take a look how the frequency of delays and air traffic varies by hour:

Delay.By.Hour <- flights.complete %>%
     group_by(flight.hour, Origin, flight.day) %>%
     summarise(prop.delay = mean(delayed),
               flights = n())

Delay.By.Hour %>%
     summarise(prop.delay = mean(prop.delay)) %>%
     summarise(prop.delay = mean(prop.delay)) %>%
     ggvis(~flight.hour, ~prop.delay) %>%
     layer_bars() %>%
     add_axis("x", title = "Scheduled Hour of Departure")%>%
     add_axis("y", title = "Average Proportion of Flights Delayed")
Delay.By.Hour %>%
     summarise(flights = mean(flights)) %>%
     summarise(flights = mean(flights)) %>%
     ggvis(~flight.hour, ~flights) %>%
     layer_bars() %>%
     add_axis("x", title = "Scheduled Hour of Departure")%>%
     add_axis("y", title = "Average Number of Scheduled Flights")

From the above charts it appears that the proportion of delayed flights seems to gradually increase from 5 am through 11 pm, and then drop off slightly before peaking again in the 2 am hour. Interstingly, the 2 am peak delay period (time with largest proportion of delays) seems also to be around the time with the fewest scheduled flights (as seen from the second bar chart).

Now let’s take a look at the distribution of delay times:

flights.complete %>%
     filter(delayed == 1) %>%
     ggvis(~delay.arr) %>%
     layer_histograms() %>%
     add_axis("x", title = "Length of Delay (minutes)")

The histogram above shows how skewed the delay times are. Most delays are less than 200 minutes long, but some delays can be much, much longer than that. The maximum observed arrival delay was 1125.

Below, we take at look at the average arrival days for each hour for each origin airport, averaged across the whole month. This helps reduce the variation in our set so we can get a different perspective on the distribution of delay times.

Delay.By.Hour <- flights.complete %>%
     filter(delayed == 1) %>%
     group_by(flight.hour, Origin) %>%
     summarise(delay.arr = mean(delay.arr),
               delay.dep = mean(delay.dep))

Delay.By.Hour %>%
     as.data.frame() %>%
     ggvis(~delay.arr) %>%
     layer_histograms(center = 15, width = 5) %>%
     add_axis("x", title = "Length of Delay (minutes)")
summary(Delay.By.Hour$delay.arr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   15.00   22.67   28.55   33.53  393.00

From the above histogram it appears that for any given hour at a given airport, the average delay times for a delayed flight is generally between 10 and 30 minutes.


Relationship between Flight Traffic and Delays

First let’s take a look at the relationship between traffic and frequency of delays:

Delay.By.Hour <- flights.complete %>%
     group_by(flight.hour, Origin, flight.day) %>%
     summarise(prop.delay = mean(delayed),
               flights = n())

# Scatterplot showing the between proportion of delayed flights v.s. the
# number of scheduled flights. Each dot represents the average for one
# hour of the day at a particular airport for the whole month of January
Delay.By.Hour %>%
     summarise(prop.delay = mean(prop.delay), 
               flights = mean(flights)) %>%
     ungroup() %>%
     ggvis(~flights, ~prop.delay) %>%
     layer_points(fill = ~flight.hour) %>%
     add_axis("x", title = "Number of Scheduled Flights")%>%
     add_axis("y", title = "Proportion of Flights Delayed") %>%
     layer_smooths(stroke := 'red')

The graph above doesn’t seem to indicate any strong relationship between airport traffic and flight delays. The smoothing curve indicates that there is a peak during at mid-traffic, but the effect is small.

Now let’s look at the relationship between traffic and length of delays:

Delay.By.Hour <- flights.complete %>%
     group_by(flight.hour, Origin, flight.day) %>%
     summarise(delay = mean(delay.arr),
               flights = n())

Delay.By.Hour %>%
     summarise(delay = mean(delay),
               flights = mean(flights)) %>%
     ungroup() %>%
     ggvis(~flights, ~delay) %>%
     layer_points(size = 0.5) %>%
     add_axis("x", title = "Total Number of Flights in Scheduled Hour")%>%
     add_axis("y", title = "Average Delay") %>%
     layer_smooths(stroke := 'red')

As with frequency of delays, there doesn’t appear to be any obvious relationship between airport traffic and length of delays.


Conclusion

Although one of the peaks times of delay occurance (2 am) corresponds to a minimum in airport traffic, no other relationship between airport traffic and the behavior of delays was observed.