flights - explorations

PURPOSE: we explore the flights dataset available in nycflights13 using dplyr and lubridate.

Quick overview of the dataset

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Not all flights actually occurred. We assume that if both dep_time and arr_time are NA, the flight was cancelled. These flights are not of interest we use filter() to remove them and overwrite flights in our workspace.

flights <- flights %>%
 filter(!is.na(dep_time), !is.na(arr_time))

## Warning: package 'bindrcpp' was built under R version 3.4.4

flights

## # A tibble: 328,063 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 328,053 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Sometimes flights leave early!

My last flight was to Montreal. I thought it would be ok to arrive just in time because I had a purchased seat reservation and had confirmed the flight. So when I arrived 5 minutes before takeover, I was surprised to find the staff were “in a fap”. Apparently there had been numerous annoucements calling for passenger McLeod to go to the departure gate immediately. After quickly going through security, I boarded almost full plane and noticed the amused look on many faces!

What proportion leave early?

An early departure is indicated by dep_delay < 0. We could use filter to select all flights early flights and use nrow() to determine the proportion.

nrow(filter(flights, dep_delay<0))/nrow(flights)

## [1] 0.5591731

Amazing about 56% leave early!

Does it depend on destination

Let’s find out how many destinations there are and their relative frequency - that is we need to sort in descending order of frequency.

We see there are 104 destinations. The top 3 destinations are ALT, BOS and ORD with over 9500 flights in 2013.

DEST <- flights %>%
 filter(dep_delay<0) %>%
 select(dest) %>%
 group_by(dest) %>%
 summarize(count=n()) %>%
 arrange(desc(count))
DEST

## # A tibble: 104 x 2
##    dest  count
##    <chr> <int>
##  1 ATL    9871
##  2 BOS    9687
##  3 ORD    9556
##  4 CLT    8826
##  5 LAX    8478
##  6 MCO    7746
##  7 MIA    6874
##  8 SFO    6566
##  9 FLL    6244
## 10 DCA    5841
## # ... with 94 more rows

We see there was only 1 flight in 2013 to ANC and LEX.

arrange(DEST, count)

## # A tibble: 104 x 2
##    dest  count
##    <chr> <int>
##  1 ANC       1
##  2 LEX       1
##  3 MTJ       3
##  4 JAC       4
##  5 SBN       4
##  6 HDN       5
##  7 EYW       8
##  8 PSP      13
##  9 BZN      15
## 10 CHO      27
## # ... with 94 more rows

There are 104 destinations. A summary of the number of flights to each is provided by the Tukey 5 number summary available in base R, stats::fivenum(). Note the use of dot notation with braces, {. }.

flights %>%
 filter(dep_delay<0) %>%
 select(dest) %>%
 group_by(dest) %>%
 summarize(count=n()) %>%
 arrange(desc(count)) %>%
 {fivenum(.$count)}

## [1]    1.0  173.0  784.5 2131.5 9871.0

Histogram of proportion of early departures

For each destination we divide the number of early departures by the total number of flights. Note that we construct two tibbles and merge them use the dplyr::inner_join(). This is one of the important functions discussed in Ch 10 *Relational Data with dplyr**.

allByDest <- flights %>%
 select(dest) %>%
 group_by(dest) %>%
 summarize(Total=n()) %>%
 arrange(desc(Total))

earlyByDest <- flights %>%
 filter(dep_delay<0) %>%
 select(dest) %>%
 group_by(dest) %>%
 summarize(Early=n()) %>%
 arrange(desc(Early))

inner_join(allByDest, earlyByDest) %>%
 mutate(probEarly=Early/Total) %>%
 ggplot(aes(x=probEarly)) +
   geom_histogram(bins=10)

## Joining, by = "dest"

There is one destination with 100% probability of early departure! Let’s drill down and find out which.

inner_join(allByDest, earlyByDest) %>%
 mutate(probEarly=Early/Total) %>%
 arrange(desc(probEarly))

## Joining, by = "dest"

## # A tibble: 104 x 4
##    dest  Total Early probEarly
##    <chr> <int> <int>     <dbl>
##  1 LEX       1     1     1    
##  2 PSP      18    13     0.722
##  3 XNA    1005   721     0.717
##  4 SRQ    1202   792     0.659
##  5 AVL     263   172     0.654
##  6 MVY     211   137     0.649
##  7 ACK     265   172     0.649
##  8 DFW    8453  5453     0.645
##  9 CLT   13684  8826     0.645
## 10 BOS   15028  9687     0.645
## # ... with 94 more rows

Well no surprise! There was only one flight to LEX and it left early.

Exploring delayed flights

Distribution of delay times

We see that there is a group of flights with delays of 4 minutes or less. Also that not many flights have delays more than $2^7$ minutes = 2.1 hours.

flights %>%
 filter(dep_delay>0) %>%
 ggplot(aes(x=log(dep_delay, base=2))) +
   geom_histogram(binwidth=0.25) +
   scale_x_continuous(breaks=1:10) +
   ggtitle("Departure delays in log_2(minutes)")

Does probability of delay for more than 4 minutes depend the destination

We construct two tibbles one for delayed flights, which a delay of more than 4 minutes, and the other for all flights. Then use inner_join() and compute the proportion of delayed flights.

nD <- flights %>%
 filter(dep_delay>4) %>%
 group_by(dest) %>%
 summarize(numDelayed=n()) %>%
 arrange(desc(numDelayed))
#
nF <- flights %>%
 group_by(dest) %>%
 summarize(numFlights=n()) %>%
 arrange(desc(numFlights))
#
nFD <- inner_join(nF, nD) %>%
 mutate(probDelay=numDelayed/numFlights) %>%
 arrange(desc(probDelay))

## Joining, by = "dest"

The Tukey five-number summary 8, 358, 1439, 4096.5, 16873 indicates that the distribution is right-skewed.

From the table below, the most popular flights seem to be delayed the most frequently.

arrange(nFD, desc(numFlights))

## # A tibble: 103 x 4
##    dest  numFlights numDelayed probDelay
##    <chr>      <int>      <int>     <dbl>
##  1 ATL        16873       4878     0.289
##  2 ORD        16607       5153     0.310
##  3 LAX        16058       4714     0.294
##  4 BOS        15028       3897     0.259
##  5 MCO        13979       4283     0.306
##  6 CLT        13684       3499     0.256
##  7 SFO        13212       4310     0.326
##  8 FLL        11930       4060     0.340
##  9 MIA        11625       3064     0.264
## 10 DCA         9125       2551     0.280
## # ... with 93 more rows

Time series plot of delays in the most frequent flights

We construct a daily time series showing the number of minutes of delay for all destinations with more than 14000 flights in 2013 for which the departure delay exceeded 4 minutes. We use lubridate::make_date() to construct the Date and then use ggplot().

arrange(nFD, desc(numFlights)) %>%
 filter(numFlights > 14000) %>%
 select(dest) %>%
 inner_join(., filter(flights, dep_delay>4)) %>%
 group_by(month, day) %>%
 summarize(mdelay=median(dep_delay, na.rm=TRUE), numdelay=n()) %>%
 mutate(date=make_date(year=2013, month=month, day=day)) %>%
 ggplot(aes(x=date, y=mdelay)) +
   geom_line() + 
   geom_smooth(se=FALSE) +
   ggtitle("Median non-negligible delay for most frequent flights") +
   labs(y="delay in minutes")

## Joining, by = "dest"

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Number of departures to all destinations per day

To plot using ggplot(), we want to replace dep_time which is in seconds with an R Date. In lubridate we can use make_datetime(),

args(make_datetime)

## function (year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L, 
##     sec = 0, tz = "UTC") 
## NULL

We see make_datetime() requires args hour and minute but the tibble flights$dep_time is the time in seconds from midnight. Using the modulo functions %% and %/% we can convert this to the required formats.

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime(year, month, day, 
        hour = dep_time%/%100, min = dep_time%%100)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt

## # A tibble: 328,063 x 9
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time
##    <chr>  <chr>     <dbl>     <dbl> <dttm>                       <int>
##  1 EWR    IAH           2        11 2013-01-01 05:17:00            515
##  2 LGA    IAH           4        20 2013-01-01 05:33:00            529
##  3 JFK    MIA           2        33 2013-01-01 05:42:00            540
##  4 JFK    BQN          -1       -18 2013-01-01 05:44:00            545
##  5 LGA    ATL          -6       -25 2013-01-01 05:54:00            600
##  6 EWR    ORD          -4        12 2013-01-01 05:54:00            558
##  7 EWR    FLL          -5        19 2013-01-01 05:55:00            600
##  8 LGA    IAD          -3       -14 2013-01-01 05:57:00            600
##  9 JFK    MCO          -3        -8 2013-01-01 05:57:00            600
## 10 LGA    ORD          -2         8 2013-01-01 05:58:00            600
## # ... with 328,053 more rows, and 3 more variables: arr_time <int>,
## #   sched_arr_time <int>, air_time <dbl>

We can use geom_freqpoly() to visualise the distribution of departure times across the 2013 year:

flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

Or within a single day, for example, in the U.S. Thanksgiving holiday was on Thursday November 28. Let’s see the pattern of flights on the day before.

flights_dt %>% 
  filter(dep_time < make_datetime(2013, 11, 28, 0, 0),
         dep_time >= make_datetime(2013, 11, 27, 0, 0)
  ) %>%
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 3*600) + # 600 s = 10 minutes
  ggtitle("NYC Airport. Total departures every 30 minutes.",
          subtitle="November 27, 2013")

Flights to Hawaii

We see there are 336,766 flights but this includes cancelled flights. The major international airport in Hawii is in Honolulu with code HNL. Eliminating all cancelled flights and selecting only flights to Hawaii.

hflights <- flights %>%
 filter(!is.na(dep_time), !is.na(arr_time), dest=="HNL") %>%
 select(-dest)
hflights

## # A tibble: 705 x 18
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      857            900        -3     1516
##  2  2013     1     1     1344           1344         0     2005
##  3  2013     1     2      909            900         9     1525
##  4  2013     1     2     1344           1344         0     1940
##  5  2013     1     3      914            900        14     1504
##  6  2013     1     3     1418           1341        37     2006
##  7  2013     1     4      900            900         0     1516
##  8  2013     1     4     1343           1341         2     1932
##  9  2013     1     5      858            900        -2     1519
## 10  2013     1     5     1329           1335        -6     1850
## # ... with 695 more rows, and 11 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

So there are 705 all together.

Which airlines fly there from NYC?

hflights %>%
 { unique(.$carrier) }

## [1] "HA" "UA"

Remark: First attempt, left out the braces - using google I found help here

https://stackoverflow.com/questions/42385010/using-the-pipe-and-dot-notation

For brevity the variable carrier uses an abbreviated code to denote the airline. The tibble airlines gives the name, so we see the two airlines are Hawaiian Airlines Inc., United Air Lines Inc.

What is the frequency of flights by weekday and by month

The package lubridate is included in the recommended tidyverse but it needs to attached. This package provides very simplified and convenient functions for working with dates. The flights tibble contains the variable time_hour which is is recognized in date-time format, , that is supported with lubridate.

Let’s look at barcharts for flights by day-of-week. We find there are about 100 flights each day – a little less on Tuesdays and Thursdays.

hflights %>%
 mutate(dayOfWeek=wday(hflights$time_hour, label=TRUE)) %>%
 ggplot(aes(dayOfWeek)) +
  geom_bar()

Similarly the barchart for months. There are about 60 flights per month so about 2 every day assuming flights are evenly distributed over the month.

hflights %>%
 mutate(monthName = month.abb[month]) %>%
 ggplot(aes(monthName)) +
  geom_bar()

Are there any days with no flights to Hawaii?

We can solve this problem by converting the year, month and day into Date format using lubridate::make_date(). Then we can construct the set of all Dates in 2013 and see if all flight days are included. We find that they are!

start <- make_date(year=2013, month=1, day=1)
days2013 <- start+days(0:364)
hdays <- with(hflights, make_date(year=2013, month=month, day=day))
all(days2013 %in% hdays)

## [1] TRUE

Number of flights per day to Hawaii

Using base::table() we need to use the braces with .-notation.

hflights %>%
 mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
 group_by(hdays) %>%
 summarize(n=n()) %>%
 {table(.$n)}

## 
##   1   2 
##  25 340

But using stats::table() we don’t need to use the braces.

hflights %>%
 mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
 group_by(hdays) %>%
 summarize(n=n()) %>%
 xtabs(~n, .)

## n
##   1   2 
##  25 340

Exploring the variation in flying time to Hawaii

hflights %>%
 mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
 group_by(hdays) %>%
 summarize(avFlyTime = mean(air_time, na.rm=TRUE)/60) %>%
 filter(!is.nan(avFlyTime)) %>%  #there is one of these!
 ggplot(aes(x=hdays, y=avFlyTime)) +
  geom_line() +
  geom_smooth(method="loess", color=alpha("blue", 0.5), size=2, 
              se=FALSE, method.args=list(degree=2)) +
  ggtitle("Average Flying Time from NYC to Honolulu") +
  labs(x="date", y="time in hours")

hflights %>%
 mutate(hdays = make_date(year=2013, month=month, day=day)) %>%
 group_by(hdays) %>%
 summarize(avFlyTime = mean(air_time, na.rm=TRUE)/60) %>%
 filter(!is.nan(avFlyTime)) %>%  #there is one of these!
 ggplot(aes(x=avFlyTime, y=..density..)) +
  geom_histogram(binwidth=0.2, fill="lightblue") +
  geom_density(size=2, bw = "SJ-dpi") +
  ggtitle("Average Flying Time from NYC to Honolulu") +
  labs(x="time in hours")