What we are going to do here is look at some data from New York City airports in 2013. There are over 336,000 flights with 19 variables, or around 6.4 million data points. That’s a lot more than we are interested in so we need to do some cleaning and structuring. We want to focus on dates and times.

We will notice the dates and time for arrivals and departures are not in a standard form, so we need to make them easier to work with. Once we have completed that task, we can begin to look at the data with visualization.

These examples are based on Hadley Wickham’s book, R for Data Science. The code can be found on GitHub.

So we start by loading the packages we will work with.

library(tidyverse)
library(lubridate)
library(nycflights13)

If we look at our flight data, in the flights dataframe, we can see right away there is a problem with the dates and times. We have edited out a number of columns we don’t need by using the select() command.

flights %>% select(year, month, day, hour, minute)

## # A tibble: 336,776 x 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # ... with 336,766 more rows

The dates and times are broken up into separate columns for year, month, day, hour, and minute. We want to put the dates and times into a more conventional form, so we will use functions from the lubridate package, which is designed for this purpose. We also add new columns we need for our analysis using the mutate() command.

flights %>% select(year, month, day, hour, minute) %>%
        mutate(departure = make_date(year, month, day))

## # A tibble: 336,776 x 6
##     year month   day  hour minute  departure
##    <int> <int> <int> <dbl>  <dbl>     <date>
##  1  2013     1     1     5     15 2013-01-01
##  2  2013     1     1     5     29 2013-01-01
##  3  2013     1     1     5     40 2013-01-01
##  4  2013     1     1     5     45 2013-01-01
##  5  2013     1     1     6      0 2013-01-01
##  6  2013     1     1     5     58 2013-01-01
##  7  2013     1     1     6      0 2013-01-01
##  8  2013     1     1     6      0 2013-01-01
##  9  2013     1     1     6      0 2013-01-01
## 10  2013     1     1     6      0 2013-01-01
## # ... with 336,766 more rows

flights %>% select(year, month, day, hour, minute) %>%
        mutate(departure = make_datetime(year, month, day, hour, minute))

## # A tibble: 336,776 x 6
##     year month   day  hour minute           departure
##    <int> <int> <int> <dbl>  <dbl>              <dttm>
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # ... with 336,766 more rows

In order to make our analysis a bit less tedious, we are going to create a custom function that we will call later.

make_datetime_100 <- function(year, month, day, time) {
        make_datetime(year, month, day, time %/% 100, time %% 100)
}

This function allows us to create four new columns for our flights data. We will use these columns to work our analysis using a new dataframe called flights_dt.

flights_dt <- flights %>% filter(!is.na(dep_time), !is.na(arr_time)) %>%
        mutate(
                dep_time = make_datetime_100(year, month, day, dep_time),
                arr_time = make_datetime_100(year, month, day, arr_time),
                sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
                sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
                ) %>% 
        select(origin, dest, ends_with("delay"), ends_with("time"))

We can see the results below. Times and dates are now in a familiar format.

flights_dt

## # A tibble: 328,063 x 9
##    origin  dest dep_delay arr_delay            dep_time
##     <chr> <chr>     <dbl>     <dbl>              <dttm>
##  1    EWR   IAH         2        11 2013-01-01 05:17:00
##  2    LGA   IAH         4        20 2013-01-01 05:33:00
##  3    JFK   MIA         2        33 2013-01-01 05:42:00
##  4    JFK   BQN        -1       -18 2013-01-01 05:44:00
##  5    LGA   ATL        -6       -25 2013-01-01 05:54:00
##  6    EWR   ORD        -4        12 2013-01-01 05:54:00
##  7    EWR   FLL        -5        19 2013-01-01 05:55:00
##  8    LGA   IAD        -3       -14 2013-01-01 05:57:00
##  9    JFK   MCO        -3        -8 2013-01-01 05:57:00
## 10    LGA   ORD        -2         8 2013-01-01 05:58:00
## # ... with 328,053 more rows, and 4 more variables: sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

With this format in place, we can begin visualizing the data. We can start with the “30,000 foot” view of flights for the full year by day.

flights_dt %>% ggplot(aes(dep_time)) + geom_freqpoly(binwidth = 86400)

Why are we using a binwidth of 86,400? It seems like an arbitrary number, until you remember that our times are broken down to the second. If we are looking at flights every day for a year, we need to break each day into seconds. 24 hours times 60 minutes times 60 seconds gives us 86,400 seconds per day. Each binwidth is one day in our chart.

We can zoom down now to the flights in a single day by using the filter() command. In this case, on January 1. Note our new value for binwidth. Why 600? Remember, our denomination is seconds.

flights_dt %>% filter(dep_time < ymd(20130102)) %>%
        ggplot(aes(dep_time)) +
        geom_freqpoly(binwidth = 600)

We can use the wday() command to figure out how many flights there are throughout the year for each day of thw week. We use mutate() to create a new column of data then plot.

flights_dt %>%
        mutate(wday = wday(dep_time, label = TRUE)) %>%
        ggplot(aes(x = wday)) +
        geom_bar()

It might also be interesting to look a departure times throughout the year by the minutes of the hour. In particular, are there times during the hour that might see minimal delays? If we create a new column of data using the minutes of actual departure times we can get the following visual.

flights_dt %>%
        mutate(minute = minute(dep_time)) %>%
        group_by(minute) %>%
        summarize(avg_delay = mean(arr_delay, na.rm = TRUE),
                  n = n()) %>%
        ggplot(aes(minute, avg_delay)) + geom_line()

But actual times aren’t scheduled times. That is, planes don’t often depart or arrive when they are supposed to. How do things look if we use the scheduled departure time, not the actual time? Well, a bit like this:

sched_dep <- flights_dt %>%
        mutate(minute = minute(sched_dep_time)) %>%
        group_by(minute) %>%
        summarize(avg_delay = mean(arr_delay, na.rm = TRUE),
                  n = n())

ggplot(sched_dep, aes(minute, avg_delay)) + geom_line ()

And when are most departures scheduled? Well, it looks like on the hour on the half hour, with other flights scheduled at five minute intervals.

ggplot(sched_dep, aes(minute, n)) + geom_line()

We can use the rounding time functions, in this case floor_date to help us graph the numbers of flights each week for the year.

flights_dt %>%
        count(week = floor_date(dep_time, "week")) %>%
        ggplot(aes(week, n)) + geom_line()

Sometimes, data has a problem with chronological order. In our case, we have flights arriving before they depart!

flights_dt %>%
        filter(arr_time < dep_time)

## # A tibble: 10,633 x 9
##    origin  dest dep_delay arr_delay            dep_time
##     <chr> <chr>     <dbl>     <dbl>              <dttm>
##  1    EWR   BQN         9        -4 2013-01-01 19:29:00
##  2    JFK   DFW        59        NA 2013-01-01 19:39:00
##  3    EWR   TPA        -2         9 2013-01-01 20:58:00
##  4    EWR   SJU        -6       -12 2013-01-01 21:02:00
##  5    EWR   SFO        11       -14 2013-01-01 21:08:00
##  6    LGA   FLL       -10        -2 2013-01-01 21:20:00
##  7    EWR   MCO        41        43 2013-01-01 21:21:00
##  8    JFK   LAX        -7       -24 2013-01-01 21:28:00
##  9    EWR   FLL        49        28 2013-01-01 21:34:00
## 10    EWR   FLL        -9       -14 2013-01-01 21:36:00
## # ... with 10,623 more rows, and 4 more variables: sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

The explanation is clear, these are flights that are in the air at midnight, so they arrive the next day. But our data aren’t showing that. So we need a fix. We need to create a column for those overnight flights, then manipulate it. We start by separating overnight flights out from the others. This is a logical expression yiedling a result of TRUE or FALSE.

flights_dt <- flights_dt %>%
        mutate(
        overnight = arr_time < dep_time,
        arr_time = arr_time + days(overnight * 1),
        sched_arr_time = sched_arr_time + days(overnight * 1)
)

flights_dt

## # A tibble: 328,063 x 10
##    origin  dest dep_delay arr_delay            dep_time
##     <chr> <chr>     <dbl>     <dbl>              <dttm>
##  1    EWR   IAH         2        11 2013-01-01 05:17:00
##  2    LGA   IAH         4        20 2013-01-01 05:33:00
##  3    JFK   MIA         2        33 2013-01-01 05:42:00
##  4    JFK   BQN        -1       -18 2013-01-01 05:44:00
##  5    LGA   ATL        -6       -25 2013-01-01 05:54:00
##  6    EWR   ORD        -4        12 2013-01-01 05:54:00
##  7    EWR   FLL        -5        19 2013-01-01 05:55:00
##  8    LGA   IAD        -3       -14 2013-01-01 05:57:00
##  9    JFK   MCO        -3        -8 2013-01-01 05:57:00
## 10    LGA   ORD        -2         8 2013-01-01 05:58:00
## # ... with 328,053 more rows, and 5 more variables: sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## #   overnight <lgl>

We check to see if we have done it correctly, and we have since we get no results where flights arrive before they depart.

flights_dt %>% filter(overnight, arr_time < dep_time)

## # A tibble: 0 x 10
## # ... with 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>,
## #   arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>,
## #   arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## #   overnight <lgl>

Working with Time Functions in lubridate

Steven Slezak

27 Dec 2017

End of Time