0. Introduction


For obvious reasons, dates and times are a very common and important type of data. For example, in flights we have information about the scheduled departure time, actual departure time, scheduled arrival time, and actual arrival time. We also have a time_hour column to record the scheduled date and hour (but with minutes ignored) in a date-time format.

glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:

The answer is “no” for all three questions because

The physics involved behind measuring time is indeed complicated. Here we will focus on establish a solid grounding of practical skills that will help us with common data analysis challenges.


1. Load Libraries

This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.

library(tidyverse)
library(nycflights13)
library(lubridate)


2. Creating date/times

There are three types of date/time data that refer to an instant in time:

Here we are only going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.


Get current date and time

To get the current date or date-time you can use today() or now():

today()
## [1] "2023-04-05"
now()
## [1] "2023-04-05 13:58:47 EDT"

Note that today() and now() may give different results with different time zones.

today(tzone = "PRC")
## [1] "2023-04-06"
now(tzone = "UTC")
## [1] "2023-04-05 17:58:47 UTC"

You can find out what R thinks your current time zone is with Sys.timezone():

Sys.timezone()
## [1] "America/New_York"

To see the complete list of all time zone names, use OlsonNames():

length(OlsonNames())
## [1] 597
head(OlsonNames())
## [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
## [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"

The timezones have continent names to avoid cities with the same name across the world.


Create date/time from strings

Date/time data often comes as strings. lubridate offers convenient functions that automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date into the form of yyyy-mm-dd. For example:

ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"

These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data.

ymd(20170131)
## [1] "2017-01-31"

ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"

By default ymd and other similar functions do give a time zone. But we can also force the creation of a date-time from a date by supplying a timezone:

ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"


Lab Exercise:

  1. What happens if you parse a string that contains invalid dates?

ymd(c("2010-10-10", "bananas"))

  1. Use the appropriate lubridate function to parse each of the following dates:
d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014


Create date/time from individual components

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

flights %>% 
  select(year, month, day, hour, minute)
## # A tibble: 336,776 × 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # … with 336,766 more rows

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times. make_date() takes up to three arguments year, month, and day. But please be aware that the default value is January 1st, 1970, the so-called “Unix Epoch”.

date1 <- make_date(2023, 4, 5)
class(date1)
## [1] "Date"

make_datetime() takes up to seven arguments, year, month, day, hour, min and second, and tz (timezone). The default value is 1970-01-01, 00:00:00 UTC.

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure_date = make_date(year, month, day)) %>%
  mutate(departure_scheduled = make_datetime(year, month, day, hour, minute, tz = Sys.timezone())) 
## # A tibble: 336,776 × 7
##     year month   day  hour minute departure_date departure_scheduled
##    <int> <int> <int> <dbl>  <dbl> <date>         <dttm>             
##  1  2013     1     1     5     15 2013-01-01     2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01     2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01     2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01     2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01     2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
## # … with 336,766 more rows

Please be noted that the default timezone for make_datetime is UTC. The time in departure time is from local timezone, which is America/New_York since the flights departed from New York City.

We can also get the minutes and seconds from dep_time or arr_time which are in a number format such as 517 using modulus arithmetic.

make_datetime_100 <- function(year, month, day, time, tz = "UTC") {
  make_datetime(year, month, day, time %/% 100, time %% 100, 0, tz)
}

flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  )  
## # A tibble: 328,063 × 19
##     year month   day dep_time            sched_dep_time      dep_delay
##    <int> <int> <int> <dttm>              <dttm>                  <dbl>
##  1  2013     1     1 2013-01-01 05:17:00 2013-01-01 05:15:00         2
##  2  2013     1     1 2013-01-01 05:33:00 2013-01-01 05:29:00         4
##  3  2013     1     1 2013-01-01 05:42:00 2013-01-01 05:40:00         2
##  4  2013     1     1 2013-01-01 05:44:00 2013-01-01 05:45:00        -1
##  5  2013     1     1 2013-01-01 05:54:00 2013-01-01 06:00:00        -6
##  6  2013     1     1 2013-01-01 05:54:00 2013-01-01 05:58:00        -4
##  7  2013     1     1 2013-01-01 05:55:00 2013-01-01 06:00:00        -5
##  8  2013     1     1 2013-01-01 05:57:00 2013-01-01 06:00:00        -3
##  9  2013     1     1 2013-01-01 05:57:00 2013-01-01 06:00:00        -3
## 10  2013     1     1 2013-01-01 05:58:00 2013-01-01 06:00:00        -2
## # … with 328,053 more rows, and 13 more variables: arr_time <dttm>,
## #   sched_arr_time <dttm>, arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Here we define a function make_datetime_100(year, month, day, time) to create date-time with the time stored in HHMM or HMM format.

However, there is a problem with the operations above. The arrival time is actually of local timezone. So we have to check the timezones for each destination airport, which is stored in airports data set.

airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows

Therefore, we need to add timezone of origin and destination airports before we create all the scheduled and actual departure/arrival time in date-time format.

airports1 <- airports %>%
  select(faa, tzone)

flights1 <- flights %>%
  left_join(airports1, by = c("dest" = "faa")) %>%
  rename("dest_tzone" = "tzone") %>%
  glimpse()
## Rows: 336,776
## Columns: 20
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…

So we first only keep the airport codes and corresponding timezone from airports data set, and then left joined by flights matching by dest, then renaming it to be dest_tzone since we will also create origin_tzone:

flights1 %>%
  left_join(airports1, by = c("origin" = "faa")) %>%
  rename("origin_tzone" = "tzone") -> flights1

glimpse(flights1)
## Rows: 336,776
## Columns: 21
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ origin_tzone   <chr> "America/New_York", "America/New_York", "America/New_Yo…

As expected, all origin time zones should be America/New_York.

Now we are ready to create date-times:

flights_dt <- flights1 %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time, tz = origin_tzone),
    arr_time = make_datetime_100(year, month, day, arr_time, tz = dest_tzone),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time, tz = origin_tzone),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time, tz = dest_tzone)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"), air_time, origin_tzone, dest_tzone)

flights_dt
## # A tibble: 328,063 × 11
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
##  2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
##  3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
##  4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
##  5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
##  6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
##  7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
##  8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
##  9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
## 10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
## # … with 328,053 more rows, and 5 more variables: arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>, origin_tzone <chr>, dest_tzone <chr>

We are going to work on this data set flights_dt in the following.


Time difference, - and duration

We can compute the difference between two date-time using the subtraction - operator.

td1 <- ymd_hms("2023-04-05 02:30:00") - ymd_hms("2023-04-05 01:20:00")
class(td1)
## [1] "difftime"

In R, when we subtract two dates or date-time objects we get a difftime object which records a time span of seconds, minutes, hours, days, or weeks. This is not very convenient to use, and we may use the duration offered by lubridate that measures a time span in exact seconds.

as.duration(td1)
## [1] "4200s (~1.17 hours)"

Later we will convert this into numeric minutes. We can do

as.numeric(as.duration(td1))/60
## [1] 70

Now we can compute the time difference between actual departure time and arrival time in flights_dt.

flights_dt %>%
  mutate(flight_time = as.numeric(as.duration(arr_time - dep_time))/60) -> flights_dt

glimpse(flights_dt)
## Rows: 328,063
## Columns: 12
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ dep_time       <dttm> 2013-01-01 05:17:00, 2013-01-01 05:33:00, 2013-01-01 0…
## $ sched_dep_time <dttm> 2013-01-01 05:15:00, 2013-01-01 05:29:00, 2013-01-01 0…
## $ arr_time       <dttm> 2013-01-01 08:30:00, 2013-01-01 08:50:00, 2013-01-01 0…
## $ sched_arr_time <dttm> 2013-01-01 08:19:00, 2013-01-01 08:30:00, 2013-01-01 0…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ origin_tzone   <chr> "America/New_York", "America/New_York", "America/New_Yo…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ flight_time    <dbl> 253, 257, 281, 320, 198, 166, 258, 132, 221, 175, 231, …

We see that the flight time is always longer than the air time since flights need preparation time before taking off and after landing.


Comparison operator <, >

We can also compare two dates or two date-times. “greater than” here refers to “later than”. For example

ymd(20230405) > ymd(20230404)
## [1] TRUE

This is TRUE because Apr 5, 2023 is later than Apr 4, 2023.

ymd(20230405) < ymd_hms("2023-04-05 02:00:00")
## [1] TRUE

This is also TRUE because when comparing a date and a date-time, the date would be considered to be at 0am of that day.

Learning this, now we can easily get all flights before or after a particular date. We can also arrange our data by time.

flights_dt %>%
  filter(dep_time > ymd("2013-06-01", tz = Sys.timezone())) %>%
  arrange(dep_time)
## # A tibble: 194,177 × 12
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 JFK    PSE           3        -9 2013-06-01 00:02:00 2013-06-01 23:59:00
##  2 EWR    CLT          -9       -16 2013-06-01 04:51:00 2013-06-01 05:00:00
##  3 EWR    IAH          -9       -45 2013-06-01 05:06:00 2013-06-01 05:15:00
##  4 LGA    IAH         -11       -29 2013-06-01 05:34:00 2013-06-01 05:45:00
##  5 JFK    BQN          -7         3 2013-06-01 05:38:00 2013-06-01 05:45:00
##  6 JFK    MIA          -1        -8 2013-06-01 05:39:00 2013-06-01 05:40:00
##  7 EWR    RSW         -14       -20 2013-06-01 05:46:00 2013-06-01 06:00:00
##  8 LGA    DFW          -9       -22 2013-06-01 05:51:00 2013-06-01 06:00:00
##  9 LGA    PHL          -8        -8 2013-06-01 05:52:00 2013-06-01 06:00:00
## 10 JFK    IAD          -7       -11 2013-06-01 05:53:00 2013-06-01 06:00:00
## # … with 194,167 more rows, and 6 more variables: arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>, origin_tzone <chr>,
## #   dest_tzone <chr>, flight_time <dbl>

Lab Exercise: Try what will happen if you don’t arrange in the example above.


Getting components

We can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().

datetime <- ymd_hms("2013-01-01 12:34:56")

year(datetime)
## [1] 2013
month(datetime)
## [1] 1
mday(datetime)
## [1] 1
yday(datetime)
## [1] 1
wday(datetime)
## [1] 3

Please be noted that the first dat of the week is Sunday, so the third day of the week is Tuesday.

For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week which becomes a ordered factor. Set abbr = FALSE to return the full name.

month(datetime, label = TRUE)
## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Tuesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
flights_dt %>%
  mutate(weekday = wday(dep_time, label = TRUE)) -> flights_dt

glimpse(flights_dt)
## Rows: 328,063
## Columns: 13
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ dep_time       <dttm> 2013-01-01 05:17:00, 2013-01-01 05:33:00, 2013-01-01 0…
## $ sched_dep_time <dttm> 2013-01-01 05:15:00, 2013-01-01 05:29:00, 2013-01-01 0…
## $ arr_time       <dttm> 2013-01-01 08:30:00, 2013-01-01 08:50:00, 2013-01-01 0…
## $ sched_arr_time <dttm> 2013-01-01 08:19:00, 2013-01-01 08:30:00, 2013-01-01 0…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ origin_tzone   <chr> "America/New_York", "America/New_York", "America/New_Yo…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ flight_time    <dbl> 253, 257, 281, 320, 198, 166, 258, 132, 221, 175, 231, …
## $ weekday        <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, …
levels(flights_dt$weekday)
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"

Now we can easily analyze data with respect to weekday behavior.

flights_dt %>%
  ggplot() + geom_bar(aes(weekday))

For example, we can study delay or air time for different weekdays.

flights_dt %>%
  filter(arr_delay >= 0) %>% # Filter out canceled flights and ahead-of-time flights
  group_by(weekday) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
  ggplot() + stat_summary(aes(x = weekday, y = mean_arr_delay), geom = "bar")

flights_dt %>%
  group_by(weekday) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
  ggplot() + stat_summary(aes(x = weekday, y = mean_air_time), geom = "bar")

So we see that weekdays have some effect on average delay but little effect on average air time.

We may also study flights departing at which hour had the worst delay or the longest air time.

flights_dt %>%
  filter(!is.na(arr_delay)) %>% # Filter out canceled flights
  mutate(hour_time = hour(sched_dep_time)) %>%
  group_by(hour_time) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
  ggplot() + stat_summary(aes(x = hour_time, y = mean_arr_delay), geom = "bar")

So generally the earlier the scheduled departure time in the day, the less delay.

flights_dt %>%
  filter(!is.na(arr_delay)) %>% # Filter out canceled flights
  mutate(hour_time = hour(sched_dep_time)) %>%
  group_by(hour_time) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
  ggplot() + stat_summary(aes(x = hour_time, y = mean_air_time), geom = "bar")

This graph shows a very interesting pattern - flights departing between 10pm and 11pm are usually short ones while those departing after 11pm are usually long ones (overnight flights).


Rounding

An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to.

# floor, round or ceiling of a number
floor(3.2)
## [1] 3
round(3.2)
## [1] 3
ceiling(3.2)
## [1] 4

Then it’s easier to understand rounding the dates or date-time

dt2 <- ymd_hms("2023-04-05 03:12:34pm", tz = Sys.timezone())

floor_date(dt2, "year")   # floor to the first day of the same year
## [1] "2023-01-01 EST"
floor_date(dt2, "month")  # floor to the first day of the same month
## [1] "2023-04-01 EDT"
floor_date(dt2, "week")   # floor to the first day of the same week (Sunday)
## [1] "2023-04-02 EDT"
floor_date(dt2, "day")    # floor to the same day
## [1] "2023-04-05 EDT"


Lab Exercise: Run round_date(dt2, "week"). what do you get? Can you explain the result?


This, for example, allows us to plot the number of flights per week:

flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

Note that date or time can be used as a continuous axis in a plot, and the unit would be in seconds. For example, to visualise the distribution of flight number for each day across the year, we can do:

flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

As suggested by a previous graph, we can tell that the constant sharp drop in flights number should be on Saturdays.


3. Time Spans


It’s important to have data regarding time spans. Here we introduce three of them:


Duration

As we mentioned above, durations always use seconds to measure a time span. There are a series of constructor functions:

dseconds(15)
## [1] "15s"
dminutes(10)
## [1] "600s (~10 minutes)"
dhours(c(12, 24))
## [1] "43200s (~12 hours)" "86400s (~1 days)"
ddays(0:5)
## [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
dweeks(3)
## [1] "1814400s (~3 weeks)"
dyears(1)
## [1] "31557600s (~1 years)"

You can add, subtract and multiply durations:

2 * dyears(1)
## [1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38869200s (~1.23 years)"

You can add and subtract durations to and from a date-time:

dt3 <- ymd_h("2023-04-05 3pm", tz = Sys.timezone())
dt3 + dhours(1)
## [1] "2023-04-05 16:00:00 EDT"
dt3 - ddays(1)
## [1] "2023-04-04 15:00:00 EDT"

A problem with durations is that, they represent an exact number of seconds assuming 60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year. This may give us unexpected result when there are leap years, day saving time, or leap seconds involved.

dt3 - dyears(1)
## [1] "2022-04-05 09:00:00 EDT"

So we see that the result is a little bit odd when we subtract the current time by a duration of one month.


Periods

To solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:

dt3 - years(1)
## [1] "2022-04-05 15:00:00 EDT"

So we see that this result is consistent with “human” understanding about the meaning of “one year ago”.

Like durations, periods can be created with a number of friendly constructor functions.

seconds(15)
## [1] "15S"
minutes(10)
## [1] "10M 0S"
hours(c(12, 24))
## [1] "12H 0M 0S" "24H 0M 0S"
days(7)
## [1] "7d 0H 0M 0S"
months(1:6)
## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
weeks(3)
## [1] "21d 0H 0M 0S"
years(1)
## [1] "1y 0m 0d 0H 0M 0S"

Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.

flights_dt %>% 
  filter(arr_time < dep_time) 
## # A tibble: 10,633 × 13
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 EWR    BQN           9        -4 2013-01-01 19:29:00 2013-01-01 19:20:00
##  2 JFK    DFW          59        NA 2013-01-01 19:39:00 2013-01-01 18:40:00
##  3 EWR    TPA          -2         9 2013-01-01 20:58:00 2013-01-01 21:00:00
##  4 EWR    SJU          -6       -12 2013-01-01 21:02:00 2013-01-01 21:08:00
##  5 EWR    SFO          11       -14 2013-01-01 21:08:00 2013-01-01 20:57:00
##  6 LGA    FLL         -10        -2 2013-01-01 21:20:00 2013-01-01 21:30:00
##  7 EWR    MCO          41        43 2013-01-01 21:21:00 2013-01-01 20:40:00
##  8 JFK    LAX          -7       -24 2013-01-01 21:28:00 2013-01-01 21:35:00
##  9 EWR    FLL          49        28 2013-01-01 21:34:00 2013-01-01 20:45:00
## 10 EWR    FLL          -9       -14 2013-01-01 21:36:00 2013-01-01 21:45:00
## # … with 10,623 more rows, and 7 more variables: arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>, origin_tzone <chr>,
## #   dest_tzone <chr>, flight_time <dbl>, weekday <ord>

These are overnight flights. We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.

flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

Now all of our flights obey the laws of physics.

flights_dt %>% 
  filter(overnight, arr_time < dep_time) 
## # A tibble: 0 × 14
## # … with 14 variables: origin <chr>, dest <chr>, dep_delay <dbl>,
## #   arr_delay <dbl>, dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>, origin_tzone <chr>,
## #   dest_tzone <chr>, flight_time <dbl>, weekday <ord>, overnight <lgl>


Intervals

Periods also have their problems. For example, it’s obvious what dyears(1) / ddays(365) should return: one, because durations are always represented by a number of seconds, and a duration of a year is defined as 365 days worth of seconds.

What should years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate, with a warning:

years(1) / days(1)
## [1] 365.25

If you want a more accurate measurement, you’ll have to use an interval. An interval is a duration with a starting point and an ending point: that makes it precise so you can determine exactly how long it is:

next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 366

Here today() %--% next_year is an interval, which uses %--% connecting two dates or date-times.


Lab Exercises:

  1. Find out how many days are there since you were born.
  2. Find out how many seconds (approximately) are there since you were born.