Lecture 15 - Date and Time

Introduction

For obvious reasons, dates and times are a very common and important type of data. For example, in flights we have information about the scheduled departure time, actual departure time, scheduled arrival time, and actual arrival time. We also have a time_hour column to record the scheduled date and hour (but with minutes ignored) in a date-time format.

glimpse(flights)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:

Does every year have 365 days?
Does every day have 24 hours?
Does every minute have 60 seconds?

Answer

The answer is “no” for all three questions because

A leap year has 366 days.
Some days may have 23 or 25 hours due to day savings time (DST). For example, in New York State we have summer time and winter time.
Some minutes have 61 seconds because every now and then leap seconds are added because the Earth’s rotation is gradually slowing down.

The physics involved behind measuring time is indeed complicated. Here we will focus on establish a solid grounding of practical skills that will help us with common data analysis challenges.

Load Libraries

This chapter will focus on the lubridate package, which makes it easier to work with dates and times in R. lubridate is not part of core tidyverse because you only need it when you’re working with dates/times. We will also need nycflights13 for practice data.

library(tidyverse)
library(nycflights13)
library(lubridate)

Creating date/times

There are three types of date/time data that refer to an instant in time:

A date. Tibbles print this as <date>.
A time within a day. Tibbles print this as <time>.
A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Elsewhere in R these are called POSIXct. This name is not very useful but we should know it represents data-time.

Here we are only going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.

Get current date and time

To get the current date or date-time you can use today() or now():

today()

## [1] "2025-03-28"

now()

## [1] "2025-03-28 08:08:32 EDT"

Note that today() and now() may give different results with different time zones.

today(tzone = "PRC")

## [1] "2025-03-28"

now(tzone = "UTC")

## [1] "2025-03-28 12:08:32 UTC"

You can find out what R thinks your current time zone is with Sys.timezone():

Sys.timezone()

## [1] "America/New_York"

To see the complete list of all time zone names, use OlsonNames():

length(OlsonNames())

## [1] 597

head(OlsonNames())

## [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
## [4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"

The timezones have continent names to avoid cities with the same name across the world.

Create date/time from strings

Date/time data often comes as strings. lubridate offers convenient functions that automatically work out the format once you specify the order of the component. To use them, identify the order in which year, month, and day appear in your dates, then arrange “y”, “m”, and “d” in the same order. That gives you the name of the lubridate function that will parse your date into the form of yyyy-mm-dd. For example:

ymd("2017-01-31")

## [1] "2017-01-31"

mdy("January 31st, 2017")

## [1] "2017-01-31"

dmy("31-Jan-2017")

## [1] "2017-01-31"

These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data.

ymd(20170131)

## [1] "2017-01-31"

ymd() and friends create dates. To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function:

ymd_hms("2017-01-31 20:11:59")

## [1] "2017-01-31 20:11:59 UTC"

mdy_hm("01/31/2017 08:01")

## [1] "2017-01-31 08:01:00 UTC"

By default ymd and other similar functions do give a time zone. But we can also force the creation of a date-time from a date by supplying a timezone:

ymd(20170131, tz = "UTC")

## [1] "2017-01-31 UTC"

Lab Exercise

What happens if you parse a string that contains invalid dates?

ymd(c("2010-10-10", "bananas"))

Use the appropriate lubridate function to parse each of the following dates:

d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014

Create date/time from individual components

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

flights %>% 
  select(year, month, day, hour, minute)

## # A tibble: 336,776 × 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # ℹ 336,766 more rows

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times. make_date() takes up to three arguments year, month, and day. But please be aware that the default value is January 1st, 1970, the so-called “Unix Epoch”.

date1 <- make_date(2023, 4, 5)
class(date1)

## [1] "Date"

The `make_datetime()` function

make_datetime() takes up to seven arguments, year, month, day, hour, min and second, and tz (timezone). The default value is 1970-01-01, 00:00:00 UTC.

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure_date = make_date(year, month, day)) %>%
  mutate(departure_scheduled = make_datetime(year, month, day, hour, minute, tz = Sys.timezone()))

## # A tibble: 336,776 × 7
##     year month   day  hour minute departure_date departure_scheduled
##    <int> <int> <int> <dbl>  <dbl> <date>         <dttm>             
##  1  2013     1     1     5     15 2013-01-01     2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01     2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01     2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01     2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01     2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01     2013-01-01 06:00:00
## # ℹ 336,766 more rows

Please be noted that the default timezone for make_datetime is UTC. The time in departure time is from local timezone, which is America/New_York since the flights departed from New York City.

Write a function

We can also get the hours and minutes from dep_time or arr_time which are in a number format such as 517 using modulus arithmetic.

make_datetime_100 <- function(year, month, day, time, tz = "EST") {
  make_datetime(year, month, day, time %/% 100, time %% 100, 0, tz)
}

flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  )

## # A tibble: 328,063 × 19
##     year month   day dep_time            sched_dep_time      dep_delay
##    <int> <int> <int> <dttm>              <dttm>                  <dbl>
##  1  2013     1     1 2013-01-01 05:17:00 2013-01-01 05:15:00         2
##  2  2013     1     1 2013-01-01 05:33:00 2013-01-01 05:29:00         4
##  3  2013     1     1 2013-01-01 05:42:00 2013-01-01 05:40:00         2
##  4  2013     1     1 2013-01-01 05:44:00 2013-01-01 05:45:00        -1
##  5  2013     1     1 2013-01-01 05:54:00 2013-01-01 06:00:00        -6
##  6  2013     1     1 2013-01-01 05:54:00 2013-01-01 05:58:00        -4
##  7  2013     1     1 2013-01-01 05:55:00 2013-01-01 06:00:00        -5
##  8  2013     1     1 2013-01-01 05:57:00 2013-01-01 06:00:00        -3
##  9  2013     1     1 2013-01-01 05:57:00 2013-01-01 06:00:00        -3
## 10  2013     1     1 2013-01-01 05:58:00 2013-01-01 06:00:00        -2
## # ℹ 328,053 more rows
## # ℹ 13 more variables: arr_time <dttm>, sched_arr_time <dttm>, arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Here we define a function make_datetime_100(year, month, day, time) to create date-time with the time stored in HHMM or HMM format.

Change the timezone

However, there is a problem with the operations above. The arrival time is actually of local timezone. So we have to check the timezones for each destination airport, which is stored in airports data set.

airports

## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows

Therefore, we need to add timezone of origin and destination airports before we create all the scheduled and actual departure/arrival time in date-time format.

airports1 <- airports %>%
  select(faa, tzone)

flights1 <- flights %>%
  left_join(airports1, by = c("dest" = "faa")) %>%
  rename("dest_tzone" = "tzone") %>%
  glimpse()

## Rows: 336,776
## Columns: 20
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…

So we first only keep the airport codes and corresponding timezone from airports data set, and then left joined by flights matching by dest, then renaming it to be dest_tzone since we will also create origin_tzone:

flights1 %>%
  left_join(airports1, by = c("origin" = "faa")) %>%
  rename("origin_tzone" = "tzone") -> flights1

glimpse(flights1)

## Rows: 336,776
## Columns: 21
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ origin_tzone   <chr> "America/New_York", "America/New_York", "America/New_Yo…

As expected, all origin time zones should be America/New_York.

Now we are ready to create date-times:

flights_dt <- flights1 %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time, tz = origin_tzone),
    arr_time = make_datetime_100(year, month, day, arr_time, tz = dest_tzone),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time, tz = origin_tzone),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time, tz = dest_tzone)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"), air_time, origin_tzone, dest_tzone)

flights_dt

## # A tibble: 328,063 × 11
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
##  2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
##  3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
##  4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
##  5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
##  6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
##  7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
##  8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
##  9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
## 10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
## # ℹ 328,053 more rows
## # ℹ 5 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## #   origin_tzone <chr>, dest_tzone <chr>

We are going to work on this data set flights_dt in the following.

Time difference, `-` and duration

We can compute the difference between two date-time using the subtraction - operator.

td1 <- ymd_hms("2023-04-05 02:30:00") - ymd_hms("2023-04-05 01:20:00")
class(td1)

## [1] "difftime"

In R, when we subtract two dates or date-time objects we get a difftime object which records a time span of seconds, minutes, hours, days, or weeks. This is not very convenient to use, and we may use the duration offered by lubridate that measures a time span in exact seconds.

as.duration(td1)

## [1] "4200s (~1.17 hours)"

Later we will convert this into numeric minutes. We can do

as.numeric(as.duration(td1))/60

## [1] 70

Now we can compute the time difference between actual departure time and arrival time in flights_dt.

flights_dt %>%
  mutate(flight_time = as.numeric(as.duration(arr_time - dep_time))/60) -> flights_dt

glimpse(flights_dt)

## Rows: 328,063
## Columns: 12
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ dep_time       <dttm> 2013-01-01 05:17:00, 2013-01-01 05:33:00, 2013-01-01 0…
## $ sched_dep_time <dttm> 2013-01-01 05:15:00, 2013-01-01 05:29:00, 2013-01-01 0…
## $ arr_time       <dttm> 2013-01-01 08:30:00, 2013-01-01 08:50:00, 2013-01-01 0…
## $ sched_arr_time <dttm> 2013-01-01 08:19:00, 2013-01-01 08:30:00, 2013-01-01 0…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ origin_tzone   <chr> "America/New_York", "America/New_York", "America/New_Yo…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ flight_time    <dbl> 253, 257, 281, 320, 198, 166, 258, 132, 221, 175, 231, …

We see that the flight time is typically longer than the air time since flights need preparation time before taking off and after landing.

Comparison operator `<`, `>`

We can also compare two dates or two date-times. “greater than” here refers to “later than”. For example

ymd(20230405) > ymd(20230404)

## [1] TRUE

This is TRUE because Apr 5, 2023 is later than Apr 4, 2023.

ymd(20230405) < ymd_hms("2023-04-05 02:00:00")

## [1] TRUE

This is also TRUE because when comparing a date and a date-time, the date would be considered to be at 0am of that day.

Learning this, now we can easily get all flights before or after a particular date. We can also arrange our data by time.

flights_dt %>%
  filter(dep_time > ymd("2013-06-01", tz = Sys.timezone())) %>%
  arrange(dep_time)

## # A tibble: 194,177 × 12
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 JFK    PSE           3        -9 2013-06-01 00:02:00 2013-06-01 23:59:00
##  2 EWR    CLT          -9       -16 2013-06-01 04:51:00 2013-06-01 05:00:00
##  3 EWR    IAH          -9       -45 2013-06-01 05:06:00 2013-06-01 05:15:00
##  4 LGA    IAH         -11       -29 2013-06-01 05:34:00 2013-06-01 05:45:00
##  5 JFK    BQN          -7         3 2013-06-01 05:38:00 2013-06-01 05:45:00
##  6 JFK    MIA          -1        -8 2013-06-01 05:39:00 2013-06-01 05:40:00
##  7 EWR    RSW         -14       -20 2013-06-01 05:46:00 2013-06-01 06:00:00
##  8 LGA    DFW          -9       -22 2013-06-01 05:51:00 2013-06-01 06:00:00
##  9 LGA    PHL          -8        -8 2013-06-01 05:52:00 2013-06-01 06:00:00
## 10 JFK    IAD          -7       -11 2013-06-01 05:53:00 2013-06-01 06:00:00
## # ℹ 194,167 more rows
## # ℹ 6 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## #   origin_tzone <chr>, dest_tzone <chr>, flight_time <dbl>

Lab Exercise

Try what will happen if you don’t arrange in the example above.

Getting components of date/date-time

We can pull out individual parts of the date with the accessor functions year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().

datetime <- ymd_hms("2013-01-01 12:34:56")

year(datetime)

## [1] 2013

month(datetime)

## [1] 1

mday(datetime)

## [1] 1

yday(datetime)

## [1] 1

wday(datetime)

## [1] 3

Please be noted that the first day of the week is Sunday, so the third day of the week is Tuesday.

Get labels for month and weekdays

For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week which becomes a ordered factor. Set abbr = FALSE to return the full name.

month(datetime, label = TRUE)

## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)

## [1] Tuesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

flights_dt %>%
  mutate(weekday = wday(dep_time, label = TRUE)) -> flights_dt

glimpse(flights_dt)

## Rows: 328,063
## Columns: 13
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ dep_time       <dttm> 2013-01-01 05:17:00, 2013-01-01 05:33:00, 2013-01-01 0…
## $ sched_dep_time <dttm> 2013-01-01 05:15:00, 2013-01-01 05:29:00, 2013-01-01 0…
## $ arr_time       <dttm> 2013-01-01 08:30:00, 2013-01-01 08:50:00, 2013-01-01 0…
## $ sched_arr_time <dttm> 2013-01-01 08:19:00, 2013-01-01 08:30:00, 2013-01-01 0…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ origin_tzone   <chr> "America/New_York", "America/New_York", "America/New_Yo…
## $ dest_tzone     <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ flight_time    <dbl> 253, 257, 281, 320, 198, 166, 258, 132, 221, 175, 231, …
## $ weekday        <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, …

levels(flights_dt$weekday)

## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"

Example of weekday analysis

Now we can easily analyze data with respect to weekday behavior.

flights_dt %>%
  ggplot() + geom_bar(aes(weekday))

For example, we can study delay or air time for different weekdays.

flights_dt %>%
  filter(arr_delay >= 0) %>% # Filter out canceled flights and ahead-of-time flights
  group_by(weekday) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
  ggplot() + stat_summary(aes(x = weekday, y = mean_arr_delay), geom = "bar")

flights_dt %>%
  group_by(weekday) %>%
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
  ggplot() + stat_summary(aes(x = weekday, y = mean_air_time), geom = "bar")

So we see that weekdays have some effect on average delay but little effect on average air time.

Lab Exercise

Study flights departing at which hour had

the worst arrival delay
the longest/shortest air time.

Rounding of date and time

An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date(), and ceiling_date(). Each function takes a vector of dates to adjust and then the name of the unit round down (floor), round up (ceiling), or round to.

# floor, round or ceiling of a number
floor(3.2)

## [1] 3

round(3.2)

## [1] 3

ceiling(3.2)

## [1] 4

Then it’s easier to understand rounding the dates or date-time

dt2 <- ymd_hms("2023-04-05 03:12:34pm", tz = Sys.timezone())

floor_date(dt2, "year")   # floor to the first day of the same year

## [1] "2023-01-01 EST"

floor_date(dt2, "month")  # floor to the first day of the same month

## [1] "2023-04-01 EDT"

floor_date(dt2, "week")   # floor to the first day of the same week (Sunday)

## [1] "2023-04-02 EDT"

floor_date(dt2, "day")    # floor to the same day

## [1] "2023-04-05 EDT"

Lab Exercise

Run round_date(dt2, "week"). what do you get? Can you explain the result?

Example of using rounding

This, for example, allows us to plot the number of flights per week:

flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

Note that date or time can be used as a continuous axis in a plot, and the unit would be in seconds. For example, to visualise the distribution of flight number for each day across the year, we can do:

flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

As suggested by a previous graph, we can tell that the constant sharp drop in flights number should be on Saturdays.

Time Spans

It’s important to have data regarding time spans. Here we introduce three of them:

durations, which represent an exact number of seconds.
periods, which represent human units like weeks and months.
intervals, which represent a starting and ending point.

Duration

As we mentioned above, durations always use seconds to measure a time span. There are a series of constructor functions:

dseconds(15)

## [1] "15s"

dminutes(10)

## [1] "600s (~10 minutes)"

dhours(c(12, 24))

## [1] "43200s (~12 hours)" "86400s (~1 days)"

ddays(0:5)

## [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"

dweeks(3)

## [1] "1814400s (~3 weeks)"

dyears(1)

## [1] "31557600s (~1 years)"

You can add, subtract and multiply durations:

2 * dyears(1)

## [1] "63115200s (~2 years)"

dyears(1) + dweeks(12) + dhours(15)

## [1] "38869200s (~1.23 years)"

You can add and subtract durations to and from a date-time:

dt3 <- ymd_h("2023-04-05 3pm", tz = Sys.timezone())
dt3 + dhours(1)

## [1] "2023-04-05 16:00:00 EDT"

dt3 - ddays(1)

## [1] "2023-04-04 15:00:00 EDT"

A problem with durations is that, they represent an exact number of seconds assuming 60 seconds in a minute, 60 minutes in an hour, 24 hours in day, 7 days in a week, 365 days in a year. This may give us unexpected result when there are leap years, day saving time, or leap seconds involved.

dt3 - dyears(1)

## [1] "2022-04-05 09:00:00 EDT"

So we see that the result is a little bit odd when we subtract the current time by a duration of one month.

Periods

To solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:

dt3 - years(1)

## [1] "2022-04-05 15:00:00 EDT"

So we see that this result is consistent with “human” understanding about the meaning of “one year ago”.

Like durations, periods can be created with a number of friendly constructor functions.

seconds(15)

## [1] "15S"

minutes(10)

## [1] "10M 0S"

hours(c(12, 24))

## [1] "12H 0M 0S" "24H 0M 0S"

days(7)

## [1] "7d 0H 0M 0S"

months(1:6)

## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"

weeks(3)

## [1] "21d 0H 0M 0S"

years(1)

## [1] "1y 0m 0d 0H 0M 0S"

Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.

flights_dt %>% 
  filter(arr_time < dep_time) %>%
  select(dep_time, arr_time, flight_time)

## # A tibble: 10,633 × 3
##    dep_time            arr_time            flight_time
##    <dttm>              <dttm>                    <dbl>
##  1 2013-01-01 19:29:00 2013-01-01 00:03:00       -1106
##  2 2013-01-01 19:39:00 2013-01-01 00:29:00       -1090
##  3 2013-01-01 20:58:00 2013-01-01 00:08:00       -1190
##  4 2013-01-01 21:02:00 2013-01-01 01:46:00       -1096
##  5 2013-01-01 21:08:00 2013-01-01 00:25:00       -1183
##  6 2013-01-01 21:20:00 2013-01-01 00:16:00       -1204
##  7 2013-01-01 21:21:00 2013-01-01 00:06:00       -1215
##  8 2013-01-01 21:28:00 2013-01-01 00:26:00       -1202
##  9 2013-01-01 21:34:00 2013-01-01 00:20:00       -1214
## 10 2013-01-01 21:36:00 2013-01-01 00:25:00       -1211
## # ℹ 10,623 more rows

These were flights that crossed 0am during its flight (with a day change in arrival time). We used the same date information for both the departure and the arrival times, but these flights arrived on the following day. We can fix this by adding days(1) to the arrival time of each overnight flight.

flights_dt <- flights_dt %>% 
  mutate(
    day_change = arr_time < dep_time,
    arr_time = arr_time + days(day_change * 1),
    sched_arr_time = sched_arr_time + days(day_change * 1), # This might introduce mistake in rare cases.
    flight_time = as.numeric(as.duration(arr_time - dep_time))/60
  )

Now all of our flights obey the laws of physics.

flights_dt %>% 
  filter(flight_time < 0)

## # A tibble: 0 × 14
## # ℹ 14 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
## #   dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>, origin_tzone <chr>,
## #   dest_tzone <chr>, flight_time <dbl>, weekday <ord>, day_change <lgl>

Intervals

Periods also have their problems. For example, dyears(1) / ddays(1) returns a fix number of 365.25; because durations are always represented by a number of seconds, and a duration of a year is defined as 365.25 days worth of seconds.

What should years(1) / days(1) return? Well, if the year was 2015 it should return 365, but if it was 2016, it should return 366! There’s not quite enough information for lubridate to give a single clear answer. What it does instead is give an estimate, which is the same as dyears(1) / ddays(365).

dyears(1)/ddays(1)

## [1] 365.25

years(1) / days(1)

## [1] 365.25

If you want a more accurate measurement, you’ll have to use an interval. An interval is a duration with a starting point and an ending point: that makes it precise so you can determine exactly how long it is:

next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)

## [1] 365

Here today() %--% next_year is an interval, which uses %--% connecting two dates or date-times.

class(today() %--% next_year)

## [1] "Interval"
## attr(,"package")
## [1] "lubridate"

Lab Exercises (Don’t post it!)

Find out how many days are there since you were born.
Find out how many seconds (approximately) are there since you were born.

Lecture 15 - Date and Time

Miao Yu

2025-03-28

Introduction

Answer

Load Libraries

Creating date/times

Get current date and time

Create date/time from strings

Lab Exercise

Create date/time from individual components

The make_datetime() function

Write a function

Change the timezone

Time difference, - and duration

Comparison operator <, >

Lab Exercise

Getting components of date/date-time

Get labels for month and weekdays

Example of weekday analysis

Lab Exercise

Rounding of date and time

Lab Exercise

Example of using rounding

Time Spans

Duration

Periods

Intervals

Lab Exercises (Don’t post it!)

The `make_datetime()` function

Time difference, `-` and duration

Comparison operator `<`, `>`