Introduction

When we deal with some time series data, we often encounter various time data, such as “2020-08-08”. In many cases, we need to extract the year, month, day or even the hour, minute, and second, so that operations such as comparison and filtering can be easily performed. If we implement the above functions ourselves, we may write a string extraction function to determine the corresponding time unit value. However, due to the variety of time data formats, there will always be some problems. Fortunately, the lubridate package has helped me realize various functions, the functions are simple but convenient and fast, which will be introduced below.

Load the packages

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Return time value

First of all, the convenience of the lubridate function is that no matter what separator is used to separate the year, month and day, it can always find the correct value and return a numeric value, such as:

year("2020-08-16")
## [1] 2020
year("2020/08/16")
## [1] 2020
month("2020/08/16")
## [1] 8
day("2020/08/16")
## [1] 16

We can see that the corresponding value can be extracted directly with the year(), month(), and day() functions. The same functions include hour(), minute(), second(), etc.:

hour("2020-08-16 18:20:11")
## [1] 18
minute("2020-08-16 18:20:11")
## [1] 20
second("2020-08-16 18:20:11")
## [1] 11

These functions can also be used to set and modify this information. Among them, the two functions of wday() and month() have a label option, you can choose to display the value or the name (eg: wday() can display 7 or Sat. Note: weekday is 1 by default)

a_date <- ymd_hms('2018/06/29/12/55/50')

a_date
## [1] "2018-06-29 12:55:50 UTC"
wday(a_date)
## [1] 6
wday(a_date, label = TRUE)
## [1] Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

At the same time, lubridate also provides functions to help process the year, month, and day data in different order:

ymd("20200816")
## [1] "2020-08-16"
mdy("08-16-2020")
## [1] "2020-08-16"
dmy("16/08/2020")
## [1] "2020-08-16"

ymd, mdy, and dmy respectively represent three common arrangements of year, month and day. In this way, we can convert different date data into standard date data.

On the top of this, if you need to read data with a specific time, you need to add hours (h), minutes (m) and seconds (s) to the function; if the time you need to read has a specific time zone, use tz option to specify.

syd_date <- ymd_hms("2020-08-16-19-21-32", tz = "Australia/ACT")

syd_date
## [1] "2020-08-16 19:21:32 AEST"

lubriadate is very flexible, it can “intelligently” judge our input format, and best get the standard time format, even if your input is incomplete, you can also identify the date input format of incomplete information.

inc_date <- c(20201122, "2015-06-01", "2011 05 03", "2018-7, 5")

ymd(inc_date)
## [1] "2020-11-22" "2015-06-01" "2011-05-03" "2018-07-05"

Create date/times from individual components

Under certain circumstances, you’ll have individual components of date/times in different columns. This is what we have in flights data:

library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.0.2
flights %>% 
  select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # ... with 336,766 more rows

To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # ... with 336,766 more rows

Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components.

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt
## # A tibble: 328,063 x 9
##    origin dest  dep_delay arr_delay dep_time            sched_dep_time     
##    <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
##  1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
##  2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
##  3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
##  4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
##  5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
##  6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
##  7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
##  8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
##  9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
## 10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
## # ... with 328,053 more rows, and 3 more variables: arr_time <dttm>,
## #   sched_arr_time <dttm>, air_time <dbl>

Parsing dates and times

In the above example, although the date and time data is messy, it is still arranged in the order of year, month, and day, but what if the date and time are arranged in disorder?

Here we need to introduce a new function parse_date_time(), which can convert various date and time characters into date and time data. There is an important parameter in this function, namely orders, through which the possible date format order is specified, such as year-month-day or month-day-year order.

test_date <- c('20131113','120315','12/17/1996','09-01-01','2015 12 23','2009-1, 5','Created on 2013 4 6')

parse_date_time(test_date,order = c('ymd','mdy','dmy','ymd'))
## [1] "2013-11-13 UTC" "2015-12-03 UTC" "1996-12-17 UTC" "2009-01-01 UTC"
## [5] "2015-12-23 UTC" "2009-01-05 UTC" "2013-04-06 UTC"

Arithmetic with date times

time span

The time interval is a specific time span (because it is tied to a specific point in time). Lubridate also provides general time span categories: durations and periods. The functions that establish periods are named with time units (plural). The function name for establishing duration is the same as the period, only adding a ‘d’ to the prefix.

minutes(1) # periods
## [1] "1M 0S"
dminutes(1) # durations[Prefixed with'd']
## [1] "60s (~1 minutes)"

Why do we need two different classes? Because the timeline is not as reliable as the number line. The durations class usually provides more accurate calculation results. A duration year is always equal to 365 days. While periods give more rational results as the timeline fluctuates, this feature is very useful when building a model of clock times. For example, when durations encounter a leap year, the results are too rigid, while the results given by periods are much more flexible:

leap_year(2016)
## [1] TRUE
ymd(20160101)+years(1)
## [1] "2017-01-01"
ymd(20160101)+dyears(1)
## [1] "2016-12-31 06:00:00 UTC"

We can use periods or durations to do basic date calculations. For example: get the same time point in the next six weeks:

syd_date_1 <- syd_date + weeks(0:5)

syd_date_1
## [1] "2020-08-16 19:21:32 AEST" "2020-08-23 19:21:32 AEST"
## [3] "2020-08-30 19:21:32 AEST" "2020-09-06 19:21:32 AEST"
## [5] "2020-09-13 19:21:32 AEST" "2020-09-20 19:21:32 AEST"