When we deal with some time series data, we often encounter various time data, such as “2020-08-08”. In many cases, we need to extract the year, month, day or even the hour, minute, and second, so that operations such as comparison and filtering can be easily performed. If we implement the above functions ourselves, we may write a string extraction function to determine the corresponding time unit value. However, due to the variety of time data formats, there will always be some problems. Fortunately, the lubridate package has helped me realize various functions, the functions are simple but convenient and fast, which will be introduced below.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## -- Attaching packages --------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## -- Conflicts ------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.2
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
First of all, the convenience of the lubridate function is that no matter what separator is used to separate the year, month and day, it can always find the correct value and return a numeric value, such as:
year("2020-08-16")
## [1] 2020
year("2020/08/16")
## [1] 2020
month("2020/08/16")
## [1] 8
day("2020/08/16")
## [1] 16
We can see that the corresponding value can be extracted directly with the year(), month(), and day() functions. The same functions include hour(), minute(), second(), etc.:
hour("2020-08-16 18:20:11")
## [1] 18
minute("2020-08-16 18:20:11")
## [1] 20
second("2020-08-16 18:20:11")
## [1] 11
These functions can also be used to set and modify this information. Among them, the two functions of wday() and month() have a label option, you can choose to display the value or the name (eg: wday() can display 7 or Sat. Note: weekday is 1 by default)
a_date <- ymd_hms('2018/06/29/12/55/50')
a_date
## [1] "2018-06-29 12:55:50 UTC"
wday(a_date)
## [1] 6
wday(a_date, label = TRUE)
## [1] Fri
## Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
At the same time, lubridate also provides functions to help process the year, month, and day data in different order:
ymd("20200816")
## [1] "2020-08-16"
mdy("08-16-2020")
## [1] "2020-08-16"
dmy("16/08/2020")
## [1] "2020-08-16"
ymd, mdy, and dmy respectively represent three common arrangements of year, month and day. In this way, we can convert different date data into standard date data.
On the top of this, if you need to read data with a specific time, you need to add hours (h), minutes (m) and seconds (s) to the function; if the time you need to read has a specific time zone, use tz option to specify.
syd_date <- ymd_hms("2020-08-16-19-21-32", tz = "Australia/ACT")
syd_date
## [1] "2020-08-16 19:21:32 AEST"
lubriadate is very flexible, it can “intelligently” judge our input format, and best get the standard time format, even if your input is incomplete, you can also identify the date input format of incomplete information.
inc_date <- c(20201122, "2015-06-01", "2011 05 03", "2018-7, 5")
ymd(inc_date)
## [1] "2020-11-22" "2015-06-01" "2011-05-03" "2018-07-05"
Under certain circumstances, you’ll have individual components of date/times in different columns. This is what we have in flights data:
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.0.2
flights %>%
select(year, month, day, hour, minute)
## # A tibble: 336,776 x 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
## 7 2013 1 1 6 0
## 8 2013 1 1 6 0
## 9 2013 1 1 6 0
## 10 2013 1 1 6 0
## # ... with 336,766 more rows
To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 x 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 05:58:00
## 7 2013 1 1 6 0 2013-01-01 06:00:00
## 8 2013 1 1 6 0 2013-01-01 06:00:00
## 9 2013 1 1 6 0 2013-01-01 06:00:00
## 10 2013 1 1 6 0 2013-01-01 06:00:00
## # ... with 336,766 more rows
Let’s do the same thing for each of the four time columns in flights. The times are represented in a slightly odd format, so we use modulus arithmetic to pull out the hour and minute components.
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
## # A tibble: 328,063 x 9
## origin dest dep_delay arr_delay dep_time sched_dep_time
## <chr> <chr> <dbl> <dbl> <dttm> <dttm>
## 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
## 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
## 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
## 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
## 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
## 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
## 7 EWR FLL -5 19 2013-01-01 05:55:00 2013-01-01 06:00:00
## 8 LGA IAD -3 -14 2013-01-01 05:57:00 2013-01-01 06:00:00
## 9 JFK MCO -3 -8 2013-01-01 05:57:00 2013-01-01 06:00:00
## 10 LGA ORD -2 8 2013-01-01 05:58:00 2013-01-01 06:00:00
## # ... with 328,053 more rows, and 3 more variables: arr_time <dttm>,
## # sched_arr_time <dttm>, air_time <dbl>
In the above example, although the date and time data is messy, it is still arranged in the order of year, month, and day, but what if the date and time are arranged in disorder?
Here we need to introduce a new function parse_date_time(), which can convert various date and time characters into date and time data. There is an important parameter in this function, namely orders, through which the possible date format order is specified, such as year-month-day or month-day-year order.
test_date <- c('20131113','120315','12/17/1996','09-01-01','2015 12 23','2009-1, 5','Created on 2013 4 6')
parse_date_time(test_date,order = c('ymd','mdy','dmy','ymd'))
## [1] "2013-11-13 UTC" "2015-12-03 UTC" "1996-12-17 UTC" "2009-01-01 UTC"
## [5] "2015-12-23 UTC" "2009-01-05 UTC" "2013-04-06 UTC"
The time interval is a specific time span (because it is tied to a specific point in time). Lubridate also provides general time span categories: durations and periods. The functions that establish periods are named with time units (plural). The function name for establishing duration is the same as the period, only adding a ‘d’ to the prefix.
minutes(1) # periods
## [1] "1M 0S"
dminutes(1) # durations[Prefixed with'd']
## [1] "60s (~1 minutes)"
Why do we need two different classes? Because the timeline is not as reliable as the number line. The durations class usually provides more accurate calculation results. A duration year is always equal to 365 days. While periods give more rational results as the timeline fluctuates, this feature is very useful when building a model of clock times. For example, when durations encounter a leap year, the results are too rigid, while the results given by periods are much more flexible:
leap_year(2016)
## [1] TRUE
ymd(20160101)+years(1)
## [1] "2017-01-01"
ymd(20160101)+dyears(1)
## [1] "2016-12-31 06:00:00 UTC"
We can use periods or durations to do basic date calculations. For example: get the same time point in the next six weeks:
syd_date_1 <- syd_date + weeks(0:5)
syd_date_1
## [1] "2020-08-16 19:21:32 AEST" "2020-08-23 19:21:32 AEST"
## [3] "2020-08-30 19:21:32 AEST" "2020-09-06 19:21:32 AEST"
## [5] "2020-09-13 19:21:32 AEST" "2020-09-20 19:21:32 AEST"