Date-time data can be extremely frustrating to work with in R or any coding language/interface. There are many characteristics of date-time data that must be taken into account – including format, timezones, and calendar inconsistencies such as leap years and daylight savings time (DST). Base R commands for date-times are relatively unintuitive and have unpredictable results depending on the type of object being used. The lubridate package is the Tidyverse solution to the many problems that arise when working with dates and times that data scientists often encounter. The functions available through lubridate allow for simple and straightforward algebraic manipulation of date-time and time-span objects.
The difficulties of working with dates and times:
Dates and times must reconcile two physical phenomena (rotation of the Earth and its orbit with the sun) with a whole raft of geopolitical phenomena
The three object types dealing with dates and times:
Three ways likely to create a date/time:
From a string
lubridate automatically works out the format once you specify the order of the components (year, month, day, etc.). It can also handle unquoted numbers.From individual date-time components
From an existing date/time object
Many base functions including am (boolean to check if a date occurs in am or pm), is.instant (checks if it is a date-time object) and guess_format (character or numeric vector)
R could now work with complicated features of time such as leap years, daylight savings time, different time zones, military time, and a wide variety of date-time formats.
Initial versions would overwrite many functions and basic operations from base R (such as “+”, “-”, “start()”, “end”), which made it difficult to use lubridate with highly complicated programs
lubridate has since been gradually modified to become more flexible with different kinds of data structures and programs.
Comparability between Duration, Period, and difftime functions
Updates since Version 1.7.2 have simply been bug fixes or patches rather than the addition of new functions
Part of the tidyverse ecosystem of packages
Relies on Google’s CCTZ library for the date-time updates and time-zone manipulation (built-in CCTZ package)
Base R has built in functions that can work with and manipulate dates
lubridateChron Package
timelineS Package
timelineS does
ggplot2 in conjunction in order to do sotimelineS does NOT provide any functions dealing with time objects or geopolitical complications (DST, time zones, etc.)
In Base R:
data <- data.frame(
date = c("2019/05/03",
"2018/02/8",
"2016-02/29"), # notice the data entry error
measure = c(25, 22, 17)
)
data %>%
mutate(
date = as.Date(date)
)| date | measure |
|---|---|
| 2019-05-03 | 25 |
| 2018-02-08 | 22 |
| NA | 17 |
# An unnecessary step
data$date = gsub(
x = data$date,
pattern = "-",
replacement = "/"
)
data %>%
mutate(
date = as.Date(date)
)| date | measure |
|---|---|
| 2019-05-03 | 25 |
| 2018-02-08 | 22 |
| 2016-02-29 | 17 |
As can be seen with base R in the above chunks, it is necessary to have an additional step to replace the invalid date format with a valid one prior to R recognizing the date. Otherwise, it calls the date NA.
Also notice how the plot above is not really continuous in terms of the time, it is instead discrete in terms of the date. Base R treats the dates as categories rather than along a continuous timeline.
With lubridate:
# install.packages('lubridate')
library(lubridate)
lubridata <- data %>%
mutate(
date = ymd(date)
)
lubridata %>% # What we want
ggplot() +
geom_point(aes(date, measure))With lubridate, there is no need for the additional replacement step. It immediately recognizes the date with this format. The plot is also now continuous on the x-axis in terms of time.
NOTICE: The smaller size of the code chunk when using lubridate rather than using base R.
In Base R:
data$month = format(as.Date(data$date), "%m")
data$year = format(as.Date(data$date), "%Y")
data$day = format(as.Date(data$date), "%d")
## character componentsWith lubridate:
data %>% # ensures numeric components
mutate(month = month(date),
year = year(date),
day = day(date))| date | measure | month | year | day |
|---|---|---|---|---|
| 2019/05/03 | 25 | 5 | 2019 | 3 |
| 2018/02/8 | 22 | 2 | 2018 | 8 |
| 2016/02/29 | 17 | 2 | 2016 | 29 |
In Base R:
## [1] "2023-05-02" "2022-02-07" "2020-02-28"
With lubridate:
| date | measure | month | year | day |
|---|---|---|---|---|
| 2023-05-03 | 25 | 05 | 2019 | 03 |
| 2022-02-08 | 22 | 02 | 2018 | 08 |
| 2020-02-29 | 17 | 02 | 2016 | 29 |
In Base R:
## Time difference of 448.9583 days
## Error in match.arg(units): 'arg' should be one of "auto", "secs", "mins", "hours", "days", "weeks"
With lubridate:
## Error in data$date[1] - data$date[2]: non-numeric argument to binary operator
## [1] 1.230137
As can be seen above, ymd function in lubridate can easily determine duration without throwing an error between years.
#Adding a time zone label to the date
# UTC is the universal time zone label
mdy("January 31st, 2017", tz = "UTC")## [1] "2017-01-31 UTC"
#Adds the hours, minutes, and seconds to the date for a date-time object
mdy_hms("January 31st, 2020 11, 02, 01")## [1] "2020-01-31 11:02:01 UTC"
## [1] "1296000s (~2.14 weeks)"
#Adds a time zone argument to the date-time object. Can place the location of the time zone and it will automatically replace with the name of the time zone (EDT)
ymd_hms("2015-06-01 12:00:00", tz = "America/New_York")## [1] "2015-06-01 12:00:00 EDT"
## [1] "2019-03-18"
#Method floor_date changes the date to the nearest boundary of the given unit argument.
#In this case, changes the day to the first as the unit argument was month.
floor_date(x, unit = "month")## [1] "2019-03-01"
## [1] Monday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
Efficiently tracks date and time data while taking into account all of the previously mentioned difficulties associated with it
Huge advancement on base R dealing with dates and times
Can make sorting between dates and times of data structures very easy
Functions are well named and it is easy to identify what each one does
Subpackage within tidyverse allows consistency between other packages and combined usage for complex goals.
Dependency on the CCTZ package
R does not come with predefined time zone names, so it depends on the user’s operating system for time zone name
Does not take into account holidays (Labor Day, Thanksgiving, etc.). Does not recognize Thanksgiving is the third Thursday of November
Could be faster, as always
Last version was released in April 2018
Some locales that use DST on paper have populations within that may not actually practice it
Universal method to parse automatically the day, month, year, and time of an input if it is in a compatible format
Calendar data structure or methods for scheduling, reminders, etc. that can be uploaded to the user’s console
Utilize a function that takes EXIF data input from photos to parse their date and time
Be able to work with different calendars in other parts of the world (Chinese Calendar, Mayan Calendar, etc.)
Taking into account holidays for data sets on jobs, activity, etc.
Function to return the Carbon-12 to Carbon-14 isotope proportion or tree ring number of a past date
Conversion to time durations on other planets or areas in the universe for GIS or aerospatial data sets