Dates are a special form of numeric that are stored as numeric but display in formats we’re familiar with.
Like Microsoft (January 1st,1900) and SAS (January 1st, 1960), R anchors to a specific date (January 1st, 1970).
Dates in R are stored as the number of days before or after 1/1/1970 and date/times are stored as the number of seconds before or after 1/1/1970.
Default display format in R is yyyy-mm-dd (‘2020-09-17’).
Dates using Base R
# Default format is yyyy-mm-dd - if you use that format, you don't have to specify# You can assign a date to an object by using a date conversion function and passing a character date string# Using base SAS function as.Date()dob <-as.Date("2020-09-03")dob
[1] "2020-09-03"
# Derive the number of days since 1/1/1970 by converting the date object to numericas.numeric(dob)
[1] 18508
#Confirming a different wayas.numeric(dob) -as.numeric(as.Date("1970-01-01"))
[1] 18508
# if you wanted to add 90 days to a date (or a column of dates):as.Date("2020-09-04") +90
[1] "2020-12-03"
# Converting strings to date or date/time# If the date is stored in a different format, you just have to include a format parse argumentchr_dates <-c("03/05/1985", "09/21/1972")dates <-as.Date(chr_dates, "%m/%d/%Y")dates
[1] "1985-03-05" "1972-09-21"
# Another frequently used function is getting the current dateSys.Date()
[1] "2024-05-11"
Dates Using Lubridate
For working with dates in this course, we’ll generally either use base R functions or the lubridate package functions from the tidyverse.
# Using lubridate package functions (part of the tidyverse)# Lubridate makes things a little easier in working with datesdob <-as_date("2020-09-03")# Using the mdy function easily converts the US date convention (or many other date conventions) to R datemdy("02/29/2020")
[1] "2020-02-29"
# if you wanted to add 90 days to a date (or a column of dates) it's basically the same:as_date("2020-09-04") +90
[1] "2020-12-03"
# But if you need to add three months, lubridate makes it much easieras_date("2020-09-04") +months(1)
[1] "2020-10-04"
# that works fine, but becomes a little problematic if the added month has fewer daysas_date("2020-03-31") +months(1)
[1] NA
# the %m+% operator adds a month but reverts back to the last day of the month if the expression yields a non-existent dateas_date("2020-03-31") %m+%months(1)
[1] "2020-04-30"
# On the off chance you need to convert from numeric:lubridate::as_date(18239)
[1] "2019-12-09"
# You can also get current time (just have to be careful about time zone) usingwith_tz(Sys.time(), tzone ="America/Los_Angeles")
[1] "2024-05-11 10:08:07 PDT"
# You can also use the today function and specify the time zonetoday(tzone ="Australia/Sydney")
[1] "2024-05-12"
Wrangling with Dates
It makes sense to keep dates as numeric behind the scene, since you’ll often want to be able to compare dates, identify number of days, weeks, months, and years between dates, and see if something happened early or later.
# Difference in daysSys.Date() -as.Date("2020-09-03")
Time difference of 1346 days
# which date is earlier, later etcSys.Date() <as.Date("2020-09-03")
[1] FALSE
# another method of determining days between two dates, but also the option for other measuresas.numeric(difftime(Sys.Date(), as.Date("2020-09-03"), units ="days"))
[1] 1346
as.numeric(difftime(Sys.Date(), as.Date("2020-09-03"), units ="hours"))
[1] 32304
# If you need to extract a date part out of a date object:test <-as_date("1999-04-02")day(test)
[1] 2
month(test)
[1] 4
year(test)
[1] 1999
# day of the weekwday(test, label =TRUE, abbr =TRUE)
[1] Fri
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
# day of the year (julian day)yday(test)
[1] 92
# key time measure for public health built into lubridate!epiweek(test)
[1] 13
You can also sort and join (next week) a dataset by dates more efficiently because it has numeric properties. As well as querying the most recent even in a dataset, grouped by people.
When you import dataset, most packages will make an attempt to guess that a column contains a date and automatically convert. But for the sake of learning how to manually manipulate, we’ll use base R import
cdcr_nodt <-read.csv("https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/cdcr-prison-totals.csv", stringsAsFactors = F)# note: readr::read_csv will guess the column data types, so you wouldn't need to convert if you used read_csv.str(cdcr_nodt)
'data.frame': 41370 obs. of 13 variables:
$ date : chr "2020-03-10" "2020-03-11" "2020-03-12" "2020-03-13" ...
$ code : chr "ASP" "ASP" "ASP" "ASP" ...
$ name : chr "Avenal State Prison" "Avenal State Prison" "Avenal State Prison" "Avenal State Prison" ...
$ city : chr "Avenal" "Avenal" "Avenal" "Avenal" ...
$ county : chr "Kings" "Kings" "Kings" "Kings" ...
$ fips : int 31 31 31 31 31 31 31 31 31 31 ...
$ zipcode : int 93204 93204 93204 93204 93204 93204 93204 93204 93204 93204 ...
$ x : num -120 -120 -120 -120 -120 ...
$ y : num 36 36 36 36 36 ...
$ confirmed_cases : int 0 0 0 0 0 0 0 0 0 0 ...
$ new_confirmed_cases: int NA 0 0 0 0 0 0 0 0 0 ...
$ deaths : int 0 0 0 0 0 0 0 0 0 0 ...
$ new_deaths : int NA 0 0 0 0 0 0 0 0 0 ...
# In a dataset with dates as character, convert to date cdcr_nodt$date_mod <-ymd(cdcr_nodt$date) #as_date would also work# Getting the most recent date and earliest datemax(cdcr_nodt$date_mod)