Working with dates in R

Background

Dates are a special form of numeric that are stored as numeric but display in formats we’re familiar with.

Like Microsoft (January 1st,1900) and SAS (January 1st, 1960), R anchors to a specific date (January 1st, 1970).

Dates in R are stored as the number of days before or after 1/1/1970 and date/times are stored as the number of seconds before or after 1/1/1970.

Default display format in R is yyyy-mm-dd (‘2020-09-17’).

Dates using Base R

# Default format is yyyy-mm-dd - if you use that format, you don't have to specify
# You can assign a date to an object by using a date conversion function and passing a character date string
# Using base SAS function as.Date()

dob <- as.Date("2020-09-03")
dob
[1] "2020-09-03"
# Derive the number of days since 1/1/1970 by converting the date object to numeric
as.numeric(dob)
[1] 18508
#Confirming a different way
as.numeric(dob) - as.numeric(as.Date("1970-01-01")) 
[1] 18508
# if you wanted to add 90 days to a date (or a column of dates):
as.Date("2020-09-04") + 90
[1] "2020-12-03"
# Converting strings to date or date/time

# If the date is stored in a different format, you just have to include a format parse argument

chr_dates <- c("03/05/1985", "09/21/1972")
dates <- as.Date(chr_dates, "%m/%d/%Y")
dates
[1] "1985-03-05" "1972-09-21"
# Another frequently used function is getting the current date
Sys.Date()
[1] "2024-05-11"

Dates Using Lubridate

For working with dates in this course, we’ll generally either use base R functions or the lubridate package functions from the tidyverse.

lubridate reference page

lubridate cheat sheet

# Using lubridate package functions (part of the tidyverse)

# Lubridate makes things a little easier in working with dates

dob <- as_date("2020-09-03")

# Using the mdy function easily converts the US date convention (or many other date conventions) to R date
mdy("02/29/2020")
[1] "2020-02-29"
# if you wanted to add 90 days to a date (or a column of dates) it's basically the same:
as_date("2020-09-04") + 90
[1] "2020-12-03"
# But if you need to add three months, lubridate makes it much easier
as_date("2020-09-04") + months(1)
[1] "2020-10-04"
# that works fine, but becomes a little problematic if the added month has fewer days
as_date("2020-03-31") + months(1)
[1] NA
# the %m+% operator adds a month but reverts back to the last day of the month if the expression yields a non-existent date
as_date("2020-03-31") %m+% months(1)
[1] "2020-04-30"
# On the off chance you need to convert from numeric:

lubridate::as_date(18239)
[1] "2019-12-09"
# You can also get current time (just have to be careful about time zone) using

with_tz(Sys.time(), tzone = "America/Los_Angeles")
[1] "2024-05-11 10:08:07 PDT"
# You can also use the today function and specify the time zone

today(tzone = "Australia/Sydney")
[1] "2024-05-12"

Wrangling with Dates

It makes sense to keep dates as numeric behind the scene, since you’ll often want to be able to compare dates, identify number of days, weeks, months, and years between dates, and see if something happened early or later.

# Difference in days
Sys.Date() - as.Date("2020-09-03")
Time difference of 1346 days
# which date is earlier, later etc
Sys.Date() < as.Date("2020-09-03")
[1] FALSE
# another method of determining days between two dates, but also the option for other measures
as.numeric(difftime(Sys.Date(), as.Date("2020-09-03"), units = "days"))
[1] 1346
as.numeric(difftime(Sys.Date(), as.Date("2020-09-03"), units = "hours"))
[1] 32304
# If you need to extract a date part out of a date object:
test <- as_date("1999-04-02")
day(test)
[1] 2
month(test)
[1] 4
year(test)
[1] 1999
# day of the week
wday(test, label = TRUE, abbr = TRUE)
[1] Fri
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
# day of the year (julian day)
yday(test)
[1] 92
# key time measure for public health built into lubridate!
epiweek(test)
[1] 13

You can also sort and join (next week) a dataset by dates more efficiently because it has numeric properties. As well as querying the most recent even in a dataset, grouped by people.

When you import dataset, most packages will make an attempt to guess that a column contains a date and automatically convert. But for the sake of learning how to manually manipulate, we’ll use base R import

cdcr_nodt <- read.csv("https://raw.githubusercontent.com/datadesk/california-coronavirus-data/master/cdcr-prison-totals.csv", stringsAsFactors = F)

# note: readr::read_csv will guess the column data types, so you wouldn't need to convert if you used read_csv.
str(cdcr_nodt)
'data.frame':   41370 obs. of  13 variables:
 $ date               : chr  "2020-03-10" "2020-03-11" "2020-03-12" "2020-03-13" ...
 $ code               : chr  "ASP" "ASP" "ASP" "ASP" ...
 $ name               : chr  "Avenal State Prison" "Avenal State Prison" "Avenal State Prison" "Avenal State Prison" ...
 $ city               : chr  "Avenal" "Avenal" "Avenal" "Avenal" ...
 $ county             : chr  "Kings" "Kings" "Kings" "Kings" ...
 $ fips               : int  31 31 31 31 31 31 31 31 31 31 ...
 $ zipcode            : int  93204 93204 93204 93204 93204 93204 93204 93204 93204 93204 ...
 $ x                  : num  -120 -120 -120 -120 -120 ...
 $ y                  : num  36 36 36 36 36 ...
 $ confirmed_cases    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ new_confirmed_cases: int  NA 0 0 0 0 0 0 0 0 0 ...
 $ deaths             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ new_deaths         : int  NA 0 0 0 0 0 0 0 0 0 ...
# In a dataset with dates as character, convert to date 

cdcr_nodt$date_mod <- ymd(cdcr_nodt$date) #as_date would also work

# Getting the most recent date and earliest date
max(cdcr_nodt$date_mod)
[1] "2023-06-04"
min(cdcr_nodt$date_mod)
[1] "2020-03-10"
# and epi week

cdcr_nodt$epiwk <- epiweek(cdcr_nodt$date_mod)
cdcr_nodt$year <- year(cdcr_nodt$date_mod)

cdcr_epiwk <- cdcr_nodt |>
  filter(year == max(year) - 1) |>
  group_by(year, epiwk) |>
  summarize(cases = sum(confirmed_cases)) |>
  ungroup() |>
  as.data.frame()
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
str(cdcr_epiwk)
'data.frame':   52 obs. of  3 variables:
 $ year : num  2022 2022 2022 2022 2022 ...
 $ epiwk: num  1 2 3 4 5 6 7 8 9 10 ...
 $ cases: int  375088 398210 427825 458349 479708 492936 500660 505432 508598 510824 ...
bind_rows(slice_head(cdcr_epiwk, n=5), 
          slice_tail(cdcr_epiwk, n=5)) |> 
  gt(auto_align = TRUE) |>
  tab_header(
    title = "CDCR Cases by Epi Week",
    subtitle = "First and Last 5 weeks displayed"
  ) 
CDCR Cases by Epi Week
First and Last 5 weeks displayed
year epiwk cases
2022 1 375088
2022 2 398210
2022 3 427825
2022 4 458349
2022 5 479708
2022 48 616450
2022 49 619492
2022 50 623162
2022 51 626577
2022 52 682266