Working with Dates (More Advanced Topic)

Author

Rachel Saidi

Working with Dates in R

Adapted from Mark Niemann-Ross’s LinkdIn tutorial. All datasets for this tutorial may be found in the class datasets link: http://bit.ly/data110datasets

Dates information can be challenging, due to the extreme variability of how it is recorded in a particular dataset. The goal is to understand how to ensure R recognises the format you are using as an actual date.

Load the libraries and set working

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)

Getting started - some basics

Read in the dates_example datasets.

setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
dates <- read_csv("dates_example.csv")
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date
dbl (1): Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
yearMonthDay <- read_csv("year_month_day_sample.csv")
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): YearMonth
dbl (1): recording

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
rr_YMD <- read_csv("year_month_day_sample.csv")
Rows: 12 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): YearMonth
dbl (1): recording

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View “dates” dataset

head(dates)
# A tibble: 6 × 2
  Date    Value
  <chr>   <dbl>
1 2020-01    25
2 2020-02    35
3 2020-03    18
4 2020-04    27
5 2020-05    22
6 2020-06    36

Find the class for the Date variable

class(dates$Date)
[1] "character"

Notice the variable, “Date” is a character string. Change it to be recognized as dates

Use lubridate to make “Date” a recognized date

dates$fixdate <- lubridate::ym(dates$Date)

Check the class for the new variable “fixdate” is correct

head(dates)
# A tibble: 6 × 3
  Date    Value fixdate   
  <chr>   <dbl> <date>    
1 2020-01    25 2020-01-01
2 2020-02    35 2020-02-01
3 2020-03    18 2020-03-01
4 2020-04    27 2020-04-01
5 2020-05    22 2020-05-01
6 2020-06    36 2020-06-01

Now you can see that the dates are recognized and recorded properly

R has several ways to store Dates and Times

Date - just stores dates, not times (and all dates are calculated as number of seconds since 1970-01-01, the epoch) POSIXct - Portable Operating System Interface - supports dates AND times (provides time zone as well) in vector format; this tends to be easier than POSIXlt POSIXlt - similar to POSIXct, but values are stored in list format; to see all values, use unclass

As long as you are not working with times, use the Date class.

Unexpected formats for dates make things tricky

  • January 7, 2020
  • 1/7/20
  • Tuesday, Jan 7, 2020
  • Or something else
# Here is an example of a string that doesn't work - remove the hashtag to run it
# as.Date("January 7, 2020")  # this code returns an error we will fix in the next chunk

Notice the error: Error in charToDate(x) : character string is not in a standard unambiguous format

# Include the format
as.Date("January 7, 2020", format = "%B %d, %Y")
[1] "2020-01-07"
# %B indicates a full month name, %d is the day, an %Y is the 4-digit year (%y gives the 2-digit year)

Now you can see that R understands this date

# to get a complete list of time formats  - remove the hashtag to run it
# ?strptime

Convert a specific date

# On Friday, March 13, 2020, most of the United States went to lockdown due to the COVID19 pandemic
pandemic <- "Friday, March 13, 2020"
#convert to Date class

as.Date(pandemic, format = "%A, %B %d, %Y")
[1] "2020-03-13"

Date/Time conversions are not always clear or easy, but as long as you have the tools, it will make much more sense

Comparing and Manipulating Dates

#Compare the epoch to today's date
epoch <- "1970-01-01"
today <- Sys.Date()
# Try logical statements:
today > epoch
[1] TRUE

Create a sequence of dates

DaysofJan <- seq.Date(from = as.Date("2020/01/01"), 
                      to = as.Date("2020/01/31"),
                      by = "day")
DaysofJan
 [1] "2020-01-01" "2020-01-02" "2020-01-03" "2020-01-04" "2020-01-05"
 [6] "2020-01-06" "2020-01-07" "2020-01-08" "2020-01-09" "2020-01-10"
[11] "2020-01-11" "2020-01-12" "2020-01-13" "2020-01-14" "2020-01-15"
[16] "2020-01-16" "2020-01-17" "2020-01-18" "2020-01-19" "2020-01-20"
[21] "2020-01-21" "2020-01-22" "2020-01-23" "2020-01-24" "2020-01-25"
[26] "2020-01-26" "2020-01-27" "2020-01-28" "2020-01-29" "2020-01-30"
[31] "2020-01-31"

Create a sequence of days/times

TimesToday <- seq.POSIXt(from = as.POSIXct("2021/05/26 00:00:00"),
                          to = as.POSIXct("2021/05/27 00:00:00"),
                          by = "hour")
TimesToday
 [1] "2021-05-26 00:00:00 EDT" "2021-05-26 01:00:00 EDT"
 [3] "2021-05-26 02:00:00 EDT" "2021-05-26 03:00:00 EDT"
 [5] "2021-05-26 04:00:00 EDT" "2021-05-26 05:00:00 EDT"
 [7] "2021-05-26 06:00:00 EDT" "2021-05-26 07:00:00 EDT"
 [9] "2021-05-26 08:00:00 EDT" "2021-05-26 09:00:00 EDT"
[11] "2021-05-26 10:00:00 EDT" "2021-05-26 11:00:00 EDT"
[13] "2021-05-26 12:00:00 EDT" "2021-05-26 13:00:00 EDT"
[15] "2021-05-26 14:00:00 EDT" "2021-05-26 15:00:00 EDT"
[17] "2021-05-26 16:00:00 EDT" "2021-05-26 17:00:00 EDT"
[19] "2021-05-26 18:00:00 EDT" "2021-05-26 19:00:00 EDT"
[21] "2021-05-26 20:00:00 EDT" "2021-05-26 21:00:00 EDT"
[23] "2021-05-26 22:00:00 EDT" "2021-05-26 23:00:00 EDT"
[25] "2021-05-27 00:00:00 EDT"

Today’s time

# Today's time
Sys.time()
[1] "2023-09-08 15:46:30 EDT"

Round to the hour

round(Sys.time(), "hour")
[1] "2023-09-08 16:00:00 EDT"

Round to the month

round(Sys.time(), "mon")
[1] "2023-09-01 EDT"

Truncate time time to the hour

trunc(Sys.time(), "hour")
[1] "2023-09-08 15:00:00 EDT"

Using the package “lubridate” to easily read and convert dates

Notice how much easier the following conversion is with lubridate in the subsequent chunk:

as.Date("June 3, 2020", format = "%b %d, %Y")
[1] "2020-06-03"

Now use lubridate

library(lubridate)  # don't forget to refer to the lubridate cheat sheet
mdy("June 3, 2020")
[1] "2020-06-03"

System Date/Time information

Sys.Date() # to get today's date
[1] "2023-09-08"
today()  # the same function in lubridate
[1] "2023-09-08"
Sys.time()
[1] "2023-09-08 15:46:31 EDT"
now()  # lubridate's version of Sys.time()
[1] "2023-09-08 15:46:31 EDT"

Use lubridate to convert year month day hours minutes seconds

ymd("2014-07-13 16:00:00 -0300")  
Warning: All formats failed to parse. No formats found.
[1] NA
# this will not work because of the hours:minutes:seconds
ymd_hms("2014-07-13 16:00:00 -0300")
[1] "2014-07-13 19:00:00 UTC"
mdy_hm("July 13, 2014 4:00 pm")
[1] "2014-07-13 16:00:00 UTC"

Use lubridate to read in Pablo Picasso’s birthday

# see how easy!
PabloPicassoBday <- mdy_hm("October 25, 1881, 11:15 PM")

# add 
year(PabloPicassoBday) + 3  # extract the "year" and add 3
[1] 1884
# You can also extract the month or day or day of the week

month(PabloPicassoBday)
[1] 10
day(PabloPicassoBday)
[1] 25
wday(PabloPicassoBday)  # this returns day 3, which is Tuesday, starting with Sunday
[1] 3
wday(PabloPicassoBday, label = TRUE, abbr = FALSE)  # This returns Tuesday
[1] Tuesday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
am(PabloPicassoBday)  # Was he born in the morning?
[1] FALSE
leap_year(PabloPicassoBday) # Was he born on a leap year?
[1] FALSE

Rounding and Truncating dates and times with lubridate

# from base R
Sys.time()
[1] "2023-09-08 15:46:31 EDT"
round(Sys.time(), "hour")
[1] "2023-09-08 16:00:00 EDT"
round(Sys.time(), "year")
[1] "2024-01-01 EST"
trunc(Sys.time(), "year")
[1] "2023-01-01 EST"

Lubridate provides similar functions, but easier to understand

now() # equivalent to Sys.time()
[1] "2023-09-08 15:46:31 EDT"
floor_date(now(), unit = "month") # round down
[1] "2023-09-01 EDT"
floor_date(now(), unit = "year")
[1] "2023-01-01 EST"
round_date(now(), unit = "hour") # round to nearest unit
[1] "2023-09-08 16:00:00 EDT"
ceiling_date(now(), unit = "minutes") # round up
[1] "2023-09-08 15:47:00 EDT"
# last day of previous month
rollback(now(), roll_to_first = FALSE, preserve_hms = FALSE)
[1] "2023-08-31 EDT"

Time Series Data

Time series data may show patterns; data arrives in chronological order

Time Series Classes in R

  • ts - time series
  • zoo - Zeileis’s ordered observations (r indexed totally ordered observations, such as discrete irregular time series, example is Stock Market Data)

Time series analysis is a complex topic that involves heavy statistics and matrix manipulations

Class TS for time series

# data is a matrix
# creating sample data. Four sine waves
mydata <- matrix(c(sin(seq(0,10,0.1)), 
                 sin(seq(1,11,0.1)),
                 sin(seq(2,12,0.1)),
                 sin(seq(3,13,0.1))),
                 ncol = 4)
# columns are series of observations  and  frequency / deltat are rows
head(mydata)
           [,1]      [,2]      [,3]        [,4]
[1,] 0.00000000 0.8414710 0.9092974  0.14112001
[2,] 0.09983342 0.8912074 0.8632094  0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
howOftenObserved <- 12 #monthly
# annual = 1, quarter = 4, month = 12, week = 52
startSeries <- ts(mydata, 
   start = 2019,
   frequency = howOftenObserved)
head(startSeries)
       Series 1  Series 2  Series 3    Series 4
[1,] 0.00000000 0.8414710 0.9092974  0.14112001
[2,] 0.09983342 0.8912074 0.8632094  0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
# instead of frequency, delta between observations
# use frequency OR delta
deltaObserve <-  1/12

ts1 <- ts(mydata, 
   start = 2019,
   deltat = deltaObserve)
head(ts1)
       Series 1  Series 2  Series 3    Series 4
[1,] 0.00000000 0.8414710 0.9092974  0.14112001
[2,] 0.09983342 0.8912074 0.8632094  0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
# name the series
pumpNames <- c("East", "West", "North", "South")
ts2 <- ts(mydata,
   end = 2019,
   frequency = howOftenObserved,
   names = pumpNames)
head(ts2)
           East      West     North       South
[1,] 0.00000000 0.8414710 0.9092974  0.14112001
[2,] 0.09983342 0.8912074 0.8632094  0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323
# adding time series
endSeries <- ts(mydata,
                end = 2019,
                frequency = howOftenObserved,
                names = pumpNames)
# Now the column names are the pump names: East, West, North, and South
head(endSeries)
           East      West     North       South
[1,] 0.00000000 0.8414710 0.9092974  0.14112001
[2,] 0.09983342 0.8912074 0.8632094  0.04158066
[3,] 0.19866933 0.9320391 0.8084964 -0.05837414
[4,] 0.29552021 0.9635582 0.7457052 -0.15774569
[5,] 0.38941834 0.9854497 0.6754632 -0.25554110
[6,] 0.47942554 0.9974950 0.5984721 -0.35078323

Use the “zoo” package

Zoo is an abbreviation used in time series analysis, which stands for “Z’s ordered observations”

Read in weather data from a table in a website

library(zoo)

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
weatherData <- read_table("https://raw.githubusercontent.com/lyndadotcom/LPO_weatherdata/master/Environmental_Data_Deep_Moor_2015.txt")

── Column specification ────────────────────────────────────────────────────────
cols(
  date = col_character(),
  time = col_time(format = ""),
  Air_Temp = col_double(),
  Barometric_Press = col_double(),
  Dew_Point = col_double(),
  Relative_Humidity = col_double(),
  Wind_Dir = col_double(),
  Wind_Gust = col_double(),
  Wind_Speed = col_double()
)
head(weatherData) # take a look at the data. Column 1 & 2 are date & time
# A tibble: 6 × 9
  date     time   Air_Temp Barometric_Press Dew_Point Relative_Humidity Wind_Dir
  <chr>    <time>    <dbl>            <dbl>     <dbl>             <dbl>    <dbl>
1 2015_01… 02'43"     19.5             30.6      14.8              81.6     160.
2 2015_01… 02'52"     19.5             30.6      14.8              81.6     160.
3 2015_01… 07'43"     19.5             30.6      14.7              81.2     156.
4 2015_01… 07'52"     19.5             30.6      14.7              81.2     156.
5 2015_01… 12'43"     19.7             30.6      15.2              82.4     167.
6 2015_01… 12'52"     19.7             30.6      15.2              82.4     167.
# ℹ 2 more variables: Wind_Gust <dbl>, Wind_Speed <dbl>
zooweather_data <- as.matrix(weatherData[ , -(1:2)], ncol = 7) # -() removes date and time columns
head(zooweather_data)
     Air_Temp Barometric_Press Dew_Point Relative_Humidity Wind_Dir Wind_Gust
[1,]    19.50            30.62     14.78              81.6   159.78        14
[2,]    19.50            30.62     14.78              81.6   159.78        14
[3,]    19.50            30.61     14.66              81.2   155.63        11
[4,]    19.50            30.61     14.66              81.2   155.63        11
[5,]    19.68            30.61     15.20              82.4   166.59        12
[6,]    19.68            30.61     15.20              82.4   166.59        12
     Wind_Speed
[1,]        9.2
[2,]        9.2
[3,]        8.6
[4,]        8.6
[5,]        9.4
[6,]        9.4

Load a library “tibbletime”

‘tibbletime’ is an extension that allows for the creation of time aware tibbles.

library(tibbletime)

Attaching package: 'tibbletime'
The following object is masked from 'package:stats':

    filter

Create a time series plot using ggplot

p1 <- read_table("https://raw.githubusercontent.com/lyndadotcom/LPO_weatherdata/master/Environmental_Data_Deep_Moor_2015.txt") |> 
   unite(datetime, date,time) |>
   mutate(datetime = ymd_hms((datetime))) |>
   tbl_time(index = datetime) |>  # tibbletime is built to live in the tidyverse
   ggplot( aes(x = datetime, y = Air_Temp))+
   geom_line(color = "#00AFBB", linewidth = .25) + 
   ggtitle("Time Series Plot of Air Temperatures in 2015")

── Column specification ────────────────────────────────────────────────────────
cols(
  date = col_character(),
  time = col_time(format = ""),
  Air_Temp = col_double(),
  Barometric_Press = col_double(),
  Dew_Point = col_double(),
  Relative_Humidity = col_double(),
  Wind_Dir = col_double(),
  Wind_Gust = col_double(),
  Wind_Speed = col_double()
)
p1