Jeff Walker
2013-08-06
Dates and times can be one of the most frustrating aspects of data analysis, and is probably too often handled incorrectly. Fortunately, the lubridate package makes dealing with datetimes easier in R. But we still need to be careful when dealing with timezones as the conversion from UTC to local time zones is sometimes not intuitive.
Datetimes are best stored as a continuous numeric value (integer or float) representing the amount of time (usually seconds) since some datum (aka epoch). In R, datetimes are best stored using the POSIXct class, which represents datetimes as the number of seconds since the start of 1970 (aka the Unix epoch).
It is important to recognize that datetimes can be represented by two key formats:
Any given datetime is a unique value according to clock time, but can have different calendar time representations depending on the timezone.
lubridate provides a series of shortcut parsing functions for convert timestamps from strings to POSIXct objects.
library(lubridate)
ymd(c("2000-01-01", "2000-01-02", "2000-01-03"))
## [1] "2000-01-01 UTC" "2000-01-02 UTC" "2000-01-03 UTC"
ymd_hm(c("2000-01-01 00:00", "2000-01-02 01:00", "2000-01-03 02:30"))
## [1] "2000-01-01 00:00:00 UTC" "2000-01-02 01:00:00 UTC"
## [3] "2000-01-03 02:30:00 UTC"
By default, lubridate assumes timestamps are in UTC. We can specify a timezone using the tz argument to the parsing function.
ymd_hm(c("2000-01-01 00:00", "2000-01-02 01:00", "2000-01-03 02:30"), tz = "America/New_York")
## [1] "2000-01-01 00:00:00 EST" "2000-01-02 01:00:00 EST"
## [3] "2000-01-03 02:30:00 EST"
Note that lubridate has automatically accounted for daylight savings time (EST vs EDT).
ymd_hm(c("2000-01-01 06:00", "2000-07-01 06:00"), tz = "America/New_York")
## [1] "2000-01-01 06:00:00 EST" "2000-07-01 06:00:00 EDT"
The POSIXct class stores dates and times as the signed number of seconds since the beginning of 1970 (i.e. the Unix epoch).
utc <- ymd_hm(c("1970-01-01 00:00", "1971-01-01 00:00", "1972-01-01 00:00"))
as.numeric(utc) # number of seconds since 1970-01-01 00:00
## [1] 0 31536000 63072000
diff(as.numeric(utc))/86400 # number of days between each consecutive datetime
## [1] 365 365
After a datetime has been parsed, its timezone can be changed in two ways:
1. Conversion from one timezone to another using with_tz() changes the calendar time representation, but not the clock time
2. Forcing a specific timezone using force_tz() changes the clock time, but not the calendar time representation
To convert from one timezone to another, use with_tz():
utc <- ymd_hm(c("2000-01-01 06:00", "2000-07-01 06:00"))
et <- with_tz(utc, tz = "America/New_York")
The calendar representations are now different:
print(utc)
## [1] "2000-01-01 06:00:00 UTC" "2000-07-01 06:00:00 UTC"
print(et)
## [1] "2000-01-01 01:00:00 EST" "2000-07-01 02:00:00 EDT"
The clock time representations are still the same:
rbind(utc = utc, et = et, diff = utc - et)
## [,1] [,2]
## utc 946706400 962431200
## et 946706400 962431200
## diff 0 0
The conversion to eastern timezone affects not only how the underlying numeric values are printed as calendar datetimes, but also the extraction of the datetime components (e.g. year(), month(), day(), yday(), hour()), which are based on the calendar representation:
hour(utc)
## [1] 6 6
hour(et)
## [1] 1 2
If a timestamp string was parsed using the incorrect timezone, then the resulting POSIXct values can be forced to use a different timezone.
utc <- ymd_hm(c("2000-01-01 06:00", "2000-07-01 06:00"))
et <- force_tz(utc, tz = "America/New_York")
print(utc)
## [1] "2000-01-01 06:00:00 UTC" "2000-07-01 06:00:00 UTC"
print(et)
## [1] "2000-01-01 06:00:00 EST" "2000-07-01 06:00:00 EDT"
The POSIXct values now look identical, but are in fact different datetimes as indicated by their numeric values:
rbind(utc = utc, et = et, diff_sec = as.numeric(utc) - as.numeric(et), diff_hr = (as.numeric(utc) -
as.numeric(et))/3600)
## [,1] [,2]
## utc 946706400 962431200
## et 946724400 962445600
## diff_sec -18000 -14400
## diff_hr -5 -4
In summary, conversion of timezones using with_tz() will not change the underlying numeric value but will change the interpretation of this value in calendar format. Forcing a timezone using force_tz() will change the underlying numeric value but will not change the interpretation in calendar format.
Mis-matching timezones when merging data frames based on the timestamp can be a problem.
df.utc <- data.frame(DATETIME = ymd_hm(c("2000-01-01 06:00", "2000-07-01 07:00")))
df.utc$HR_UTC <- hour(df.utc$DATETIME)
print(df.utc)
## DATETIME HR_UTC
## 1 2000-01-01 06:00:00 6
## 2 2000-07-01 07:00:00 7
df.et <- data.frame(DATETIME = ymd_hm(c("2000-01-01 06:00", "2000-07-01 07:00"),
tz = "America/New_York"))
df.et$HR_ET <- hour(df.et$DATETIME)
print(df.et)
## DATETIME HR_ET
## 1 2000-01-01 06:00:00 6
## 2 2000-07-01 07:00:00 7
The two data frames appear to have the same datetimes, but when we try to merge them, the df.et datetimes are converted to UTC:
df.merge <- merge(df.utc, df.et, all = TRUE)
print(df.merge)
## DATETIME HR_UTC HR_ET
## 1 2000-01-01 06:00:00 6 NA
## 2 2000-01-01 11:00:00 NA 6
## 3 2000-07-01 07:00:00 7 NA
## 4 2000-07-01 11:00:00 NA 7
print(tz(df.merge$DATETIME))
## [1] "UTC"
This is also a problem when joining a continuous timeseries (e.g. rainfall/streamflow) with a non-continuous timeseries (e.g. water quality).
df.utc <- data.frame(DATETIME = ymd_hm("2000-01-01 00:00") + dhours(0:23))
df.utc$HR_UTC <- hour(df.utc$DATETIME)
df.et <- data.frame(DATETIME = ymd_hm(c("2000-01-01 06:00", "2000-01-01 18:00"),
tz = "America/New_York"))
df.et$HR_ET <- hour(df.et$DATETIME)
print(df.et)
## DATETIME HR_ET
## 1 2000-01-01 06:00:00 6
## 2 2000-01-01 18:00:00 18
df.merge <- merge(df.utc, df.et, all = TRUE)
print(df.merge)
## DATETIME HR_UTC HR_ET
## 1 2000-01-01 00:00:00 0 NA
## 2 2000-01-01 01:00:00 1 NA
## 3 2000-01-01 02:00:00 2 NA
## 4 2000-01-01 03:00:00 3 NA
## 5 2000-01-01 04:00:00 4 NA
## 6 2000-01-01 05:00:00 5 NA
## 7 2000-01-01 06:00:00 6 NA
## 8 2000-01-01 07:00:00 7 NA
## 9 2000-01-01 08:00:00 8 NA
## 10 2000-01-01 09:00:00 9 NA
## 11 2000-01-01 10:00:00 10 NA
## 12 2000-01-01 11:00:00 11 6
## 13 2000-01-01 12:00:00 12 NA
## 14 2000-01-01 13:00:00 13 NA
## 15 2000-01-01 14:00:00 14 NA
## 16 2000-01-01 15:00:00 15 NA
## 17 2000-01-01 16:00:00 16 NA
## 18 2000-01-01 17:00:00 17 NA
## 19 2000-01-01 18:00:00 18 NA
## 20 2000-01-01 19:00:00 19 NA
## 21 2000-01-01 20:00:00 20 NA
## 22 2000-01-01 21:00:00 21 NA
## 23 2000-01-01 22:00:00 22 NA
## 24 2000-01-01 23:00:00 23 18
We get the correct date if the datetimes in df.et are actually the same UTC, but converted to EST/EDT, which changes their calendar representation.
df.utc <- data.frame(DATETIME = ymd_hm("2000-01-01 00:00") + dhours(0:23))
df.utc$HR_UTC <- hour(df.utc$DATETIME)
df.et <- data.frame(DATETIME = ymd_hm(c("2000-01-01 06:00", "2000-01-01 18:00")))
df.et$DATETIME <- with_tz(df.et$DATETIME, "America/New_York")
df.et$HR_ET <- hour(df.et$DATETIME)
print(df.et)
## DATETIME HR_ET
## 1 2000-01-01 01:00:00 1
## 2 2000-01-01 13:00:00 13
df.merge <- merge(df.utc, df.et, all = TRUE)
print(df.merge)
## DATETIME HR_UTC HR_ET
## 1 2000-01-01 00:00:00 0 NA
## 2 2000-01-01 01:00:00 1 NA
## 3 2000-01-01 02:00:00 2 NA
## 4 2000-01-01 03:00:00 3 NA
## 5 2000-01-01 04:00:00 4 NA
## 6 2000-01-01 05:00:00 5 NA
## 7 2000-01-01 06:00:00 6 1
## 8 2000-01-01 07:00:00 7 NA
## 9 2000-01-01 08:00:00 8 NA
## 10 2000-01-01 09:00:00 9 NA
## 11 2000-01-01 10:00:00 10 NA
## 12 2000-01-01 11:00:00 11 NA
## 13 2000-01-01 12:00:00 12 NA
## 14 2000-01-01 13:00:00 13 NA
## 15 2000-01-01 14:00:00 14 NA
## 16 2000-01-01 15:00:00 15 NA
## 17 2000-01-01 16:00:00 16 NA
## 18 2000-01-01 17:00:00 17 NA
## 19 2000-01-01 18:00:00 18 13
## 20 2000-01-01 19:00:00 19 NA
## 21 2000-01-01 20:00:00 20 NA
## 22 2000-01-01 21:00:00 21 NA
## 23 2000-01-01 22:00:00 22 NA
## 24 2000-01-01 23:00:00 23 NA
Alternatively, if the original timestamp strings in df.et are the local representation, then we can assign the timezone in the parsing function.
df.utc <- data.frame(DATETIME = ymd_hm("2000-01-01 00:00") + dhours(0:23))
df.utc$HR_UTC <- hour(df.utc$DATETIME)
df.et <- data.frame(DATETIME = ymd_hm(c("2000-01-01 01:00", "2000-01-01 13:00"),
tz = "America/New_York"))
df.et$HR_ET <- hour(df.et$DATETIME)
print(df.et)
## DATETIME HR_ET
## 1 2000-01-01 01:00:00 1
## 2 2000-01-01 13:00:00 13
df.merge <- merge(df.utc, df.et, all = TRUE)
print(df.merge)
## DATETIME HR_UTC HR_ET
## 1 2000-01-01 00:00:00 0 NA
## 2 2000-01-01 01:00:00 1 NA
## 3 2000-01-01 02:00:00 2 NA
## 4 2000-01-01 03:00:00 3 NA
## 5 2000-01-01 04:00:00 4 NA
## 6 2000-01-01 05:00:00 5 NA
## 7 2000-01-01 06:00:00 6 1
## 8 2000-01-01 07:00:00 7 NA
## 9 2000-01-01 08:00:00 8 NA
## 10 2000-01-01 09:00:00 9 NA
## 11 2000-01-01 10:00:00 10 NA
## 12 2000-01-01 11:00:00 11 NA
## 13 2000-01-01 12:00:00 12 NA
## 14 2000-01-01 13:00:00 13 NA
## 15 2000-01-01 14:00:00 14 NA
## 16 2000-01-01 15:00:00 15 NA
## 17 2000-01-01 16:00:00 16 NA
## 18 2000-01-01 17:00:00 17 NA
## 19 2000-01-01 18:00:00 18 13
## 20 2000-01-01 19:00:00 19 NA
## 21 2000-01-01 20:00:00 20 NA
## 22 2000-01-01 21:00:00 21 NA
## 23 2000-01-01 22:00:00 22 NA
## 24 2000-01-01 23:00:00 23 NA
Or if both data frames are in local time zone.
df.x <- data.frame(DATETIME = ymd_hm("2000-01-01 00:00", tz = "America/New_York") +
dhours(0:23))
df.x$HR_X <- hour(df.x$DATETIME)
df.y <- data.frame(DATETIME = ymd_hm(c("2000-01-01 06:00", "2000-01-01 18:00"),
tz = "America/New_York"))
df.y$HR_Y <- hour(df.y$DATETIME)
print(df.y)
## DATETIME HR_Y
## 1 2000-01-01 06:00:00 6
## 2 2000-01-01 18:00:00 18
df.merge <- merge(df.x, df.y, all = TRUE)
print(df.merge)
## DATETIME HR_X HR_Y
## 1 2000-01-01 00:00:00 0 NA
## 2 2000-01-01 01:00:00 1 NA
## 3 2000-01-01 02:00:00 2 NA
## 4 2000-01-01 03:00:00 3 NA
## 5 2000-01-01 04:00:00 4 NA
## 6 2000-01-01 05:00:00 5 NA
## 7 2000-01-01 06:00:00 6 6
## 8 2000-01-01 07:00:00 7 NA
## 9 2000-01-01 08:00:00 8 NA
## 10 2000-01-01 09:00:00 9 NA
## 11 2000-01-01 10:00:00 10 NA
## 12 2000-01-01 11:00:00 11 NA
## 13 2000-01-01 12:00:00 12 NA
## 14 2000-01-01 13:00:00 13 NA
## 15 2000-01-01 14:00:00 14 NA
## 16 2000-01-01 15:00:00 15 NA
## 17 2000-01-01 16:00:00 16 NA
## 18 2000-01-01 17:00:00 17 NA
## 19 2000-01-01 18:00:00 18 18
## 20 2000-01-01 19:00:00 19 NA
## 21 2000-01-01 20:00:00 20 NA
## 22 2000-01-01 21:00:00 21 NA
## 23 2000-01-01 22:00:00 22 NA
## 24 2000-01-01 23:00:00 23 NA
Note that the DATETIME column is no shown in local time zone, not UTC.
tz(df.merge$DATETIME)
## [1] "America/New_York"
So what happens to the timestamp around daylight savings time. In 2008, DST began on March 9 at 2 am (local time). For America/New_York, this corresponds to the UTC time 2008-03-09 07:00. Notice the skip from 1 am to 3 am:
dt.utc <- ymd_hm(c("2008-03-09 00:00")) + dhours(0:12)
data.frame(UTC = dt.utc, ET = with_tz(dt.utc, "America/New_York"))
## UTC ET
## 1 2008-03-09 00:00:00 2008-03-08 19:00:00
## 2 2008-03-09 01:00:00 2008-03-08 20:00:00
## 3 2008-03-09 02:00:00 2008-03-08 21:00:00
## 4 2008-03-09 03:00:00 2008-03-08 22:00:00
## 5 2008-03-09 04:00:00 2008-03-08 23:00:00
## 6 2008-03-09 05:00:00 2008-03-09 00:00:00
## 7 2008-03-09 06:00:00 2008-03-09 01:00:00
## 8 2008-03-09 07:00:00 2008-03-09 03:00:00
## 9 2008-03-09 08:00:00 2008-03-09 04:00:00
## 10 2008-03-09 09:00:00 2008-03-09 05:00:00
## 11 2008-03-09 10:00:00 2008-03-09 06:00:00
## 12 2008-03-09 11:00:00 2008-03-09 07:00:00
## 13 2008-03-09 12:00:00 2008-03-09 08:00:00
When DST ends on Nov 2 at 2 am, then we loose an hour and 1 am is repeated:
dt.utc <- ymd_hm(c("2008-11-02 00:00")) + dhours(0:12)
data.frame(UTC = dt.utc, ET = with_tz(dt.utc, "America/New_York"))
## UTC ET
## 1 2008-11-02 00:00:00 2008-11-01 20:00:00
## 2 2008-11-02 01:00:00 2008-11-01 21:00:00
## 3 2008-11-02 02:00:00 2008-11-01 22:00:00
## 4 2008-11-02 03:00:00 2008-11-01 23:00:00
## 5 2008-11-02 04:00:00 2008-11-02 00:00:00
## 6 2008-11-02 05:00:00 2008-11-02 01:00:00
## 7 2008-11-02 06:00:00 2008-11-02 01:00:00
## 8 2008-11-02 07:00:00 2008-11-02 02:00:00
## 9 2008-11-02 08:00:00 2008-11-02 03:00:00
## 10 2008-11-02 09:00:00 2008-11-02 04:00:00
## 11 2008-11-02 10:00:00 2008-11-02 05:00:00
## 12 2008-11-02 11:00:00 2008-11-02 06:00:00
## 13 2008-11-02 12:00:00 2008-11-02 07:00:00
So what happens if you get a sub-hourly time series with the timestamps in the local timezone?
with_tz(ymd_hm(c("2008-11-02 01:30", "2008-11-02 02:00", "2008-11-02 02:30"),
tz = "America/New_York"), "UTC")
## [1] "2008-11-02 05:30:00 UTC" "2008-11-02 07:00:00 UTC"
## [3] "2008-11-02 07:30:00 UTC"
Then one of the hours gets skipped. And when DST ends, there is no unambiguous way to interpret the time during the repeated hour, and lubridate gives us an error:
with_tz(ymd_hm(c("2008-03-09 01:30", "2008-03-09 02:00", "2008-03-09 02:30",
"2008-03-09 03:00"), tz = "America/New_York"), "UTC")
## Warning: 2 failed to parse.
## [1] "2008-03-09 06:30:00 UTC" NA
## [3] NA "2008-03-09 07:00:00 UTC"
Here is an example of streamflow data from USGS at the beginning and end of daylight savings time. The dataset includes a column containing the corresponding timezone (EDT/EST) for each timestamp. Note how the dataset repeats some values during the end of DST with alternating EDT/EST.
usgs <- read.table("usgs.txt", sep = "", col.names = c("SOURCE", "STATION",
"DATE", "TIME", "TIMEZONE", "VALUE", "FLAG"))
usgs$TIMESTAMP <- paste(usgs$DATE, usgs$TIME, sep = " ")
usgs <- usgs[order(usgs$TIMESTAMP), ]
usgs <- subset(usgs, select = c("TIMESTAMP", "TIMEZONE", "VALUE"))
print(usgs)
## TIMESTAMP TIMEZONE VALUE
## 18 2007-11-04 00:00 EDT 68
## 19 2007-11-04 00:15 EDT 68
## 20 2007-11-04 00:30 EDT 67
## 21 2007-11-04 00:45 EDT 67
## 22 2007-11-04 01:00 EDT 65
## 23 2007-11-04 01:00 EST 63
## 24 2007-11-04 01:15 EDT 65
## 25 2007-11-04 01:15 EST 63
## 26 2007-11-04 01:30 EDT 64
## 27 2007-11-04 01:30 EST 63
## 28 2007-11-04 01:45 EDT 64
## 29 2007-11-04 01:45 EST 63
## 30 2007-11-04 02:00 EST 61
## 31 2007-11-04 02:15 EST 61
## 32 2007-11-04 02:30 EST 61
## 33 2007-11-04 02:45 EST 61
## 34 2007-11-04 03:00 EST 60
## 35 2007-11-04 03:15 EST 60
## 36 2007-11-04 03:30 EST 60
## 37 2007-11-04 03:45 EST 60
## 1 2008-03-09 00:00 EST 194
## 2 2008-03-09 00:15 EST 196
## 3 2008-03-09 00:30 EST 196
## 4 2008-03-09 00:45 EST 199
## 5 2008-03-09 01:00 EST 201
## 6 2008-03-09 01:15 EST 201
## 7 2008-03-09 01:30 EST 203
## 8 2008-03-09 01:45 EST 205
## 9 2008-03-09 03:00 EDT 205
## 10 2008-03-09 03:15 EDT 205
## 11 2008-03-09 03:30 EDT 207
## 12 2008-03-09 03:45 EDT 210
## 13 2008-03-09 04:00 EDT 210
## 14 2008-03-09 04:15 EDT 212
## 15 2008-03-09 04:30 EDT 214
## 16 2008-03-09 04:45 EDT 214
## 17 2008-03-09 05:00 EDT 214
We could incorrectly parse the timestamp without consideration for the time zone, which would result in the datetime values in UTC. As a result, the time series would have duplicated timestamps when daylight savings time begins, and a gap when it ends.
usgs$DATETIME_UTC <- ymd_hm(usgs$TIMESTAMP)
print(usgs)
## TIMESTAMP TIMEZONE VALUE DATETIME_UTC
## 18 2007-11-04 00:00 EDT 68 2007-11-04 00:00:00
## 19 2007-11-04 00:15 EDT 68 2007-11-04 00:15:00
## 20 2007-11-04 00:30 EDT 67 2007-11-04 00:30:00
## 21 2007-11-04 00:45 EDT 67 2007-11-04 00:45:00
## 22 2007-11-04 01:00 EDT 65 2007-11-04 01:00:00
## 23 2007-11-04 01:00 EST 63 2007-11-04 01:00:00
## 24 2007-11-04 01:15 EDT 65 2007-11-04 01:15:00
## 25 2007-11-04 01:15 EST 63 2007-11-04 01:15:00
## 26 2007-11-04 01:30 EDT 64 2007-11-04 01:30:00
## 27 2007-11-04 01:30 EST 63 2007-11-04 01:30:00
## 28 2007-11-04 01:45 EDT 64 2007-11-04 01:45:00
## 29 2007-11-04 01:45 EST 63 2007-11-04 01:45:00
## 30 2007-11-04 02:00 EST 61 2007-11-04 02:00:00
## 31 2007-11-04 02:15 EST 61 2007-11-04 02:15:00
## 32 2007-11-04 02:30 EST 61 2007-11-04 02:30:00
## 33 2007-11-04 02:45 EST 61 2007-11-04 02:45:00
## 34 2007-11-04 03:00 EST 60 2007-11-04 03:00:00
## 35 2007-11-04 03:15 EST 60 2007-11-04 03:15:00
## 36 2007-11-04 03:30 EST 60 2007-11-04 03:30:00
## 37 2007-11-04 03:45 EST 60 2007-11-04 03:45:00
## 1 2008-03-09 00:00 EST 194 2008-03-09 00:00:00
## 2 2008-03-09 00:15 EST 196 2008-03-09 00:15:00
## 3 2008-03-09 00:30 EST 196 2008-03-09 00:30:00
## 4 2008-03-09 00:45 EST 199 2008-03-09 00:45:00
## 5 2008-03-09 01:00 EST 201 2008-03-09 01:00:00
## 6 2008-03-09 01:15 EST 201 2008-03-09 01:15:00
## 7 2008-03-09 01:30 EST 203 2008-03-09 01:30:00
## 8 2008-03-09 01:45 EST 205 2008-03-09 01:45:00
## 9 2008-03-09 03:00 EDT 205 2008-03-09 03:00:00
## 10 2008-03-09 03:15 EDT 205 2008-03-09 03:15:00
## 11 2008-03-09 03:30 EDT 207 2008-03-09 03:30:00
## 12 2008-03-09 03:45 EDT 210 2008-03-09 03:45:00
## 13 2008-03-09 04:00 EDT 210 2008-03-09 04:00:00
## 14 2008-03-09 04:15 EDT 212 2008-03-09 04:15:00
## 15 2008-03-09 04:30 EDT 214 2008-03-09 04:30:00
## 16 2008-03-09 04:45 EDT 214 2008-03-09 04:45:00
## 17 2008-03-09 05:00 EDT 214 2008-03-09 05:00:00
Unfortunately, we cannot simply assign the TIMEZONE column to the tz argument of the parsing function, as this needs to be a single string:
usgs$DATETIME_UTC <- ymd_hm(usgs$TIMESTAMP, tz = usgs$TIMEZONE)
## Error: invalid 'tz' value
Furthermore, EDT and EST are imprecise timezone designations (from a global perspective) so there is no automatic way to integrate this information.
We could manually assign the local time zone, and then convert to UTC. This results in a continuous timeseries when DST ends but we sill have duplicated values at the end of DST.
usgs$DATETIME_LOCAL <- ymd_hm(usgs$TIMESTAMP, tz = "America/New_York")
usgs$DATETIME_UTC <- with_tz(usgs$DATETIME_LOCAL, "UTC")
print(usgs)
## TIMESTAMP TIMEZONE VALUE DATETIME_UTC DATETIME_LOCAL
## 18 2007-11-04 00:00 EDT 68 2007-11-04 04:00:00 2007-11-04 00:00:00
## 19 2007-11-04 00:15 EDT 68 2007-11-04 04:15:00 2007-11-04 00:15:00
## 20 2007-11-04 00:30 EDT 67 2007-11-04 04:30:00 2007-11-04 00:30:00
## 21 2007-11-04 00:45 EDT 67 2007-11-04 04:45:00 2007-11-04 00:45:00
## 22 2007-11-04 01:00 EDT 65 2007-11-04 05:00:00 2007-11-04 01:00:00
## 23 2007-11-04 01:00 EST 63 2007-11-04 05:00:00 2007-11-04 01:00:00
## 24 2007-11-04 01:15 EDT 65 2007-11-04 06:15:00 2007-11-04 01:15:00
## 25 2007-11-04 01:15 EST 63 2007-11-04 06:15:00 2007-11-04 01:15:00
## 26 2007-11-04 01:30 EDT 64 2007-11-04 06:30:00 2007-11-04 01:30:00
## 27 2007-11-04 01:30 EST 63 2007-11-04 06:30:00 2007-11-04 01:30:00
## 28 2007-11-04 01:45 EDT 64 2007-11-04 06:45:00 2007-11-04 01:45:00
## 29 2007-11-04 01:45 EST 63 2007-11-04 06:45:00 2007-11-04 01:45:00
## 30 2007-11-04 02:00 EST 61 2007-11-04 07:00:00 2007-11-04 02:00:00
## 31 2007-11-04 02:15 EST 61 2007-11-04 07:15:00 2007-11-04 02:15:00
## 32 2007-11-04 02:30 EST 61 2007-11-04 07:30:00 2007-11-04 02:30:00
## 33 2007-11-04 02:45 EST 61 2007-11-04 07:45:00 2007-11-04 02:45:00
## 34 2007-11-04 03:00 EST 60 2007-11-04 08:00:00 2007-11-04 03:00:00
## 35 2007-11-04 03:15 EST 60 2007-11-04 08:15:00 2007-11-04 03:15:00
## 36 2007-11-04 03:30 EST 60 2007-11-04 08:30:00 2007-11-04 03:30:00
## 37 2007-11-04 03:45 EST 60 2007-11-04 08:45:00 2007-11-04 03:45:00
## 1 2008-03-09 00:00 EST 194 2008-03-09 05:00:00 2008-03-09 00:00:00
## 2 2008-03-09 00:15 EST 196 2008-03-09 05:15:00 2008-03-09 00:15:00
## 3 2008-03-09 00:30 EST 196 2008-03-09 05:30:00 2008-03-09 00:30:00
## 4 2008-03-09 00:45 EST 199 2008-03-09 05:45:00 2008-03-09 00:45:00
## 5 2008-03-09 01:00 EST 201 2008-03-09 06:00:00 2008-03-09 01:00:00
## 6 2008-03-09 01:15 EST 201 2008-03-09 06:15:00 2008-03-09 01:15:00
## 7 2008-03-09 01:30 EST 203 2008-03-09 06:30:00 2008-03-09 01:30:00
## 8 2008-03-09 01:45 EST 205 2008-03-09 06:45:00 2008-03-09 01:45:00
## 9 2008-03-09 03:00 EDT 205 2008-03-09 07:00:00 2008-03-09 03:00:00
## 10 2008-03-09 03:15 EDT 205 2008-03-09 07:15:00 2008-03-09 03:15:00
## 11 2008-03-09 03:30 EDT 207 2008-03-09 07:30:00 2008-03-09 03:30:00
## 12 2008-03-09 03:45 EDT 210 2008-03-09 07:45:00 2008-03-09 03:45:00
## 13 2008-03-09 04:00 EDT 210 2008-03-09 08:00:00 2008-03-09 04:00:00
## 14 2008-03-09 04:15 EDT 212 2008-03-09 08:15:00 2008-03-09 04:15:00
## 15 2008-03-09 04:30 EDT 214 2008-03-09 08:30:00 2008-03-09 04:30:00
## 16 2008-03-09 04:45 EDT 214 2008-03-09 08:45:00 2008-03-09 04:45:00
## 17 2008-03-09 05:00 EDT 214 2008-03-09 09:00:00 2008-03-09 05:00:00
The only correct way to deal with this format is to manually assign hourly offsets based on the value of TIMEZONE.
usgs <- merge(usgs, data.frame(TIMEZONE = c("EDT", "EST"), OFFSET = c(-4, -5)))
usgs$DATETIME_UTC <- ymd_hm(usgs$TIMESTAMP, tz = "UTC") - dhours(usgs$OFFSET)
usgs$DATETIME_LOCAL <- with_tz(usgs$DATETIME_UTC, "America/New_York")
usgs <- usgs[order(usgs$DATETIME_UTC), ]
print(usgs[, -6])
## TIMEZONE TIMESTAMP VALUE DATETIME_UTC DATETIME_LOCAL
## 1 EDT 2007-11-04 00:00 68 2007-11-04 04:00:00 2007-11-04 00:00:00
## 2 EDT 2007-11-04 00:15 68 2007-11-04 04:15:00 2007-11-04 00:15:00
## 3 EDT 2007-11-04 00:30 67 2007-11-04 04:30:00 2007-11-04 00:30:00
## 4 EDT 2007-11-04 00:45 67 2007-11-04 04:45:00 2007-11-04 00:45:00
## 5 EDT 2007-11-04 01:00 65 2007-11-04 05:00:00 2007-11-04 01:00:00
## 7 EDT 2007-11-04 01:15 65 2007-11-04 05:15:00 2007-11-04 01:15:00
## 9 EDT 2007-11-04 01:30 64 2007-11-04 05:30:00 2007-11-04 01:30:00
## 11 EDT 2007-11-04 01:45 64 2007-11-04 05:45:00 2007-11-04 01:45:00
## 20 EST 2007-11-04 01:00 63 2007-11-04 06:00:00 2007-11-04 01:00:00
## 21 EST 2007-11-04 01:15 63 2007-11-04 06:15:00 2007-11-04 01:15:00
## 23 EST 2007-11-04 01:30 63 2007-11-04 06:30:00 2007-11-04 01:30:00
## 24 EST 2007-11-04 01:45 63 2007-11-04 06:45:00 2007-11-04 01:45:00
## 19 EST 2007-11-04 02:00 61 2007-11-04 07:00:00 2007-11-04 02:00:00
## 22 EST 2007-11-04 02:15 61 2007-11-04 07:15:00 2007-11-04 02:15:00
## 18 EST 2007-11-04 02:30 61 2007-11-04 07:30:00 2007-11-04 02:30:00
## 29 EST 2007-11-04 02:45 61 2007-11-04 07:45:00 2007-11-04 02:45:00
## 30 EST 2007-11-04 03:00 60 2007-11-04 08:00:00 2007-11-04 03:00:00
## 31 EST 2007-11-04 03:15 60 2007-11-04 08:15:00 2007-11-04 03:15:00
## 32 EST 2007-11-04 03:30 60 2007-11-04 08:30:00 2007-11-04 03:30:00
## 33 EST 2007-11-04 03:45 60 2007-11-04 08:45:00 2007-11-04 03:45:00
## 34 EST 2008-03-09 00:00 194 2008-03-09 05:00:00 2008-03-09 00:00:00
## 35 EST 2008-03-09 00:15 196 2008-03-09 05:15:00 2008-03-09 00:15:00
## 36 EST 2008-03-09 00:30 196 2008-03-09 05:30:00 2008-03-09 00:30:00
## 37 EST 2008-03-09 00:45 199 2008-03-09 05:45:00 2008-03-09 00:45:00
## 25 EST 2008-03-09 01:00 201 2008-03-09 06:00:00 2008-03-09 01:00:00
## 26 EST 2008-03-09 01:15 201 2008-03-09 06:15:00 2008-03-09 01:15:00
## 27 EST 2008-03-09 01:30 203 2008-03-09 06:30:00 2008-03-09 01:30:00
## 28 EST 2008-03-09 01:45 205 2008-03-09 06:45:00 2008-03-09 01:45:00
## 12 EDT 2008-03-09 03:00 205 2008-03-09 07:00:00 2008-03-09 03:00:00
## 13 EDT 2008-03-09 03:15 205 2008-03-09 07:15:00 2008-03-09 03:15:00
## 14 EDT 2008-03-09 03:30 207 2008-03-09 07:30:00 2008-03-09 03:30:00
## 6 EDT 2008-03-09 03:45 210 2008-03-09 07:45:00 2008-03-09 03:45:00
## 15 EDT 2008-03-09 04:00 210 2008-03-09 08:00:00 2008-03-09 04:00:00
## 8 EDT 2008-03-09 04:15 212 2008-03-09 08:15:00 2008-03-09 04:15:00
## 16 EDT 2008-03-09 04:30 214 2008-03-09 08:30:00 2008-03-09 04:30:00
## 10 EDT 2008-03-09 04:45 214 2008-03-09 08:45:00 2008-03-09 04:45:00
## 17 EDT 2008-03-09 05:00 214 2008-03-09 09:00:00 2008-03-09 05:00:00
Now we see that the timeseries is continuous around the DST transitions, and that the local datetime (POSIXct) matches the timestamp (character).
When handling sub-daily time series, it is critical to pay close attention to the timezone information. All datetimes should be represented using the POSIXct class with the underlying clock time (numeric value) in UTC. For display purposes and extracting the calendar representation components (hour, day), the datetimes should be assigned the correct time zone.