For obvious reasons, dates and times are a very common and important
type of data. For example, in flights
we have information
about the scheduled departure time, actual departure time, scheduled
arrival time, and actual arrival time. We also have a
time_hour
column to record the scheduled date and hour (but
with minutes ignored) in a date-time format.
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
At first glance, dates and times seem simple. You use them all the time in your regular life, and they don’t seem to cause much confusion. However, the more you learn about dates and times, the more complicated they seem to get. To warm up, try these three seemingly simple questions:
Does every year have 365 days?
Does every day have 24 hours?
Does every minute have 60 seconds?
The answer is “no” for all three questions because
The physics involved behind measuring time is indeed complicated. Here we will focus on establish a solid grounding of practical skills that will help us with common data analysis challenges.
This chapter will focus on the lubridate
package, which
makes it easier to work with dates and times in R.
lubridate
is not part of core tidyverse
because you only need it when you’re working with dates/times. We will
also need nycflights13
for practice data.
library(tidyverse)
library(nycflights13)
library(lubridate)
There are three types of date/time data that refer to an instant in time:
<date>
.<time>
.<dttm>
. Elsewhere in R these are called
POSIXct
. This name is not very useful but we should know it
represents data-time.Here we are only going to focus on dates and date-times as R doesn’t
have a native class for storing times. If you need one, you can use the
hms
package.
You should always use the simplest possible data type that works for your needs. That means if you can use a date instead of a date-time, you should. Date-times are substantially more complicated because of the need to handle time zones, which we’ll come back to at the end of the chapter.
To get the current date or date-time you can use today()
or now()
:
today()
## [1] "2025-03-28"
now()
## [1] "2025-03-28 08:08:32 EDT"
Note that today()
and now()
may give
different results with different time zones.
today(tzone = "PRC")
## [1] "2025-03-28"
now(tzone = "UTC")
## [1] "2025-03-28 12:08:32 UTC"
You can find out what R thinks your current time zone is with
Sys.timezone()
:
Sys.timezone()
## [1] "America/New_York"
To see the complete list of all time zone names, use
OlsonNames()
:
length(OlsonNames())
## [1] 597
head(OlsonNames())
## [1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
## [4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"
The timezones have continent names to avoid cities with the same name across the world.
Date/time data often comes as strings. lubridate
offers
convenient functions that automatically work out the format once you
specify the order of the component. To use them, identify the order in
which year, month, and day appear in your dates, then arrange “y”, “m”,
and “d” in the same order. That gives you the name of the
lubridate
function that will parse your date into the form
of yyyy-mm-dd
. For example:
ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
These functions also take unquoted numbers. This is the most concise way to create a single date/time object, as you might need when filtering date/time data.
ymd(20170131)
## [1] "2017-01-31"
ymd()
and friends create dates. To create a date-time,
add an underscore and one or more of “h”, “m”, and “s” to the name of
the parsing function:
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
By default ymd
and other similar functions do give a
time zone. But we can also force the creation of a date-time from a date
by supplying a timezone:
ymd(20170131, tz = "UTC")
## [1] "2017-01-31 UTC"
ymd(c("2010-10-10", "bananas"))
lubridate
function to parse each of
the following dates:d1 <- "January 1, 2010"
d2 <- "2015-Mar-07"
d3 <- "06-Jun-2017"
d4 <- c("August 19 (2015)", "July 1 (2015)")
d5 <- "12/30/14" # Dec 30, 2014
Instead of a single string, sometimes you’ll have the individual
components of the date-time spread across multiple columns. This is what
we have in the flights
data:
flights %>%
select(year, month, day, hour, minute)
## # A tibble: 336,776 × 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
## 7 2013 1 1 6 0
## 8 2013 1 1 6 0
## 9 2013 1 1 6 0
## 10 2013 1 1 6 0
## # ℹ 336,766 more rows
To create a date/time from this sort of input, use
make_date()
for dates, or make_datetime()
for
date-times. make_date()
takes up to three arguments
year
, month
, and day
. But please
be aware that the default value is January 1st, 1970, the so-called
“Unix Epoch”.
date1 <- make_date(2023, 4, 5)
class(date1)
## [1] "Date"
make_datetime()
functionmake_datetime()
takes up to seven arguments,
year
, month
, day
,
hour
, min
and second
, and
tz
(timezone). The default value is
1970-01-01, 00:00:00 UTC
.
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure_date = make_date(year, month, day)) %>%
mutate(departure_scheduled = make_datetime(year, month, day, hour, minute, tz = Sys.timezone()))
## # A tibble: 336,776 × 7
## year month day hour minute departure_date departure_scheduled
## <int> <int> <int> <dbl> <dbl> <date> <dttm>
## 1 2013 1 1 5 15 2013-01-01 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 2013-01-01 05:58:00
## 7 2013 1 1 6 0 2013-01-01 2013-01-01 06:00:00
## 8 2013 1 1 6 0 2013-01-01 2013-01-01 06:00:00
## 9 2013 1 1 6 0 2013-01-01 2013-01-01 06:00:00
## 10 2013 1 1 6 0 2013-01-01 2013-01-01 06:00:00
## # ℹ 336,766 more rows
Please be noted that the default timezone for
make_datetime
is UTC
. The time in departure
time is from local timezone, which is America/New_York
since the flights departed from New York City.
We can also get the hours and minutes from dep_time
or
arr_time
which are in a number format such as
517
using modulus arithmetic.
make_datetime_100 <- function(year, month, day, time, tz = "EST") {
make_datetime(year, month, day, time %/% 100, time %% 100, 0, tz)
}
flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
)
## # A tibble: 328,063 × 19
## year month day dep_time sched_dep_time dep_delay
## <int> <int> <int> <dttm> <dttm> <dbl>
## 1 2013 1 1 2013-01-01 05:17:00 2013-01-01 05:15:00 2
## 2 2013 1 1 2013-01-01 05:33:00 2013-01-01 05:29:00 4
## 3 2013 1 1 2013-01-01 05:42:00 2013-01-01 05:40:00 2
## 4 2013 1 1 2013-01-01 05:44:00 2013-01-01 05:45:00 -1
## 5 2013 1 1 2013-01-01 05:54:00 2013-01-01 06:00:00 -6
## 6 2013 1 1 2013-01-01 05:54:00 2013-01-01 05:58:00 -4
## 7 2013 1 1 2013-01-01 05:55:00 2013-01-01 06:00:00 -5
## 8 2013 1 1 2013-01-01 05:57:00 2013-01-01 06:00:00 -3
## 9 2013 1 1 2013-01-01 05:57:00 2013-01-01 06:00:00 -3
## 10 2013 1 1 2013-01-01 05:58:00 2013-01-01 06:00:00 -2
## # ℹ 328,053 more rows
## # ℹ 13 more variables: arr_time <dttm>, sched_arr_time <dttm>, arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Here we define a function
make_datetime_100(year, month, day, time)
to create
date-time with the time stored in HHMM
or HMM
format.
However, there is a problem with the operations above. The arrival
time is actually of local timezone. So we have to check the timezones
for each destination airport, which is stored in airports
data set.
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
Therefore, we need to add timezone of origin and destination airports before we create all the scheduled and actual departure/arrival time in date-time format.
airports1 <- airports %>%
select(faa, tzone)
flights1 <- flights %>%
left_join(airports1, by = c("dest" = "faa")) %>%
rename("dest_tzone" = "tzone") %>%
glimpse()
## Rows: 336,776
## Columns: 20
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ dest_tzone <chr> "America/Chicago", "America/Chicago", "America/New_York…
So we first only keep the airport codes and corresponding timezone
from airports
data set, and then left joined by
flights
matching by dest
, then renaming it to
be dest_tzone
since we will also create
origin_tzone
:
flights1 %>%
left_join(airports1, by = c("origin" = "faa")) %>%
rename("origin_tzone" = "tzone") -> flights1
glimpse(flights1)
## Rows: 336,776
## Columns: 21
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
## $ dest_tzone <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ origin_tzone <chr> "America/New_York", "America/New_York", "America/New_Yo…
As expected, all origin time zones should be
America/New_York
.
Now we are ready to create date-times:
flights_dt <- flights1 %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time, tz = origin_tzone),
arr_time = make_datetime_100(year, month, day, arr_time, tz = dest_tzone),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time, tz = origin_tzone),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time, tz = dest_tzone)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"), air_time, origin_tzone, dest_tzone)
flights_dt
## # A tibble: 328,063 × 11
## origin dest dep_delay arr_delay dep_time sched_dep_time
## <chr> <chr> <dbl> <dbl> <dttm> <dttm>
## 1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
## 2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
## 3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
## 4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
## 5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
## 6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
## 7 EWR FLL -5 19 2013-01-01 05:55:00 2013-01-01 06:00:00
## 8 LGA IAD -3 -14 2013-01-01 05:57:00 2013-01-01 06:00:00
## 9 JFK MCO -3 -8 2013-01-01 05:57:00 2013-01-01 06:00:00
## 10 LGA ORD -2 8 2013-01-01 05:58:00 2013-01-01 06:00:00
## # ℹ 328,053 more rows
## # ℹ 5 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## # origin_tzone <chr>, dest_tzone <chr>
We are going to work on this data set flights_dt
in the
following.
-
and durationWe can compute the difference between two date-time using the
subtraction -
operator.
td1 <- ymd_hms("2023-04-05 02:30:00") - ymd_hms("2023-04-05 01:20:00")
class(td1)
## [1] "difftime"
In R, when we subtract two dates or date-time objects we get a
difftime
object which records a time span of seconds,
minutes, hours, days, or weeks. This is not very convenient to use, and
we may use the duration offered by
lubridate
that measures a time span in exact seconds.
as.duration(td1)
## [1] "4200s (~1.17 hours)"
Later we will convert this into numeric minutes. We can do
as.numeric(as.duration(td1))/60
## [1] 70
Now we can compute the time difference between actual departure time
and arrival time in flights_dt
.
flights_dt %>%
mutate(flight_time = as.numeric(as.duration(arr_time - dep_time))/60) -> flights_dt
glimpse(flights_dt)
## Rows: 328,063
## Columns: 12
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ dep_time <dttm> 2013-01-01 05:17:00, 2013-01-01 05:33:00, 2013-01-01 0…
## $ sched_dep_time <dttm> 2013-01-01 05:15:00, 2013-01-01 05:29:00, 2013-01-01 0…
## $ arr_time <dttm> 2013-01-01 08:30:00, 2013-01-01 08:50:00, 2013-01-01 0…
## $ sched_arr_time <dttm> 2013-01-01 08:19:00, 2013-01-01 08:30:00, 2013-01-01 0…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ origin_tzone <chr> "America/New_York", "America/New_York", "America/New_Yo…
## $ dest_tzone <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ flight_time <dbl> 253, 257, 281, 320, 198, 166, 258, 132, 221, 175, 231, …
We see that the flight time is typically longer than the air time since flights need preparation time before taking off and after landing.
<
, >
We can also compare two dates or two date-times. “greater than” here refers to “later than”. For example
ymd(20230405) > ymd(20230404)
## [1] TRUE
This is TRUE
because Apr 5, 2023 is later than Apr 4,
2023.
ymd(20230405) < ymd_hms("2023-04-05 02:00:00")
## [1] TRUE
This is also TRUE
because when comparing a date and a
date-time, the date would be considered to be at 0am of that day.
Learning this, now we can easily get all flights before or after a particular date. We can also arrange our data by time.
flights_dt %>%
filter(dep_time > ymd("2013-06-01", tz = Sys.timezone())) %>%
arrange(dep_time)
## # A tibble: 194,177 × 12
## origin dest dep_delay arr_delay dep_time sched_dep_time
## <chr> <chr> <dbl> <dbl> <dttm> <dttm>
## 1 JFK PSE 3 -9 2013-06-01 00:02:00 2013-06-01 23:59:00
## 2 EWR CLT -9 -16 2013-06-01 04:51:00 2013-06-01 05:00:00
## 3 EWR IAH -9 -45 2013-06-01 05:06:00 2013-06-01 05:15:00
## 4 LGA IAH -11 -29 2013-06-01 05:34:00 2013-06-01 05:45:00
## 5 JFK BQN -7 3 2013-06-01 05:38:00 2013-06-01 05:45:00
## 6 JFK MIA -1 -8 2013-06-01 05:39:00 2013-06-01 05:40:00
## 7 EWR RSW -14 -20 2013-06-01 05:46:00 2013-06-01 06:00:00
## 8 LGA DFW -9 -22 2013-06-01 05:51:00 2013-06-01 06:00:00
## 9 LGA PHL -8 -8 2013-06-01 05:52:00 2013-06-01 06:00:00
## 10 JFK IAD -7 -11 2013-06-01 05:53:00 2013-06-01 06:00:00
## # ℹ 194,167 more rows
## # ℹ 6 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>,
## # origin_tzone <chr>, dest_tzone <chr>, flight_time <dbl>
Try what will happen if you don’t arrange in the example above.
We can pull out individual parts of the date with the accessor
functions year()
, month()
, mday()
(day of the month), yday()
(day of the year),
wday()
(day of the week), hour()
,
minute()
, and second()
.
datetime <- ymd_hms("2013-01-01 12:34:56")
year(datetime)
## [1] 2013
month(datetime)
## [1] 1
mday(datetime)
## [1] 1
yday(datetime)
## [1] 1
wday(datetime)
## [1] 3
Please be noted that the first day of the week is Sunday, so the third day of the week is Tuesday.
For month()
and wday()
you can set
label = TRUE
to return the abbreviated name of the month or
day of the week which becomes a ordered factor. Set
abbr = FALSE
to return the full name.
month(datetime, label = TRUE)
## [1] Jan
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
#> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label = TRUE, abbr = FALSE)
## [1] Tuesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
#> 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
flights_dt %>%
mutate(weekday = wday(dep_time, label = TRUE)) -> flights_dt
glimpse(flights_dt)
## Rows: 328,063
## Columns: 13
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ dep_time <dttm> 2013-01-01 05:17:00, 2013-01-01 05:33:00, 2013-01-01 0…
## $ sched_dep_time <dttm> 2013-01-01 05:15:00, 2013-01-01 05:29:00, 2013-01-01 0…
## $ arr_time <dttm> 2013-01-01 08:30:00, 2013-01-01 08:50:00, 2013-01-01 0…
## $ sched_arr_time <dttm> 2013-01-01 08:19:00, 2013-01-01 08:30:00, 2013-01-01 0…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ origin_tzone <chr> "America/New_York", "America/New_York", "America/New_Yo…
## $ dest_tzone <chr> "America/Chicago", "America/Chicago", "America/New_York…
## $ flight_time <dbl> 253, 257, 281, 320, 198, 166, 258, 132, 221, 175, 231, …
## $ weekday <ord> Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, Tue, …
levels(flights_dt$weekday)
## [1] "Sun" "Mon" "Tue" "Wed" "Thu" "Fri" "Sat"
Now we can easily analyze data with respect to weekday behavior.
flights_dt %>%
ggplot() + geom_bar(aes(weekday))
For example, we can study delay or air time for different weekdays.
flights_dt %>%
filter(arr_delay >= 0) %>% # Filter out canceled flights and ahead-of-time flights
group_by(weekday) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
ggplot() + stat_summary(aes(x = weekday, y = mean_arr_delay), geom = "bar")
flights_dt %>%
group_by(weekday) %>%
summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE), mean_air_time = mean(air_time, na.rm = TRUE)) %>%
ggplot() + stat_summary(aes(x = weekday, y = mean_air_time), geom = "bar")
So we see that weekdays have some effect on average delay but little effect on average air time.
Study flights departing at which hour had
the worst arrival delay
the longest/shortest air time.
An alternative approach to plotting individual components is to round
the date to a nearby unit of time, with floor_date()
,
round_date()
, and ceiling_date()
. Each
function takes a vector of dates to adjust and then the name of the unit
round down (floor), round up (ceiling), or round to.
# floor, round or ceiling of a number
floor(3.2)
## [1] 3
round(3.2)
## [1] 3
ceiling(3.2)
## [1] 4
Then it’s easier to understand rounding the dates or date-time
dt2 <- ymd_hms("2023-04-05 03:12:34pm", tz = Sys.timezone())
floor_date(dt2, "year") # floor to the first day of the same year
## [1] "2023-01-01 EST"
floor_date(dt2, "month") # floor to the first day of the same month
## [1] "2023-04-01 EDT"
floor_date(dt2, "week") # floor to the first day of the same week (Sunday)
## [1] "2023-04-02 EDT"
floor_date(dt2, "day") # floor to the same day
## [1] "2023-04-05 EDT"
Run round_date(dt2, "week")
. what do you get? Can you
explain the result?
This, for example, allows us to plot the number of flights per week:
flights_dt %>%
count(week = floor_date(dep_time, "week")) %>%
ggplot(aes(week, n)) +
geom_line()
Note that date or time can be used as a continuous axis in a plot, and the unit would be in seconds. For example, to visualise the distribution of flight number for each day across the year, we can do:
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
As suggested by a previous graph, we can tell that the constant sharp drop in flights number should be on Saturdays.
It’s important to have data regarding time spans. Here we introduce three of them:
durations, which represent an exact number of seconds.
periods, which represent human units like weeks and months.
intervals, which represent a starting and ending point.
As we mentioned above, durations always use seconds to measure a time span. There are a series of constructor functions:
dseconds(15)
## [1] "15s"
dminutes(10)
## [1] "600s (~10 minutes)"
dhours(c(12, 24))
## [1] "43200s (~12 hours)" "86400s (~1 days)"
ddays(0:5)
## [1] "0s" "86400s (~1 days)" "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
dweeks(3)
## [1] "1814400s (~3 weeks)"
dyears(1)
## [1] "31557600s (~1 years)"
You can add, subtract and multiply durations:
2 * dyears(1)
## [1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38869200s (~1.23 years)"
You can add and subtract durations to and from a date-time:
dt3 <- ymd_h("2023-04-05 3pm", tz = Sys.timezone())
dt3 + dhours(1)
## [1] "2023-04-05 16:00:00 EDT"
dt3 - ddays(1)
## [1] "2023-04-04 15:00:00 EDT"
A problem with durations
is that, they represent an
exact number of seconds assuming 60 seconds in a minute, 60 minutes in
an hour, 24 hours in day, 7 days in a week, 365 days in a year. This may
give us unexpected result when there are leap years, day saving time, or
leap seconds involved.
dt3 - dyears(1)
## [1] "2022-04-05 09:00:00 EDT"
So we see that the result is a little bit odd when we subtract the current time by a duration of one month.
To solve this problem, lubridate provides periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them to work in a more intuitive way:
dt3 - years(1)
## [1] "2022-04-05 15:00:00 EDT"
So we see that this result is consistent with “human” understanding about the meaning of “one year ago”.
Like durations, periods can be created with a number of friendly constructor functions.
seconds(15)
## [1] "15S"
minutes(10)
## [1] "10M 0S"
hours(c(12, 24))
## [1] "12H 0M 0S" "24H 0M 0S"
days(7)
## [1] "7d 0H 0M 0S"
months(1:6)
## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
weeks(3)
## [1] "21d 0H 0M 0S"
years(1)
## [1] "1y 0m 0d 0H 0M 0S"
Let’s use periods to fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.
flights_dt %>%
filter(arr_time < dep_time) %>%
select(dep_time, arr_time, flight_time)
## # A tibble: 10,633 × 3
## dep_time arr_time flight_time
## <dttm> <dttm> <dbl>
## 1 2013-01-01 19:29:00 2013-01-01 00:03:00 -1106
## 2 2013-01-01 19:39:00 2013-01-01 00:29:00 -1090
## 3 2013-01-01 20:58:00 2013-01-01 00:08:00 -1190
## 4 2013-01-01 21:02:00 2013-01-01 01:46:00 -1096
## 5 2013-01-01 21:08:00 2013-01-01 00:25:00 -1183
## 6 2013-01-01 21:20:00 2013-01-01 00:16:00 -1204
## 7 2013-01-01 21:21:00 2013-01-01 00:06:00 -1215
## 8 2013-01-01 21:28:00 2013-01-01 00:26:00 -1202
## 9 2013-01-01 21:34:00 2013-01-01 00:20:00 -1214
## 10 2013-01-01 21:36:00 2013-01-01 00:25:00 -1211
## # ℹ 10,623 more rows
These were flights that crossed 0am during its flight (with a day
change in arrival time). We used the same date information for both the
departure and the arrival times, but these flights arrived on the
following day. We can fix this by adding days(1)
to the
arrival time of each overnight flight.
flights_dt <- flights_dt %>%
mutate(
day_change = arr_time < dep_time,
arr_time = arr_time + days(day_change * 1),
sched_arr_time = sched_arr_time + days(day_change * 1), # This might introduce mistake in rare cases.
flight_time = as.numeric(as.duration(arr_time - dep_time))/60
)
Now all of our flights obey the laws of physics.
flights_dt %>%
filter(flight_time < 0)
## # A tibble: 0 × 14
## # ℹ 14 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
## # dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
## # sched_arr_time <dttm>, air_time <dbl>, origin_tzone <chr>,
## # dest_tzone <chr>, flight_time <dbl>, weekday <ord>, day_change <lgl>
Periods also have their problems. For example,
dyears(1) / ddays(1)
returns a fix number of 365.25;
because durations
are always represented by a number of
seconds, and a duration of a year is defined as 365.25 days worth of
seconds.
What should years(1) / days(1)
return? Well, if the year
was 2015 it should return 365, but if it was 2016, it should return 366!
There’s not quite enough information for lubridate
to give
a single clear answer. What it does instead is give an estimate, which
is the same as dyears(1) / ddays(365)
.
dyears(1)/ddays(1)
## [1] 365.25
years(1) / days(1)
## [1] 365.25
If you want a more accurate measurement, you’ll have to use an interval. An interval is a duration with a starting point and an ending point: that makes it precise so you can determine exactly how long it is:
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 365
Here today() %--% next_year
is an interval, which uses
%--%
connecting two dates or date-times.
class(today() %--% next_year)
## [1] "Interval"
## attr(,"package")
## [1] "lubridate"