Dates and Times with Lubridate

Prerequisites

library(tidyverse)
library(lubridate)
library(nycflights13)

Creating date/times

The three types of date/time data that refer to an instant in time:
  • Date. Tibbles print this as <date>
  • time within a day. Tibbles print this as <time>
  • date-time is a date plus a time: it uniquely identifies an instant in time. to the nearest second. Tibbles print this as <dttm>

    R doesn’t have a native class for storing times. If you need one, you se the hms package

    To get the date or date-time you can use today() or now()

    library(lubridate)
    
    Attaching package: <U+393C><U+3E31>lubridate<U+393C><U+3E32>
    
    The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:
    
        date
    today()
    [1] "2017-04-20"
    now()
    [1] "2017-04-20 14:21:40 MYT"
    The three ways to create a date/time field
    1. From a string
    2. From individual date-time components
    3. From an existing date/time object

    To use the lubridate helpere functions, you need to identify the order in which year, month and day appear in your dates in the string, then arange “y”, “m”, and “d” in the same order.

    ymd("2017-01-31")
    [1] "2017-01-31"
    mdy("January 31st, 2017")
    [1] "2017-01-31"
    mdy("Jan 31, 2017")
    [1] "2017-01-31"
    dmy("31-Jan-2017")
    [1] "2017-01-31"
    dmy("31-January-2017")
    [1] "2017-01-31"

    These functions also take unquote numbers. This is the most concise way to create a single date/time object

    ymd(20170101)
    [1] "2017-01-01"

    To add a time to a date, add an underscore and one or more h,m and s

    ymd_hm("2017-01-01 12:30")
    [1] "2017-01-01 12:30:00 UTC"

    YOu can force the creation of a time zone.

    ymd(20170101, tz = "UTC")
    [1] "2017-01-01 UTC"
    ymd_hm("2017-01-01 11:30", tz = "UTC")
    [1] "2017-01-01 11:30:00 UTC"
    mdy("Mar 7, 2018", tz = "UTC")
    [1] "2018-03-07 UTC"
    ymd_hm(201701011130, tz = "UTC")
    [1] "2017-01-01 11:30:00 UTC"

    From Individual Components

    When you want to get the individual components of the date-time spread across multiple columns.

    library(nycflights13)
    library(tidyverse)
    flights %>%
      select (year, month,day, hour, minute, arr_time)

    use make_date() for dates, or make_datetime() for date-times

    flights %>%
      select(year, month, day, hour, minute) %>%
      mutate(departure = make_datetime(year, month,day, hour, minute)
            )
    flights %>%
      head  (10)

    We need to use modulus arithmetic to pull out the hour and minute componennts. So 517 is 5 hours and 17 minutes. that is why we need to divide time by 100 and the remainder as minutes.

    make_datetime_100 <-  function(year, month, day, time){
      make_datetime(year, month, day, time %/% 100, time %% 100)
    }
    flights_dt <-  flights %>%
      filter(!is.na(dep_time), !is.na(arr_time)) %>%
      mutate(
        dep_time = make_datetime_100(year, month, day, dep_time),
        arr_time = make_datetime_100(year, month, day, arr_time),
        sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
        sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
      ) 
    flights_dt %>%
      select(origin, dest, ends_with("delay"), ends_with("time"))
    # visualize
    flights_dt %>%
    ggplot(aes(dep_time)) +
      geom_freqpoly(binwidth=86400
                    )

    Or within a single day:

    flights_dt %>%
    filter(dep_time < ymd(20130102)) %>%
      ggplot(aes(dep_time)) +
      geom_freqpoly(binwidth = 600)

    Note: when using date-times in a numeric context, 1 means 1 second, so a bindwidth of 86400 means one day. For dates, 1 means 1 day.

    From Other Types

    Yo may want to switch between a date-time and a date. The the job of as_datetime() and as_date()

    as_datetime(today())
    [1] "2017-04-20 UTC"
    as_date(now())
    [1] "2017-04-20"

    When your dat/times is a nuemric offset from the unix epoch (1970 01 01)

    as_datetime(1492672742)
    [1] "2017-04-20 07:19:02 UTC"
    as_date(365*40 +2)
    [1] "2009-12-24"

    Exercise

    What happens if you parse a string that contains invalid dates?

    ymd(c("2010-10-10","bananas"))
     1 failed to parse.
    [1] "2010-10-10" NA          

    It produces an NA and an warning message.

    What does the tzone argument to today() do? Why is it important?

    today(tzone = "Asia/Manila")
    [1] "2017-04-20"
    tzone argument specifies which time zone you would like to find the current date of. tzone defaults to the system time zone set on your computer.

    Use the appropriate lubridate function to parse each of the following dates:

    d1 <- "January  1, 2010"
    d2 <- "2015-mar-07"
    d3 <- "06-Jun_2017"
    d4 <-  c("August 19 (2015)","July 1 (2015)")
    d5 <- "12/30/14" # dec 30 2014
    mdy(d1)
    [1] "2010-01-01"
    ymd(d2)
    [1] "2015-03-07"
    dmy(d3)
    [1] "2017-06-06"
    mdy(d4)
    [1] "2015-08-19" "2015-07-01"
    mdy(d5)
    [1] "2014-12-30"

    Date-Time Components

    Focus now on accessor functions that letyou get and set individual components.

    Getting components

    You can pull out individual parts of the date with the functions : year(), month(), mday() [day of month], yday() [day of year], wday() [day of the week], hour(), minute(), and second().

    datetime <- ymd_hms("2016-07-08 12:34:56")
    year(datetime)
    [1] 2016
    month(datetime)
    [1] 7
    mday(datetime)
    [1] 8
    yday(datetime)
    [1] 190
    wday(datetime)
    [1] 6
    hour(datetime)
    [1] 12
    minute(datetime)
    [1] 34
    second(datetime)
    [1] 56

    Month() and wday() can set label = true and abbr = FALSE

    month(datetime, label = TRUE, abbr = FALSE)
    [1] July
    12 Levels: January < February < March < April < May < June < July < August < September < ... < December
    wday(datetime, label = TRUE, abbr = TRUE)
    [1] Fri
    Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
    flights_dt %>%
      mutate(wday = wday(dep_time, label = TRUE)) %>%
      ggplot(aes(x = wday)) + 
      geom_bar()

    Let’s look at the average departure delay by minute wihtin the hour. It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour.

    flights_dt %>%
      mutate(minute = minute(dep_time)) %>%
      group_by(minute) %>%
      summarize(
        avg_delay = mean(arr_delay, na.rm = TRUE), 
        n = n()) %>%
      ggplot(aes(minute, avg_delay)) +
      geom_line()

    Interestingly, the scheduled departure time don’t have a similar pattern.

    sched_dep <-  flights_dt %>%
      mutate(minute = minute(sched_dep_time)) %>%
      group_by(minute) %>%
      summarize(avg_delay = mean(arr_delay, na.rm = TRUE),
                 n = n())
    ggplot(sched_dep, aes(minute, avg_delay)) +
             geom_line()

    NA
    ggplot(sched_dep, aes(minute, n)) +
      geom_line()

    Rounding

    An alternative approach to plotting individual components is to round the date to a nearby unit of time, with floor_date(), round_date() and ceiling_date(). Each ceiling_date() function takes a vector of dates to adjust and then the name of the unit to round down (floor).

    flights_dt %>% 
      count(week = floor_date(dep_time, "week")) %>%
      ggplot(aes(week, n)) +
      geom_line()

    Computing the difference between a rounded and unrounded date can be particularly useful.

    Setting Components

    datetime <- ymd_hms("2016-07-08 12:34:56")
    year(datetime) <-  2020
    month(datetime) <- 01
    hour(datetime) <-  hour(datetime) + 2
    datetime
    [1] "2020-01-08 14:34:56 UTC"

    You can also create a new date-time with update(). This allows you to set multile avlues at once.

    update(datetime, year=2020, month=4, mday = 2, hour = 2)
    [1] "2020-04-02 02:34:56 UTC"

    If values are too big, they will rollover to the next valid date

    ymd("2015-02-01") %>%
      update(mday = 30)
    [1] "2015-03-02"
    ymd("2015-02-01") %>%
      update(hour = 400)
    [1] "2015-02-17 16:00:00 UTC"

    Here, update is used to show the distribution of flights across the course of the day for every day of the year:

     flights_dt %>%
      mutate(dep_hour = update(dep_time, yday = 20 )) %>%
      ggplot(aes(dep_hour)) +
      geom_freqpoly(binwidth = 300)

    Setting larger components of a date to a constant is a powerful technique that allows you to drill down into the smaller components.

    Exercises> How does the distribution of flight times within a day change over the course of the year?

    flights_dt %>%
      mutate(time = hour(dep_time) * 100 + minute(dep_time),
             mon = as.factor(month
                             (dep_time))) %>%
      ggplot(aes(x = time, group = mon, color = mon)) +
      geom_freqpoly(binwidth = 100)

    Compare dep_time, sched_dep_time and dep_delay. Are they consistent? Explain your findings. If they are consistent, then dep_time = sched_dep_time + dep_delay.

    flights_dt %>%
      mutate(dep_time_ = sched_dep_time + dep_delay * 60) %>%
      filter(dep_time_ != dep_time) %>%
      select(dep_time_, dep_time, sched_dep_time, dep_delay)
    There exist discrepencies. It looks like there are mistakes in the dates. These are flights in which the actual departure time is on the next day relative to the scheduled departure time. We forgot to account for this when creating the date-times. The code would have had to check if the departure time is less than the scheduled departure time. Alternatively, simply adding the delay time is more robust because it will automatically account for crossing into the next day.

    Compare air_time with the duration between the departure and arrival. Explain your findings.

    flights_dt %>%
      mutate(flight_duration = as.numeric(arr_time - dep_time),
             air_time_mins = air_time,
             diff = flight_duration - air_time_mins) %>%
      select(origin, dest, flight_duration, air_time_mins, diff)

    How does the average delay time change over the course of a day? Should you use dep_time or sched_dep_time? Why?
    Use sched_dep_time because that is the relevant metric for someone scheduling a flight. Also, using dep_time will always bias delays to later in the day since delays will push flights later.

    flights_dt %>%
      mutate(sched_dep_hour = hour(sched_dep_time)) %>%
      group_by(sched_dep_hour) %>%
      summarise(dep_delay = mean(dep_delay)) %>%
      ggplot(aes(y = dep_delay, x = sched_dep_hour)) +
      geom_point() +
      geom_smooth()

    On what day of the week should you leave if you want to minimise the chance of a delay?
    Saturday has the lowest combined delays.

    flights_dt %>%
      mutate(dow = wday(sched_dep_time, label = TRUE)) %>%
      group_by(dow) %>%
      summarise(dep_delay = mean(dep_delay),
                arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
      arrange(desc(dep_delay+arr_delay))

    What makes the distribution of diamonds\(carat and flights\)sched_dep_time similar?

    ggplot(diamonds, aes(x = carat)) + 
      geom_density()

    In both carat and sched_dep_time there are abnormally large numbers of values are at nice “human” numbers. In sched_dep_time it is at 00 and 30 minutes. In carats, it is at 0, 1/3, 1/2, 2/3,

    ggplot(diamonds, aes(x = carat %% 1 * 100)) +
      geom_histogram(binwidth = 1)

    In scheduled departure times it is 00 and 30 minutes, and minutes ending in 0 and 5.

    ggplot(flights_dt, aes(x = minute(sched_dep_time))) +
      geom_histogram(binwidth = 1)

    Confirm my hypothesis that the early departures of flights in minutes 20-30 and 50-60 are caused by scheduled flights that leave early. Hint: create a binary variable that tells you whether or not a flight was delayed.

    flights_dt %>%
      mutate(early = dep_delay < 0,
             minute = minute(sched_dep_time)) %>%
      group_by(minute) %>%
      summarise(early = mean(early)) %>%
      ggplot(aes(x = minute, y = early)) +
      geom_point()

    At the minute level, there doesn’t appear to be anything. But if grouped in 10 minute intervals, there is a higher proportion of early flights during those minutes.

    flights_dt %>%
      mutate(early = dep_delay < 0,
             minute = minute(sched_dep_time) %% 10) %>%
      group_by(minute) %>%
      summarise(early = mean(early)) %>%
      ggplot(aes(x = minute, y = early)) +
      geom_point()

    Time Spanse

    Arithmetic of dates, including subtraction, additin, and division. And the three important classes:
    durations which represent an exact number of seconds
    periods, which represent human units like weeks and months
    intervals, which repreent a starting and ending points

    Durations

    In R, when you subtract two dates, you get a difftime object.

    h_age <-  today() - ymd(19630509)
    h_age
    Time difference of 19705 days
    as.duration(h_age)
    [1] "1702512000s (~53.95 years)"

    Durations come with a bunch of conenient constructors: dseconds(), dminutes(), dhours() and ddays(),dweeks(), dyears()

    dseconds(15)
    [1] "15s"
    dminutes(10)
    [1] "600s (~10 minutes)"
    ddays(0:5)
    [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)" "259200s (~3 days)" "345600s (~4 days)"
    [6] "432000s (~5 days)"
    dweeks(3)
    [1] "1814400s (~3 weeks)"
    dyears(1)
    [1] "31536000s (~52.14 weeks)"

    Duration is always recorded in seconds. Larger units are created by converting minutes, hours, days, weeks and years to seconds at the standard rate. You can add and multiply durations.

    2 * dyears(1)
    [1] "63072000s (~2 years)"
    dyears(1) + dweeks(12) + dhours(15)
    [1] "38847600s (~1.23 years)"

    You can add and subtract durations to and from days.

    tomorrow <-  today() + ddays(1)
    lst_year <-  today() - dyears(1)
    tomorrow
    [1] "2017-04-21"
    lst_year
    [1] "2016-04-20"

    Duration includes Daylight savings time.

    Periods

    To solve the DST problem, lubridate provides periods. Periods are time spans but don’t have a fixed lenth in seconds. Instead they work with “human times”“, like days and months.

    one_pm <-  ymd_hms("2016-03-12 13:00:00", tz="America/New_York")
    one_pm
    [1] "2016-03-12 13:00:00 EST"
    one_pm + ddays(1)
    [1] "2016-03-13 14:00:00 EDT"
    one_pm + days(1)
    [1] "2016-03-13 13:00:00 EDT"

    Like durations, Periods can be created with a number of constructor functions: seconds(), minutes(), hours(), days(), months(), weeks(), years()

    seconds(15)
    [1] "15S"
    minutes(10)
    [1] "10M 0S"
    hours(c(12,24))
    [1] "12H 0M 0S" "24H 0M 0S"
    days(7)
    [1] "7d 0H 0M 0S"
    months(1:6)
    [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S" "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
    weeks(3)
    [1] "21d 0H 0M 0S"
    years(1)
    [1] "1y 0m 0d 0H 0M 0S"

    You can add and multiply periods

    10 * (months(6) + days(1))
    [1] "60m 10d 0H 0M 0S"
    days(50) + hours(25) + minutes(2)
    [1] "50d 25H 2M 0S"

    You can add them to dates. Compared to duratins, periods are more likely to do what you expect.

    ymd('2016-01-01') + dyears(1)
    [1] "2016-12-31"
    ymd('2016-01-01') + years(1)
    [1] "2017-01-01"

    Periods can be useful to fix flights that appear to arrive before they departed.

    flights_dt %>%
      filter(arr_time < dep_time)

    We can fix this by adding days(1) to the arrival time of each overnight flight

    flights_dt <-  flights_dt %>%
      mutate( 
        overnight = arr_time < dep_time,
        arr_time = arr_time + days(overnight * 1),
        sched_arr_time = sched_arr_time + days(overnight * 1))
    flights_dt

    Intervals

    years(1) / days(1)
    estimate only: convert to intervals for accuracy
    [1] 365.25

    If you want more accurate measurement, you’ll have to use an interval. An interval is a duration with a starting point. That makes it precise so you can determine exactly ow long it is.

    next_year <-  today() + years(1)
    (today() %--% next_year /ddays(1))
    [1] 365

    To find out how many periods fall into an interval, you need to sue itneger division

     (today() %--% next_year) %/% days(1)
    Note: method with signature <U+393C><U+3E31>Timespan#Timespan<U+393C><U+3E32> chosen for function <U+393C><U+3E31>%/%<U+393C><U+3E32>,
     target signature <U+393C><U+3E31>Interval#Period<U+393C><U+3E32>.
     "Interval#ANY", "ANY#Period" would also be valid
    [1] 365

    These are the permitted arithmetic operations between the different data types.

    Exercises

    Why is there months() but no dmonths()?
    There is no direct unambigous value of months in seconds

    Explain days(overnight * 1) to someone who has just started learning R. How does it work?

    overnight is equal to TRUE (1) or FALSE (0). So if it is an overnight flight, this becomes 1 day, and if not, then overnight = 0, and no days are added to the date.

    Create a vector of dates giving the first day of every month in 2015.

    ymd("2015-01-01") + months(0:11)
     [1] "2015-01-01" "2015-02-01" "2015-03-01" "2015-04-01" "2015-05-01" "2015-06-01" "2015-07-01" "2015-08-01"
     [9] "2015-09-01" "2015-10-01" "2015-11-01" "2015-12-01"

    Create a vector of dates giving the first day of every month in the current year.

    floor_date(today(), unit = "year") + months(0:11)
     [1] "2017-01-01" "2017-02-01" "2017-03-01" "2017-04-01" "2017-05-01" "2017-06-01" "2017-07-01" "2017-08-01"
     [9] "2017-09-01" "2017-10-01" "2017-11-01" "2017-12-01"

    Write a function that given your birthday (as a date), returns how old you are in years.

    age <-  function(bday) {
      (bday %--% today()) %/% years(1)
    }
    age(ymd('1963-05-09'))
    [1] 53

    Time Zones

    R uses international standard IANA time zones. These use a consistent naming scheme with “/” typically in the form “<continent>/<city>”

    You can find out your current time zone is with Sys.timezone()

    Sys.timezone()
    [1] "Asia/Kuala_Lumpur"
    Sys.time()
    [1] "2017-04-20 17:37:32 MYT"

    To see the complete list of all time zone names with OlsonNames()

    length(OlsonNames())
    [1] 589
    head(OlsonNames())
    [1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa" "Africa/Algiers"     "Africa/Asmara"     
    [6] "Africa/Asmera"     

    To change the underlying instant in time - force_tz(). Use this whe you have an instant that has been labeled with the incorrect time zone, and you need to fix it

    x1 <-  ymd_hms("2015-06-01 12:00:00", tz = "America/New_York")
    x2 <-  ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen")
    x3 <-  ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland")
    x4 <- c(x1,x2,x3)
    x4b <-  force_tz(x4, tzone = "Australia/Lord_Howe")
    x4b
    [1] "2015-06-02 LHST" "2015-06-02 LHST" "2015-06-02 LHST"

    Exercise answers from : https://jrnold.github.io/e4qf/dates-and-times.html

    