Introduction to dates

Dates

27th Feb 2013

ISO 8601 YYYY-MM-DD

Specifying dates

As you saw in the video, R doesn’t know something is a date unless you tell it. If you have a character string that represents a date in the ISO 8601 standard you can turn it into a Date using the as.Date() function. Just pass the character string (or a vector of character strings) as the first argument.

In this exercise you’ll convert a character string representation of a date to a Date object.

# The date R 3.0.0 was released
x <- "2013-04-03"

# Examine structure of x
str(x)
##  chr "2013-04-03"
# Use as.Date() to interpret x as a date
x_date <- as.Date(x)

# Examine structure of x_date
str(x_date)
##  Date[1:1], format: "2013-04-03"
# Store April 10 2014 as a Date
april_10_2014 <- as.Date("2014-04-10")

Fantastic work! What if your string isn’t in ISO 8601 format? Don’t worry, you’ll learn how to parse all sorts of formats in Chapter 2.

Automatic import

Sometimes you’ll need to input a couple of dates by hand using as.Date() but it’s much more common to have a column of dates in a data file.

Some functions that read in data will automatically recognize and parse dates in a variety of formats. In particular the import functions, like read_csv(), in the readr package will recognize dates in a few common formats.

There is also the anytime() function in the anytime package whose sole goal is to automatically parse strings as dates regardless of the format.

Try them both out in this exercise.

# Load the readr package
library(readr)

# Use read_csv() to import rversions.csv
releases <- read_csv("_data/rversions.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   major = col_double(),
##   minor = col_double(),
##   patch = col_double(),
##   date = col_date(format = ""),
##   datetime = col_datetime(format = ""),
##   time = col_time(format = ""),
##   type = col_character()
## )
# Examine the structure of the date column
str(releases$date)
##  Date[1:105], format: "1997-12-04" "1997-12-21" "1998-01-10" "1998-03-14" "1998-05-02" ...
# Load the anytime package
library(anytime)

# Various ways of writing Sep 10 2009
sep_10_2009 <- c("September 10 2009", "2009-09-10", "10 Sep 2009", "09-10-2009")

# Use anytime() to parse sep_10_2009
anytime(sep_10_2009)
## [1] "2009-09-10 EDT" "2009-09-10 EDT" "2009-09-10 EDT" "2009-09-10 EDT"

Nice, you’re already importing dates into R! Sometimes these functions won’t work, especially if dates are ambiguous (e.g. Is 2004-10-4, Oct 4th or April 10th?) but you’ll learn how to handle these cases in Chapter 2.

Why use dates?

Plotting

If you plot a Date on the axis of a plot, you expect the dates to be in calendar order, and that’s exactly what happens with plot() or ggplot().

In this exercise you’ll make some plots with the R version releases data from the previous exercises using ggplot2. There are two big differences when a Date is on an axis:

  1. If you specify limits they must be Date objects.

  2. To control the behavior of the scale you use the scale_x_date() function.

Have a go in this exercise where you explore how often R releases occur.

library(ggplot2)

# Set the x axis to the date column
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major)))

# Limit the axis to between 2010-01-01 and 2014-01-01
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major))) +
  xlim(as.Date("2010-01-01"), as.Date("2014-01-01"))
## Warning: Removed 87 row(s) containing missing values (geom_path).

# Specify breaks every ten years and labels with "%Y"
ggplot(releases, aes(x = date, y = type)) +
  geom_line(aes(group = 1, color = factor(major)))  +
  scale_x_date(date_breaks = "10 years", date_labels = "%Y")

Super! You’ll use ggplot2 quite a lot in Chapter 2. We’ll provide the code you need, but if you want to learn more about ggplot2, take the Data Visualization with ggplot2 course.

Arithmetic and logical operators

Since Date objects are internally represented as the number of days since 1970-01-01 you can do basic math and comparisons with dates. You can compare dates with the usual logical operators (<, ==, > etc.), find extremes with min() and max(), and even subtract two dates to find out the time between them.

In this exercise you’ll see how these operations work by exploring the last R release. You’ll see Sys.date() in the code, it simply returns today’s date.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Find the largest date
last_release_date <- max(releases$date)

# Filter row for last release
last_release <- filter(releases, date == last_release_date)

# Print last_release
last_release
## # A tibble: 1 x 7
##   major minor patch date       datetime            time     type 
##   <dbl> <dbl> <dbl> <date>     <dttm>              <time>   <chr>
## 1     3     4     1 2017-06-30 2017-06-30 07:04:11 07:04:11 patch
# How long since last release?
Sys.Date() - last_release_date
## Time difference of 1263 days

Great job! Did you notice that the time since last release was reported in days? You’ll learn a ton more about controlling the units of time differences and doing calculations with dates in Chapter 3.

What about times?

ISO 8601

HH:MM:SS

Datetimes behave nicely too

Once a POSIXct object, datetimes can be:

Getting datetimes into R

Just like dates without times, if you want R to recognize a string as a datetime you need to convert it, although now you use as.POSIXct(). as.POSIXct() expects strings to be in the format YYYY-MM-DD HH:MM:SS.

The only tricky thing is that times will be interpreted in local time based on your machine’s set up. You can check your timezone with Sys.timezone(). If you want the time to be interpreted in a different timezone, you just set the tz argument of as.POSIXct(). You’ll learn more about time zones in Chapter 4.

In this exercise you’ll input a couple of datetimes by hand and then see that read_csv() also handles datetimes automatically in a lot of cases.

# Use as.POSIXct to enter the datetime 
as.POSIXct("2010-10-01 12:12:00")
## [1] "2010-10-01 12:12:00 EDT"
# Use as.POSIXct again but set the timezone to `"America/Los_Angeles"`
as.POSIXct("2010-10-01 12:12:00", tz = "America/Los_Angeles")
## [1] "2010-10-01 12:12:00 PDT"
# Use readr to import rversions.csv
releases <- read_csv("_data/rversions.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   major = col_double(),
##   minor = col_double(),
##   patch = col_double(),
##   date = col_date(format = ""),
##   datetime = col_datetime(format = ""),
##   time = col_time(format = ""),
##   type = col_character()
## )
# Examine structure of datetime column
str(releases$datetime)
##  POSIXct[1:105], format: "1997-12-04 08:47:58" "1997-12-21 13:09:22" "1998-01-10 00:31:55" ...

Nice work! Did you take a look at the release times? I wonder how quickly people download new versions…

Datetimes behave nicely too

Just like Date objects, you can plot and do math with POSIXct objects.

As an example, in this exercise you’ll see how quickly people download new versions of R, by examining the download logs from the RStudio CRAN mirror.

R 3.2.0 was released at “2015-04-16 07:13:33” so cran-logs_2015-04-17.csv contains a random sample of downloads on the 16th, 17th and 18th.

# Import "cran-logs_2015-04-17.csv" with read_csv()
logs <- read_csv("_data/cran-logs_2015-04-17.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   datetime = col_datetime(format = ""),
##   r_version = col_character(),
##   country = col_character()
## )
# Print logs
logs
## # A tibble: 100,000 x 3
##    datetime            r_version country
##    <dttm>              <chr>     <chr>  
##  1 2015-04-16 22:40:19 3.1.3     CO     
##  2 2015-04-16 09:11:04 3.1.3     GB     
##  3 2015-04-16 17:12:37 3.1.3     DE     
##  4 2015-04-18 12:34:43 3.2.0     GB     
##  5 2015-04-16 04:49:18 3.1.3     PE     
##  6 2015-04-16 06:40:44 3.1.3     TW     
##  7 2015-04-16 00:21:36 3.1.3     US     
##  8 2015-04-16 10:27:23 3.1.3     US     
##  9 2015-04-16 01:59:43 3.1.3     SG     
## 10 2015-04-18 15:41:32 3.2.0     CA     
## # ... with 99,990 more rows
# Store the release time as a POSIXct object
release_time <- as.POSIXct("2015-04-16 07:13:33", tz = "UTC")

# When is the first download of 3.2.0?
logs %>% 
  filter(datetime > release_time,
    r_version == "3.2.0")
## # A tibble: 35,826 x 3
##    datetime            r_version country
##    <dttm>              <chr>     <chr>  
##  1 2015-04-18 12:34:43 3.2.0     GB     
##  2 2015-04-18 15:41:32 3.2.0     CA     
##  3 2015-04-18 14:58:41 3.2.0     IE     
##  4 2015-04-18 16:44:45 3.2.0     US     
##  5 2015-04-18 04:34:35 3.2.0     US     
##  6 2015-04-18 22:29:45 3.2.0     CH     
##  7 2015-04-17 16:21:06 3.2.0     US     
##  8 2015-04-18 20:34:57 3.2.0     AT     
##  9 2015-04-17 18:23:19 3.2.0     US     
## 10 2015-04-18 03:00:31 3.2.0     US     
## # ... with 35,816 more rows
# Examine histograms of downloads by version
ggplot(logs, aes(x = datetime)) +
  geom_histogram() +
  geom_vline(aes(xintercept = as.numeric(release_time)))+
  facet_wrap(~ r_version, ncol = 1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Cool plot! Did you see how it takes about two days for downloads of the new version (3.2.0) to overtake downloads of the old version (3.1.3)?

Why lubridate?

lubridate