Introduction to R. Session 03.

Working with dates

# create some dates
dates <- c("03.05.1992 16:53:15", "08.02.1991 18:04:12", "28.08.1998 14:34:23", 
    "09.12.2002 09:12:01", "12.09.2005 08:32:55")

# create some values
values <- c(25, 38, 45, 53, 66)

# put the dates and values in a data frame
df <- data.frame(dates, values)

# check the properties of the data frame
summary(df)
##                  dates       values    
##  03.05.1992 16:53:15:1   Min.   :25.0  
##  08.02.1991 18:04:12:1   1st Qu.:38.0  
##  09.12.2002 09:12:01:1   Median :45.0  
##  12.09.2005 08:32:55:1   Mean   :45.4  
##  28.08.1998 14:34:23:1   3rd Qu.:53.0  
##                          Max.   :66.0

str(df)
## 'data.frame':    5 obs. of  2 variables:
##  $ dates : Factor w/ 5 levels "03.05.1992 16:53:15",..: 1 2 5 3 4
##  $ values: num  25 38 45 53 66

Oh no! str(df) shows that the dates have been imported as factors, not as dates!!

Dates have to be specified manually (nearly always).

In this example, we have date and a time at once. i.e. date and time in one value are not a calendar data. Therefore, the most obvious function as.Date() does NOT work (is only for calendar dates, not for time and date!).

R follows the POSIX standard (“Portable Operating System Interface”) established by IEE definition of date and time is given in POSIX time as “seconds since 1.1.1970 00:00:00” in R, POSIX time is implemented via “POSIXlt” and “POSIXct”“ classes from ?DateTimeClasses: "Class "POSIXct” represents the (signed) number of seconds since the beginning of 1970 (in the UTC timezone) as a numeric vector. Class “POSIXlt” is a named list of vectors representing [date elements]“ "POSIXct” is more convenient for including in data frames, and “POSIXlt” is closer to human-readable forms.“

How to convert the factors in the data frame to a POSIX date?

We create a new column in the data frame to store the date

Options for the conversion and an overview on names of date/time elements are given in ?strptime i.e. %m indicates the month as number, %j is day of year, %d day of month as number…

In our example, we give the format of our date, and indicate the punctuation between them:

# check the first element of the dates column as an example:
df$dates[1]
## [1] 03.05.1992 16:53:15
## 5 Levels: 03.05.1992 16:53:15 08.02.1991 18:04:12 ... 28.08.1998 14:34:23

# then we specifically define the format of the date elements and provide
# the time zone
df$Date <- as.POSIXct(df$dates, format = "%d.%m.%Y %H:%M:%S", tz = "Australia/Melbourne")

# check the data frame again
head(df)
##                 dates values                Date
## 1 03.05.1992 16:53:15     25 1992-05-03 16:53:15
## 2 08.02.1991 18:04:12     38 1991-02-08 18:04:12
## 3 28.08.1998 14:34:23     45 1998-08-28 14:34:23
## 4 09.12.2002 09:12:01     53 2002-12-09 09:12:01
## 5 12.09.2005 08:32:55     66 2005-09-12 08:32:55

# specify time zone as name instead of abbreviation see ?Sys.timezone for
# details internally it is still given as 'EST'
df$Date <- as.POSIXct(df$dates, format = "%d.%m.%Y %H:%M:%S", tz = "EST")

# check the data frame again
head(df)
##                 dates values                Date
## 1 03.05.1992 16:53:15     25 1992-05-03 16:53:15
## 2 08.02.1991 18:04:12     38 1991-02-08 18:04:12
## 3 28.08.1998 14:34:23     45 1998-08-28 14:34:23
## 4 09.12.2002 09:12:01     53 2002-12-09 09:12:01
## 5 12.09.2005 08:32:55     66 2005-09-12 08:32:55

# see if the details of df
str(df)
## 'data.frame':    5 obs. of  3 variables:
##  $ dates : Factor w/ 5 levels "03.05.1992 16:53:15",..: 1 2 5 3 4
##  $ values: num  25 38 45 53 66
##  $ Date  : POSIXct, format: "1992-05-03 16:53:15" "1991-02-08 18:04:12" ...

# convert the date to another format for fun
format(df$Date, "%m")  # get the month as numeric
## [1] "05" "02" "08" "12" "09"
format(df$Date, "%C")  # get the century (i.e. either 19 or 20 for the example data)
## [1] "19" "19" "19" "20" "20"
format(df$Date, "%l")  # hour based on a 12 hour clock. Careful with this one!!
## [1] " 4" " 6" " 2" " 9" " 8"
format(df$Date, "%p")  # am/pm information
## [1] "PM" "PM" "PM" "AM" "AM"
format(df$Date, "%k")  # hour based on a 24 hour clock.
## [1] "16" "18" "14" " 9" " 8"
format(df$Date, "%Z")  # get time zone information
## [1] "EST" "EST" "EST" "EST" "EST"
format(df$Date, "%A")  # weekday name
## [1] "Sunday" "Friday" "Friday" "Monday" "Monday"
format(df$Date, "%b")  # abbreviated month name (based on your computers locale setting)
## [1] "May" "Feb" "Aug" "Dec" "Sep"
format(df$Date, "%j")  # Day of year
## [1] "124" "039" "240" "343" "255"
# remember that you can always assign these to a new column in the dataframe

Another example for importing dates

dates2 <- c("August 09, 2010 - 01:05:18 AM", "July 18, 2011 - 3:12:36 PM")

values2 <- c(9, 11)

df2 <- data.frame(dates2, values2)

head(df2)
##                          dates2 values2
## 1 August 09, 2010 - 01:05:18 AM       9
## 2    July 18, 2011 - 3:12:36 PM      11

As a safety pre-caution it is a good idea to wrap such a complex factor in as.character(). This helps especially in cases where a date or time component might have been interpreted as a numeric during import.

df2$Dates <- as.POSIXct(as.character(df2$dates2), format = "%B %d, %Y - %I:%M:%S %p", 
    tz = "EST")

head(df2)
##                          dates2 values2               Dates
## 1 August 09, 2010 - 01:05:18 AM       9 2010-08-09 01:05:18
## 2    July 18, 2011 - 3:12:36 PM      11 2011-07-18 15:12:36
str(df2)
## 'data.frame':    2 obs. of  3 variables:
##  $ dates2 : Factor w/ 2 levels "August 09, 2010 - 01:05:18 AM",..: 1 2
##  $ values2: num  9 11
##  $ Dates  : POSIXct, format: "2010-08-09 01:05:18" "2011-07-18 15:12:36"

Be aware of the importance of am/pm in some date/time formats:

# check this command
df2$Dates2 <- as.POSIXct(as.character(df2$dates2), format = "%B %d, %Y - %H:%M:%S", 
    tz = "EST")

# check the data again
head(df2)
##                          dates2 values2               Dates
## 1 August 09, 2010 - 01:05:18 AM       9 2010-08-09 01:05:18
## 2    July 18, 2011 - 3:12:36 PM      11 2011-07-18 15:12:36
##                Dates2
## 1 2010-08-09 01:05:18
## 2 2011-07-18 03:12:36

# get rid of the wrong column, i.e. declare it 'undefined' via assigning
# 'NULL' to it.
df2$Dates2 <- NULL

# check again
head(df2)
##                          dates2 values2               Dates
## 1 August 09, 2010 - 01:05:18 AM       9 2010-08-09 01:05:18
## 2    July 18, 2011 - 3:12:36 PM      11 2011-07-18 15:12:36

# and now the original column is obsolete (we are 100% sure the date was
# imported correctly) we get rid of the column as well
df2$dates2 <- NULL

head(df2)
##   values2               Dates
## 1       9 2010-08-09 01:05:18
## 2      11 2011-07-18 15:12:36

Once the import / date conversion procedure has been confirmed, it is possible to convert the date in place without using an extra column (Assigning the result to itself ) But be careful!

# i.e.
df2$dates2 <- as.POSIXct(as.character(df2$dates2))
## Error: replacement has 0 rows, data has 2

Selections with dates

When providing a date manually, convert it on the fly to POSIXct

df[df$Date > as.POSIXct("2000-01-01"), ]
##                 dates values                Date
## 4 09.12.2002 09:12:01     53 2002-12-09 09:12:01
## 5 12.09.2005 08:32:55     66 2005-09-12 08:32:55

Calculations with dates

Duration:

df$Date[1] - df$Date[3]  # returns an object of class 'difftime'
## Time difference of -2308 days
as.numeric(df$Date[1] - df$Date[3])
## [1] -2308

# be careful with a difftime object! it has a sign test if the difference is
# less than 1000 days
df$Date[1] - df$Date[3] < 1000
## [1] TRUE
df$Date[3] - df$Date[1] < 1000
## [1] FALSE

# addition remember, this works in seconds!!
df$Date[1] + 5  # this adds five seconds
## [1] "1992-05-03 16:53:20 EST"

# calculations with dates can be tricky in base R as everything is based on
# seconds, calculations are possible when everything is broken down to
# seconds: add one day to a date
sec.in.day <- 60 * 60 * 24
df$Date[1] + sec.in.day
## [1] "1992-05-04 16:53:15 EST"

# two days
df$Date[1] + 2 * sec.in.day
## [1] "1992-05-05 16:53:15 EST"

# subtract three days
df$Date[1] - 3 * sec.in.day
## [1] "1992-04-30 16:53:15 EST"

Another option is to take the date elements separate as shown with the "format” examples above and manipulate them individually.

Using seq() with day is another option in base R:

Examples from H. Wickham, lubridate

# add a day to a date
seq(df$Date[1], length = 2, by = "day")[2]
## [1] "1992-05-04 16:53:15 EST"

# Subtract 5 days from a day
seq(df$Date[1], length = 2, by = "-5 day")[2]
## [1] "1992-04-28 16:53:15 EST"

i.e. it is complicated, unfortunately. Of course there is a packages to help with dates

I recommend the excellent “lubridate” package by H. Wickham for date arithmetic introduction http://www.jstatsoft.org/v40/i03/paper manual http://cran.r-project.org/web/packages/lubridate/lubridate.pdf if you did not install it yet:

install.packages("lubridate", dep = TRUE)
# load the package
library(lubridate)

# add one day
df$Date[1] + days(1)
## [1] "1992-05-04 16:53:15 EST"

# add three minutes
df$Date[1] + minutes(3)
## [1] "1992-05-03 16:56:15 EST"

# add twelve years
df$Date[1] + years(12)
## [1] "2004-05-03 16:53:15 EST"

# Be careful regarding leap years when calculating with seconds!
as.POSIXct("2001-02-28", tz = "EST") + 365 * sec.in.day  # ok
## [1] "2002-02-28 EST"
as.POSIXct("2000-02-29", tz = "EST") + 365 * sec.in.day  # a day short
## [1] "2001-02-28 EST"
as.POSIXct("2000-02-28", tz = "EST") + 366 * sec.in.day  # ok
## [1] "2001-02-28 EST"

as.POSIXct("2000-02-28", tz = "EST") + sec.in.day
## [1] "2000-02-29 EST"
as.POSIXct("2001-02-28", tz = "EST") + sec.in.day
## [1] "2001-03-01 EST"

# better use the 'lubridate' package to avoid trouble
as.POSIXct("2000-02-28", tz = "EST") + years(1)
## [1] "2001-02-28 EST"
as.POSIXct("2000-02-29", tz = "EST") + years(1)
## [1] NA
as.POSIXct("2001-02-29", tz = "EST")  # gives an obvious error
## Error: character string is not in a standard unambiguous format

# to check for leap_years with the 'lubridate' package in case you need to
# do something manually
leap_year(2000)
## [1] TRUE
leap_year(2001)
## [1] FALSE

# But with POSIXct and lubridate there are no problems with leap years or
# any other dates

++++++++++++++++++++++++++++++++++++ Next session: ++++++++++++++++++++++++++++++++++++

I'll give a quick introduction to the graphics package “ggplot2” Grammar of Graphics by Hadley Wickham see reference manual at: http://had.co.nz/ggplot2/ and more detailed, updated manual at http://had.co.nz/ggplot2/docs/ install the package from the repositories install.packages(“ggplot2”, dep = TRUE)

in the list of available CRAN repositories, choose “Uni Melbourne” CRAN mirror http://cran.ms.unimelb.edu.au/

then it will be a more open discussion regarding your needs


Requests what to discuss in the next session: