Introductory statement

As data scientists, we will all regularly come accross data related to time, in particular time recorded using dates. Similarly, a date may not be formatted in a way that is usable and the date may only be recognisable as a string rather than as a time factor. In this discussion I will try to introduce some tricks that I have picked up for dealing with date related data.

What is lubridate and dplyr?

‘lubridate’ is an R package that makes it easier to work with dates compared to the built in date functions. It was created by Hadley Wickham and Garret Grolemund.

‘dplyr’ is a data manipulation package by Hadley Wickham which provides data manipulation “grammar” to easily modify or clean data.

Using the example data sets, hflights (from the hflights package) and lakers (from the lubridate package), a number of ways to manipulate dates can be illustrated. Install these packages try:

#install.packages("lubridate")
#install.packages("hflights", lib="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
#install.packages("dplyr")
#library(lubridate)
#library(hflights)
#library(dplyr)

A quick look at the two example data sets shows that hflights has no singular dedicated date column and instead the year month and day of the month are contained in three separate columns. The lakers dataset has a string containing the digits for year month and day without clear separation.

str(hflights)
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
str(lakers$date)
##  int [1:34624] 20081028 20081028 20081028 20081028 20081028 20081028 20081028 20081028 20081028 20081028 ...

Adding Day, Month and Year from different columns together

When using dplyr package you can add separated date related columns together to make a date format that is more useful. I will demonstrate this with the hflights dataset using the dplyr mutate function:

mutate(dataset, Date = paste(Year, Month, Day, sep = “-”))

Dated_hflights <- mutate(hflights,Date = paste(Year, Month, DayofMonth, sep = "-"))
str(Dated_hflights$Date)
##  chr [1:227496] "2011-1-1" "2011-1-2" "2011-1-3" "2011-1-4" ...

Note that the Dates_hflights should now have a new column called “Date” which has the date formatted into the Year-Month-Day string format. There is no specific requirement to place the date into this order however the order of the date should be noted so that the appropriate lubrdate function can be implemented.

Using lubridate to parse the date string

You may choose the function required by the way the function’s name orders the elements: year (‘y’), month (‘m’), day (‘d’) with optional hour (‘h’), minute (‘m’) and second (‘s’). For example the string could be parsed as: dmy (30-12-2017), myd (12-30-2017), ymd (2017-12-30), ydm (2017-30-12), dym (30-2017-12), mdy (12-30-2017), or ymd_hms (2017-12-30 12:30:29).

Lubridate’s selected function will then parse the date string into R as POSIXct date-time objects.

Using: ymd(lakers$date) will output:

ymd(lakers$date[1:10])
##  [1] "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28"
##  [6] "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28"

And again for our previous example hflights, using ymd(Dated_Hflights$Date) will output:

ymd(Dated_hflights$Date[1:10])
##  [1] "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" "2011-01-05"
##  [6] "2011-01-06" "2011-01-07" "2011-01-08" "2011-01-09" "2011-01-10"

I hope you found this useful and let me know if you have feed back. For more on using lubridate or dplyr check out the following links:

dplyr package intro

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Hflights dataset information

https://cran.r-project.org/web/packages/hflights/index.html

lubridate vignette

https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html