As data scientists, we will all regularly come accross data related to time, in particular time recorded using dates. Similarly, a date may not be formatted in a way that is usable and the date may only be recognisable as a string rather than as a time factor. In this discussion I will try to introduce some tricks that I have picked up for dealing with date related data.
‘lubridate’ is an R package that makes it easier to work with dates compared to the built in date functions. It was created by Hadley Wickham and Garret Grolemund.
‘dplyr’ is a data manipulation package by Hadley Wickham which provides data manipulation “grammar” to easily modify or clean data.
Using the example data sets, hflights (from the hflights package) and lakers (from the lubridate package), a number of ways to manipulate dates can be illustrated. Install these packages try:
#install.packages("lubridate")
#install.packages("hflights", lib="/Library/Frameworks/R.framework/Versions/3.3/Resources/library")
#install.packages("dplyr")
#library(lubridate)
#library(hflights)
#library(dplyr)
A quick look at the two example data sets shows that hflights has no singular dedicated date column and instead the year month and day of the month are contained in three separate columns. The lakers dataset has a string containing the digits for year month and day without clear separation.
str(hflights)
## 'data.frame': 227496 obs. of 21 variables:
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
str(lakers$date)
## int [1:34624] 20081028 20081028 20081028 20081028 20081028 20081028 20081028 20081028 20081028 20081028 ...
When using dplyr package you can add separated date related columns together to make a date format that is more useful. I will demonstrate this with the hflights dataset using the dplyr mutate function:
mutate(dataset, Date = paste(Year, Month, Day, sep = “-”))
Dated_hflights <- mutate(hflights,Date = paste(Year, Month, DayofMonth, sep = "-"))
str(Dated_hflights$Date)
## chr [1:227496] "2011-1-1" "2011-1-2" "2011-1-3" "2011-1-4" ...
Note that the Dates_hflights should now have a new column called “Date” which has the date formatted into the Year-Month-Day string format. There is no specific requirement to place the date into this order however the order of the date should be noted so that the appropriate lubrdate function can be implemented.
You may choose the function required by the way the function’s name orders the elements: year (‘y’), month (‘m’), day (‘d’) with optional hour (‘h’), minute (‘m’) and second (‘s’). For example the string could be parsed as: dmy (30-12-2017), myd (12-30-2017), ymd (2017-12-30), ydm (2017-30-12), dym (30-2017-12), mdy (12-30-2017), or ymd_hms (2017-12-30 12:30:29).
Lubridate’s selected function will then parse the date string into R as POSIXct date-time objects.
Using: ymd(lakers$date) will output:
ymd(lakers$date[1:10])
## [1] "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28"
## [6] "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28" "2008-10-28"
And again for our previous example hflights, using ymd(Dated_Hflights$Date) will output:
ymd(Dated_hflights$Date[1:10])
## [1] "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" "2011-01-05"
## [6] "2011-01-06" "2011-01-07" "2011-01-08" "2011-01-09" "2011-01-10"
I hope you found this useful and let me know if you have feed back. For more on using lubridate or dplyr check out the following links:
dplyr package intro
https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
Hflights dataset information
https://cran.r-project.org/web/packages/hflights/index.html
lubridate vignette
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html