Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:
The following reviews some imporant concepts for working with text data, as seen in Intro to R: Text Data:
Classes: Most variables are of a particular class, e.g. “numeric”, “integer”, “character”, “logical”, etc.
class()str() on a data frameTRUE or FALSE with is.*() functions, e.g. is.logical()as.*() functins, e.g. as.character()Character & Factor Classes: The most common types of qualitative variables are of class character and factor.
"")as.character(), confirm with is.character()as.factor(), confirm with is.factor()factor()levels = argument or levels() functionFundamentals of Text: There are several easy-to-use Base R functions for “character” data.
paste() and specify delimiting characters with argument sep =format() and formatC()
scales has additional formatting options, e.g. dollar()writeLines()
noquote()Package “stringr” Functions: The “stringr” package has an intuitive, unified framework common to the Tidyverse.
str_c()str_length()str_sub()TRUE and FALSE values for pattern matches with str_detect()str_subset()str_count()str_split()str_replace() and str_replace_all()str_trim()str_pad()Overview: We can easily create strings of text to represent dates and times.
POSIXlt stores datetimes in days, months, years, hours, minutes, and seconds relative to January 1, 1970.
Here, we’ll use the current date and time using function Sys.time():
Sys.time()
## [1] "2018-12-20 16:19:49 EST"
my_date_lt <- as.POSIXlt(Sys.time())
unclass(my_date_lt)
## $sec
## [1] 49.21545
##
## $min
## [1] 19
##
## $hour
## [1] 16
##
## $mday
## [1] 20
##
## $mon
## [1] 11
##
## $year
## [1] 118
##
## $wday
## [1] 4
##
## $yday
## [1] 353
##
## $isdst
## [1] 0
##
## $zone
## [1] "EST"
##
## $gmtoff
## [1] -18000
##
## attr(,"tzone")
## [1] "" "EST" "EDT"
POSIXct only stores datetimes in seconds relative to January 1, 1970.
Sys.time()
## [1] "2018-12-20 16:19:49 EST"
my_date_ct <- as.POSIXct(Sys.time())
unclass(my_date_ct)
## [1] 1545340789
“POSIXlt” values may be formatted in a variety of ways, e.g.:
Coercing to “POSIXlt”: We can coerce variables to “POSIXlt” class using function as.POSIXlt(), e.g.:
my_date <- "2018-12-21"
print(my_date)
## [1] "2018-12-21"
class(my_date)
## [1] "character"
my_date <- as.POSIXlt("2018-12-21")
print(my_date)
## [1] "2018-12-21 EST"
class(my_date)
## [1] "POSIXlt" "POSIXt"
Standardized Formats: Of these ways, the preferable and standard formatting is “YYYY-MM-DD”, e.g. “2018-12-21”. Why?
my_dates <- c("October 01, 2018", "December 01, 2019", "November 01, 2018", "September 01, 2019")
print(my_dates)
## [1] "October 01, 2018" "December 01, 2019" "November 01, 2018"
## [4] "September 01, 2019"
sort(my_dates)
## [1] "December 01, 2019" "November 01, 2018" "October 01, 2018"
## [4] "September 01, 2019"
Note that without converting to class “POSIXlt”, these dates are arranged naively in alphabetical order.
Another Example: Here, we only use numbers to identify dates. Observe:
my_dates <- c("10/01/18", "12/01/19", "11/01/18", "09/01/19")
print(my_dates)
## [1] "10/01/18" "12/01/19" "11/01/18" "09/01/19"
sort(my_dates)
## [1] "09/01/19" "10/01/18" "11/01/18" "12/01/19"
These are also arranged naively, and incorrectly, in numerical order (note that 2019 comes both before and after 2018).
Standard Formatting in Action: Here we can see how the standard format is impossible to disarrange:
my_dates <- c("2018-10-01", "2019-12-01", "2018-11-01", "2019-09-01")
print(my_dates)
## [1] "2018-10-01" "2019-12-01" "2018-11-01" "2019-09-01"
sort(my_dates)
## [1] "2018-10-01" "2018-11-01" "2019-09-01" "2019-12-01"
Even if arranged naively and not coerced to “POSIXlt” class, these dates will invariably sort correctly.
Formatting in Base R: It’s not a bad idea to know how this is done, but it’s a pain in the rear.
lubridate%Y is a 4-digit year%y is a 2-digit year%m is a 2-digit month%d is a 2-digit day of the month%A is the weekday, e.g. “Wednesday”%B is the month, e.g. “February”%b is the abbreviated month, e.g. “Feb”%H is hours as a decimal number%I is hours in AM/PM format as a decimal number%M is minutes as a decimal number%S is seconds as a decimal number%T is shorthand for standard format: %H:%M:%S%p is the AM/PM indicatorThe standard format for timestamps are “HH:MM:SS”, i.e. hours, minutes, and seconds.
Package lubridate allows us to easily parse just about any date with simple, intuitive functions:
Formatting datetimes, as mentioned, is a huge pain. Here, we’ll try to format 12/20/18
my_datetime <- "12/21/18 08:30:00 AM"
print(my_datetime)
## [1] "12/21/18 08:30:00 AM"
as.POSIXlt(my_datetime)
## Error in as.POSIXlt.character(my_datetime): character string is not in a standard unambiguous format
Here, we get an error, since the input data isn’t in “standard unambiguous format”. To format in Base R:
Formatting in Base R, we have to manually format our datetimes, since they aren’t “standard unambiguous” format.
my_datetime <- "12/21/18 08:30:00 AM"
print(my_datetime)
## [1] "12/21/18 08:30:00 AM"
as.POSIXlt(x = my_datetime,
format = "%m/%d/%y %H:%M:%S %p")
## [1] "2018-12-21 08:30:00 EST"
Package lubridate is able to detect dates and times in “unambiguous” format with very little specification.
First, let’s install and load lubridate:
if(!require(lubridate)){install.packages("lubridate")}
library(lubridate)
Formatting the Lubridate Way: We’ll take the same datetime and format it with function mdy_hms().
my_datetime <- "12/21/18 08:30:00 AM"
print(my_datetime)
## [1] "12/21/18 08:30:00 AM"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
Way, way, way easier! By using mdy_hms(), we told R that the format approximates “MM-DD-YYYY HH:MM:SS”.
Other Formats: We can do this with a variety of datetime formats. Behold:
my_datetime <- "12/21/2018 08:30:00 AM"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "December 21, 2018 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "12/21/18 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "Dec. 21, 2018 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "Dec 21 18 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "122118 083000"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
We can get increasingly lazy with our dates, but lubridate gets the job done without fail.
Similar Functions: Other lubridate functions for formatting non-standard, unambiguous datetimes include:
mdy_hm()mdy_h()mdy()dmy_hms()dmy_hm()dmy_h()dmyydm_hms()ydm_hm()ydm_h()ydm()ymd_hms()ymd_hm()ymd_h()ymd()Just remember:
y is “year”m is “month”d is “day”h is “hour”m is “minute”s is “second”And you’re good to go!
Package lubridate is replete with date-, time-, and datetime-related functions. Here’s a few cool ones:
Dates: date() and as_date() are less intimidating and more human-readable ways to create class POSIX variables.
my_date <- "2018-12-21"
class(my_date)
## [1] "character"
my_date <- as_date(my_date)
class(my_date)
## [1] "Date"
Extracting Units of Time: You can extract a unit of time from a datetime object, including:
year()month()week()day()weekdays()hour()minute()second()am()pm()Observe:
my_datetime <- mdy_hms("December 21, 2018 08:30:00 AM")
year(my_datetime)
## [1] 2018
month(my_datetime)
## [1] 12
week(my_datetime)
## [1] 51
day(my_datetime)
## [1] 21
weekdays(my_datetime)
## [1] "Friday"
hour(my_datetime)
## [1] 8
minute(my_datetime)
## [1] 30
second(my_datetime)
## [1] 0
am(my_datetime)
## [1] TRUE
pm(my_datetime)
## [1] FALSE
Timestamping: Print current date and time with now(), similar to Sys.time() but easier:
now()
## [1] "2018-12-20 16:19:49 EST"
Determine Differences in Times: Find time and date differences using simple arithmetic:
session_start <- mdy_hms("December 21, 2018 08:30:00 AM")
current_time <- now()
session_start - current_time
Determine Difference in Dates: Again, since dates and times are numbers under the hood, it’s arithmetic!
birthday <- mdy("August 10, 1989")
age <- today() - birthday
print(age)
## Time difference of 10724 days
class(age)
## [1] "difftime"
Storing this creates a new variable class: “difftime”.
difftime()Round Dates: You can make dates uniform using round_date(), floor_date(), and ceiling_date().
crimes <- data.frame(crime = c("Larceny", "Larceny", "Arson"),
date = c("2018-04-03", "2018-04-19", "2018-05-11"))
crimes$date <- ymd(crimes$date)
print(crimes)
## crime date
## 1 Larceny 2018-04-03
## 2 Larceny 2018-04-19
## 3 Arson 2018-05-11
crimes$date <- floor_date(x = crimes$date, unit = "month")
print(crimes)
## crime date
## 1 Larceny 2018-04-01
## 2 Larceny 2018-04-01
## 3 Arson 2018-05-01
Instructions: Run the following code to read in housing code violation data from Syracuse Open Data.
violationsif(!require(readr)){install.packages("readr")}; library(readr)
url <- "https://opendata.arcgis.com/datasets/fb7233117df1443081541f220327f178_0.csv"
types <- "--c---T---c---c---------"
violations <- read_csv(file = url, col_types = types)
Challenge 1: Determine the class of variable violation_date. Convert it to class “Date”.
Challenge 2: When was the earliest violation in violations? The latest?
min() and max()Challenge 3: What is the duration between the earliest and latest violation in violations?
Challenge 4: How long ago was the most recent violation as of today?
now(), max(), and basic arithmeticChallenge 5: Which month had the most reported violations according to the data?
floor_date() to make instances uniformtable(), only inputting the dataset and variable name with $ notation