Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:

  • R v. 3.5.1
  • RStudio v. 1.1.456
  • Document v. 1.0
  • Last Updated: 2018-12-20

1 Review: Text Data

The following reviews some imporant concepts for working with text data, as seen in Intro to R: Text Data:

Classes: Most variables are of a particular class, e.g. “numeric”, “integer”, “character”, “logical”, etc.

  • Determine the class of a single variable using function class()
  • Determine multiple variable classes using function str() on a data frame
  • Confirm class via TRUE or FALSE with is.*() functions, e.g. is.logical()
  • Coerce variables to a different class with as.*() functins, e.g. as.character()

Character & Factor Classes: The most common types of qualitative variables are of class character and factor.

  • Character data are comprised of sequences of characters wrapped in quotes ("")
  • Factors are defined explicitly as categorical, nominal, or discrete variables (all synonymous)
  • Coerce variables to class “character” using as.character(), confirm with is.character()
  • Coerce variables to class “factor” using as.factor(), confirm with is.factor()
  • Create new factor variables using function factor()
  • Factors are critical in visualizing and modeling data, and their categories are known as levels
  • Factors may be ordinal (ordered), made possible with the levels = argument or levels() function

Fundamentals of Text: There are several easy-to-use Base R functions for “character” data.

  • Combine strings of text with function paste() and specify delimiting characters with argument sep =
  • Coerce and format numeric data with functions format() and formatC()
    • Package scales has additional formatting options, e.g. dollar()
  • Print “character” data without quotes or position identifiers using function writeLines()
    • Print the same unquoted data with position identifiers using function noquote()

Package “stringr” Functions: The “stringr” package has an intuitive, unified framework common to the Tidyverse.

  • Many critical “stringr” functions have a Base R counterpart but are easier to use
  • Paste strings of text together using str_c()
  • Determine the number of characters in a string using str_length()
  • Extract substrings using str_sub()
  • Return TRUE and FALSE values for pattern matches with str_detect()
  • Return original values using str_subset()
  • Return the frequency of pattern matches using str_count()
  • Split strings into substrings with a specified delimiter using str_split()
  • Find and replace specified patterns using str_replace() and str_replace_all()
  • Trim whitespace on either or both sides of values using str_trim()
  • “Pad” values with specified characters, e.g. leading zeroes, using str_pad()


2 Dates & Times in R

Overview: We can easily create strings of text to represent dates and times.

  • The character string “12/21/2018” may be human-readable, but not machine-readable
  • One must coerce the class of these strings to be recognizable by R as dates or times
  • Class “POSIX” (Portable Operating System Interface for Unix) is specifically for dates and datetimes
  • Under the hood, class “POSIX” variables are stored as numbers, but format to be human-readable
    • These numbers represent either various numbers in date units or total seconds since Thursday, January 1, 1970
    • If a positive number, the date/datetime is after January 1, 1970
    • If a negative number, the date/datetime is before January 1, 1970
    • Seemingly arbitrary, this date actually derives from “Unix Time”, a.k.a. “The Epoch”


2.1 Class “POSIXlt” & “POSIXct”

POSIXlt stores datetimes in days, months, years, hours, minutes, and seconds relative to January 1, 1970.

Here, we’ll use the current date and time using function Sys.time():

Sys.time()
## [1] "2018-12-20 16:19:49 EST"
my_date_lt <- as.POSIXlt(Sys.time())
unclass(my_date_lt)
## $sec
## [1] 49.21545
## 
## $min
## [1] 19
## 
## $hour
## [1] 16
## 
## $mday
## [1] 20
## 
## $mon
## [1] 11
## 
## $year
## [1] 118
## 
## $wday
## [1] 4
## 
## $yday
## [1] 353
## 
## $isdst
## [1] 0
## 
## $zone
## [1] "EST"
## 
## $gmtoff
## [1] -18000
## 
## attr(,"tzone")
## [1] ""    "EST" "EDT"


POSIXct only stores datetimes in seconds relative to January 1, 1970.

Sys.time()
## [1] "2018-12-20 16:19:49 EST"
my_date_ct <- as.POSIXct(Sys.time())
unclass(my_date_ct)
## [1] 1545340789


2.1.1 Formatting “POSIXlt” & “POSIXct”

“POSIXlt” values may be formatted in a variety of ways, e.g.:

  • “December 21, 2018”
  • “21 December 2018”
  • “2018-12-21”
  • “12/21/18”
  • “12-21-2018”


Coercing to “POSIXlt”: We can coerce variables to “POSIXlt” class using function as.POSIXlt(), e.g.:

my_date <- "2018-12-21"
print(my_date)
## [1] "2018-12-21"
class(my_date)
## [1] "character"
my_date <- as.POSIXlt("2018-12-21")
print(my_date)
## [1] "2018-12-21 EST"
class(my_date)
## [1] "POSIXlt" "POSIXt"


Standardized Formats: Of these ways, the preferable and standard formatting is “YYYY-MM-DD”, e.g. “2018-12-21”. Why?

my_dates <- c("October 01, 2018", "December 01, 2019", "November 01, 2018", "September 01, 2019")
print(my_dates)
## [1] "October 01, 2018"   "December 01, 2019"  "November 01, 2018" 
## [4] "September 01, 2019"
sort(my_dates)
## [1] "December 01, 2019"  "November 01, 2018"  "October 01, 2018"  
## [4] "September 01, 2019"

Note that without converting to class “POSIXlt”, these dates are arranged naively in alphabetical order.


Another Example: Here, we only use numbers to identify dates. Observe:

my_dates <- c("10/01/18", "12/01/19", "11/01/18", "09/01/19")
print(my_dates)
## [1] "10/01/18" "12/01/19" "11/01/18" "09/01/19"
sort(my_dates)
## [1] "09/01/19" "10/01/18" "11/01/18" "12/01/19"

These are also arranged naively, and incorrectly, in numerical order (note that 2019 comes both before and after 2018).


Standard Formatting in Action: Here we can see how the standard format is impossible to disarrange:

my_dates <- c("2018-10-01", "2019-12-01", "2018-11-01", "2019-09-01")
print(my_dates)
## [1] "2018-10-01" "2019-12-01" "2018-11-01" "2019-09-01"
sort(my_dates)
## [1] "2018-10-01" "2018-11-01" "2019-09-01" "2019-12-01"

Even if arranged naively and not coerced to “POSIXlt” class, these dates will invariably sort correctly.

  • This is because the units of measurement have a hierarchy of unit magnitude
  • Hence why much of the rest of the world formats dates as, e.g. “21 December 2018”
    • Despite the opposite direction, these follow a hierarchy of unit magnitude
    • It’s like the metric system in that it actually makes sense


Formatting in Base R: It’s not a bad idea to know how this is done, but it’s a pain in the rear.

  • Therefore, we’ll learn how to format dates using the Tidyverse package lubridate
  • In Base R, you have to manually identify the elements of non-standard dates to properly parse them:
    • %Y is a 4-digit year
    • %y is a 2-digit year
    • %m is a 2-digit month
    • %d is a 2-digit day of the month
    • %A is the weekday, e.g. “Wednesday”
    • %B is the month, e.g. “February”
    • %b is the abbreviated month, e.g. “Feb”
  • What’s more, in Base R, you have to manually identify elements of non-standard timestamps:
    • %H is hours as a decimal number
    • %I is hours in AM/PM format as a decimal number
    • %M is minutes as a decimal number
    • %S is seconds as a decimal number
    • %T is shorthand for standard format: %H:%M:%S
    • %p is the AM/PM indicator

The standard format for timestamps are “HH:MM:SS”, i.e. hours, minutes, and seconds.

  • The same logic applies: Decreasing order of units of magnitude
  • Therefore, datetime objects are standardized as “YYYY-MM-DD HH:MM:SS”


2.2 Package “lubridate”

Package lubridate allows us to easily parse just about any date with simple, intuitive functions:


2.2.1 Ways of Old

Formatting datetimes, as mentioned, is a huge pain. Here, we’ll try to format 12/20/18

my_datetime <- "12/21/18 08:30:00 AM"
print(my_datetime)
## [1] "12/21/18 08:30:00 AM"
as.POSIXlt(my_datetime)
## Error in as.POSIXlt.character(my_datetime): character string is not in a standard unambiguous format

Here, we get an error, since the input data isn’t in “standard unambiguous format”. To format in Base R:


Formatting in Base R, we have to manually format our datetimes, since they aren’t “standard unambiguous” format.

  • In other words, we have to tell R exactly how the date and time are formatted
my_datetime <- "12/21/18 08:30:00 AM"
print(my_datetime)
## [1] "12/21/18 08:30:00 AM"
as.POSIXlt(x = my_datetime, 
           format = "%m/%d/%y %H:%M:%S %p")
## [1] "2018-12-21 08:30:00 EST"


2.2.2 Making Things Easier

Package lubridate is able to detect dates and times in “unambiguous” format with very little specification.

First, let’s install and load lubridate:

if(!require(lubridate)){install.packages("lubridate")}
library(lubridate)


Formatting the Lubridate Way: We’ll take the same datetime and format it with function mdy_hms().

my_datetime <- "12/21/18 08:30:00 AM"
print(my_datetime)
## [1] "12/21/18 08:30:00 AM"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"

Way, way, way easier! By using mdy_hms(), we told R that the format approximates “MM-DD-YYYY HH:MM:SS”.


Other Formats: We can do this with a variety of datetime formats. Behold:

my_datetime <- "12/21/2018 08:30:00 AM"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "December 21, 2018 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <-  "12/21/18 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "Dec. 21, 2018 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "Dec 21 18 8:30:00"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"
my_datetime <- "122118 083000"
mdy_hms(my_datetime)
## [1] "2018-12-21 08:30:00 UTC"

We can get increasingly lazy with our dates, but lubridate gets the job done without fail.


Similar Functions: Other lubridate functions for formatting non-standard, unambiguous datetimes include:

  • mdy_hm()
  • mdy_h()
  • mdy()
  • dmy_hms()
  • dmy_hm()
  • dmy_h()
  • dmy
  • ydm_hms()
  • ydm_hm()
  • ydm_h()
  • ydm()
  • ymd_hms()
  • ymd_hm()
  • ymd_h()
  • ymd()

Just remember:

  • y is “year”
  • m is “month”
  • d is “day”
  • h is “hour”
  • m is “minute”
  • s is “second”

And you’re good to go!


2.2.3 Other Cool Functions

Package lubridate is replete with date-, time-, and datetime-related functions. Here’s a few cool ones:


Dates: date() and as_date() are less intimidating and more human-readable ways to create class POSIX variables.

  • Enter a new class, class “Date”, but this is the exact same thing as a “POSIClt” variable
my_date <- "2018-12-21"
class(my_date)
## [1] "character"
my_date <- as_date(my_date)
class(my_date)
## [1] "Date"


Extracting Units of Time: You can extract a unit of time from a datetime object, including:

  • year()
  • month()
  • week()
  • day()
  • weekdays()
  • hour()
  • minute()
  • second()
  • am()
  • pm()

Observe:

my_datetime <- mdy_hms("December 21, 2018 08:30:00 AM")
year(my_datetime)
## [1] 2018
month(my_datetime)
## [1] 12
week(my_datetime)
## [1] 51
day(my_datetime)
## [1] 21
weekdays(my_datetime)
## [1] "Friday"
hour(my_datetime)
## [1] 8
minute(my_datetime)
## [1] 30
second(my_datetime)
## [1] 0
am(my_datetime)
## [1] TRUE
pm(my_datetime)
## [1] FALSE


Timestamping: Print current date and time with now(), similar to Sys.time() but easier:

now()
## [1] "2018-12-20 16:19:49 EST"


Determine Differences in Times: Find time and date differences using simple arithmetic:

session_start <- mdy_hms("December 21, 2018 08:30:00 AM")
current_time <- now()

session_start - current_time


Determine Difference in Dates: Again, since dates and times are numbers under the hood, it’s arithmetic!

birthday <- mdy("August 10, 1989")
age <- today() - birthday

print(age)
## Time difference of 10724 days
class(age)
## [1] "difftime"

Storing this creates a new variable class: “difftime”.

  • Class “difftime” objects may be converted to different units using Base R function difftime()


Round Dates: You can make dates uniform using round_date(), floor_date(), and ceiling_date().

  • You must specify the unit of time to which each function rounds
  • Particularly useful for, e.g., grouping number of crimes by month
crimes <- data.frame(crime = c("Larceny", "Larceny", "Arson"),
                     date = c("2018-04-03", "2018-04-19", "2018-05-11"))
crimes$date <- ymd(crimes$date)
print(crimes)
##     crime       date
## 1 Larceny 2018-04-03
## 2 Larceny 2018-04-19
## 3   Arson 2018-05-11
crimes$date <- floor_date(x = crimes$date, unit = "month")
print(crimes)
##     crime       date
## 1 Larceny 2018-04-01
## 2 Larceny 2018-04-01
## 3   Arson 2018-05-01


3 Applied Practice

Instructions: Run the following code to read in housing code violation data from Syracuse Open Data.

  • If you’d like, check out the documentation here
  • The dataframe will be made available in your local environment: violations
if(!require(readr)){install.packages("readr")}; library(readr)

url <- "https://opendata.arcgis.com/datasets/fb7233117df1443081541f220327f178_0.csv"
types <- "--c---T---c---c---------"
violations <- read_csv(file = url, col_types = types)


Challenge 1: Determine the class of variable violation_date. Convert it to class “Date”.


Challenge 2: When was the earliest violation in violations? The latest?

  • Hint: Consider using functions min() and max()


Challenge 3: What is the duration between the earliest and latest violation in violations?


Challenge 4: How long ago was the most recent violation as of today?

  • Hint: Consider using functions now(), max(), and basic arithmetic


Challenge 5: Which month had the most reported violations according to the data?

  • Hint #1: Consider using floor_date() to make instances uniform
  • Hint #2: Consider using table(), only inputting the dataset and variable name with $ notation