Editing text, working with dates

This is an R Markdown document created with examples of how to edit text and work with dates using R. Based on Coursera’s Getting and Cleaning Data.


Editing text variables

You often have to manipulate text. You often feel that need when you load data into R and look at the names of the variables. Take for example the following database (about speeding tickets):

if(!file.exists("./data")){dir.create("./data")}
fileUrl = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/cameras.csv", method="curl")
cameraData=read.csv("./data/cameras.csv")
head(cameraData)
##                          address direction      street  crossStreet
## 1       S CATON AVE & BENSON AVE       N/B   Caton Ave   Benson Ave
## 2       S CATON AVE & BENSON AVE       S/B   Caton Ave   Benson Ave
## 3 WILKENS AVE & PINE HEIGHTS AVE       E/B Wilkens Ave Pine Heights
## 4        THE ALAMEDA & E 33RD ST       S/B The Alameda      33rd St
## 5        E 33RD ST & THE ALAMEDA       E/B      E 33rd  The Alameda
## 6        ERDMAN AVE & N MACON ST       E/B      Erdman     Macon St
##                 intersection                      Location.1
## 1     Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
## 2     Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
## 3 Wilkens Ave & Pine Heights  (39.2720252302, -76.676960806)
## 4     The Alameda  & 33rd St (39.3285013141, -76.5953545714)
## 5      E 33rd  & The Alameda (39.3283410623, -76.5953594625)
## 6         Erdman  & Macon St (39.3068045671, -76.5593167803)

If we want to extract the names of the data set, it is quite easy:

names(cameraData)
## [1] "address"      "direction"    "street"       "crossStreet" 
## [5] "intersection" "Location.1"

Now there are several things we might want to do, for instance:

  • convert all letters to lower case letters: use function tolower
  • convert all letters to upper case letters: use function toupper
  • when variable names have dots in them, like “Location.1”, we may want to get rid of the dot and what comes after that. For that, we can start by splitting all names of the variables into “what comes before the dot” and “what comes after the dot”:
splitNames = strsplit(names(cameraData),"\\.")
splitNames[[6]]
## [1] "Location" "1"

To get only the part of the names that come before any dots we can use the sapply function as follows

firstElement = function(x){x[1]}
sapply(splitNames,firstElement)
## [1] "address"      "direction"    "street"       "crossStreet" 
## [5] "intersection" "Location"

To illustrate more ways we can handle text, consider the following example:

example=data.frame(var_1=c(1, 2, 3), var_2=c("a", "b", "c"), aux_var_3=c("x", "y", "z"))
example
##   var_1 var_2 aux_var_3
## 1     1     a         x
## 2     2     b         y
## 3     3     c         z

We may want to

  • remove the underscores out of all of the variables:
sub("_", "", names(example))
## [1] "var1"     "var2"     "auxvar_3"

This only substitutes the first instance when the underscore appears. If we want to replace all the underscores:

gsub("_", "", names(example))
## [1] "var1"    "var2"    "auxvar3"
  • find which variables contain the name “aux”, which can be done using grep or grepl:
grep("aux", names(example))
## [1] 3
grep("aux", names(example), value=TRUE)
## [1] "aux_var_3"
grepl("aux", names(example))
## [1] FALSE FALSE  TRUE
  • get only the first three letters of every variable (=getting a substring):
substringing = function(x){substr(x, 1, 3)}
sapply(names(example),substringing)
##     var_1     var_2 aux_var_3 
##     "var"     "var"     "aux"
  • paste two strings together:
paste("a", "b", sep="")
## [1] "ab"

Regular expressions

We need a way to express - whitespace word boundaries - sets of literals - the beginning and end of a line - alternatives (“war” or “peace”)

Let’s take as an example the following list:

example_list=list("i think this is cool!", "i think number 3 is not that ugly...", "no, i think not", "cool? i don't think so", "I am taller than 1m74!", "My weight is 65kg.", "The post code is X89y.")
example_list
## [[1]]
## [1] "i think this is cool!"
## 
## [[2]]
## [1] "i think number 3 is not that ugly..."
## 
## [[3]]
## [1] "no, i think not"
## 
## [[4]]
## [1] "cool? i don't think so"
## 
## [[5]]
## [1] "I am taller than 1m74!"
## 
## [[6]]
## [1] "My weight is 65kg."
## 
## [[7]]
## [1] "The post code is X89y."

Let’s see how to:

  • select all sentences containing “i think”:
grep("i think", example_list)
## [1] 1 2 3
  • select all sentences begginning “i think”:
grep("^i think", example_list)
## [1] 1 2
  • select all sentences ending with “cool”:
grep("cool$", example_list)
## integer(0)
  • select all sentences starting with lower case or upper case “i”
grep("^[iI]", example_list)
## [1] 1 2 5
  • select all sentences that contain a sequence of a number followed by the letter “m” followed by two numbers:
grep("[0-9][m][0-9][0-9]", example_list)
## [1] 5
  • select any sentence containing a post code (any letter, any two numbers, any letter):
grep("[a-zA-Z][0-9][0-9][a-zA-Z]", example_list)
## [1] 7
  • select any sentence that does not end with a question mark:
grep("[^!]$", example_list)
## [1] 2 3 4 6 7

Let’s look at another example to see what we can do:

example_list=list("I am taller than 1m74!", "My height is 1.74m", "1-74=-73, duh", "The revolution was in 1974", "I'm a good guy!", "He's a bad guy!")
example_list
## [[1]]
## [1] "I am taller than 1m74!"
## 
## [[2]]
## [1] "My height is 1.74m"
## 
## [[3]]
## [1] "1-74=-73, duh"
## 
## [[4]]
## [1] "The revolution was in 1974"
## 
## [[5]]
## [1] "I'm a good guy!"
## 
## [[6]]
## [1] "He's a bad guy!"

Let’s try to:

  • select all sentences containing a 1 followed by any character, followed by 74:
grep("1.74", example_list)
## [1] 1 2 3 4
  • and now all sentences containing exactly “1.74” (you have to use \. to represent a dot. Using only . means “any character”)
grep("1\\.74", example_list)
## [1] 2
  • select all sentences either starting with “I” or containing the word bad
grep("^I|bad", example_list)
## [1] 1 5 6
  • select all sentences either starting with “I am” or “I’m”
grep("^(I am|I'm)", example_list)
## [1] 1 5

And now a final example:

example_list=list("It's a bear (not sure if it's grizzly though)", "My number is 987654321. Call me", "This sentence has 5 words", "This sentence has more than 5 words", "I sometimes repeat repeat myself")
example_list
## [[1]]
## [1] "It's a bear (not sure if it's grizzly though)"
## 
## [[2]]
## [1] "My number is 987654321. Call me"
## 
## [[3]]
## [1] "This sentence has 5 words"
## 
## [[4]]
## [1] "This sentence has more than 5 words"
## 
## [[5]]
## [1] "I sometimes repeat repeat myself"

So now let’s:

  • select all sentences that contain something within parenthesis (. means any character and * means repeated any number of times):
grep("\\(.*\\)", example_list)
## [1] 1
  • select all sentences that contain more than 5 numbers in a row ({5, } means repeated at least 5 times… {1, 5} would mean repeated between 1 and 5 times):
grep("[0-9]{5,}", example_list)
## [1] 2
  • select all sentences that contain the word “This” followed by at least 5 words and then followed by “words”. By the way, “+” stands for “at least one”
grep("This( +[^ ]+){5,} words", example_list)
## [1] 4
  • select all sentences that repeat the same word once (consecutively):
grep(" +([a-zA-Z]+) +\\1", example_list)
## [1] 5

Dates

Dates sometimes appear in the data we want to treat and it is important to know how to handle that. First of all, if we want to know the date it is at the moment, we can simply use the Sys.Date function. It returns an object of class date:

Sys.Date()
## [1] "2014-07-17"

Dates can have different formats:

%d = day as number (0-31), %a = abbreviated weekday,%A = unabbreviated weekday, %m = month (00-12), %b = abbreviated month, %B = unabbrevidated month, %y = 2 digit year, %Y = four digit year

So say I want to rewrite the date now in the form dd/mm/yy, then i can do the following:

format(Sys.Date(), "%d/%m/%y")
## [1] "17/07/14"

It’s better to treat date as objects of class date. So if we have just characters, we can coerce them to dates:

x = c("1jan1980", "2jan1980"); 
z = as.Date(x, "%d%b%Y")
z
## [1] "1980-01-01" "1980-01-02"

The cool thing is that now R can compute how many days there are between two dates:

z[2] - z[1]
## Time difference of 1 days
as.numeric(z[2]-z[1])
## [1] 1

We can find out what weekday and month each date corresponds to. The function julian tells us the number of days that have occured between an origin (of the system) and the dates we input:

weekdays(z)
## [1] "Tuesday"   "Wednesday"
months(z)
## [1] "January" "January"
julian(z)
## [1] 3652 3653
## attr(,"origin")
## [1] "1970-01-01"

A nice package is lubridate. It also allows the conversion of numbers to dates:

library(lubridate); 
ymd("20140108")
## [1] "2014-01-08 UTC"
mdy("08/04/2013")
## [1] "2013-08-04 UTC"
dmy("03-04-2013")
## [1] "2013-04-03 UTC"
ymd_hms("2011-08-03 10:15:03")
## [1] "2011-08-03 10:15:03 UTC"
ymd_hms("2011-08-03 10:15:03",tz="UTC")
## [1] "2011-08-03 10:15:03 UTC"