This is an R Markdown document created with examples of how to edit text and work with dates using R. Based on Coursera’s Getting and Cleaning Data.
You often have to manipulate text. You often feel that need when you load data into R and look at the names of the variables. Take for example the following database (about speeding tickets):
if(!file.exists("./data")){dir.create("./data")}
fileUrl = "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/cameras.csv", method="curl")
cameraData=read.csv("./data/cameras.csv")
head(cameraData)
## address direction street crossStreet
## 1 S CATON AVE & BENSON AVE N/B Caton Ave Benson Ave
## 2 S CATON AVE & BENSON AVE S/B Caton Ave Benson Ave
## 3 WILKENS AVE & PINE HEIGHTS AVE E/B Wilkens Ave Pine Heights
## 4 THE ALAMEDA & E 33RD ST S/B The Alameda 33rd St
## 5 E 33RD ST & THE ALAMEDA E/B E 33rd The Alameda
## 6 ERDMAN AVE & N MACON ST E/B Erdman Macon St
## intersection Location.1
## 1 Caton Ave & Benson Ave (39.2693779962, -76.6688185297)
## 2 Caton Ave & Benson Ave (39.2693157898, -76.6689698176)
## 3 Wilkens Ave & Pine Heights (39.2720252302, -76.676960806)
## 4 The Alameda & 33rd St (39.3285013141, -76.5953545714)
## 5 E 33rd & The Alameda (39.3283410623, -76.5953594625)
## 6 Erdman & Macon St (39.3068045671, -76.5593167803)
If we want to extract the names of the data set, it is quite easy:
names(cameraData)
## [1] "address" "direction" "street" "crossStreet"
## [5] "intersection" "Location.1"
Now there are several things we might want to do, for instance:
splitNames = strsplit(names(cameraData),"\\.")
splitNames[[6]]
## [1] "Location" "1"
To get only the part of the names that come before any dots we can use the sapply function as follows
firstElement = function(x){x[1]}
sapply(splitNames,firstElement)
## [1] "address" "direction" "street" "crossStreet"
## [5] "intersection" "Location"
To illustrate more ways we can handle text, consider the following example:
example=data.frame(var_1=c(1, 2, 3), var_2=c("a", "b", "c"), aux_var_3=c("x", "y", "z"))
example
## var_1 var_2 aux_var_3
## 1 1 a x
## 2 2 b y
## 3 3 c z
We may want to
sub("_", "", names(example))
## [1] "var1" "var2" "auxvar_3"
This only substitutes the first instance when the underscore appears. If we want to replace all the underscores:
gsub("_", "", names(example))
## [1] "var1" "var2" "auxvar3"
grep("aux", names(example))
## [1] 3
grep("aux", names(example), value=TRUE)
## [1] "aux_var_3"
grepl("aux", names(example))
## [1] FALSE FALSE TRUE
substringing = function(x){substr(x, 1, 3)}
sapply(names(example),substringing)
## var_1 var_2 aux_var_3
## "var" "var" "aux"
paste("a", "b", sep="")
## [1] "ab"
We need a way to express - whitespace word boundaries - sets of literals - the beginning and end of a line - alternatives (“war” or “peace”)
Let’s take as an example the following list:
example_list=list("i think this is cool!", "i think number 3 is not that ugly...", "no, i think not", "cool? i don't think so", "I am taller than 1m74!", "My weight is 65kg.", "The post code is X89y.")
example_list
## [[1]]
## [1] "i think this is cool!"
##
## [[2]]
## [1] "i think number 3 is not that ugly..."
##
## [[3]]
## [1] "no, i think not"
##
## [[4]]
## [1] "cool? i don't think so"
##
## [[5]]
## [1] "I am taller than 1m74!"
##
## [[6]]
## [1] "My weight is 65kg."
##
## [[7]]
## [1] "The post code is X89y."
Let’s see how to:
grep("i think", example_list)
## [1] 1 2 3
grep("^i think", example_list)
## [1] 1 2
grep("cool$", example_list)
## integer(0)
grep("^[iI]", example_list)
## [1] 1 2 5
grep("[0-9][m][0-9][0-9]", example_list)
## [1] 5
grep("[a-zA-Z][0-9][0-9][a-zA-Z]", example_list)
## [1] 7
grep("[^!]$", example_list)
## [1] 2 3 4 6 7
Let’s look at another example to see what we can do:
example_list=list("I am taller than 1m74!", "My height is 1.74m", "1-74=-73, duh", "The revolution was in 1974", "I'm a good guy!", "He's a bad guy!")
example_list
## [[1]]
## [1] "I am taller than 1m74!"
##
## [[2]]
## [1] "My height is 1.74m"
##
## [[3]]
## [1] "1-74=-73, duh"
##
## [[4]]
## [1] "The revolution was in 1974"
##
## [[5]]
## [1] "I'm a good guy!"
##
## [[6]]
## [1] "He's a bad guy!"
Let’s try to:
grep("1.74", example_list)
## [1] 1 2 3 4
grep("1\\.74", example_list)
## [1] 2
grep("^I|bad", example_list)
## [1] 1 5 6
grep("^(I am|I'm)", example_list)
## [1] 1 5
And now a final example:
example_list=list("It's a bear (not sure if it's grizzly though)", "My number is 987654321. Call me", "This sentence has 5 words", "This sentence has more than 5 words", "I sometimes repeat repeat myself")
example_list
## [[1]]
## [1] "It's a bear (not sure if it's grizzly though)"
##
## [[2]]
## [1] "My number is 987654321. Call me"
##
## [[3]]
## [1] "This sentence has 5 words"
##
## [[4]]
## [1] "This sentence has more than 5 words"
##
## [[5]]
## [1] "I sometimes repeat repeat myself"
So now let’s:
grep("\\(.*\\)", example_list)
## [1] 1
grep("[0-9]{5,}", example_list)
## [1] 2
grep("This( +[^ ]+){5,} words", example_list)
## [1] 4
grep(" +([a-zA-Z]+) +\\1", example_list)
## [1] 5
Dates sometimes appear in the data we want to treat and it is important to know how to handle that. First of all, if we want to know the date it is at the moment, we can simply use the Sys.Date function. It returns an object of class date:
Sys.Date()
## [1] "2014-07-17"
Dates can have different formats:
%d = day as number (0-31), %a = abbreviated weekday,%A = unabbreviated weekday, %m = month (00-12), %b = abbreviated month, %B = unabbrevidated month, %y = 2 digit year, %Y = four digit year
So say I want to rewrite the date now in the form dd/mm/yy, then i can do the following:
format(Sys.Date(), "%d/%m/%y")
## [1] "17/07/14"
It’s better to treat date as objects of class date. So if we have just characters, we can coerce them to dates:
x = c("1jan1980", "2jan1980");
z = as.Date(x, "%d%b%Y")
z
## [1] "1980-01-01" "1980-01-02"
The cool thing is that now R can compute how many days there are between two dates:
z[2] - z[1]
## Time difference of 1 days
as.numeric(z[2]-z[1])
## [1] 1
We can find out what weekday and month each date corresponds to. The function julian tells us the number of days that have occured between an origin (of the system) and the dates we input:
weekdays(z)
## [1] "Tuesday" "Wednesday"
months(z)
## [1] "January" "January"
julian(z)
## [1] 3652 3653
## attr(,"origin")
## [1] "1970-01-01"
A nice package is lubridate. It also allows the conversion of numbers to dates:
library(lubridate);
ymd("20140108")
## [1] "2014-01-08 UTC"
mdy("08/04/2013")
## [1] "2013-08-04 UTC"
dmy("03-04-2013")
## [1] "2013-04-03 UTC"
ymd_hms("2011-08-03 10:15:03")
## [1] "2011-08-03 10:15:03 UTC"
ymd_hms("2011-08-03 10:15:03",tz="UTC")
## [1] "2011-08-03 10:15:03 UTC"