if(!file.exists("data")) {
dir.create("data")
}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
if(!file.exists("./data/cameras.csv")) {
download.file(fileUrl, destfile="./data/cameras.csv", method="curl")
}
cameraData <- read.csv("./data/cameras.csv")
names(cameraData)
## [1] "address" "direction" "street" "crossStreet"
## [5] "intersection" "Location.1"
tolower(names(cameraData))
## [1] "address" "direction" "street" "crossstreet"
## [5] "intersection" "location.1"
My note: Converting “crossStreet” is a bad idea. What is truly error-prone is having three “s” characters in a row, overlapping two words. I do not agree that “crossstreet” is in any way better than “crossStreet”.
Note that the split parameter is a regular expression, a topic not yet covered. We need to add a backslash in front of it so that the regex engine knows we are specifying the period character; “.” by itself in regex means “any character” (it's a wildcard). And we specify two backslashes because the expression is inside double-quotes, and in that context, a single backslash is used to denote an escaped character such as a tab (“\t”).
splitNames = strsplit(names(cameraData), split = "\\.")
splitNames[[5]]
## [1] "intersection"
splitNames[[6]] # 'Location.1' changed into two values, 'Location' and '1'
## [1] "Location" "1"
The following creates a list with three elements, the first two of which are named.
mylist <- list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:25, ncol = 5))
head(mylist)
## $letters
## [1] "A" "b" "c"
##
## $numbers
## [1] 1 2 3
##
## [[3]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
This shows three ways of selecting the first element of the list:
mylist[1] # returns a list
## $letters
## [1] "A" "b" "c"
class(mylist[1])
## [1] "list"
mylist$letters # returns a vector
## [1] "A" "b" "c"
class(mylist$letters)
## [1] "character"
mylist[[1]] # returns a vector
## [1] "A" "b" "c"
class(mylist[[1]])
## [1] "character"
splitNames[[6]][1] #splitNames[[6]] contains two values, 'Location' and '1'; this returns the first
## [1] "Location"
firstElement <- function(x) {
x[1]
} # returns first element of x
sapply(splitNames, firstElement) # passes each element in splitNames into firstElement
## [1] "address" "direction" "street" "crossStreet"
## [5] "intersection" "Location"
Same as used in week 3.
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0026895
if (!file.exists("./data")) {
dir.create("./data")
}
fileUrl1 = "https://dl.dropboxusercontent.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 = "https://dl.dropboxusercontent.com/u/7710864/data/solutions-apr29.csv"
if (!file.exists("./data/reviews.csv")) {
download.file(fileUrl1, destfile = "./data/reviews.csv", method = "curl")
}
if (!file.exists("./data/solutions.csv")) {
download.file(fileUrl2, destfile = "./data/solutions.csv", method = "curl")
}
reviews = read.csv("./data/reviews.csv")
solutions = read.csv("./data/solutions.csv")
head(reviews, 2)
## id solution_id reviewer_id start stop time_left accept
## 1 1 3 27 1304095698 1304095758 1754 1
## 2 2 4 22 1304095188 1304095206 2306 1
head(solutions, 2)
## id problem_id subject_id start stop time_left answer
## 1 1 156 29 1304095119 1304095169 2343 B
## 2 2 269 25 1304095119 1304095183 2329 C
Removes all underscores:
names(reviews)
## [1] "id" "solution_id" "reviewer_id" "start" "stop"
## [6] "time_left" "accept"
sub(pattern = "_", replacement = "", x = names(reviews))
## [1] "id" "solutionid" "reviewerid" "start" "stop"
## [6] "timeleft" "accept"
testName <- "this_is_a_test"
sub("_", "", testName)
## [1] "thisis_a_test"
gsub("_", "", testName)
## [1] "thisisatest"
grep(pattern = "Alameda", x = cameraData$intersection) # Find all intersections that contain 'Alameda'
## [1] 4 5 36
table(grepl(pattern = "Alameda", x = cameraData$intersection)) # build a table of intersections that do, and do not, contain 'Alameda'
##
## FALSE TRUE
## 77 3
cameraData2 <- cameraData[!grepl(pattern = "Alameda", x = cameraData$intersection),
] # subset of intersections that do not contain 'Alameda'
My note: regular expressions are an art. grep is one tool for working with regular expressions, except that grep is in fact many tools, or rather flavors of a regex tool. For example, the Perl language has its own variant of regex. Many bash shells allow emulation of Perl regex; but they're not all exactly the same. The grep and grepl functions both have perl=TRUE options, but how closely do they match the Perl language's regexes? I don't know yet. At any rate, if you're new to regex, it will take time and practice to master them (much like R).
grep(pattern = "Alameda", x = cameraData$intersection, value = TRUE) # value=TRUE means return the values, rather than indices to the original data set
## [1] "The Alameda & 33rd St" "E 33rd & The Alameda"
## [3] "Harford \n & The Alameda"
grep("JeffStreet", cameraData$intersection) # if no matches, returns intger(0)
## integer(0)
length(grep("JeffStreet", cameraData$intersection)) # returns 0
## [1] 0
Note: nchar, substr, paste, paste0 are all in base R.
library(stringr)
nchar("Jeffrey Leek")
## [1] 12
substr("Jeffrey Leek", 1, 7)
## [1] "Jeffrey"
paste("Jeffrey", "Leek") # sep defaults to ' '
## [1] "Jeffrey Leek"
paste0("Jeffrey", "Leek") # there is no sep param, just concatenates
## [1] "JeffreyLeek"
str_trim("Jeff ") # trims trailing whitespace
## [1] "Jeff"
str_trim(" Jeff ", side = "both") # or on both sides (can be left, right, both)
## [1] "Jeff"
My note: Some of the above is absurd. It could mean that Rtc_Sep_Stages_Sign_X_12_3 becomes rctsepstagessignx123, and if you think that's an improvement, well you're just wrong! Pffffft.
Regular expressions were introduced in previous video, this section will explore regex more fully.
Simplest pattern consists of only literals. The literal “nuclear” would match the following lines:
Ooh, I just learned that to keep myself alive after a nuclear blast! All I have to do is milk some rats then drink the milk. Awesome. :}
Laozi says nuclear weapons are mas macho.
Chaos in a country that has nuclear weapons--not good.
my nephew is trying to teach me nuclear physics, or possibly just trying to show how smart he is so I'll be proud of him (which I am)
lol if you ever say "nuclear"" people immediately think DEATH by radiation LOL
We need a way to express
Some metacharacters represent the start of a line
(Note: The ^ metacharacter means “start of line”; it is the only metacharacter I've ever seen that represents the start of a line.)
^i think
which will match
i think we all rule for participating
i think i have been outed
i think this will be quite fun actually
It will not match:
that's not what i think
because “i think” is not at the start of the line
$ represents the end of a line
morning$
This matches any line that ends with “morning” (along with a line end sequence such as linefeed, or carriage return-linefeed)
We can list a set of characters we will accept at a given point in the match.
(NOTE: What this means is that any one of the set will count as a match)
The following (which defines four consecutive character classes) will match “bush” in any combination of upper and lower case. So “Bush”, “bUsh”, “bush”, “BUSH” would all match.
[Bb][Uu][Ss][Hh]
You can combine metacharacters. The following matches either “i am” or “I am”, so long as it is at the start of a line:
^[Ii] am
So this would match “i am”, “I am”, and “i amped”, at the start of a line.
Similarly, you can specify a range of letters a-z or A-Z; notice that order doesn't matter.
NOTE: Order most certainly does matter. If you specify [z-a] it is very different from [a-z]! I think what is meant here is that if you combine more than one range of letters in a single class, then the order of the ranges doesn't matter, e.g., [a-zA-Z] is the same as [A-Za-z].
So this:
^[0-9][a-zA-Z]
will match any single numeric digit, followed by any single letter, uppre or lower case, at the start of a line.
When used at the beginning of a character class, the ^ is also a metacharacter and indicates characters that should not be matched.
For example this:
[^?.]$
will match any lines that do not end in either a question mark or a period.
Also note that ^ has this functionality only at the the beginning of a character class. This:
[a-z^]
matches any lower case letter or the ^ symbol.
I veer away from many of the examples in the video and provide my own, in order to simplify some things.
“.” is used to refer to any character. So:
9.11
matches a “9” followed by any character, followed by “11”.
NOTE: This isn't entirely true. If 9 were followed by a linefeed, for example, this would not (by default) match. Generally speaking, regular expressions work on only one line at a time. But apart from line-ending characters, the above would match any string that had a 9, then any character, then 11. So it would match “9.11”, “9211”, “9:11”, “9a11” and so on.
The “|” character is used to create “or” conditions in order to combine subexpressions:
flood|fire
this matches either “flood” or “fire”.
We can include any number of alternatives:
flood|earthquake|hurricane|coldfire
The alternatives can also be more complex expressions:
^[Gg]ood|[Bb]ad
The above matches either:
So the | takes precedence over the , kind of like a * taking precedence over a + in math.
If you wanted to match either “Good” or “good” or “Bad” or “bad” at the start of a line, group both expressions in parentheses:
^([Gg]ood|[Bb]ad)
The question mark indicates that the indicated expression is optional.
The following matches “George Bush”, “george bush”, “George W. Bush”, “george w. bush”. So it will match with or without a “ W.” between “George” and “Bush”.
[Gg]eorge( [Ww]\.)? [Bb]ush
Note above the “.” which indicates that we want to match the period character. I.e., it does not serve its normal purpose as a wildcard matching any character. This can be required for other characters; for instance, if you want to match “\( ", then specify \ \). This is generally true for any metacharacter, but note what I said above: regular expressions are an art, and they take time to master. Confusions about escaping can crop up, especially since different tools can require different kinds of escaping. Even within R, the functions that use regular expressions allow you to specify that you are using Perl-style expressions, which can require different kinds of escaping. In a nutshell, be prepared for some confusion if you are new to regular expressions.
The * metacharacter means "any number of the preceding character/expression”, and + means “one or more of the preceding character/expression”.
So this:
(.*)
will match any line that has both a left and right parens, whether there is any text between them or not.
NOTE: The video is wrong here. You must escape the parentheses, because they themselves are metacharacters. So the expression would be:
\(.*\)
This:
[0-9]+
will match any string that has at least one numeric digit, and this:
[0-9]+ (.*)[0-9]+
will match any line that has one or more digits, followed by a space, followed by zero or more of any character, followed by one or more digits.
Note that the parentheses used here are correct; the examples of matched lines in the video do not have parentheses in them. So in the above, the parentheses are metacharacters, not literals. If you are new to regular expressions, this would probably have you scratching your head.
These metacharacters are used to define the minimum and maximum number of matches of an expression. For example this:
[Bb]ush( +[^ ]+ +){1,5} debate
looks for the following, in order:
So in a sense, it looks for “Bush”, then one to five words, then “debate” (note that this is not an entirely accurate desription, one of those “words” might be an ellipse, “…”, and this hints at the difficulties you can face with regular experssions. Did I mention is't an art?)
If you use {3} in the above, it means match exactly three times. If you use {2,7} it means match between 2 and 7 times. If you use {3,} it means match at least three times.
NOTE: Again, it's an art. Working with backreferences starts out simple, but grows more complex as your needs become more sophisticated. Two quick things I'll mention without elaborating on:
So this:
([a-zA-Z]+) \1
will match any sequence of one or more letters that is repeated and separated by a space. In effect this says, “match any sequence of letters and store it as backreference 1, then a space, then whatever is stored as backreference 1”.
The * metacharacter is “greedy” by default, so this:
^s(.*)s
will match as many characters as possible after the “s” at the beginning of a line. For example, if the string searched is “soon as pass” then it will not match “soon as”, but rather, “soon as pass”. It will match as many characters as it possibly can.
What if you want to match starting with an s, then any characters, but up to and including only the second s in the line? Do this:
^s(.*?)s
The question mark directly after the asterisk is another special metacharacter that signals that non-greedy matching should be performed. So the command now is, “look for s at the start of a line, then zero or more characters followed by an s, but stop at the second s in the line”.
One final note: Remember that whenever you use a backslash in a string expression surrouned by double quotes, you have to double the backslash, so the regular expression . would become “.”