Automated Data Collection in R Chap. 8

4. Match each pattern to a string.



The first pattern matches any length of numbers followed by a dollar sign. It can be extracted from anywhere in a string.

library(stringr)

s1 = "alsdkfjhldksja12345$adsfyyrutyukj"
pattern1 = "[0-9]+\\$"
str_extract(s1, pattern1)   # any len of numbers followed by a dollar sign 
## [1] "12345$"



The second pattern is a word boundary. Using extract_all, the pattern will find any lower case group of characters between the lengths of 1 and 4.

s2 ="alkjdsf kj alkjhds hjf akljdhsf p ownc ajkgfui"
pattern2 = "\\b[a-z]{1,4}\\b"
unlist(str_extract_all(s2, pattern2))  # word boundary. lower case length 1 -> 4
## [1] "kj"   "hjf"  "p"    "ownc"



The third pattern only finds a match at the end of a string that contains the .txt extension. Notice how it does not find the .txt extension in the variable s3b because it is not at the end of the string.

s3 = "mush.txt"
s3b = "mush.txt something.csv"
pattern3 = ".*?\\.txt$"
str_extract(s3, pattern3) # only a string that ends in .txt
## [1] "mush.txt"
unlist(str_extract_all(s3b, pattern3)) # returns NA
## character(0)



The fourth pattern extracts a date pattern: 2 digits / 2 digits / 4 digits.

s4 = "12/12/2112 aa/bb/cccc 01/01/1984"
pattern4 = "\\d{2}/\\d{2}/\\d{4}" # date with slashes.. two digits/two digits/ 4 digits
unlist(str_extract_all(s4, pattern4))
## [1] "12/12/2112" "01/01/1984"



5. Rewrite pattern 4a with different elements that can perform the same task



This pattern uses the [[:digit:]] class, the 0 or more *, and a literal $ in brackets. It is able to extract the same data from the same string.

s1 = "alsdkfjhldksja12345$adsfyyrutyukj"
pattern1b = "[[:digit:]]*[$]"
str_extract(s1, pattern1b) 
## [1] "12345$"



6. Consider the email address chunkylover53[at]aol[dot]com



a) Transform the string to a standard mail format using regex

Here search for [at] and [dot] in the string and split the string on that pattern. This returns a character vector with three pieces. We then use paste() to concatenate the pieces of the vector with the new symbols @ and .

e_pattern = "(\\[at\\])|(\\[dot\\])"
e_str = "chunkylover53[at]aol[dot]com"

e_split = unlist(str_split(e_str, e_pattern))
email = paste(e_split[1], "@", e_split[2], ".", e_split[3], sep="")
email
## [1] "chunkylover53@aol.com"



b) Imagine we are trying to extract the digits in an email address. To do so we write the expression [:digit:]. Explain why this fails and return the correct expression.

This fails because the [:digit:] is a predefined class. It needs to be wrapped in brackets iself. Otherwise regex is looking for : d i g i t literals.

str_extract(email, "[[:digit:]]+")  # need to return 1 or more digits
## [1] "53"
unlist(str_extract_all(email,"[[:digit:]]"))
## [1] "5" "3"



c) Instead of using predefined character classes, we would like to use the predefined symbols to extract all the digits in the mail address. To do so we write the expression \D. Explain why this fails and correct the expression.

This fails because \D returns non digits. The correct way to get digits using a predefined symbol is lower case d.

str_extract(email, "\\d+")         # lowercase d gets the digits
## [1] "53"
unlist(str_extract_all(email,"\\d"))
## [1] "5" "3"