Loading packages

I am loading stringr package.

Problem 3

  1. Copy the introductory example.
  1. Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

Load data

We are calling str_extract_all funtion from stringr package. It is defined as str_extract_all(string, pattern) such that we first input the string that is to be operated upon and second the expression we are looking for. str_extract_all will extract every match.

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"
## [1] "Moe  Szyslak"      "Montgomery  Burns" "Timothy  Lovejoy" 
## [4] "Ned  Flanders"     "Homer  Simpson"    "Julius  Hibbert"
  1. Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
  1. Construct a logical vector indicating whether a character has a second name.
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE

Test the code

## [1] "Awsaf Akbar"      "Md. Forhad Akbar" "Shamzida Sharmin"
## [1] "Awsaf  Akbar"      "Forhad  Akbar"     "Shamzida  Sharmin"
## [1] FALSE  TRUE FALSE
## [1] FALSE FALSE FALSE

Problem 4

  1. Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

I will try to explain each regular expression in details and come up with at least two different examples.

  1. [0-9]+\$ The [0-9]+ part looks for any string of digits 0 thru 9 that is 1 or more characters long. The two back slahes tells us to regard the $ as a character to be matched, not a metacharacter. Hence, any string of digits followed by a dollar sign would be matched by this regular expression.

Store the pattern in a variable. Then Create two different example strings and test them

## [1] "120$"
## [1] TRUE
## [1] "240$" "12$"  "7$"
## [1] TRUE
  1. \b[a-z]{1,4}\b This regular expression looks at each word edge and matches lower case letters at least once, but not more than four times, and then requires there to be a word edge at the end of the string. Therefore, it will only match lower case words that are four characters or less in length.

Create two different example strings and test them

## [1] "am"
## [1] TRUE
## [1] "data" "six"  "zero" "etc"
## [1] TRUE
  1. .*?\.txt$ The dot represents any character. It is followed by an asterisk, which means the character can be matched zero or more times. The question mark tells us that the preceding item is optional, which means we don’t have to have any characters at all. The two backslashes tell us to treat the second dot literally (as a character instead of a metacharacter), which means we’re trying to match “.txt” within the string. The dollar sign tells us the “.txt” should be at the end of the string. This regular expression should match “.txt” or any string of characters followed by “.txt”.

Create two different example strings and test them

## [1] "5454#34_2.txt option.png.image dark.txt"
## [1] TRUE
## [1] ".txt"      "data.txt"  "1$g!1.txt"
## [1] TRUE
  1. \d{2}/\d{2}/\d{4} The two backslashes and the ‘d’ looks for numerical digits, and the {x} tells us how many numerical digits to look for. In between three sets of numerical digits the expression looks for the forward slash character. Thus, this regular expression would match any date in a mm/dd/yyyy or dd/mm/yyyy format, or even any string in that format even it was not a valid date (i.e., “34/99/0002”). It would not match any dates that did not use a two-digit day or month, or a year which was not four digits.

Create two different example strings and test them

## [1] "04/12/2019" "26/03/1985" "34/99/0005"
## [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [1] "09/12/2019" "09/12/2016"
## [1] TRUE
  1. <(.+?)>.+?</\1> This regular expression matches any string that starts with ‘<’, followed by one or more characters. Note that the one or more characters part (dot - plus - question mark) is in parentheses. After this, the ‘>’ character is matched, then one or more characters again, and then ‘</’. After this, it matches the same string which was matched earlier using the code inside the aforementioned parentheses (this is what the \1 does). Then, it looks for ‘>’. This regex uses backreferencing to return any string that starts with a <text> and ends with </text>. This would be a good way to search through html or xml.

Create two different example strings and test them

## [1] "<tag>Text</tag>"
## [1]  TRUE FALSE
## [1] "<div>hello world</div>"            "<ol><li>one</li><li>two</li></ol>"
## [1] TRUE

Problem 9

  1. The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0Tanwo Uwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigO d6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5 fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr

Step 2

##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" "." "Y"
## [18] "O" "U" "." "A" "R" "E" "." "A" "." "S" "U" "P" "E" "R" "N" "E" "R"
## [35] "D" "!"

Step 3

##  [1] "C" "O" "N" "G" "R" "A" "T" "U" "L" "A" "T" "I" "O" "N" "S" " " "Y"
## [18] "O" "U" " " "A" "R" "E" " " "A" " " "S" "U" "P" "E" "R" "N" "E" "R"
## [35] "D" "!"