ADCR, Chapter 8, Exercise 3 (p.217)

Copy the introductory example. The vector name stores the extracted names.

## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5543642Dr. Julius Hibbert"
##      [,1]                  
## [1,] "Moe Szyslak"         
## [2,] "Burns, C. Montgomery"
## [3,] "Rev. Timothy Lovejoy"
## [4,] "Ned Flanders"        
## [5,] "Simpson, Homer"      
## [6,] "Dr. Julius Hibbert"
##      [,1]            
## [1,] "555-1239"      
## [2,] "(636) 555-0113"
## [3,] "555-6542"      
## [4,] "555 8904"      
## [5,] "636-555-3226"  
## [6,] "5543642"
##                   name          phone
## 1          Moe Szyslak       555-1239
## 2 Burns, C. Montgomery (636) 555-0113
## 3 Rev. Timothy Lovejoy       555-6542
## 4         Ned Flanders       555 8904
## 5       Simpson, Homer   636-555-3226
## 6   Dr. Julius Hibbert        5543642

Use the tools of this chapter to rearrange the vector so that all elements conform to the standard first_name last_name.

The above description is a bit unclear – I am unsure as to whether this means that titles should be stripped out.
Also, in the case of an individual like Mr. Burns, are we to use just the letter “C” as his first name?

I am going to make the assumption that the question is asking us to take those names which are presented as:

lastname, firstname

and rearrange such names so that the comma is removed and the names are presented in the order:

firstname lastname

but in the case of any titles, initials, etc., I am not going to remove them.

(The instructions should have been more clear if something different was desired.)

ENDING NAMES:
##      [,1]                  
## [1,] "Moe Szyslak"         
## [2,] "C. Montgomery Burns" 
## [3,] "Rev. Timothy Lovejoy"
## [4,] "Ned Flanders"        
## [5,] "Homer Simpson"       
## [6,] "Dr. Julius Hibbert"
Note that names #2 and #5 have been changed vs. original input.

Construct a logical vector indicating whether a character has a title (i.e., Rev. and Dr.).

##       [,1]
## [1,] FALSE
## [2,] FALSE
## [3,]  TRUE
## [4,] FALSE
## [5,] FALSE
## [6,]  TRUE

Display the names of those individuals with titles preceding their names

##      [,1]                  
## [1,] "Rev. Timothy Lovejoy"
## [2,] "Dr. Julius Hibbert"

Construct a logical vector indicating whether a character has a second name.

## [1] 2 3 3 2 2 3
## 2 Moe Szyslak
## 3 C. Montgomery Burns
## 3 Rev. Timothy Lovejoy
## 2 Ned Flanders
## 2 Homer Simpson
## 3 Dr. Julius Hibbert

If the individual has a title (i.e.,. “Rev.” or “Dr.”, then his name must have 4 parts to include a middle name.
If the individual does not have a title, then his name must have 3 parts to include a middle name.

##       [,1]
## [1,] FALSE
## [2,]  TRUE
## [3,] FALSE
## [4,] FALSE
## [5,] FALSE
## [6,] FALSE

Display the list of individuals who have a Middle Name:

## [1] "C. Montgomery Burns"

If the instructions clearly specified to drop a title and/or middle name, then we could use the above to execute this.

In absence of such instructions, I’ll leave the names as they are.


ADCR, Chapter 8, Exercise 4 (p.217)

Describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.

1. [0-9]+\\$

This matches one or more digits, followed by a dollar-sign “$” character. (Note: It does not represent digits at the end of a line, because the backslashes preceding the dollar-sign give it its literal meaning, rather than its “end-of-line” meaning.)

##  [1] "1$"      "2$"      "34$"     "56$"     "7$"      "456$"    "89$"     "234$"    "10$"    
## [10] "11$"     "123456$" "1234$"   "56789$"
##  [1] "1$"      "2$"      "34$"     "56$"     "7$"      "456$"    "89$"     "234$"    "10$"    
## [10] "11$"     "123456$" "1234$"   "56789$"
## [1] "1$: "
##      [,1]
## [1,] "1$"
## [1] "_______________________"
## [1] "2$: "
##      [,1]
## [1,] "2$"
## [1] "_______________________"
## [1] "34$: "
##      [,1] 
## [1,] "34$"
## [1] "_______________________"
## [1] "56$: "
##      [,1] 
## [1,] "56$"
## [1] "_______________________"
## [1] "xyz7$abc456$789: "
##      [,1]  
## [1,] "7$"  
## [2,] "456$"
## [1] "_______________________"
## [1] "xyz89$abc$234$567: "
##      [,1]  
## [1,] "89$" 
## [2,] "234$"
## [1] "_______________________"
## [1] "xyz10$abc$asdf$: "
##      [,1] 
## [1,] "10$"
## [1] "_______________________"
## [1] "xyz11$abc$123456$asdfg: "
##      [,1]     
## [1,] "11$"    
## [2,] "123456$"
## [1] "_______________________"
## [1] "This is a long string with letters$9876 and 1234$numbers - many 56789$numbers: "
##      [,1]    
## [1,] "1234$" 
## [2,] "56789$"
## [1] "_______________________"

2. \\b[a-z]{1,4}\\b

Thus matches any “word” containing between 1 and 4 lower-case letters.
Note that in addition to spaces, a “word boundary” can also be delimited by certain characters such as a dollar-sign (“$”) or a hyphen (“-”).
Note however that an underscore ("_") is not considered a word boundary. Rather, an underscore is considered part of a word.

## 
## ____example2a_____
## 1 is
## 2 a
## 3 long
## 4 many
## 5 is
## 6 the
## 7 of
## 8 our
## 9 a
## 10 b
## __________________
## 
## ____example2b_____
## 1 is
## 2 a
## 3 long
## 4 many
## 5 is
## 6 the
## 7 of
## 8 our
## 9 a
## 10 b
## __________________
## 
## ____example2c_____
## 1 is
## 2 a
## 3 long
## 4 many
## __________________
Note that the underscores in example2c do not constitute word separators, thus none of the words are taken from the last part of that example.

3. .*?\\.txt$

This matches any number of characters (of any kind) followed by “.txt” , which must then conclude the line.

## [1] "This is a filename.txt" "Hello123456.txt"        "what is new.txt"
## This is a filename.txt
## Hello123456.txt
## what is new.txt

5. <(.+?)>.+?</\\1>

This identifies an HTML or XML-style block where some tag references the start and end of such block, with non-empty text in between.
The start tag would be something like <FOO> and the corresponding end tag would be </FOO>.

## [1] "<HTML>some content</HTML>"                "<b>text content inside block</b>"        
## [3] "<notempty>x</notempty>"                   "</maybe>perhaps this will work?<//maybe>"

ADCR, Chapter 8, Exercise 9 (p.218)

The following code hides a secret message. Crack it with R and regular expressions. Hint: Some of the characters are more revealing than others! The code snippet is also available in the materials at www.r-datacollection.com.

## [1] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr\n"

Let’s see if any of the regex classes can help filter out something meaningful:

##  [1] "[[:digit:]]" "[[:lower:]]" "[[:upper:]]" "[[:alpha:]]" "[[:alnum:]]" "[[:punct:]]"
##  [7] "[[:graph:]]" "[[:blank:]]" "[[:space:]]" "[[:print:]]"

Let’s try each regex class on the secret, and see if any of them gives insight:

## [1] "1087792855078035307553364116224905651724639589659490545"                                                                                                                                                                                                 
## [2] "clcopowzmstcdwnkigvdicpuggvhrynjuwczihqrfpxsjdwpnanwowisdijjkpfdrcocbtyczjataootjtjnecfekrwwwojigdvrfrbzbknbhzgvizcropwgnbqofaotfbwmktszqefyndtkcfgmcgxonhkgr"                                                                                           
## [3] "CONGRATULATIONSYOUAREASUPERNERD"                                                                                                                                                                                                                         
## [4] "clcopCowzmstcdwnkigOvdicpNuggvhrynGjuwczihqrfpRxsAjdwpnTanwoUwisdijLjkpfATIdrcocbtyczjatOaootjtNjnecSfekrwYwwojigOdvrfUrbzbkAnbhzgvRizEcropwAgnbSqoUfPaotfbwEmktsRzqefynNdtkcfEgmcRgxonhDkgr"                                                            
## [5] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfekr1w1YwwojigOd6vrfUrbz22bkAnbhzgv4R9i05zEcropwAgnbSqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDkgr"     
## [6] "....!"                                                                                                                                                                                                                                                   
## [7] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"
## [8] "\n"                                                                                                                                                                                                                                                      
## [9] "clcopCow1zmstc0d87wnkig7OvdicpNuggvhryn92Gjuwczi8hqrfpRxs5Aj5dwpn0TanwoUwisdij7Lj8kpf03AT5Idr3coc0bt7yczjatOaootj55t3Nj3ne6c4Sfek.r1w1YwwojigOd6vrfUrbz2.2bkAnbhzgv4R9i05zEcrop.wAgnb.SqoU65fPa1otfb7wEm24k6t3sR9zqe5fy89n6Nd5t9kc4fE905gmc4Rgxo5nhDk!gr"

The third one shows promise – it has some discernable words, but they are run together.

## [1] "[[:upper:]]"
## [1] "CONGRATULATIONSYOUAREASUPERNERD"

I see some punctuation in the sixth item:

## [1] "[[:punct:]]"
## [1] "....!"

Perhaps we should try combining both regexes (i.e., 3 and 6) ?

## [1] "[[:upper:][:punct:]]"

Here’s the result from combining the UPPERCASE and the punctuation:

## [1] "CONGRATULATIONS.YOU.ARE.A.SUPERNERD!"

OK, time to clean up the punctuation. Since the original secret string didn’t contain any spaces, clearly it used a dot in place of a space. So, let’s switch them back:

## [1] "CONGRATULATIONS YOU ARE A SUPERNERD!"

To make this grammatically correct, it needs a comma:

## [1] "CONGRATULATIONS, YOU ARE A SUPERNERD!"

The secret message is: “CONGRATULATIONS, YOU ARE A SUPERNERD!”

(Please, tell me something I didn’t already know?)