library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data

## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"

name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))

name

## [1] "Moe Szyslak"          "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Simpson, Homer"       "Dr. Julius Hibbert"

[] - Brackets enclose character classes which match any of the characters within the brackets. [:alpha:] - Alphabetic characters: a-z and A-Z. “.,” - Because these characters are within the brackets we match all “.,” values. {2,} - The preceding item is matched 2 or more times.

name <- str_replace(name, pattern = "C. Montgomery", "Burns" )
name <- str_replace(name, pattern = "Burns,", "Montgomery C." )
name <- str_replace_all(name, pattern = "Homer", "Simpson" )
name <- str_replace_all(name, pattern = "Simpson,", "Homer" )
name

## [1] "Moe Szyslak"          "Montgomery C. Burns"  "Rev. Timothy Lovejoy"
## [4] "Ned Flanders"         "Homer Simpson"        "Dr. Julius Hibbert"

title <- str_detect(name, "[:lower:][.]")
middle <- str_detect(name, "[:Upper:][.]")

data.frame(name, title, middle)

##                   name title middle
## 1          Moe Szyslak FALSE  FALSE
## 2  Montgomery C. Burns FALSE   TRUE
## 3 Rev. Timothy Lovejoy  TRUE  FALSE
## 4         Ned Flanders FALSE  FALSE
## 5        Homer Simpson FALSE  FALSE
## 6   Dr. Julius Hibbert  TRUE  FALSE

[0-9]+\$

[0-9] will extract all digits as seperate strings. Including + will match the preceding item one or more times. We precede the $ with \ because it is a metacharacter. In order to match any $ symbols in the string we need to use metacharacters.

The expression will extract all digits preceding the $ symbol but up until a $ symbol if there is one before.

raw.data1 <- "12345$ 1$23 $123 123 abc1-23$ abc$123 123$abc 1a2$3abc 123abc$"
samp1 <- unlist(str_extract_all(raw.data1, "[0-9]+\\$"))
samp1

## [1] "12345$" "1$"     "23$"    "123$"   "2$"

\b[a-z]{1,4}\b

\b indicates a word edge while [a-z] will extract any lowercase letters. {1,4} will only extract four or less characters.

raw.data2 <- ".this# sIs is%a t3st hello test!*&*"
samp2 <- unlist(str_extract_all(raw.data2, "\\b[a-z]{1,4}\\b"))
samp2

## [1] "this" "is"   "a"    "test"

The expression will extract starting and ending at a words edge, so long as the word does not have any lower case letters and the words are less than four characters. Symbols and punctuation will count as word seperators.

.*?\.txt$

. Will match any character while * and ? create a rule that the preceding item is optional or can be matched zero or more times. The \ acts as a metacharacter insuring that only strings with .txt patterns will be extracted. The $ operator ensures that the .txt being extracted is on the back of the string.

raw.data3 <- ".txta$%%sdf.txt"
samp3 <- unlist(str_extract_all(raw.data3, ".*?\\.txt$"))
samp3

## [1] ".txta$%%sdf.txt"

This is a good string manipulation technique for extracting .txt files. If the .txt is not on the end then the result is n/a however, as long as the string is ending in .txt it can have either no characters before it or any characters before it.

\d{2}/\d{2}/\d{4}

This expression will extract 2 digits, a forward slash, 2 digits, a forward slash and four digits with a forward slash.

raw.data4 <- "12/25/2020"
samp4 <- unlist(str_extract_all(raw.data4, "\\d{2}/\\d{2}/\\d{4}"))
samp4

## [1] "12/25/2020"

This could be useful when extracting dates.

<(.+?)>.+?</\1>

we are looking for the shortest possible sequence of any characters in the beginning and end of the word. Then the shortest possible sequence of characters and / to begin the word and

raw.data5 <- "A small sentence -2. Another tiny /sentence"
samp5 <- str_extract(raw.data5, "<(.+?)>.+?</\\1>")
samp5

## [1] NA

Initially I look at the expression inside of the parenthesis. I understand the “.+” behavior is changed once accompanied by a “?”, so that it only looks for the shortest possible sequence of characters. The < and > characters represent the word beginning and end respectively. The parenthesis are used to save that sequence as a reference which is then called by the /1. So for the shortest possible sequence we can extract, which must begin with a word and end with a word, we are trying to match them at the end of our pattern. Prior to referencing this sequence a forward slash is included. The forward slash must begin the word. The shortest possible sequence of any characters are expressed in the middle of the pattern.