library(stringr)
raw.data <-"555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
raw.data
## [1] "555-1239Moe Szyslak(636) 555-0113Burns, C. Montgomery555-6542Rev. Timothy Lovejoy555 8904Ned Flanders636-555-3226Simpson, Homer5553642Dr. Julius Hibbert"
name <- unlist(str_extract_all(raw.data, "[[:alpha:]., ]{2,}"))
name
## [1] "Moe Szyslak" "Burns, C. Montgomery" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Simpson, Homer" "Dr. Julius Hibbert"
[] - Brackets enclose character classes which match any of the characters within the brackets. [:alpha:] - Alphabetic characters: a-z and A-Z. “.,” - Because these characters are within the brackets we match all “.,” values. {2,} - The preceding item is matched 2 or more times.
name <- str_replace(name, pattern = "C. Montgomery", "Burns" )
name <- str_replace(name, pattern = "Burns,", "Montgomery C." )
name <- str_replace_all(name, pattern = "Homer", "Simpson" )
name <- str_replace_all(name, pattern = "Simpson,", "Homer" )
name
## [1] "Moe Szyslak" "Montgomery C. Burns" "Rev. Timothy Lovejoy"
## [4] "Ned Flanders" "Homer Simpson" "Dr. Julius Hibbert"
title <- str_detect(name, "[:lower:][.]")
middle <- str_detect(name, "[:Upper:][.]")
data.frame(name, title, middle)
## name title middle
## 1 Moe Szyslak FALSE FALSE
## 2 Montgomery C. Burns FALSE TRUE
## 3 Rev. Timothy Lovejoy TRUE FALSE
## 4 Ned Flanders FALSE FALSE
## 5 Homer Simpson FALSE FALSE
## 6 Dr. Julius Hibbert TRUE FALSE
[0-9] will extract all digits as seperate strings. Including + will match the preceding item one or more times. We precede the $ with \ because it is a metacharacter. In order to match any $ symbols in the string we need to use metacharacters.
The expression will extract all digits preceding the $ symbol but up until a $ symbol if there is one before.
raw.data1 <- "12345$ 1$23 $123 123 abc1-23$ abc$123 123$abc 1a2$3abc 123abc$"
samp1 <- unlist(str_extract_all(raw.data1, "[0-9]+\\$"))
samp1
## [1] "12345$" "1$" "23$" "123$" "2$"
\b indicates a word edge while [a-z] will extract any lowercase letters. {1,4} will only extract four or less characters.
raw.data2 <- ".this# sIs is%a t3st hello test!*&*"
samp2 <- unlist(str_extract_all(raw.data2, "\\b[a-z]{1,4}\\b"))
samp2
## [1] "this" "is" "a" "test"
The expression will extract starting and ending at a words edge, so long as the word does not have any lower case letters and the words are less than four characters. Symbols and punctuation will count as word seperators.
. Will match any character while * and ? create a rule that the preceding item is optional or can be matched zero or more times. The \ acts as a metacharacter insuring that only strings with .txt patterns will be extracted. The $ operator ensures that the .txt being extracted is on the back of the string.
raw.data3 <- ".txta$%%sdf.txt"
samp3 <- unlist(str_extract_all(raw.data3, ".*?\\.txt$"))
samp3
## [1] ".txta$%%sdf.txt"
This is a good string manipulation technique for extracting .txt files. If the .txt is not on the end then the result is n/a however, as long as the string is ending in .txt it can have either no characters before it or any characters before it.
This expression will extract 2 digits, a forward slash, 2 digits, a forward slash and four digits with a forward slash.
raw.data4 <- "12/25/2020"
samp4 <- unlist(str_extract_all(raw.data4, "\\d{2}/\\d{2}/\\d{4}"))
samp4
## [1] "12/25/2020"
This could be useful when extracting dates.
we are looking for the shortest possible sequence of any characters in the beginning and end of the word. Then the shortest possible sequence of characters and / to begin the word and
raw.data5 <- "A small sentence -2. Another tiny /sentence"
samp5 <- str_extract(raw.data5, "<(.+?)>.+?</\\1>")
samp5
## [1] NA
Initially I look at the expression inside of the parenthesis. I understand the “.+” behavior is changed once accompanied by a “?”, so that it only looks for the shortest possible sequence of characters. The < and > characters represent the word beginning and end respectively. The parenthesis are used to save that sequence as a reference which is then called by the /1. So for the shortest possible sequence we can extract, which must begin with a word and end with a word, we are trying to match them at the end of our pattern. Prior to referencing this sequence a forward slash is included. The forward slash must begin the word. The shortest possible sequence of any characters are expressed in the middle of the pattern.