String Basics


> #First load the requisite packages
> library(tidyverse)
> library(stringr)

You can use single or double quotes, but it’s best to use single if you want to use quotation marks within a string.

> string1 <- "How are you?"
> string2 <- 'I like to "quote" people'

To include literal single or double quotes you can use \(\backslash\) to escape it. To include a backslash you would have to double it.

> double_quote <- "\"" 
> single_quote <- "\'"
> back_slash <- "\\"

The printed output of a string is not the same as the actual string. Use writeLines() to view the raw contents.

> x <- c("\'","\\")
> x
[1] "'"  "\\"
> writeLines(x)
'
\

A list of some special characters is provided below. For more details see ?"'" for help on Quotes.

Special Character Operation
\n newline
\r carriage return
\t tab
\b backspace
\a alert (bell)
\f form feed
\v vertical tab
\\ backslash \(\backslash\)
\' ASCII apostrophe ’
\" ASCII quotation mark "
\nnn character with given octal code (1, 2 or 3 digits)
\xnn character with given hex code (1 or 2 hex digits)
\unnnn Unicode character with given code (1–4 hex digits)
\Unnnnnnnn Unicode character with given code (1–8 hex digits)

An example of a unicode character would be:

> #code for mu
> x <- "\u00b5"
> x
[1] "µ"

Multiple strings can be stored in a character vector

> c("one","two","three")
[1] "one"   "two"   "three"

You can retrieve the length of strings with str_length()

> str_length(c("it","is very windy today",NA))
[1]  2 19 NA

str_c() will combine strings. The sep= option controls how they are separated.

> str_c("a","b","c","d")
[1] "abcd"
> str_c("a","b","c","d", sep="..")
[1] "a..b..c..d"

You will have to specifically ask for NA values if you want them.

> x <- c("abc",NA)
> str_c("|~",x,"~|")
[1] "|~abc~|" NA       
> str_c("|~",str_replace_na(x),"~|")
[1] "|~abc~|" "|~NA~|" 

Notice how str_c() will combine vectors when the length of one is a multiple of the length of the others.

> #vector length 3 is a multiple of length 1
> str_c("I\'m eating a ",c("ham","bologna","turkey")," sandwich")
[1] "I'm eating a ham sandwich"     "I'm eating a bologna sandwich"
[3] "I'm eating a turkey sandwich" 
> #vector of length 4 is a multiple of length 2
> str_c(c("I\'m eating a ","I hate "),c("ham","bologna","turkey","cheese"))
[1] "I'm eating a ham"    "I hate bologna"      "I'm eating a turkey"
[4] "I hate cheese"      

Objects of length 0 are dropped.

> name <- "Brenda"
> time_of_day <- "morning"
> new_years <- FALSE
> 
> str_c("Good ",time_of_day," ",name,
+       if (new_years) " and HAPPY NEW YEAR",".")
[1] "Good morning Brenda."

It is possible to combine a vector of strings with collapse().

> str_c(c("a","b","c"),collapse="")
[1] "abc"
> str_c(c("a","b","c"),collapse="_")
[1] "a_b_c"

str_sub() can be used to subset parts of a string.

> x <- c("Budweiser","Heineken","Guinness")
> #start at 1 from the beginning and end at 3
> str_sub(x,1,3)
[1] "Bud" "Hei" "Gui"
> #negative numbers start from the end
> str_sub(x,-3,-1)
[1] "ser" "ken" "ess"
> #You can also assign the output to modify strings
> str_to_lower(str_sub(x,1,1))-> str_sub(x,1,1)
> x
[1] "budweiser" "heineken"  "guinness" 

Sometimes there are easier ways to change case.

> dog <- "The quick brown dog"
> str_to_upper(dog)
[1] "THE QUICK BROWN DOG"
> str_to_lower(dog)
[1] "the quick brown dog"
> str_to_title(dog)
[1] "The Quick Brown Dog"
> str_to_sentence("the quick brown dog")
[1] "The quick brown dog"

It’s important to note that the output might be different based on the region.

> x <- c("apple","eggplant","banana")
> str_sort(x,locale="en") #English
[1] "apple"    "banana"   "eggplant"
> str_sort(x,locale="haw") #Hawaiian
[1] "apple"    "eggplant" "banana"  

Exercises

  1. In code that doesn’t use stringr, you’ll often see paste() and paste0. What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?
  • Equivalent to str_c().
  • Paste has a sep= option. Default is sep=" ".
  • Paste0 has no sep= option.
  • Paste and Paste0 will include NA. str_c won’t unless you ask for it.
> #Need to add sep=" " for a space
> str_c("I\'m eating a",c("ham","bologna","turkey"),"sandwich", sep=" ")
[1] "I'm eating a ham sandwich"     "I'm eating a bologna sandwich"
[3] "I'm eating a turkey sandwich" 
> #default is sep=" ".
> paste("I\'m eating a",c("ham","bologna","turkey"),"sandwich")
[1] "I'm eating a ham sandwich"     "I'm eating a bologna sandwich"
[3] "I'm eating a turkey sandwich" 
> #No option for sep=" ".  No spaces 
> paste0("I\'m eating a",c("ham","bologna","turkey"),"sandwich", sep=" ")
[1] "I'm eating ahamsandwich "     "I'm eating abolognasandwich "
[3] "I'm eating aturkeysandwich " 
> x <- c("abc",NA)
> #Won't include NA
> str_c("|~",x,"~|")
[1] "|~abc~|" NA       
> #Includes NA with default space
> paste("|~",x,"~|")
[1] "|~ abc ~|" "|~ NA ~|" 
> #Includes NA with no spaces
> paste0("|~",x,"~|")
[1] "|~abc~|" "|~NA~|" 
  1. In your own words, describe the difference between the sep and collapse arguments to str_c().
  • One works with individual strings and one works with a vector of strings.
> str_c("a","b","c","d")
[1] "abcd"
> #Individual strings
> str_c("a","b","c","d", sep="..")
[1] "a..b..c..d"
> #Vector of strings
> str_c(c("a","b","c","d"), collapse="..")
[1] "a..b..c..d"
  1. Use str_length() and str_sub() to extract the middle characters from a string. What will you do if the string has an even number of characters?
  • I used %/% (integer division) and %% (modulus - remainder).
  • For even I took the middle two.
> returns_middle <- function(x) {
+ #returns middle
+   # If odd remainder = 1
+ ifelse(str_length(x)%%2==1,
+        #If odd take middle
+        str_sub(x,(str_length(x)%/%2+str_length(x)%%2),
+                (str_length(x)%/%2+str_length(x)%%2)),
+        #If even take middle 2
+        str_sub(x,(str_length(x)%/%2),
+                (str_length(x)%/%2+1))
+        )
+ 
+ }
> 
> x <- "Breguet"
> returns_middle(x)
[1] "g"
> x <- "Longines"
> returns_middle(x)
[1] "gi"
  1. What does str_wrap() do? When might you want to use it?
  • It turns a string into a paragraph.
  • indent is for the first line. exdent is for the remaining lines.
> str_wrap('You\'ve Got To Ask Yourself One Question: 
+          "Do I Feel Lucky?" Well, Do Ya, Punk?.', 
+          width = 20, indent = 3, exdent = 1) %>% 
+   writeLines()
   You've Got To
 Ask Yourself One
 Question: "Do I Feel
 Lucky?" Well, Do Ya,
 Punk?.
  1. What does str_trim() do? What’s the opposite of str_trim()?
  • str_trim() will remove spaces from the left, right, or both.
  • str_squish() will remove spaces from both sides and the middle.
  • str_pad() is the opposite. It will add spaces on the left, right, or both. You can specify the total string lenght, including spaces.
> # default remove from both
> str_trim("  String with trailing and leading white space   ")
[1] "String with trailing and leading white space"
> #remove from both sides and middle
> str_squish("  String with trailing,  middle, and leading white space   ")
[1] "String with trailing, middle, and leading white space"
> #total string size of 9.  Spaces added on both sides
> str_pad("omega", 9,"both")
[1] "  omega  "
  1. Write a function that turns (e.g.) a vector c("a","b","c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0,1, or 2.
  • If length() <2 just return the string
  • If length() >2 separate the vector and combine it again with the requested format.
> vector_to_string <- function(x) {
+ 
+   # length <2
+   if (length(x) < 2)
+   return(x)
+   
+   #length >2
+   
+   x <- str_sub(x) # make separate strings
+   y <- x[1:length(x)-1] #all but last string
+   z <- x[length(x)] # last string
+   aa <- str_c(y,collapse=", ") # add , and space 
+   str_c(aa," and ",z) # add and plus last string
+   
+ }
>  
> x <- c("a", "b", "c")
> vector_to_string(x)
[1] "a, b and c"
> x <- c("a")
> vector_to_string(x)
[1] "a"
> x <- c("Omega", "Breguet", "Panerai", "Rolex", "Blancpain")
> vector_to_string(x)
[1] "Omega, Breguet, Panerai, Rolex and Blancpain"

Matching Patterns with Regular Expressions


Regular expressions (regexp) enable you to match patterns in strings. str_view and str_view_all provide a way to visualize the output.

For example, matching the text “an” in a vector of strings.

> x <- c("apple","banana","pear")
> str_view(x,"an")

To match any character you can use a period.

> str_view(x,".a.")

So how would you match a literal period? You would need to escape it with a backslash. But a backslash is a special character in a string and would also need to be escaped.

Basically, you want your regular expression to match the output of writeLines(). So if you want to escape a period \(\left(\backslash.\right)\) the string version of your regular expression would need to be written as \(``\backslash\backslash."\).

> writeLines("\\.")
\.
> str_view(c("abc","a.c","bef"),"a\\.c")

So if you wanted to match \(``\backslash\backslash"\), each backslash you need to be escaped. Remember, the raw output (writeLines()) of \(``\backslash\backslash"\) is a single \(\backslash\).

> writeLines("\\\\")
\\
> x <- c("x","a\\b","y")
> str_view(x,"\\\\")

Exercises

  1. Explain why each of these strings don’t match a \(\backslash:\) \(``\backslash"\), \(``\backslash\backslash"\), \(``\backslash\backslash\backslash"\).
  • To produce a raw \(\backslash\) a string needs \(\backslash\backslash\).
  • A regular expression needs to escape the raw \(\backslash\) and so need \(\backslash\backslash\backslash\backslash\).
> writeLines("a\\b") #string output
a\b
> writeLines("a\\\\b") #regular expression
a\\b
  1. How would you match the sequence "'\
  • Each special character needs to be escaped and each backslash needs to be escaped.
> x <- "a\"\`\\c"
> writeLines(x)
a"`\c
> writeLines("\\\"\\\`\\\\")
\"\`\\
> str_view(x,"\\\"\\\`\\\\")
  1. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?
  • Pattern - period, any character, period, any character, period, any character.
  • Each backslash needs to be escaped in a string.
> x <- "eerx.t.f.jjg"
> writeLines(x)
eerx.t.f.jjg
> writeLines("\\..\\..\\..")
\..\..\..
> str_view(x,"\\..\\..\\..")

Anchors

You can set the regular expression to match from the start or end of a string.

  • ^ to match from the start
  • $ to match from the end
> x <- c("apple","banana","pear")
> str_view(x,"^a")
> str_view(x,"a$")

Use both symbols to match the string exactly.

> x <- c("apple pie","apple","apple cake")
> str_view(x,"apple")
> str_view(x,"^apple$")

Anchors - Exercises

  1. How would you match the literal string "$^$"?
  • Each symbol and each backslash would need to be escaped.
> x <- "abc$^$xyz"
> writeLines("\\$\\^\\$")
\$\^\$
> str_view(x,"\\$\\^\\$")
  1. Given the corpus of common words in stringr::words, create regular expressions that find all the words that:
  1. Start with “y”.
> str_view(words,"^y", match=TRUE)
  1. End with “x”.
> str_view(words,"x$", match=TRUE)
  1. Are exactly three letters long.
> str_subset(words,"^...$")
  [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad"
 [13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can"
 [25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg"
 [37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy"
 [49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay" "leg" "let" "lie"
 [61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now" "odd" "off" "old"
 [73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run" "say" "see" "set"
 [85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea" "ten" "the" "tie"
 [97] "too" "top" "try" "two" "use" "war" "way" "wee" "who" "why" "win" "yes"
[109] "yet" "you"
  1. Have seven letters or more.
> # too many to display
> str_view(words,".......", match=TRUE)

Character Classes and Alternatives

There are some other special patterns.

  • \d matches any digit
  • \s matches any whitespace
  • [abc] matches a, b, or c
  • [^abc] matches anything except a, b, or c

To create the regular expression the backslash needs to be escaped, so it would be "\\d".

You could also use hat|d..r to match “hat” OR “deer”.

> str_view(c("grey","gray"),"gr(e|a)y")
> str_view(c("grey","gray"),"gr[ea]y")

Character Classes and Alternatives - Exercises

  1. Create regular expressions to find all words that:
  1. Start with a vowel.
> #too many to display
> str_view(words,"^[aeiou]", match=TRUE)
  1. Only contain consonants.
> #exact match^$.  Non consonants. 1 or more matches +
> str_view(words, "^[^aeiou]+$", match=TRUE)
  1. End with ed but not eed.
> str_view(words, "([^e]ed)$", match=TRUE) 
  1. End with ing or ize.
> str_view(words, "(ing)$|(ize)$", match=TRUE) 
  1. Emperically verify the rule “i before e except after c”.
  • Show anything but c followed by ie and cei.
> str_view(words, "[^c]ie|cei", match=TRUE) 
  • Show what ei comes after.
> str_view(words, ".ei", match=TRUE) 
  1. Is “q” always followed by a “u”?.
  • There are no matches, so yes.
> str_view(words, "q[^u]", match=TRUE) #nothing
  1. Write a regular expression that matches a word if it’s probably written in British English, not American English.
> str_view(words, "(l|b)our", match=TRUE)
  1. Create a regular expression that will match telephone numbers as commonly written in your country.
> x <- "Jenny (800) 867-5309"
> writeLines("\\(\\d\\d\\d\\)\\s\\d\\d\\d-\\d\\d\\d\\d")
\(\d\d\d\)\s\d\d\d-\d\d\d\d
> str_view(x, "\\(\\d\\d\\d\\)\\s\\d\\d\\d-\\d\\d\\d\\d")

Repetition

It is often useful to match a pattern a specific number of times.

  • ? - 0 or 1 times
  • + - 1 or more times
  • * - 0 or more times
> x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
> str_view(x,"CC") #first match
> str_view(x,"CC+") #1 or more matches
> str_view(x,"C[LX]+") #(L or X) 1 or more times
> # u 0 or 1 times
> x <- c("color","colour")
> str_view(x,"colou?r")

Or, you can use brackets {} to be more explicit.

  • {n} - exactly n times
  • {n,} - n or more times
  • {,m} - at most m times
  • {n,m} - n to m times
> x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
> str_view(x,"C{2}") # 2 times
> str_view(x,"C{2,}")# 2 or more times
> str_view(x,"C{2,3}") # 2 to 3 times

By default it will match the longest pattern. If you want it to match the shortest you can use a ?.

> # 2 to 3 times, but show shortest match
> str_view(x,"C{2,3}?")

Repetition - Exercises

  1. Describe the equivalents of ?, +, * in {m,n} form.
  • ? = {0,1}
  • + = {1,}
  • * = {0,}
  1. Describe in words what these regular expressions match (some are regular expressions and some are strings that represent regular expressions).
  1. ^.*$
> #matches anything
> x <- c("aeiouy","sometimes y")
> str_view(x,"^.*$") 
  1. "\\{.+\\}"
> #brackets and in-between where the text
> #in-between is 1 or more
> x <- c("look{at what is}here","single{}")
> str_view(x,"\\{.+\\}") 
  1. \d{4}-\d{2}-\d{2}
> # 4 digits - 2 digits - 2 digits
> x <- c("888-7-6", "8888-77-66","88888-777-666")
> str_view(x,"\\d{4}-\\d{2}-\\d{2}") 
  1. "\\\\{4}"
> #matches 4 backslashes
> x <- c("\\\\\\", "\\\\\\\\","\\\\\\\\\\")
> str_view(x,"\\\\{4}") 
  1. Create regular expressions to find all words that:
  1. Start with three consonants.
> str_subset(words,"^([^aeiou]{3})") 
 [1] "Christ"    "Christmas" "dry"       "fly"       "mrs"       "scheme"   
 [7] "school"    "straight"  "strategy"  "street"    "strike"    "strong"   
[13] "structure" "system"    "three"     "through"   "throw"     "try"      
[19] "type"      "why"      
  1. Have three or more vowels in a row.
> str_view(words,"[aeiou]{3,}", match=TRUE) 
  1. Have two or more vowel-consonant pairs in a row.
> #too many to display
> str_view(words,"([aeiou][^aeiou]){2,}", match=TRUE) 

Grouping and Backreferences

You can use a parentheses to create a group. If you would like to repeat the output from that group you use a backreference. For example, \1 will reference what is contained in the first group, \2 will reference the second, etc.

The following example will match any two characters as a group and then to match the same two characters again (not any two characters, the same two) we can reference it.

> str_view(fruit,"(..)\\1", match = TRUE)

Grouping and Backreferences - Exercises

  1. Describe, in words, what these expressions will match.
  1. (.)\1\1
  • Three characters repeated
> x <- c("aa","bbb","dccc")
> str_view(x,"(.)\\1\\1")
  1. "(.)(.)\\2\\1"
  • Character 1, Character 2, Character 2, Character 1
> str_view(words,"(.)(.)\\2\\1", match=TRUE)
  1. (..)\1
  • Two characters repeated
> str_view(words,"(..)\\1", match=TRUE)
  1. "(.).\\1.\\1"
  • character 1, anything, character 1, anything, character 1
> str_view(words,"(.).\\1.\\1", match = TRUE)
  1. "(.)(.)(.).*\\3\\2\\1"
  • ch 1, ch2, ch3, anything or nothing, ch 3, ch2, ch1
> str_view(words,"(.)(.)(.).*\\3\\2\\1", match=TRUE)
  1. Construct regular expressions to match words that:
  1. Start and end with the same character.
> str_subset(words,"^(.).*\\1$")
 [1] "america"    "area"       "dad"        "dead"       "depend"    
 [6] "educate"    "else"       "encourage"  "engine"     "europe"    
[11] "evidence"   "example"    "excuse"     "exercise"   "expense"   
[16] "experience" "eye"        "health"     "high"       "knock"     
[21] "level"      "local"      "nation"     "non"        "rather"    
[26] "refer"      "remember"   "serious"    "stairs"     "test"      
[31] "tonight"    "transport"  "treat"      "trust"      "window"    
[36] "yesterday" 
  1. Contain a repeated pair of letters (e.g. “church” contains “ch” repeated twice).
> str_subset(words,"(..).*\\1")
 [1] "appropriate" "church"      "condition"   "decide"      "environment"
 [6] "london"      "paragraph"   "particular"  "photograph"  "prepare"    
[11] "pressure"    "remember"    "represent"   "require"     "sense"      
[16] "therefore"   "understand"  "whether"    
  1. Contain one letter repeated in at least three places (e.g. “eleven” contains three “e”s).
> str_subset(words,"(.).*\\1.*\\1")
 [1] "appropriate" "available"   "believe"     "between"     "business"   
 [6] "degree"      "difference"  "discuss"     "eleven"      "environment"
[11] "evidence"    "exercise"    "expense"     "experience"  "individual" 
[16] "paragraph"   "receive"     "remember"    "represent"   "telephone"  
[21] "therefore"   "tomorrow"   

Tools


In addition to just matching patterns in a string we can:

  • Find their positions
  • Extract their contents
  • Replace them with new values
  • Use them to split strings

Detect Matches

str_detect() will return a TRUE or FALSE if a character vector matches a pattern.

> x <- c("apple","banana","pear")
> str_detect(x,"e")
[1]  TRUE FALSE  TRUE

Since TRUE is equal to 1, we can find sums, means, etc.

> # Number of words that start with t
> sum(str_detect(words,"^t"))
[1] 65
> # Proportion of words that end in a vowel
> mean(str_detect(words,"[aeiou]$"))
[1] 0.2765306

There are usually multiple ways of solving a problem, but it’s best to keep it simple where possible.

> #all words with no vowels
> no_vowels_1 <- !str_detect(words,"[aeiou]")
> no_vowels_2 <- str_detect(words,"^[^aeiou]+$")
> identical(no_vowels_1,no_vowels_2)
[1] TRUE

str_subset() is an easier way to extract the elements that match a pattern.

> words[str_detect(words,"x$")]
[1] "box" "sex" "six" "tax"
> str_subset(words,"x$")
[1] "box" "sex" "six" "tax"

When working with a data frame use filter() with str_detect().

> df <- tibble(
+   word=words,
+   i=seq_along(word)
+ )
> 
> df %>% 
+   filter(str_detect(word,"x$"))
# A tibble: 4 x 2
  word      i
  <chr> <int>
1 box     108
2 sex     747
3 six     772
4 tax     841

str_detect() only returns a TRUE or FALSE, but sometimes you need to determine how many times a pattern matches. In this case you can use str_count().

> x <- c("apple","banana","pear")
> str_count(x,"a")
[1] 1 3 1
> # average number of vowels per word
> mean(str_count(words,"[aeiou]"))
[1] 1.991837

Used with mutate():

> df %>% 
+   mutate(
+     vowels = str_count(word,"[aeiou]"),
+     consonants = str_count(word,"[^aeiou]")
+   )
# A tibble: 980 x 4
   word         i vowels consonants
   <chr>    <int>  <int>      <int>
 1 a            1      1          0
 2 able         2      2          2
 3 about        3      3          2
 4 absolute     4      4          4
 5 accept       5      2          4
 6 account      6      3          4
 7 achieve      7      4          3
 8 across       8      2          4
 9 act          9      1          2
10 active      10      3          3
# ... with 970 more rows

Note that matches do not overlap. In the following example you could circle “aba” 3 times, but since the middle one would overlap it is not counted.

> str_count("abababa","aba")
[1] 2
> str_view_all("abababa","aba")

Detect Matches - Exercises

  1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
  1. Find all words that start or end with x.
> x <- c("box","six","xenon","brexit")
> str_subset(x,"^x|x$")
[1] "box"   "six"   "xenon"
> x[str_detect(x,"^x")|str_detect(x,"x$")]
[1] "box"   "six"   "xenon"
  1. Find all words that start with a vowel and end with a consonant.
> #displaying first 10
> str_subset(words,"^[aeiou].*[^aeiou]$")[1:10]
 [1] "about"   "accept"  "account" "across"  "act"     "actual"  "add"    
 [8] "address" "admit"   "affect" 
> words[str_detect(words,"^[aeiou]")&
+       str_detect(words,"[^aeiou]$")][1:10]
 [1] "about"   "accept"  "account" "across"  "act"     "actual"  "add"    
 [8] "address" "admit"   "affect" 
  1. Are there any words that contain at least one of each different vowel?.
  • There are no words that contain all vowels.
> str_subset(words,"a.*e.*i.*o.*u")
character(0)
> words[str_detect(words,"a")&str_detect(words,"e")&
+       str_detect(words,"i")&str_detect(words,"o")&
+       str_detect(words,"u")]
character(0)
  1. What word has the highest number of vowels? What word has the highest proportion of vowels?.
  • 5 is the max number of vowels.
  • a has the highest proportion of vowels at 100%.
> max(str_count(words,"[aeiou]"))
[1] 5
> words[str_count(words,"[aeiou]")==max(str_count(words,"[aeiou]"))]
[1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
[6] "experience"  "individual"  "television" 
> vow_prop <- str_count(words, "[aeiou]") / str_count(words,".")
> max_prop <- max(vow_prop)
> words[vow_prop== max_prop]
[1] "a"

Extract Matches

str_exact() will extract the actual text of a match. It can be demonstrated with the sentences data set.

> length(sentences)
[1] 720
> head(sentences)
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[6] "The juice of lemons makes fine punch."      

Start by creating a regular expression:

> colors <- c("red","orange","yellow","green","blue","purple")
> color_match <- str_c(colors,collapse="|")
> color_match
[1] "red|orange|yellow|green|blue|purple"

Then use the regular expression to filter and extract the matches:

> has_color <- str_subset(sentences,color_match)
> matches <- str_extract(has_color,color_match)
> head(matches)
[1] "blue" "blue" "red"  "red"  "red"  "blue"

Unfortunately str_extract() will only extract the first match. We can see below that it only retrieved “blue”, “green”, and “orange.”

> more <- sentences[str_count(sentences, color_match)>1]
> str_view_all(more, color_match)
> str_extract(more,color_match)
[1] "blue"   "green"  "orange"

However, str_extract_all() will extract all matches. It returns a list, but with simplify=TRUE it will return a matrix.

> str_extract_all(more,color_match)
[[1]]
[1] "blue" "red" 

[[2]]
[1] "green" "red"  

[[3]]
[1] "orange" "red"   
> str_extract_all(more,color_match,simplify = TRUE)
     [,1]     [,2] 
[1,] "blue"   "red"
[2,] "green"  "red"
[3,] "orange" "red"

The size of the matrix will equal the longest match.

> x <- c("a","a,b","a,b,c")
> str_extract_all(x,"[a-z]",simplify = TRUE)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "b"  ""  
[3,] "a"  "b"  "c" 

Extract Matches - Exercises

  1. In the text’s example the regular expression matched “flickered”, which is not a color. Modify the regex to fix the problem.

Matches flickered

Solution

  • Want it to match a space before a lowercase color.
  • Want it to match a space after an uppercase letter (sentence start).
  • Need to trim the space when extracting.
> x <- c("The green light in the brown box flickered.",
+        "Greenson likes red.",
+        "Green is a great color.")
> 
> colors <- c("red","orange","yellow","green","blue","purple","brown")
> colors2 <- str_to_title(colors)
> 
> y <- str_c("\\s",colors)
> z <- str_c(colors2,"\\s")
> aa <- c(y,z)
> color_match <- str_c(aa,collapse="|")
> color_match
[1] "\\sred|\\sorange|\\syellow|\\sgreen|\\sblue|\\spurple|\\sbrown|Red\\s|Orange\\s|Yellow\\s|Green\\s|Blue\\s|Purple\\s|Brown\\s"
> str_view_all(x,color_match)
> str_trim(str_extract_all(x,color_match) %>% unlist())
[1] "green" "brown" "red"   "Green"
  1. From the Harvard sentences data, extract:.
  1. The first word from each sentence.
  • Match anything but a space.
  • Just str_extract() for the first match.
> x <- head(sentences)
> str_view(x,"[^ ]+")
> str_extract(x,"[^ ]+")
[1] "The"   "Glue"  "It's"  "These" "Rice"  "The"  
  1. All words ending in ing.
  • Don’t want it to match “springs”, “dinged”, “king’s”
  • No space 1 or more times, “ing”, not a to z or an apostrophe
  • It will include a space and period in the match. Trim the space and match only letters to exclude the period.
> x <- head(sentences,50)
> str_view(x,"[^ ]+ing[^a-z']",match=TRUE)
> ab <- str_extract_all(sentences,"[^ ]+ing[^a-z']") %>% unlist
> ab <- str_trim(ab)
> str_extract_all(ab,"[a-zA-Z]+")%>% unlist %>% unique
 [1] "spring"    "evening"   "morning"   "winding"   "living"    "king"     
 [7] "Adding"    "making"    "raging"    "playing"   "sleeping"  "ring"     
[13] "glaring"   "sinking"   "dying"     "Bring"     "lodging"   "filing"   
[19] "wearing"   "wading"    "swing"     "nothing"   "sing"      "painting" 
[25] "walking"   "bring"     "shipping"  "puzzling"  "landing"   "thing"    
[31] "waiting"   "whistling" "timing"    "changing"  "drenching" "moving"   
[37] "working"  
  1. All plurals.
  • Not a great solution. Everything that ends in s. Not apostrophe s.
> x <- head(sentences,10)
> str_view_all(x,"[A-Za-z]+s[^a-z]",match=TRUE)
> ab <- str_extract_all(sentences,"[A-Za-z]+s[^a-z]") %>% unlist()
> ab <- str_trim(ab)
> str_extract_all(ab,"[a-zA-Z]+") %>% unlist() %>% 
+   unique() %>% head(20)
 [1] "planks"    "days"      "is"        "bowls"     "lemons"    "makes"    
 [7] "was"       "hogs"      "hours"     "us"        "stockings" "helps"    
[13] "pass"      "fires"     "across"    "bonds"     "Press"     "pants"    
[19] "useless"   "gas"      

Grouped Matches

By creating separate groups with parentheses you can extract each component. For example, nouns tend to follow “a” or “the”. You could match “(a|the)” as one group, followed by a space, followed by a word. A word can be represented as a non-space 1 or more times “([^ ]+)”.

First, we can extract the matches from the first ten sentences that have a match.

> noun <- "(a|the) ([^ ]+)"
> has_noun <- sentences %>% 
+   str_subset(noun) %>% head(10)
> has_noun %>% 
+   str_extract(noun)
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
 [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"   

Next, str_match() will return a matrix with the complete match along with each group. If you want matches for each string you would use str_match_all().

> has_noun %>% 
+   str_match(noun)
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"  

With a tibble and tidyr::extract you can add new columns for each group.

> tibble(sentence=sentences) %>% 
+   tidyr::extract(
+     sentence,c("article","noun"),"(a|the) ([^ ]+)",
+     remove=FALSE
+   )
# A tibble: 720 x 3
   sentence                                    article noun   
   <chr>                                       <chr>   <chr>  
 1 The birch canoe slid on the smooth planks.  the     smooth 
 2 Glue the sheet to the dark blue background. the     sheet  
 3 It's easy to tell the depth of a well.      the     depth  
 4 These days a chicken leg is a rare dish.    a       chicken
 5 Rice is often served in round bowls.        <NA>    <NA>   
 6 The juice of lemons makes fine punch.       <NA>    <NA>   
 7 The box was thrown beside the parked truck. the     parked 
 8 The hogs were fed chopped corn and garbage. <NA>    <NA>   
 9 Four hours of steady work faced us.         <NA>    <NA>   
10 Large size in stockings is hard to sell.    <NA>    <NA>   
# ... with 710 more rows

Grouped Matches - Exercises

  1. Find all words that come after a “number” like “one”,“two”,“three”, etc. Pull out the number and the word.
  • 1 to 10, space, non-space or non-period 1 or more times.
> number_after <- "(one|two|three|four|five|six|seven|eight|
+ nine|ten) ([^ .]+)"
> has_num <- sentences %>% 
+   str_subset(number_after)
> 
> has_num %>% 
+   str_match(number_after) %>% head(15)
      [,1]          [,2]    [,3]     
 [1,] "ten served"  "ten"   "served" 
 [2,] "one over"    "one"   "over"   
 [3,] "seven books" "seven" "books"  
 [4,] "two met"     "two"   "met"    
 [5,] "two factors" "two"   "factors"
 [6,] "one and"     "one"   "and"    
 [7,] "three lists" "three" "lists"  
 [8,] "seven is"    "seven" "is"     
 [9,] "two when"    "two"   "when"   
[10,] "one floor"   "one"   "floor"  
[11,] "ten inches"  "ten"   "inches" 
[12,] "one with"    "one"   "with"   
[13,] "one war"     "one"   "war"    
[14,] "one button"  "one"   "button" 
[15,] "six minutes" "six"   "minutes"
  1. Find all contractions. Separate out the pieces before and after the apostrophe.
  • any letters 1 or more times, apostrophe, non-space or non-period 1 or more times.
> contract <- "([a-zA-Z]+)'([^ .]+)"
> has_contract <- sentences %>% 
+   str_subset(contract)
> 
> has_contract %>% 
+   str_match(contract) %>% head(15)
      [,1]         [,2]       [,3]
 [1,] "It's"       "It"       "s" 
 [2,] "man's"      "man"      "s" 
 [3,] "don't"      "don"      "t" 
 [4,] "store's"    "store"    "s" 
 [5,] "workmen's"  "workmen"  "s" 
 [6,] "Let's"      "Let"      "s" 
 [7,] "sun's"      "sun"      "s" 
 [8,] "child's"    "child"    "s" 
 [9,] "king's"     "king"     "s" 
[10,] "It's"       "It"       "s" 
[11,] "don't"      "don"      "t" 
[12,] "queen's"    "queen"    "s" 
[13,] "don't"      "don"      "t" 
[14,] "pirate's"   "pirate"   "s" 
[15,] "neighbor's" "neighbor" "s" 

Replacing Matches

str_replace() and str_replace_all() will replace matches with new strings.

> x <- c("apple","pear","banana")
> str_replace(x,"[aeiou]","-")
[1] "-pple"  "p-ar"   "b-nana"
> str_replace_all(x,"[aeiou]","-")
[1] "-ppl-"  "p--r"   "b-n-n-"

You can also perform multiple replacements by supplying a named vector.

> x <- c("1 house","2 cars","3 people")
> str_replace_all(x,c("1"="one","2"="two","3"="three"))
[1] "one house"    "two cars"     "three people"

Backreferences can also be used. The following example switches the second and third word.

> sentences %>% 
+   str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
+   head(5)
[1] "The canoe birch slid on the smooth planks." 
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."     
[4] "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."       

Replacing Matches - Exercises

  1. Replace all forward slashes in a string with backslashes.
> x <- "a/bc///d"
> writeLines(x)
a/bc///d
> x %>% str_replace_all("/","\\\\") %>% 
+ writeLines()
a\bc\\\d
  1. Implement a simple version of str_to_lower() using replace_all().
> x <- "ABCDefgHIJ"
> x %>% str_replace_all("[A-Z]",tolower) %>% 
+   writeLines()
abcdefghij
  1. Switch the first and last letters in words.
> words %>% 
+   str_replace("([a-zA-Z])(.*)([a-zA-Z])","\\3\\2\\1") %>% 
+   head(25)
 [1] "a"         "ebla"      "tboua"     "ebsoluta"  "tccepa"    "tccouna"  
 [7] "echieva"   "scrosa"    "tca"       "ectiva"    "lctuaa"    "dda"      
[13] "sddresa"   "tdmia"     "edvertisa" "tffeca"    "dffora"    "rftea"    
[19] "nfternooa" "ngaia"     "tgainsa"   "ega"       "tgena"     "oga"      
[25] "egrea"    

Splitting

It’s probably no surprise that str_split() will split strings.

> sentences %>% head(5) %>% str_split(" ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
[8] "rare"    "dish."  

[[5]]
[1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Since it creates a list it will be easier to extract only the first element when working with a vector of length 1.

> "a|b|c|d" %>% str_split("\\|") %>% .[[1]]
[1] "a" "b" "c" "d"

Or you can have it return a matrix.

> sentences %>% head(5) %>% 
+   str_split(" ",simplify = TRUE)
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
[5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
     [,9]   
[1,] ""     
[2,] ""     
[3,] "well."
[4,] "dish."
[5,] ""     

You can also specify the maximum number of pieces.

> sentences %>% head(5) %>% 
+   str_split(" ",n=3,simplify = TRUE)
     [,1]    [,2]    [,3]                                
[1,] "The"   "birch" "canoe slid on the smooth planks."  
[2,] "Glue"  "the"   "sheet to the dark blue background."
[3,] "It's"  "easy"  "to tell the depth of a well."      
[4,] "These" "days"  "a chicken leg is a rare dish."     
[5,] "Rice"  "is"    "often served in round bowls."      

In addition to patterns, you can also split a string by boundaries (character, line, sentence, and word).

> x <- "This is a sentence. This is another sentence"
> str_view_all(x, boundary("word"))
> str_split(x," ")[[1]]
[1] "This"      "is"        "a"         "sentence." "This"      "is"       
[7] "another"   "sentence" 
> str_split(x,boundary("word"))[[1]]
[1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
[8] "sentence"

Splitting - Exercises

  1. Split up a string like “apples, pears, and bananas” into individual components.
> x <- "apples, pears, and bananas"
> y <- str_split(x,", and |, ")
> y [[1]]
[1] "apples"  "pears"   "bananas"
> #Or, if you want to keep "and"
> str_split(x,boundary("word"))[[1]]
[1] "apples"  "pears"   "and"     "bananas"
  1. Why is it better to split up by boundary(“word”) than " "?
  • boundary("word") doesn’t pick up extra characters.
> fruits <- ("fruits: apples, peaches, bananas, and oranges!")
> str_split(fruits, " ")[[1]]
[1] "fruits:"  "apples,"  "peaches," "bananas," "and"      "oranges!"
> str_split(fruits, boundary("word"))[[1]]
[1] "fruits"  "apples"  "peaches" "bananas" "and"     "oranges"
  1. What does splitting with an empty string("") do?
  • An empty pattern, "“, is equivalent to boundary(”character").
> x <- sentences %>% head(5)
> str_split(x,"")
[[1]]
 [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i"
[20] "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t" "h" " " "p" "l" "a"
[39] "n" "k" "s" "."

[[2]]
 [1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t"
[20] "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b" "a" "c" "k" "g" "r"
[39] "o" "u" "n" "d" "."

[[3]]
 [1] "I" "t" "'" "s" " " "e" "a" "s" "y" " " "t" "o" " " "t" "e" "l" "l" " " "t"
[20] "h" "e" " " "d" "e" "p" "t" "h" " " "o" "f" " " "a" " " "w" "e" "l" "l" "."

[[4]]
 [1] "T" "h" "e" "s" "e" " " "d" "a" "y" "s" " " "a" " " "c" "h" "i" "c" "k" "e"
[20] "n" " " "l" "e" "g" " " "i" "s" " " "a" " " "r" "a" "r" "e" " " "d" "i" "s"
[39] "h" "."

[[5]]
 [1] "R" "i" "c" "e" " " "i" "s" " " "o" "f" "t" "e" "n" " " "s" "e" "r" "v" "e"
[20] "d" " " "i" "n" " " "r" "o" "u" "n" "d" " " "b" "o" "w" "l" "s" "."

Locating Positions

Sometimes it is easier to just find the start and end of a pattern with str_locate(). You can then use str_sub() to extract the string.

> fruit <- c("apple", "banana", "pear", "pineapple")
> str_locate(fruit, "$")
     start end
[1,]     6   5
[2,]     7   6
[3,]     5   4
[4,]    10   9
> str_locate(fruit, "a")
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
[4,]     5   5
> str_locate(fruit, c("a", "b", "p", "p"))
     start end
[1,]     1   1
[2,]     1   1
[3,]     1   1
[4,]     1   1
> str_locate_all(fruit, "a")
[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     3   3

[[4]]
     start end
[1,]     5   5
> str_locate_all(fruit, "e")
[[1]]
     start end
[1,]     5   5

[[2]]
     start end

[[3]]
     start end
[1,]     2   2

[[4]]
     start end
[1,]     4   4
[2,]     9   9
> str_locate_all(fruit, c("a", "b", "p", "p"))
[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     1   1

[[3]]
     start end
[1,]     1   1

[[4]]
     start end
[1,]     1   1
[2,]     6   6
[3,]     7   7

Other Types of Pattern


A pattern that’s represented as a string automatically calls regex()

> str_view(stringr::fruit,"nana", match=TRUE)
> str_view(stringr::fruit,regex("nana"),match=TRUE)

You can call regex() directly to control the output. For example, if you don’t want case to matter:

> bananas <- c("banana", "Banana"," BANANA")
> str_view(bananas,"banana")
> str_view(bananas,regex("banana",ignore_case = TRUE))

multiline=TRUE allows ^ and $ to match the start and end of each line rather than the start and end of a complete string.

> x <- "Line 1\nLine 2\nLine 3"
> writeLines(x)
Line 1
Line 2
Line 3
> str_extract_all(x,"^Line")[[1]]
[1] "Line"
> str_extract_all(x,regex("^Line",multiline = TRUE))[[1]]
[1] "Line" "Line" "Line"

comments=TRUE allows you to use comments and white space. It will ignore spaces and everything after #. To explicitly match a space you’ll need to escape it with "\\ ".

> phone <- regex("
+   \\(?      #optional opening parentheses
+   (\\d{3})  #area code
+   [)- ]?    #optional closing parentheses, dash, or space
+   (\\d{3})  #another 3 numbers
+   [ -]?     #optional space or dash
+   (\\d{4})  #four more numbers
+   ", comments=TRUE)
> 
> str_match("888-867-5309",phone)  
     [,1]           [,2]  [,3]  [,4]  
[1,] "888-867-5309" "888" "867" "5309"

fixed() matches the exact sequence of bytes and ignores special characters. It is not very comprehensive, but it is faster for basic matches.

> microbenchmark::microbenchmark(
+   fixed = str_detect(sentences, fixed("the")),
+   regex = str_detect(sentences,"the"),
+   times=20
+ )
Unit: microseconds
  expr   min     lq    mean median     uq   max neval cld
 fixed  75.0  79.10 104.365  86.80 106.35 266.2    20  a 
 regex 207.9 212.75 261.520 226.35 298.80 447.7    20   b

As with str_split(), you can also use boundary with other functions.

> x <- "This is a sentence."
> str_view_all(x, boundary("word"))
> str_extract_all(x, boundary("word"))
[[1]]
[1] "This"     "is"       "a"        "sentence"

Exercises

  1. How would you find all strings containing \ with regex() versus fixed()?
> strings <- c("ab", "1\\2", "x\\y")
> str_subset(strings, regex("\\\\")) 
[1] "1\\2" "x\\y"
> #fixed() ignores special characters so no need to escape backslash
> #it matches the exact text
> str_subset(strings, fixed("\\"))
[1] "1\\2" "x\\y"
  1. What are the five most common words in sentences?
> x <- str_extract_all(sentences, boundary("word")) %>% unlist
> x <- str_to_lower(x)
> y <- as_tibble(x) %>% rename(words=value)
> y %>% group_by(words) %>% count(sort=TRUE) %>% head(5)
# A tibble: 5 x 2
# Groups:   words [5]
  words     n
  <chr> <int>
1 the     751
2 a       202
3 of      132
4 to      123
5 and     118

Other Uses of Regular Expressions


apropos() searches all objects in the global environment. This can be helpful when searching for a function.

> apropos("replace")
[1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
[5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"   

dir() lists all the files in a directory. You can also search by a pattern.

> head(dir(pattern = "\\.Rmd$"))
[1] "Strings.Rmd"                "Strings_Copy.Rmd"          
[3] "StringsExercises _Copy.Rmd" "StringsExercises.Rmd"      

Finally, it is important to note that stringr is built on top of the stringi package, which is more comprehensive. It may have a solution that isn’t readily available with stringr.