You can use single or double quotes, but it’s best to use single if you want to use quotation marks within a string.
To include literal single or double quotes you can use \(\backslash\) to escape it. To include a backslash you would have to double it.
The printed output of a string is not the same as the actual string. Use writeLines() to view the raw contents.
[1] "'" "\\"
'
\
A list of some special characters is provided below. For more details see ?"'" for help on Quotes.
| Special Character | Operation |
|---|---|
\n |
newline |
\r |
carriage return |
\t |
tab |
\b |
backspace |
\a |
alert (bell) |
\f |
form feed |
\v |
vertical tab |
\\ |
backslash \(\backslash\) |
\' |
ASCII apostrophe ’ |
\" |
ASCII quotation mark " |
\nnn |
character with given octal code (1, 2 or 3 digits) |
\xnn |
character with given hex code (1 or 2 hex digits) |
\unnnn |
Unicode character with given code (1–4 hex digits) |
\Unnnnnnnn |
Unicode character with given code (1–8 hex digits) |
An example of a unicode character would be:
[1] "µ"
Multiple strings can be stored in a character vector
[1] "one" "two" "three"
You can retrieve the length of strings with str_length()
[1] 2 19 NA
str_c() will combine strings. The sep= option controls how they are separated.
[1] "abcd"
[1] "a..b..c..d"
You will have to specifically ask for NA values if you want them.
[1] "|~abc~|" NA
[1] "|~abc~|" "|~NA~|"
Notice how str_c() will combine vectors when the length of one is a multiple of the length of the others.
> #vector length 3 is a multiple of length 1
> str_c("I\'m eating a ",c("ham","bologna","turkey")," sandwich")[1] "I'm eating a ham sandwich" "I'm eating a bologna sandwich"
[3] "I'm eating a turkey sandwich"
> #vector of length 4 is a multiple of length 2
> str_c(c("I\'m eating a ","I hate "),c("ham","bologna","turkey","cheese"))[1] "I'm eating a ham" "I hate bologna" "I'm eating a turkey"
[4] "I hate cheese"
Objects of length 0 are dropped.
> name <- "Brenda"
> time_of_day <- "morning"
> new_years <- FALSE
>
> str_c("Good ",time_of_day," ",name,
+ if (new_years) " and HAPPY NEW YEAR",".")[1] "Good morning Brenda."
It is possible to combine a vector of strings with collapse().
[1] "abc"
[1] "a_b_c"
str_sub() can be used to subset parts of a string.
> x <- c("Budweiser","Heineken","Guinness")
> #start at 1 from the beginning and end at 3
> str_sub(x,1,3)[1] "Bud" "Hei" "Gui"
[1] "ser" "ken" "ess"
> #You can also assign the output to modify strings
> str_to_lower(str_sub(x,1,1))-> str_sub(x,1,1)
> x[1] "budweiser" "heineken" "guinness"
Sometimes there are easier ways to change case.
[1] "THE QUICK BROWN DOG"
[1] "the quick brown dog"
[1] "The Quick Brown Dog"
[1] "The quick brown dog"
It’s important to note that the output might be different based on the region.
[1] "apple" "banana" "eggplant"
[1] "apple" "eggplant" "banana"
paste() and paste0. What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?str_c().sep= option. Default is sep=" ".sep= option.str_c won’t unless you ask for it.> #Need to add sep=" " for a space
> str_c("I\'m eating a",c("ham","bologna","turkey"),"sandwich", sep=" ")[1] "I'm eating a ham sandwich" "I'm eating a bologna sandwich"
[3] "I'm eating a turkey sandwich"
[1] "I'm eating a ham sandwich" "I'm eating a bologna sandwich"
[3] "I'm eating a turkey sandwich"
> #No option for sep=" ". No spaces
> paste0("I\'m eating a",c("ham","bologna","turkey"),"sandwich", sep=" ")[1] "I'm eating ahamsandwich " "I'm eating abolognasandwich "
[3] "I'm eating aturkeysandwich "
[1] "|~abc~|" NA
[1] "|~ abc ~|" "|~ NA ~|"
[1] "|~abc~|" "|~NA~|"
sep and collapse arguments to str_c().[1] "abcd"
[1] "a..b..c..d"
[1] "a..b..c..d"
str_length() and str_sub() to extract the middle characters from a string. What will you do if the string has an even number of characters?%/% (integer division) and %% (modulus - remainder).> returns_middle <- function(x) {
+ #returns middle
+ # If odd remainder = 1
+ ifelse(str_length(x)%%2==1,
+ #If odd take middle
+ str_sub(x,(str_length(x)%/%2+str_length(x)%%2),
+ (str_length(x)%/%2+str_length(x)%%2)),
+ #If even take middle 2
+ str_sub(x,(str_length(x)%/%2),
+ (str_length(x)%/%2+1))
+ )
+
+ }
>
> x <- "Breguet"
> returns_middle(x)[1] "g"
[1] "gi"
str_wrap() do? When might you want to use it?indent is for the first line. exdent is for the remaining lines.> str_wrap('You\'ve Got To Ask Yourself One Question:
+ "Do I Feel Lucky?" Well, Do Ya, Punk?.',
+ width = 20, indent = 3, exdent = 1) %>%
+ writeLines() You've Got To
Ask Yourself One
Question: "Do I Feel
Lucky?" Well, Do Ya,
Punk?.
str_trim() do? What’s the opposite of str_trim()?str_trim() will remove spaces from the left, right, or both.str_squish() will remove spaces from both sides and the middle.str_pad() is the opposite. It will add spaces on the left, right, or both. You can specify the total string lenght, including spaces.[1] "String with trailing and leading white space"
> #remove from both sides and middle
> str_squish(" String with trailing, middle, and leading white space ")[1] "String with trailing, middle, and leading white space"
[1] " omega "
c("a","b","c") into the string "a, b, and c". Think carefully about what it should do if given a vector of length 0,1, or 2.length() <2 just return the stringlength() >2 separate the vector and combine it again with the requested format.> vector_to_string <- function(x) {
+
+ # length <2
+ if (length(x) < 2)
+ return(x)
+
+ #length >2
+
+ x <- str_sub(x) # make separate strings
+ y <- x[1:length(x)-1] #all but last string
+ z <- x[length(x)] # last string
+ aa <- str_c(y,collapse=", ") # add , and space
+ str_c(aa," and ",z) # add and plus last string
+
+ }
>
> x <- c("a", "b", "c")
> vector_to_string(x)[1] "a, b and c"
[1] "a"
[1] "Omega, Breguet, Panerai, Rolex and Blancpain"
Regular expressions (regexp) enable you to match patterns in strings. str_view and str_view_all provide a way to visualize the output.
For example, matching the text “an” in a vector of strings.
To match any character you can use a period.
So how would you match a literal period? You would need to escape it with a backslash. But a backslash is a special character in a string and would also need to be escaped.
Basically, you want your regular expression to match the output of writeLines(). So if you want to escape a period \(\left(\backslash.\right)\) the string version of your regular expression would need to be written as \(``\backslash\backslash."\).
\.
So if you wanted to match \(``\backslash\backslash"\), each backslash you need to be escaped. Remember, the raw output (writeLines()) of \(``\backslash\backslash"\) is a single \(\backslash\).
\\
a\b
a\\b
"'\a"`\c
\"\`\\
\..\..\.. match? How would you represent it as a string?eerx.t.f.jjg
\..\..\..
You can set the regular expression to match from the start or end of a string.
Use both symbols to match the string exactly.
"$^$"?\$\^\$
stringr::words, create regular expressions that find all the words that: [1] "act" "add" "age" "ago" "air" "all" "and" "any" "arm" "art" "ask" "bad"
[13] "bag" "bar" "bed" "bet" "big" "bit" "box" "boy" "bus" "but" "buy" "can"
[25] "car" "cat" "cup" "cut" "dad" "day" "die" "dog" "dry" "due" "eat" "egg"
[37] "end" "eye" "far" "few" "fit" "fly" "for" "fun" "gas" "get" "god" "guy"
[49] "hit" "hot" "how" "job" "key" "kid" "lad" "law" "lay" "leg" "let" "lie"
[61] "lot" "low" "man" "may" "mrs" "new" "non" "not" "now" "odd" "off" "old"
[73] "one" "out" "own" "pay" "per" "put" "red" "rid" "run" "say" "see" "set"
[85] "sex" "she" "sir" "sit" "six" "son" "sun" "tax" "tea" "ten" "the" "tie"
[97] "too" "top" "try" "two" "use" "war" "way" "wee" "who" "why" "win" "yes"
[109] "yet" "you"
There are some other special patterns.
\d matches any digit\s matches any whitespace[abc] matches a, b, or c[^abc] matches anything except a, b, or cTo create the regular expression the backslash needs to be escaped, so it would be "\\d".
You could also use hat|d..r to match “hat” OR “deer”.
\(\d\d\d\)\s\d\d\d-\d\d\d\d
It is often useful to match a pattern a specific number of times.
? - 0 or 1 times+ - 1 or more times* - 0 or more timesOr, you can use brackets {} to be more explicit.
{n} - exactly n times{n,} - n or more times{,m} - at most m times{n,m} - n to m timesBy default it will match the longest pattern. If you want it to match the shortest you can use a ?.
? = {0,1}+ = {1,}* = {0,}^.*$"\\{.+\\}"> #brackets and in-between where the text
> #in-between is 1 or more
> x <- c("look{at what is}here","single{}")
> str_view(x,"\\{.+\\}") \d{4}-\d{2}-\d{2}> # 4 digits - 2 digits - 2 digits
> x <- c("888-7-6", "8888-77-66","88888-777-666")
> str_view(x,"\\d{4}-\\d{2}-\\d{2}") "\\\\{4}" [1] "Christ" "Christmas" "dry" "fly" "mrs" "scheme"
[7] "school" "straight" "strategy" "street" "strike" "strong"
[13] "structure" "system" "three" "through" "throw" "try"
[19] "type" "why"
You can use a parentheses to create a group. If you would like to repeat the output from that group you use a backreference. For example, \1 will reference what is contained in the first group, \2 will reference the second, etc.
The following example will match any two characters as a group and then to match the same two characters again (not any two characters, the same two) we can reference it.
(.)\1\1"(.)(.)\\2\\1"(..)\1"(.).\\1.\\1""(.)(.)(.).*\\3\\2\\1" [1] "america" "area" "dad" "dead" "depend"
[6] "educate" "else" "encourage" "engine" "europe"
[11] "evidence" "example" "excuse" "exercise" "expense"
[16] "experience" "eye" "health" "high" "knock"
[21] "level" "local" "nation" "non" "rather"
[26] "refer" "remember" "serious" "stairs" "test"
[31] "tonight" "transport" "treat" "trust" "window"
[36] "yesterday"
[1] "appropriate" "church" "condition" "decide" "environment"
[6] "london" "paragraph" "particular" "photograph" "prepare"
[11] "pressure" "remember" "represent" "require" "sense"
[16] "therefore" "understand" "whether"
[1] "appropriate" "available" "believe" "between" "business"
[6] "degree" "difference" "discuss" "eleven" "environment"
[11] "evidence" "exercise" "expense" "experience" "individual"
[16] "paragraph" "receive" "remember" "represent" "telephone"
[21] "therefore" "tomorrow"
In addition to just matching patterns in a string we can:
str_detect() will return a TRUE or FALSE if a character vector matches a pattern.
[1] TRUE FALSE TRUE
Since TRUE is equal to 1, we can find sums, means, etc.
[1] 65
[1] 0.2765306
There are usually multiple ways of solving a problem, but it’s best to keep it simple where possible.
> #all words with no vowels
> no_vowels_1 <- !str_detect(words,"[aeiou]")
> no_vowels_2 <- str_detect(words,"^[^aeiou]+$")
> identical(no_vowels_1,no_vowels_2)[1] TRUE
str_subset() is an easier way to extract the elements that match a pattern.
[1] "box" "sex" "six" "tax"
[1] "box" "sex" "six" "tax"
When working with a data frame use filter() with str_detect().
# A tibble: 4 x 2
word i
<chr> <int>
1 box 108
2 sex 747
3 six 772
4 tax 841
str_detect() only returns a TRUE or FALSE, but sometimes you need to determine how many times a pattern matches. In this case you can use str_count().
[1] 1 3 1
[1] 1.991837
Used with mutate():
> df %>%
+ mutate(
+ vowels = str_count(word,"[aeiou]"),
+ consonants = str_count(word,"[^aeiou]")
+ )# A tibble: 980 x 4
word i vowels consonants
<chr> <int> <int> <int>
1 a 1 1 0
2 able 2 2 2
3 about 3 3 2
4 absolute 4 4 4
5 accept 5 2 4
6 account 6 3 4
7 achieve 7 4 3
8 across 8 2 4
9 act 9 1 2
10 active 10 3 3
# ... with 970 more rows
Note that matches do not overlap. In the following example you could circle “aba” 3 times, but since the middle one would overlap it is not counted.
[1] 2
str_detect() calls.[1] "box" "six" "xenon"
[1] "box" "six" "xenon"
[1] "about" "accept" "account" "across" "act" "actual" "add"
[8] "address" "admit" "affect"
[1] "about" "accept" "account" "across" "act" "actual" "add"
[8] "address" "admit" "affect"
character(0)
> words[str_detect(words,"a")&str_detect(words,"e")&
+ str_detect(words,"i")&str_detect(words,"o")&
+ str_detect(words,"u")]character(0)
[1] 5
[1] "appropriate" "associate" "available" "colleague" "encourage"
[6] "experience" "individual" "television"
> vow_prop <- str_count(words, "[aeiou]") / str_count(words,".")
> max_prop <- max(vow_prop)
> words[vow_prop== max_prop][1] "a"
str_exact() will extract the actual text of a match. It can be demonstrated with the sentences data set.
[1] 720
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
[6] "The juice of lemons makes fine punch."
Start by creating a regular expression:
> colors <- c("red","orange","yellow","green","blue","purple")
> color_match <- str_c(colors,collapse="|")
> color_match[1] "red|orange|yellow|green|blue|purple"
Then use the regular expression to filter and extract the matches:
> has_color <- str_subset(sentences,color_match)
> matches <- str_extract(has_color,color_match)
> head(matches)[1] "blue" "blue" "red" "red" "red" "blue"
Unfortunately str_extract() will only extract the first match. We can see below that it only retrieved “blue”, “green”, and “orange.”
[1] "blue" "green" "orange"
However, str_extract_all() will extract all matches. It returns a list, but with simplify=TRUE it will return a matrix.
[[1]]
[1] "blue" "red"
[[2]]
[1] "green" "red"
[[3]]
[1] "orange" "red"
[,1] [,2]
[1,] "blue" "red"
[2,] "green" "red"
[3,] "orange" "red"
The size of the matrix will equal the longest match.
[,1] [,2] [,3]
[1,] "a" "" ""
[2,] "a" "b" ""
[3,] "a" "b" "c"
Matches flickered
Solution
> x <- c("The green light in the brown box flickered.",
+ "Greenson likes red.",
+ "Green is a great color.")
>
> colors <- c("red","orange","yellow","green","blue","purple","brown")
> colors2 <- str_to_title(colors)
>
> y <- str_c("\\s",colors)
> z <- str_c(colors2,"\\s")
> aa <- c(y,z)
> color_match <- str_c(aa,collapse="|")
> color_match[1] "\\sred|\\sorange|\\syellow|\\sgreen|\\sblue|\\spurple|\\sbrown|Red\\s|Orange\\s|Yellow\\s|Green\\s|Blue\\s|Purple\\s|Brown\\s"
[1] "green" "brown" "red" "Green"
str_extract() for the first match.[1] "The" "Glue" "It's" "These" "Rice" "The"
> ab <- str_extract_all(sentences,"[^ ]+ing[^a-z']") %>% unlist
> ab <- str_trim(ab)
> str_extract_all(ab,"[a-zA-Z]+")%>% unlist %>% unique [1] "spring" "evening" "morning" "winding" "living" "king"
[7] "Adding" "making" "raging" "playing" "sleeping" "ring"
[13] "glaring" "sinking" "dying" "Bring" "lodging" "filing"
[19] "wearing" "wading" "swing" "nothing" "sing" "painting"
[25] "walking" "bring" "shipping" "puzzling" "landing" "thing"
[31] "waiting" "whistling" "timing" "changing" "drenching" "moving"
[37] "working"
> ab <- str_extract_all(sentences,"[A-Za-z]+s[^a-z]") %>% unlist()
> ab <- str_trim(ab)
> str_extract_all(ab,"[a-zA-Z]+") %>% unlist() %>%
+ unique() %>% head(20) [1] "planks" "days" "is" "bowls" "lemons" "makes"
[7] "was" "hogs" "hours" "us" "stockings" "helps"
[13] "pass" "fires" "across" "bonds" "Press" "pants"
[19] "useless" "gas"
By creating separate groups with parentheses you can extract each component. For example, nouns tend to follow “a” or “the”. You could match “(a|the)” as one group, followed by a space, followed by a word. A word can be represented as a non-space 1 or more times “([^ ]+)”.
First, we can extract the matches from the first ten sentences that have a match.
> noun <- "(a|the) ([^ ]+)"
> has_noun <- sentences %>%
+ str_subset(noun) %>% head(10)
> has_noun %>%
+ str_extract(noun) [1] "the smooth" "the sheet" "the depth" "a chicken" "the parked"
[6] "the sun" "the huge" "the ball" "the woman" "a helps"
Next, str_match() will return a matrix with the complete match along with each group. If you want matches for each string you would use str_match_all().
[,1] [,2] [,3]
[1,] "the smooth" "the" "smooth"
[2,] "the sheet" "the" "sheet"
[3,] "the depth" "the" "depth"
[4,] "a chicken" "a" "chicken"
[5,] "the parked" "the" "parked"
[6,] "the sun" "the" "sun"
[7,] "the huge" "the" "huge"
[8,] "the ball" "the" "ball"
[9,] "the woman" "the" "woman"
[10,] "a helps" "a" "helps"
With a tibble and tidyr::extract you can add new columns for each group.
> tibble(sentence=sentences) %>%
+ tidyr::extract(
+ sentence,c("article","noun"),"(a|the) ([^ ]+)",
+ remove=FALSE
+ )# A tibble: 720 x 3
sentence article noun
<chr> <chr> <chr>
1 The birch canoe slid on the smooth planks. the smooth
2 Glue the sheet to the dark blue background. the sheet
3 It's easy to tell the depth of a well. the depth
4 These days a chicken leg is a rare dish. a chicken
5 Rice is often served in round bowls. <NA> <NA>
6 The juice of lemons makes fine punch. <NA> <NA>
7 The box was thrown beside the parked truck. the parked
8 The hogs were fed chopped corn and garbage. <NA> <NA>
9 Four hours of steady work faced us. <NA> <NA>
10 Large size in stockings is hard to sell. <NA> <NA>
# ... with 710 more rows
> number_after <- "(one|two|three|four|five|six|seven|eight|
+ nine|ten) ([^ .]+)"
> has_num <- sentences %>%
+ str_subset(number_after)
>
> has_num %>%
+ str_match(number_after) %>% head(15) [,1] [,2] [,3]
[1,] "ten served" "ten" "served"
[2,] "one over" "one" "over"
[3,] "seven books" "seven" "books"
[4,] "two met" "two" "met"
[5,] "two factors" "two" "factors"
[6,] "one and" "one" "and"
[7,] "three lists" "three" "lists"
[8,] "seven is" "seven" "is"
[9,] "two when" "two" "when"
[10,] "one floor" "one" "floor"
[11,] "ten inches" "ten" "inches"
[12,] "one with" "one" "with"
[13,] "one war" "one" "war"
[14,] "one button" "one" "button"
[15,] "six minutes" "six" "minutes"
> contract <- "([a-zA-Z]+)'([^ .]+)"
> has_contract <- sentences %>%
+ str_subset(contract)
>
> has_contract %>%
+ str_match(contract) %>% head(15) [,1] [,2] [,3]
[1,] "It's" "It" "s"
[2,] "man's" "man" "s"
[3,] "don't" "don" "t"
[4,] "store's" "store" "s"
[5,] "workmen's" "workmen" "s"
[6,] "Let's" "Let" "s"
[7,] "sun's" "sun" "s"
[8,] "child's" "child" "s"
[9,] "king's" "king" "s"
[10,] "It's" "It" "s"
[11,] "don't" "don" "t"
[12,] "queen's" "queen" "s"
[13,] "don't" "don" "t"
[14,] "pirate's" "pirate" "s"
[15,] "neighbor's" "neighbor" "s"
str_replace() and str_replace_all() will replace matches with new strings.
[1] "-pple" "p-ar" "b-nana"
[1] "-ppl-" "p--r" "b-n-n-"
You can also perform multiple replacements by supplying a named vector.
[1] "one house" "two cars" "three people"
Backreferences can also be used. The following example switches the second and third word.
[1] "The canoe birch slid on the smooth planks."
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."
[4] "These a days chicken leg is a rare dish."
[5] "Rice often is served in round bowls."
a/bc///d
a\bc\\\d
str_to_lower() using replace_all().abcdefghij
[1] "a" "ebla" "tboua" "ebsoluta" "tccepa" "tccouna"
[7] "echieva" "scrosa" "tca" "ectiva" "lctuaa" "dda"
[13] "sddresa" "tdmia" "edvertisa" "tffeca" "dffora" "rftea"
[19] "nfternooa" "ngaia" "tgainsa" "ega" "tgena" "oga"
[25] "egrea"
It’s probably no surprise that str_split() will split strings.
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
[8] "planks."
[[2]]
[1] "Glue" "the" "sheet" "to" "the"
[6] "dark" "blue" "background."
[[3]]
[1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
[[4]]
[1] "These" "days" "a" "chicken" "leg" "is" "a"
[8] "rare" "dish."
[[5]]
[1] "Rice" "is" "often" "served" "in" "round" "bowls."
Since it creates a list it will be easier to extract only the first element when working with a vector of length 1.
[1] "a" "b" "c" "d"
Or you can have it return a matrix.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
[2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
[3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
[4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
[5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
[,9]
[1,] ""
[2,] ""
[3,] "well."
[4,] "dish."
[5,] ""
You can also specify the maximum number of pieces.
[,1] [,2] [,3]
[1,] "The" "birch" "canoe slid on the smooth planks."
[2,] "Glue" "the" "sheet to the dark blue background."
[3,] "It's" "easy" "to tell the depth of a well."
[4,] "These" "days" "a chicken leg is a rare dish."
[5,] "Rice" "is" "often served in round bowls."
In addition to patterns, you can also split a string by boundaries (character, line, sentence, and word).
[1] "This" "is" "a" "sentence." "This" "is"
[7] "another" "sentence"
[1] "This" "is" "a" "sentence" "This" "is" "another"
[8] "sentence"
[1] "apples" "pears" "bananas"
[1] "apples" "pears" "and" "bananas"
boundary("word") doesn’t pick up extra characters.[1] "fruits:" "apples," "peaches," "bananas," "and" "oranges!"
[1] "fruits" "apples" "peaches" "bananas" "and" "oranges"
[[1]]
[1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i"
[20] "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t" "h" " " "p" "l" "a"
[39] "n" "k" "s" "."
[[2]]
[1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t"
[20] "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b" "a" "c" "k" "g" "r"
[39] "o" "u" "n" "d" "."
[[3]]
[1] "I" "t" "'" "s" " " "e" "a" "s" "y" " " "t" "o" " " "t" "e" "l" "l" " " "t"
[20] "h" "e" " " "d" "e" "p" "t" "h" " " "o" "f" " " "a" " " "w" "e" "l" "l" "."
[[4]]
[1] "T" "h" "e" "s" "e" " " "d" "a" "y" "s" " " "a" " " "c" "h" "i" "c" "k" "e"
[20] "n" " " "l" "e" "g" " " "i" "s" " " "a" " " "r" "a" "r" "e" " " "d" "i" "s"
[39] "h" "."
[[5]]
[1] "R" "i" "c" "e" " " "i" "s" " " "o" "f" "t" "e" "n" " " "s" "e" "r" "v" "e"
[20] "d" " " "i" "n" " " "r" "o" "u" "n" "d" " " "b" "o" "w" "l" "s" "."
Sometimes it is easier to just find the start and end of a pattern with str_locate(). You can then use str_sub() to extract the string.
start end
[1,] 6 5
[2,] 7 6
[3,] 5 4
[4,] 10 9
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 5 5
start end
[1,] 1 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
[[1]]
start end
[1,] 1 1
[[2]]
start end
[1,] 2 2
[2,] 4 4
[3,] 6 6
[[3]]
start end
[1,] 3 3
[[4]]
start end
[1,] 5 5
[[1]]
start end
[1,] 5 5
[[2]]
start end
[[3]]
start end
[1,] 2 2
[[4]]
start end
[1,] 4 4
[2,] 9 9
[[1]]
start end
[1,] 1 1
[[2]]
start end
[1,] 1 1
[[3]]
start end
[1,] 1 1
[[4]]
start end
[1,] 1 1
[2,] 6 6
[3,] 7 7
A pattern that’s represented as a string automatically calls regex()
You can call regex() directly to control the output. For example, if you don’t want case to matter:
multiline=TRUE allows ^ and $ to match the start and end of each line rather than the start and end of a complete string.
Line 1
Line 2
Line 3
[1] "Line"
[1] "Line" "Line" "Line"
comments=TRUE allows you to use comments and white space. It will ignore spaces and everything after #. To explicitly match a space you’ll need to escape it with "\\ ".
> phone <- regex("
+ \\(? #optional opening parentheses
+ (\\d{3}) #area code
+ [)- ]? #optional closing parentheses, dash, or space
+ (\\d{3}) #another 3 numbers
+ [ -]? #optional space or dash
+ (\\d{4}) #four more numbers
+ ", comments=TRUE)
>
> str_match("888-867-5309",phone) [,1] [,2] [,3] [,4]
[1,] "888-867-5309" "888" "867" "5309"
fixed() matches the exact sequence of bytes and ignores special characters. It is not very comprehensive, but it is faster for basic matches.
> microbenchmark::microbenchmark(
+ fixed = str_detect(sentences, fixed("the")),
+ regex = str_detect(sentences,"the"),
+ times=20
+ )Unit: microseconds
expr min lq mean median uq max neval cld
fixed 75.0 79.10 104.365 86.80 106.35 266.2 20 a
regex 207.9 212.75 261.520 226.35 298.80 447.7 20 b
As with str_split(), you can also use boundary with other functions.
[[1]]
[1] "This" "is" "a" "sentence"
\ with regex() versus fixed()?[1] "1\\2" "x\\y"
> #fixed() ignores special characters so no need to escape backslash
> #it matches the exact text
> str_subset(strings, fixed("\\"))[1] "1\\2" "x\\y"
> x <- str_extract_all(sentences, boundary("word")) %>% unlist
> x <- str_to_lower(x)
> y <- as_tibble(x) %>% rename(words=value)
> y %>% group_by(words) %>% count(sort=TRUE) %>% head(5)# A tibble: 5 x 2
# Groups: words [5]
words n
<chr> <int>
1 the 751
2 a 202
3 of 132
4 to 123
5 and 118
apropos() searches all objects in the global environment. This can be helpful when searching for a function.
[1] "%+replace%" "replace" "replace_na" "setReplaceMethod"
[5] "str_replace" "str_replace_all" "str_replace_na" "theme_replace"
dir() lists all the files in a directory. You can also search by a pattern.
[1] "Strings.Rmd" "Strings_Copy.Rmd"
[3] "StringsExercises _Copy.Rmd" "StringsExercises.Rmd"
Finally, it is important to note that stringr is built on top of the stringi package, which is more comprehensive. It may have a solution that isn’t readily available with stringr.