The package stringr provides functions for both 1) basic manipulations and 2) regular expression operations. First, we are going to cover those functions that have to do with basic manipulations.
The following table shows some stringr
functions for basic string operations:
Function | Description | Similar Base Functions |
---|---|---|
str_length() |
number of characters | nchar() |
str_split() |
split up a string into pieces | strsplit() |
str_c() |
string concatenation | paste() |
str_squish() |
removes any redundant white space | |
str_detect() |
finds a particular pattern of characters | |
str_view_all() |
show the matching result on the actual screen |
Note that all functions in stringr
starts with "str_"
followed by a term in relation to the task they perform.
str_length()
The function str_length()
is equivalent to the base function nchar()
. Both functions return a vector of the numbers of characters in each string, that is, the length of a string (do not confuse it with the length()
of a vector).
library(stringr)
str_length(c("abc", "adfds ", "1234343", ".!@#"))
## [1] 3 6 7 4
nchar(c("abc", "adfds ", "1234343", ".!@#"))
## [1] 3 6 7 4
length(c("abc", "adfds ", "1234343", ".!@#"))
## [1] 4
load("covid_text_word.RData")
covid_text_word[1:100] # First 100 elements in the character vector, covid_text_word
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)." "Prior"
## [27] "to" "the\r\nCOVID<U+2011>19"
## [29] "pandemic," "there"
## [31] "was" "an"
## [33] "established" "body"
## [35] "of" "knowledge"
## [37] "about\r\nthe" "structure"
## [39] "and" "function"
## [41] "of" "coronaviruses"
## [43] "causing" "diseases"
## [45] "like" "severe"
## [47] "acute\r\nrespiratory" "syndrome"
## [49] "(SARS)" "and"
## [51] "Middle" "East"
## [53] "respiratory" "syndrome\r\n(MERS),"
## [55] "which" "enabled"
## [57] "accelerated" "development"
## [59] "of" "various"
## [61] "vaccine\r\ntechnologies" "during"
## [63] "early" "2020.[1]"
## [65] "On" "10"
## [67] "January" "2020,"
## [69] "the" "SARS-CoV-2\r\ngenetic"
## [71] "sequence" "data"
## [73] "was" "shared"
## [75] "through" "GISAID,"
## [77] "and" "by"
## [79] "19" "March,"
## [81] "the\r\nglobal" "pharmaceutical"
## [83] "industry" "announced"
## [85] "a" "major"
## [87] "commitment" "to"
## [89] "address\r\nCOVID-19.[2]" ""
## [91] "" ""
## [93] "" ""
## [95] "" ""
## [97] "" ""
## [99] "" ""
str_length(covid_text_word[1:100]) # Some elements have no character
## [1] 8 10 8 7 2 1 7 8 2 7 8 17 6 5 11 8 11 1 13 10 7 11 7 4 11
## [26] 5 2 13 9 5 3 2 11 4 2 9 10 9 3 8 2 13 7 8 4 6 18 8 6 3
## [51] 6 4 11 17 5 7 11 11 2 7 21 6 5 8 2 2 7 5 3 19 8 4 3 6 7
## [76] 7 3 2 2 6 11 14 8 9 1 5 10 2 21 0 0 0 0 0 0 0 0 0 0 0
str_length(covid_text_word[1:100])>0
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [85] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
covid_text_word[1:100][str_length(covid_text_word[1:100])>0]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)." "Prior"
## [27] "to" "the\r\nCOVID<U+2011>19"
## [29] "pandemic," "there"
## [31] "was" "an"
## [33] "established" "body"
## [35] "of" "knowledge"
## [37] "about\r\nthe" "structure"
## [39] "and" "function"
## [41] "of" "coronaviruses"
## [43] "causing" "diseases"
## [45] "like" "severe"
## [47] "acute\r\nrespiratory" "syndrome"
## [49] "(SARS)" "and"
## [51] "Middle" "East"
## [53] "respiratory" "syndrome\r\n(MERS),"
## [55] "which" "enabled"
## [57] "accelerated" "development"
## [59] "of" "various"
## [61] "vaccine\r\ntechnologies" "during"
## [63] "early" "2020.[1]"
## [65] "On" "10"
## [67] "January" "2020,"
## [69] "the" "SARS-CoV-2\r\ngenetic"
## [71] "sequence" "data"
## [73] "was" "shared"
## [75] "through" "GISAID,"
## [77] "and" "by"
## [79] "19" "March,"
## [81] "the\r\nglobal" "pharmaceutical"
## [83] "industry" "announced"
## [85] "a" "major"
## [87] "commitment" "to"
## [89] "address\r\nCOVID-19.[2]"
length(covid_text_word[1:100])
## [1] 100
str_split()
The function str_split()
is equivalent to the base function strsplit()
. Both functions split a string into a variable number of pieces and return a list of character vectors.
strsplit("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).", split=" ") # Split up the string vector into pieces by a blank
## [[1]]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)."
str_split("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).", pattern=" ") # Do the same
## [[1]]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)."
str_c()
This function is equivalent to the base function paste()
.
covid_sent <- str_split("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).", pattern=" ") # Split up the vector of strings from the first sentence into words
class(covid_sent)
## [1] "list"
class(unlist(covid_sent))
## [1] "character"
paste(unlist(covid_sent), collapse = " ") # Concatenate the character vector to a string
## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the\r\nvirus causing coronavirus disease 2019 (COVID<U+2011>19)."
str_c(unlist(covid_sent), collapse = " ")
## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the\r\nvirus causing coronavirus disease 2019 (COVID<U+2011>19)."
str_squish()
One of the typical tasks of string processing is that of parsing a text into individual words. Usually, you end up with words that have blanks, called whitespaces. Or, some sentences contain long whitespaces. In this situation, we can use the str_squish()
function to remove any number of whitespaces inside a string or either end of a string. Its usage is very simple:
str_squish(string)
The input is the string
to be trimmed,and any redundant whitespace will be removed.
Consider the following vector of strings, which have some whitespaces within the text.
covid_sent_trim <- str_squish("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).")
covid_sent_trim
## [1] "COVID-19 vaccine A COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the virus causing coronavirus disease 2019 (COVID<U+2011>19)."
But what about punctuation marks and numbers? 2. Punctuation marks: Cannot specify what punctuation marks appear and should be removed 3. Numbers: Cannot specify what numbers appear and should be removed
To solve these problems, we need to take things to the next level.