library(tidyverse)
library(nycflights13)
Strings are collection of characters, which are used to store “text data”, or any data format in terms of texts. It is very important to be skilled at handling strings in data science. In this module, we will study
Basics in string manipulation in R
Basics in regular expressions (regexps)
You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour.
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
If you forget to close a quote, you’ll see +, the continuation character:
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
If this happen to you, just press Esc(Escape) and try again!
To include a literal single or double quote in a string you can use to “escape” it:
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
If you need a literal backslash, we need
backslash <- "\\"
An important thing to know about strings is that they have literal values (what they actually are) and their representations (how you input that into a programming language). A literal value pair with a representation.
For example, for the literal value \
, we must input
"\\"
in R.
Unlike in Python, the print()
function in R returns the
representation. We need to use the function
writeLines()
to show the literal value of a string.
print("\\")
## [1] "\\"
writeLines("\\")
## \
Like many other programming languages, R use backslash to start an escape sequence inside a string:
Representation | Literal value |
---|---|
\n | new line |
\t | tab charcter |
\\ | backslash \ |
\" | double quotation marks " |
\' | single quotation marks ' |
\` | backticks ` |
For the full table of escape sequences, you may check the help documentation of quotes.
help("'")
For example, if we hope to write a string with literal value of
"\"
, we need to write
my_string <- "\"\\\""
writeLines(my_string)
## "\"
Write a string of literal value of \\\
All characters have a UTF-8 code and we can print them out in R:
writeLines("\u00b5") # The greek letter "mu"
## µ
writeLines("\xe4\xbd\xa0\xe5\xa5\xbd") # The Chinese "你好"
## 你好
writeLines("\u2660") # Spade symbol of a card suit
## ♠
There are many online encoders to convert any character into UTF-8 codes.
Multiple strings are often stored in a character vector, which you can create with c():
string_vector <- c("One", "Two", "Three")
print(string_vector)
## [1] "One" "Two" "Three"
stringr
Base R contains many functions to work with strings but we’ll avoid
them because they can be inconsistent, which makes them hard to
remember. Instead we’ll use functions from stringr
. These
have more intuitive names, and all start with str_
. For
example, str_length() tells you the number of characters in a
string:
str_length(c("a", "R for data science", NA))
## [1] 1 18 NA
To combine two or more strings, use
str_c()
:
str_c("x","y","z")
## [1] "xyz"
Use the sep
argument to control how they’re
separated:
str_c("x","y","z", sep = "+")
## [1] "x+y+z"
str_c()
is vectorised, and it automatically recycles
shorter vectors to the same length as the longest:
str_c("b", c("a", "e", "u"), "g")
## [1] "bag" "beg" "bug"
To collapse a vector of strings into a single
string, use collapse
argument:
str_c(c("x", "y", "z"), collapse = ",")
## [1] "x,y,z"
You can extract parts of a string using str_sub()
. As
well as the string, str_sub()
takes start
and
end
arguments which give the (inclusive) position of the
substring:
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"
Note that str_sub()
won’t fail if the string is too
short: it will just return as much as possible:
str_sub("a", 1, 5)
## [1] "a"
You can also use the assignment form of str_sub() to modify part of a string:
x = "There is a typo in the word studant"
str_sub(x, -3, -3) <- "e"
x
## [1] "There is a typo in the word student"
str_to_lower
and str_to_upper
functions
convert the text to lower/upper case respectively.
x <- "China"
str_to_lower(x)
## [1] "china"
str_to_upper(x)
## [1] "CHINA"
str_sort()
function sort a vector of strings by
alphabetic order. We can do it either in increasing (by default) or
decreasing order.
str_sort(c("apple", "orange", "banana"))
## [1] "apple" "banana" "orange"
str_sort(c("apple", "orange", "banana"), decreasing = TRUE)
## [1] "orange" "banana" "apple"
stringr
dataTo exercise string manipulations, we will use the three pre-loaded
string data sets in stringr
package. They are,
words
, fruit
and sentences
words
contain 980 most commonly used English wordsfruit
contain 80 English words of fruitssentences
contain 720 English sentences which was used
for standardised testing of voice from “Harvard sentences”Let’s play with it - first find the longest word in the
words
data set:
word_data <- as_tibble(words) %>%
mutate(length = str_length(value)) %>%
arrange(desc(length)) %>%
print()
## # A tibble: 980 × 2
## value length
## <chr> <int>
## 1 appropriate 11
## 2 environment 11
## 3 opportunity 11
## 4 responsible 11
## 5 department 10
## 6 difference 10
## 7 experience 10
## 8 individual 10
## 9 particular 10
## 10 photograph 10
## # ℹ 970 more rows
Now, let’s say we hope to find all words with some patterns such as:
an
as part of the worda
s in the worde
and be longer than 6 wordsHow to do these jobs? We would need to refer to our next topic - regular expressions.
Regular expressions, or in short “regexps” or “regex”, are a mini programming language that allow you to describe patterns in strings. They are very powerful in handling file names, folder names, texts or any job related to strings.
For example, we have tidied the who
data about TB cases
in different countries and years. One step there is to separate a “key”
column into a few different ones:
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # ℹ 76,036 more rows
Previously we had used mutate
and separate
function to separate the key
column into
types
, Gender
and Age_Group
.
After learning regular expressions, we would be able to do all these
just in one line:
tidyr::extract(who1, key, c("type", "Gender", "Age_Group"), "new[_]?(.*)_(m|f)(\\d*)", remove = F) -> who1
who1
## # A tibble: 76,046 × 9
## country iso2 iso3 year key type Gender Age_Group cases
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 sp m 014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 sp m 1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 sp m 2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 sp m 3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 sp m 4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 sp m 5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 sp m 65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 sp f 014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 sp f 1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 sp f 2534 36
## # ℹ 76,036 more rows
The odd-looking string here uses regular expression to identify some
particular patterns to decode the key
column. Essentially,
regular expression is like a mini programming language.
At the beginning it may take some time to get used to it, but after you
understand how it works you will find that it is very useful and fun to
work with.
To learn regular expressions, we’ll use str_view()
as
the starting point. str_view()
takes a character vector and
a regular expression, and show you how they match.
Let’s start from the simplest case, match exactly one or more letters.
x <- c("apple", "banana", "pear")
str_view(x, "a", html = TRUE)
x <- c("apple", "banana", "pear")
str_view(x, "an", html = TRUE, match = NA)
So we see that the function highlights any part of the words that
matches the given pattern, which is exactly "a"
or
"an"
in this case.
The template to use str_view
is:
str_view(string, pattern, match = TRUE, html = FALSE)
Here string
is the string or a vector of strings to
inspect, pattern
uses regular expression
to describe what pattern we are looking for. match
controls
what we print (only words with match, without match or all words
regardless of having match or not). html
should only be
TRUE
when we want to print the reuslt in a webpage (such as
a markdown).
.
matches any single character (except a new line)Now let’s study the mini language of regular expression. First, a
mere .
in a regular expression represents any
single character excluding a new line. For example,
str_view(x, ".a.", match = NA, html = TRUE)
matches any three characters with “a” in the middle.
Similarly, ...
matches any characters of length three.
x <- c("a", "ab", "abc", "abcd")
str_view(x, "...", match = NA, html = TRUE)
However, the new line character \n
is not counted as a
single character.
x <- 'ab\ncd'
writeLines(x)
## ab
## cd
str_view(x, "...", match = NA, html = TRUE)
Before we learn more ways to describe patterns, we need to learn the
escape sequences that are needed in regular expression. Now we know that
.
is used to represent any single character, but then how
we express the literal .
by itself? We have to use the
escape sequence \.
to represent a literal
.
However, if we try to do this in R, there will be some error message
x <- c("2.357", "apple")
str_view(x, "\.", match = NA, html = TRUE)
Why doesn’t this work? The reason is that, we use a string to
represent the regular expression \.
. But for a literal
\.
, we need \\.
as the representation as we
learned above. So the right thing to do is:
x <- c("2.357", "apple")
str_view(x, "\\.", match = NA, html = TRUE)
In summary, we have two ways to write a regular expression:
.
or \.
"."
, or
"\\."
where we must use a pair of quotation marks to
enclose the string.For our textbook, we may use both ways to write a regular expression.
But remember in R, we have to use the second way as the input of
str_view
or other functions that work on regular
expressions.
As below is a table that helps you understand this.
literal values of regular expression | string representation used in R | Meaning |
---|---|---|
. |
"." |
Any single character (excluding new line) |
\. |
"\\." |
A literal . |
\\ |
"\\\\" |
A literal \ |
" |
'"' or "\\\"" |
A literal " |
' |
"'" or '\\\'' |
A literal ' |
\d |
"\\d" |
any digit (0-9) |
\s |
"\\s" |
any white space |
The following regular expression matches a literal
\
:
x <- c("a\\b", "a/b")
writeLines(x)
## a\b
## a/b
str_view(x, "\\\\", match = NA, html = TRUE)
The following regular expression matches a literal like
2.3
:
x <- c("30", "2.56", "1e5", "0.5943")
str_view(x, "\\d\\.\\d", match = NA, html = TRUE)
How would you match the sequence "'\
?
What patterns will the regular expression \..\..\..
match? How would you represent it as a string?
^
and $
By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:
^
to match the start of the string.$
to match the end of the string.For example, to match the pattern of starting or ending with letter “a”, we can do
x <- c("apple", "banana")
str_view(x, "^a", match = NA, html = TRUE)
str_view(x, "a$", match = NA, html = TRUE)
Our textbook provides an interesting mnemonic from Evan Misshula to
help remember this: if you begin with power (^
), you end up
with money ($
).
To force a regular expression to only match a complete string, anchor
it with both ^
and $
:
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple", match = NA, html = TRUE)
str_view(x, "^apple$", match = NA, html = TRUE)
Now let’s get all words in words
that ends with “a”. We
can use the str_detect
method to do the filter, which
returns TRUE
or FALSE
for each string in the
vector.
x <- c("apple", "banana")
str_detect(x, "a$")
## [1] FALSE TRUE
word_data %>%
filter(str_detect(value, "a$")) %>%
print()
## # A tibble: 6 × 2
## value length
## <chr> <int>
## 1 america 7
## 2 extra 5
## 3 area 4
## 4 idea 4
## 5 tea 3
## 6 a 1
words
data set.words
data
set with and without using str_length
function.$
and ^
Since $
and ^
has special meanings in
regular expressions. We have to use the escape sequence to represent
literal $
and ^
as well. This applies to all
future symbols of such type as well.
# To match the literal $^$
x <- c("a$^$b")
writeLines(x)
## a$^$b
str_view(x, "\\$\\^\\$", match = NA, html = TRUE)
There are a number of special patterns that match more than one
character. You’ve already seen .
, which matches any
character apart from a newline. There are four other useful tools:
\d
: matches any digit.\s
: matches any whitespace (e.g. space, tab,
newline).[abc]
: matches a, b, or c.[^abc]
: matches anything except a, b, or c.Note that ^
has different meaning outside
[]
or inside a []
. Also, to use
\d
and \s
in string representations, we must
use "\\d"
and "\\s"
.
A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c", match = NA, html = TRUE)
The code above finds the same pattern as a\.c
, but its
string representation is more readable than "a\\.c"
.
This works for most (but not all) regex metacharacters:
$ . | ? * + ( ) [ {
. Unfortunately, a few characters have
special meaning even inside a character class and must be handled with
backslash escapes: ]
\
^
and
-
.
Question: Find all words in words
that
starts with a vowel (“a”, “e”, “i”, “o” or “u”).
Solution:
word_data %>%
filter(str_detect(value, "^[aeiou]")) %>%
print()
## # A tibble: 175 × 2
## value length
## <chr> <int>
## 1 appropriate 11
## 2 environment 11
## 3 opportunity 11
## 4 experience 10
## 5 individual 10
## 6 understand 10
## 7 university 10
## 8 advertise 9
## 9 afternoon 9
## 10 associate 9
## # ℹ 165 more rows
Question: Find all words in words
that
ends with ed
, but not with eed
.
Solution:
word_data %>%
filter(str_detect(value, "ed$")) %>% # This only finds words ending with "ed"
print() %>%
filter(str_detect(value, "[^e]ed$")) %>%
print()
## # A tibble: 9 × 2
## value length
## <chr> <int>
## 1 hundred 7
## 2 proceed 7
## 3 succeed 7
## 4 indeed 6
## 5 speed 5
## 6 feed 4
## 7 need 4
## 8 bed 3
## 9 red 3
## # A tibble: 3 × 2
## value length
## <chr> <int>
## 1 hundred 7
## 2 bed 3
## 3 red 3
One can use alternation to pick between one or more
alternative patterns. For example, abc|d..f
will match
either abc
, or deaf
. Note that the precedence
for |
is low, so that abc|xyz
matches
abc
or xyz
not abcyz
or
abxyz
. Like with mathematical expressions, if precedence
ever gets confusing, use parentheses to make it clear what you want:
str_view(c("grey", "gray"), "gr(e|a)y", match = NA, html = TRUE)
words
that ends with
ing
or ise
word_data %>%
filter(str_detect(value, "(ing|ise)$")) %>%
print()
## # A tibble: 17 × 2
## value length
## <chr> <int>
## 1 advertise 9
## 2 otherwise 9
## 3 exercise 8
## 4 practise 8
## 5 surprise 8
## 6 evening 7
## 7 meaning 7
## 8 morning 7
## 9 realise 7
## 10 during 6
## 11 bring 5
## 12 raise 5
## 13 thing 5
## 14 king 4
## 15 ring 4
## 16 rise 4
## 17 sing 4
Find all words with the first letter being “a” or “e”, and the third letter being “r” or “s”.
Next, let’s see how to describe patterns that repeat itself for exactly or selectively some number of times.
The following symbols define how many times the previous character repeat:
?
: 0 or 1+
: 1 or more*
: 0 or more{n}
: exactly \(n\)
timesFor example, if we hope to search words starting with “a” and ending with “e”, we may do:
str_view(words, "^a.*e$", html = TRUE)
Here the .*
refers to any character of any length since
.
refers to any character and *
refers to 0 or
more times.
As another example, if we hope to search words that has three vowel letters connecting each other, for example, “iou”, we can do
str_view(words, "[aeiou]{3}", html = TRUE)
Here [aeiou]
refers to a single
character which is a vowel letter, and {3}
refers
to repeating three times.
^.*$
\d{4}-\d{2}-\d{2}
"\\\\{4}"
{n,}
: \(n\) times
or more
{,m}
: at most \(m\)
times
{n,m}
: between \(n\) times and \(m\) times
Parenthesis ()
in regular expressions can be used to
refer to a numbered capturing group. This is useful
when we want to refer to exactly the same text later.
For example, how to search for words that have more than three same letters of “a”, “e” or “i”? We may do the following:
str_view(words, "([aei]).*\\1.*\\1", html = TRUE)
Here .*
refers to any character of any length as seen
before. ([aei])
refers to either “a” or “e” or “i”, and
"\\1"
which is \1
in value refers to
the same letter in ()
occurring again.
As another example, the following regular expression finds all fruits that have a repeated pair of letters.
str_view(fruit, "(..)\\1", html = TRUE)
(.)\1\1
"(.)(.)\\2\\1"
2.Construct regular expressions to match words that start and end with the same character.
Next, let’s learn how to apply regular expressions to real problems.
We will learn stringr
functions that help
Detect which strings match a pattern.
Find the positions of matches.
Extract the content of matches.
Replace matches with new values.
Split a string based on a match.
To determine if a character vector matches a pattern, use
str_detect()
. It returns a logical vector the same length
as the input. When there is a match, a TRUE
will be
returned; otherwise it will be FALSE
.
x <- c("apple", "banana", "pear")
str_detect(x, "e")
## [1] TRUE FALSE TRUE
We can then use sum()
and mean()
function
to answer how many strings match the given pattern and what is the
proportion of matching strings in the vector:
# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306
find how many words ending with “e*e” where “*” can be any letter.
A common use of str_detect()
is to select the elements
that match a pattern. You can do this with logical subsetting, or the
convenient str_subset()
wrapper:
words[str_detect(words, "x$")]
## [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"
Typically, however, your strings will be one column of a data frame.
We will use filter
together with
str_detect()
:
df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(word, "x$"))
## # A tibble: 4 × 2
## word i
## <chr> <int>
## 1 box 108
## 2 sex 747
## 3 six 772
## 4 tax 841
Here the seq_along()
function returns the sequence
number of each word in the list.
A variation on str_detect()
is str_count()
:
rather than a simple yes or no, it tells you how many matches there are
in a string:
x <- c("apple", "banana", "pear")
str_count(x, "a")
## [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
## [1] 1.991837
It’s natural to use str_count()
with
mutate()
:
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)
## # A tibble: 980 × 4
## word i vowels consonants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # ℹ 970 more rows
Now let’s look at an example with the sentences
data
set. First, let’s find the longest sentences
df1 <- tibble(
sentence = sentences,
word_number = str_count(sentence, "\\s") + 1
)
df1 %>%
arrange(desc(word_number)) %>%
print()
## # A tibble: 720 × 2
## sentence word_number
## <chr> <dbl>
## 1 It was hidden from sight by a mass of leaves and shrubs. 12
## 2 It was a bad error on the part of the new judge. 12
## 3 A ridge on a smooth surface is a bump or flaw. 11
## 4 The barrel of beer was a brew of malt and hops. 11
## 5 The crunch of feet in the snow was the only sound. 11
## 6 The vane on top of the pole revolved in the wind. 11
## 7 The bills were mailed promptly on the tenth of the month. 11
## 8 In the rear of the ground floor was a large passage. 11
## 9 The water in this well is a source of good health. 11
## 10 He wrote his name boldly at the top of the sheet. 11
## # ℹ 710 more rows
So we change the list into a tibble, then count the number of words in each sentence by counting the number of white spaces and then add one. Then we arrange it in descending order by the word number.
Next let’s find the sentences that do not have “a”, “an” or “the”:
df1 %>%
filter(!str_detect(str_to_lower(sentence), "( a )|( the )|( an )")) %>%
print()
## # A tibble: 254 × 2
## sentence word_number
## <chr> <dbl>
## 1 Rice is often served in round bowls. 7
## 2 The juice of lemons makes fine punch. 7
## 3 The hogs were fed chopped corn and garbage. 8
## 4 Four hours of steady work faced us. 7
## 5 A large size in stockings is hard to sell. 9
## 6 A rod is used to catch pink salmon. 8
## 7 Smoky fires lack flame and heat. 6
## 8 The swan dive was far short of perfect. 8
## 9 Her purse was full of useless trash. 7
## 10 Read verse out loud for pleasure. 6
## # ℹ 244 more rows
Here we use !
as logical NOT
to exclude the
given patterns. Note that we must have space in the parentheses to
capture the single words of “a”, “an” or “the”.
Find all sentences in sentences
that neither have “r”
nor “s”.
In data cleaning tasks, we usually need to extract the actual text of a match. For example, we hope to know whether “q” is always followed by “u” in a word.
In that case, we can use str_extract()
.
q_string <- str_extract(words, "q.")
head(q_string)
## [1] NA NA NA NA NA NA
This returns a lot of NA
values which are from vectors
that do not contain “q”. Let’s remove those NA
values.
q_string <- q_string[!is.na(q_string)]
print(q_string)
## [1] "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu"
Now it becomes clear that we only have “u” after “q” in those words.
Another way to do this is to use the str_subset
function to
keep strings with the given patterns only.
q_string2 <- str_subset(words, "q.")
q_string2 <- str_extract(q_string2, "q.")
print(q_string2)
## [1] "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu"
As another example, imagine we want to find all sentences in
sentences
that contain a colour. We first create a vector
of colour names, and then turn it into a single regular expression:
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
## [1] "red|orange|yellow|green|blue|purple"
Now we can select the sentences that contain a colour in the list, and then extract the colour to figure out which one it is:
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
## [1] "blue" "blue" "red" "red" "red" "blue"
Note that str_extract()
only extracts the first match.
We can see that most easily by first selecting all the sentences that
have more than 1 match:
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match, html = TRUE)
To get all matches, use str_extract_all()
. It returns a
list (we will learn list later) or a matrix (if using
simplify = TRUE
):
str_extract_all(more, colour_match)
## [[1]]
## [1] "blue" "red"
##
## [[2]]
## [1] "green" "red"
##
## [[3]]
## [1] "orange" "red"
str_extract_all(more, colour_match, simplify = TRUE)
## [,1] [,2]
## [1,] "blue" "red"
## [2,] "green" "red"
## [3,] "orange" "red"
There is a sentence in the previous example that doesn’t meet our criterion (“flickered” is not a color). Think about how to remove it.
Earlier we talked about the use of parentheses for clarifying
precedence and for back-references when matching. You can also use
parentheses to extract parts of a complex match using
str_match
function.
For example, let’s see how to extract year, month and day from a date
string like "2023-03-28"
date_string <- "2023-03-28"
str_match(date_string, "(\\d{4})-(\\d{2})-(\\d{2})")
## [,1] [,2] [,3] [,4]
## [1,] "2023-03-28" "2023" "03" "28"
Here string_match
returns a matrix with the first column
being the complete match, and next three columns the each group in
parentheses.
If your data is in a tibble, it’s often easier to use
tidyr::extract()
. It works like str_match()
but requires you to name the matches, which are then placed in new
columns.
Let’s take the flights
data set as an example. There is
a column time_hour
that contains all the date-time
information:
flights1 <- flights %>%
select(time_hour) %>%
print()
## # A tibble: 336,776 × 1
## time_hour
## <dttm>
## 1 2013-01-01 05:00:00
## 2 2013-01-01 05:00:00
## 3 2013-01-01 05:00:00
## 4 2013-01-01 05:00:00
## 5 2013-01-01 06:00:00
## 6 2013-01-01 05:00:00
## 7 2013-01-01 06:00:00
## 8 2013-01-01 06:00:00
## 9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # ℹ 336,766 more rows
To show how things work, we remove all other columns. Now let’s
create new columns named “year”, “month”, “day”, “hour”, “minute”,
“second” which are all extracted from the time_hour
string.
flights1 %>%
extract(
time_hour,
c("year", "month", "day", "hour", "minute", "second"),
"(\\d{4})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})",
remove = FALSE, convert = TRUE
) %>%
print()
## # A tibble: 336,776 × 7
## time_hour year month day hour minute second
## <dttm> <int> <int> <int> <int> <int> <int>
## 1 2013-01-01 05:00:00 2013 1 1 5 0 0
## 2 2013-01-01 05:00:00 2013 1 1 5 0 0
## 3 2013-01-01 05:00:00 2013 1 1 5 0 0
## 4 2013-01-01 05:00:00 2013 1 1 5 0 0
## 5 2013-01-01 06:00:00 2013 1 1 6 0 0
## 6 2013-01-01 05:00:00 2013 1 1 5 0 0
## 7 2013-01-01 06:00:00 2013 1 1 6 0 0
## 8 2013-01-01 06:00:00 2013 1 1 6 0 0
## 9 2013-01-01 06:00:00 2013 1 1 6 0 0
## 10 2013-01-01 06:00:00 2013 1 1 6 0 0
## # ℹ 336,766 more rows
Other than the data set name, tidyr::extract()
takes
three arguments, the column name to be matched, names of new columns in
a string vector, and the regular expression with grouped matches.
Matches in each parentheses will be placed in the newly created columns
correspondingly. remove = FALSE
keeps the original column
time_hour
and convert = TRUE
demands new
columns to be parsed into most appropriate data types.
Let’s take the tidied who
data set (TB case numbers) as
another example. After tidying data, we arrive at the following data
frame:
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # ℹ 76,036 more rows
As we know, the key column contains information about TB types, gender and age group. To recall the pattern:
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be
diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be
diagnosed by a pulmonary smear (smear positive)The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old3544
= 35 – 44 years old4554
= 45 – 54 years old5564
= 55 – 64 years old65
= 65 or olderThe following grouped regular expression would put all needed information into three new columns, “type”, “gender”, and “age_group”.
who1 %>%
extract(key,
c("type", "gender", "age_Group"),
"new[_]?(.*)_(m|f)(\\d*)",
remove = F) %>%
print()
## # A tibble: 76,046 × 9
## country iso2 iso3 year key type gender age_Group cases
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 sp m 014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 sp m 1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 sp m 2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 sp m 3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 sp m 4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 sp m 5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 sp m 65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 sp f 014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 sp f 1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 sp f 2534 36
## # ℹ 76,036 more rows
In this regular expression, "new"
refers to the “new” at
the beginning of every string in key
columns.
[_]?
refers to either nothing or a single “_” since for
rel
type there is no _
between “new” and
“rel”.
Then the first group (.*)
before the next “_” would
capture the code of TB types (either “rel”, “ep”, “sn” or “sp”). The
second group (m|f)
then captures the gender code. The
digits afterwards can be captured by the third group
(\\d*)
.
str_replace()
and str_replace_all()
allow
you to replace matches with new strings. The simplest use is to replace
a pattern with a fixed string:
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-" "p--r" "b-n-n-"
With str_replace_all()
you can perform multiple
replacements by supplying a named vector:
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house" "two cars" "three people"
In the case study above, what we did is to replace
"newrel"
with "new_rel"
before we further
analyze it.
who2 <- who1 %>%
mutate(key = str_replace(key, "newrel", "new_rel")) %>%
filter(str_detect(key, "new_rel")) # Only keep rows with "new_rel" for checking
who2
## # A tibble: 2,580 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 2013 new_rel_m014 1705
## 2 Afghanistan AF AFG 2013 new_rel_f014 1749
## 3 Albania AL ALB 2013 new_rel_m014 14
## 4 Albania AL ALB 2013 new_rel_m1524 60
## 5 Albania AL ALB 2013 new_rel_m2534 61
## 6 Albania AL ALB 2013 new_rel_m3544 32
## 7 Albania AL ALB 2013 new_rel_m4554 44
## 8 Albania AL ALB 2013 new_rel_m5564 50
## 9 Albania AL ALB 2013 new_rel_m65 67
## 10 Albania AL ALB 2013 new_rel_f014 5
## # ℹ 2,570 more rows
Switch the first and last letters for all words in words
and place the result in a new column.
A particularly useful function is the str_split
function. This function can split, for example, a sentence into
words.
sentences %>%
head(5) %>%
str_split(" ")
## [[1]]
## [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
## [8] "planks."
##
## [[2]]
## [1] "Glue" "the" "sheet" "to" "the"
## [6] "dark" "blue" "background."
##
## [[3]]
## [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
##
## [[4]]
## [1] "These" "days" "a" "chicken" "leg" "is" "a"
## [8] "rare" "dish."
##
## [[5]]
## [1] "Rice" "is" "often" "served" "in" "round" "bowls."
Because each component might contain a different number of pieces, this returns a list.
The splitting above is somehow unsatisfactory, since the punctuation are included. So we can refine the pattern:
sentences1 <- sentences
str_sub(sentences1, -1, -1) <- "" # Remove the last character which is a period
sentences1 %>%
head(5) %>%
str_split("[^A-Za-z]+")
## [[1]]
## [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks"
##
## [[2]]
## [1] "Glue" "the" "sheet" "to" "the"
## [6] "dark" "blue" "background"
##
## [[3]]
## [1] "It" "s" "easy" "to" "tell" "the" "depth" "of" "a"
## [10] "well"
##
## [[4]]
## [1] "These" "days" "a" "chicken" "leg" "is" "a"
## [8] "rare" "dish"
##
## [[5]]
## [1] "Rice" "is" "often" "served" "in" "round" "bowls"
Here "a-z"
and "A-Z"
inside []
in a regular expression represent all lower-case and upper-case letters
in English. So we are splitting the sentence by any non-letter character
of length one or more (+
represents one time or more).
Actually there is a simpler way to do this, using
boundary()
function.
sentences %>%
head(5) %>%
str_split(boundary("word"))
## [[1]]
## [1] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks"
##
## [[2]]
## [1] "Glue" "the" "sheet" "to" "the"
## [6] "dark" "blue" "background"
##
## [[3]]
## [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well"
##
## [[4]]
## [1] "These" "days" "a" "chicken" "leg" "is" "a"
## [8] "rare" "dish"
##
## [[5]]
## [1] "Rice" "is" "often" "served" "in" "round" "bowls"
Here boundry("word")
refers to split by each word. We
can also split by “line_break”, “character” or “sentence”.
There are two useful function in base R that also use regular expressions:
apropos()
searches all objects available from the
global environment that match with the given regular expression. This is
useful if you can’t quite remember the name of the function.apropos("replace")
## [1] "%+replace%" "replace" "replace_na" "setReplaceMethod"
## [5] "str_replace" "str_replace_all" "str_replace_na" "theme_replace"
dir()
lists all the files in a directory. The pattern
argument takes a regular expression and only returns file names that
match the pattern. For example, you can find all the R Markdown files in
the current directory with:dir(pattern = "\\.Rmd$")
## [1] "1-Introduction.Rmd"
## [2] "EDA_Class_Exercise.Rmd"
## [3] "R_Functions.Rmd"
## [4] "RMD10_Data_Tidying_1.Rmd"
## [5] "RMD11_Data_Tidying_2.Rmd"
## [6] "RMD12_Data_Import.Rmd"
## [7] "RMD13_Strings.Rmd"
## [8] "RMD2_Basics_Descriptive_Statistics.Rmd"
## [9] "RMD3_Data_Visualization_1.Rmd"
## [10] "RMD4_Data_Visualization_2.Rmd"
## [11] "RMD5_Data_Visualization_3.Rmd"
## [12] "RMD6_Data_Transformation_1.Rmd"
## [13] "RMD7_Data_Transformation_2.Rmd"
## [14] "RMD8_EDA_1.Rmd"
## [15] "RMD9_EDA_2.Rmd"
In real applications, we frequently use regular expressions to search for names of files, folders in coding projects.
Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.