library(tidyverse)
library(nycflights13)
Strings are collection of characters, which are used to store “text data”, or any data format in terms of texts. It is very important to be skilled at handling strings in data science. In this module, we will study
You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour.
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
If you forget to close a quote, you’ll see +, the continuation character:
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
If this happen to you, just press Esc(Escape) and try again!
To include a literal single or double quote in a string you can use to “escape” it:
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
If you need a literal backslash, we need
backslash <- "\\"
An important thing to know about strings is that they have literal values (what they actually are) and their representations (how you input that into a programming language). A literal value pair with a representation.
For example, for the literal value \
, we must input
"\\"
in R.
Unlike in Python, the print()
function in R returns the
representation. We need to use the function
writeLines()
to show the literal value of a string.
print("\\")
## [1] "\\"
writeLines("\\")
## \
Like many other programming languages, R use backslash to start an escape sequence inside a string:
Representation | Literal value |
---|---|
\n | new line |
\t | tab charcter |
\\ | backslash \ |
\" | double quotation marks " |
\' | single quotation marks ' |
\` | backticks ` |
For the full table of escape sequences, you may check the help documentation of quotes.
help("'")
For example, if we hope to write a string with literal value of
"\"
, we need to write
my_string <- "\"\\\""
writeLines(my_string)
## "\"
\\\
All characters have a UTF-8 code and we can print them out in R:
writeLines("\u00b5") # The greek letter "mu"
## µ
writeLines("\xe4\xbd\xa0\xe5\xa5\xbd") # The Chinese "你好"
## 你好
writeLines("\u2660") # Spade symbol of a card suit
## ♠
There are many online encoders to convert any character into UTF-8 codes.
Multiple strings are often stored in a character vector, which you can create with c():
string_vector <- c("One", "Two", "Three")
print(string_vector)
## [1] "One" "Two" "Three"
stringr
Base R contains many functions to work with strings but we’ll avoid
them because they can be inconsistent, which makes them hard to
remember. Instead we’ll use functions from stringr
. These
have more intuitive names, and all start with str_
. For
example, str_length() tells you the number of characters in a
string:
str_length(c("a", "R for data science", NA))
## [1] 1 18 NA
To combine two or more strings, use
str_c()
:
str_c("x","y","z")
## [1] "xyz"
Use the sep
argument to control how they’re
separated:
str_c("x","y","z", sep = "+")
## [1] "x+y+z"
str_c()
is vectorised, and it automatically recycles
shorter vectors to the same length as the longest:
str_c("b", c("a", "e", "u"), "g")
## [1] "bag" "beg" "bug"
To collapse a vector of strings into a single
string, use collapse
argument:
str_c(c("x", "y", "z"), collapse = ",")
## [1] "x,y,z"
You can extract parts of a string using str_sub()
. As
well as the string, str_sub()
takes start
and
end
arguments which give the (inclusive) position of the
substring:
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"
Note that str_sub()
won’t fail if the string is too
short: it will just return as much as possible:
str_sub("a", 1, 5)
## [1] "a"
You can also use the assignment form of str_sub() to modify part of a string:
x = "There is a typo in the word studant"
str_sub(x, -3, -3) <- "e"
x
## [1] "There is a typo in the word student"
str_to_lower
and str_to_upper
functions
convert the text to lower/upper case respectively.
x <- "china"
str_to_lower(x)
## [1] "china"
str_to_upper(x)
## [1] "CHINA"
str_sort()
function sort a vector of strings by
alphabetic order. We can do it either in increasing (by default) or
decreasing order.
str_sort(c("apple", "orange", "banana"))
## [1] "apple" "banana" "orange"
str_sort(c("apple", "orange", "banana"), decreasing = TRUE)
## [1] "orange" "banana" "apple"
stringr
dataTo exercise string manipulations, we will use the three pre-loaded
string data sets in stringr
package. They are,
words
, fruit
and sentences
words
contain 980 most commonly used English wordsfruit
contain 80 English words of fruitssentences
contain 720 English sentences which was used
for standardised testing of voice from “Harvard sentences”Let’s play with it - first find the longest word in the
words
data set:
word_data <- as_tibble(words) %>%
mutate(length = str_length(value)) %>%
arrange(desc(length)) %>%
print()
## # A tibble: 980 × 2
## value length
## <chr> <int>
## 1 appropriate 11
## 2 environment 11
## 3 opportunity 11
## 4 responsible 11
## 5 department 10
## 6 difference 10
## 7 experience 10
## 8 individual 10
## 9 particular 10
## 10 photograph 10
## # … with 970 more rows
Now, let’s say we hope to find all words with some patterns such as:
an
as part of the worda
s in the worde
and be longer than 6 wordsHow to do these jobs? We would need to refer to our next topic - regular expressions.
Regular expressions, or in short “regexps” or “regex”, are a mini programming language that allow you to describe patterns in strings. They are very powerful in handling file names, folder names, texts or any job related to strings.
For example, we have tidied the who
data about TB cases
in different countries and years. One step there is to separate a “key”
column into a few different ones:
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
Previously we had used mutate
and separate
function to separate the key
column into
types
, Gender
and Age_Group
.
After learning regular expressions, we would be able to do all these
just in one line:
tidyr::extract(who1, key, c("type", "Gender", "Age_Group"), "new[_]?(.*)_(m|f)(\\d*)", remove = F) -> who1
who1
## # A tibble: 76,046 × 9
## country iso2 iso3 year key type Gender Age_Group cases
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 sp m 014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 sp m 1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 sp m 2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 sp m 3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 sp m 4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 sp m 5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 sp m 65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 sp f 014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 sp f 1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 sp f 2534 36
## # … with 76,036 more rows
The odd-looking string here uses regular expression to identify some
particular patterns to decode the key
column. Essentially,
regular expression is like a mini programming language.
At the beginning it may take some time to get used to it, but after you
understand how it works you will find that it is very useful and fun to
work with.
To learn regular expressions, we’ll use str_view()
as
the starting point. str_view()
takes a character vector and
a regular expression, and show you how they match.
Let’s start from the simplest case, match exactly one or more letters.
x <- c("apple", "banana", "pear")
str_view(x, "a", html = TRUE)
x <- c("apple", "banana", "pear")
str_view(x, "an", html = TRUE, match = NA)
So we see that the function highlights any part of the words that
matches the given pattern, which is exactly "a"
or
"an"
in this case.
The template to use str_view
is:
str_view(string, pattern, match = TRUE, html = FALSE)
Here string
is the string or a vector of strings to
inspect, pattern
uses regular expression
to describe what pattern we are looking for. match
controls
what we print (only words with match, without match or all words
regardless of having match or not). html
should only be
TRUE
when we want to print the reuslt in a webpage (such as
a markdown).
.
matches any single character (except a new line)Now let’s study the mini language of regular expression. First, a
mere .
in a regular expression represents any
single character excluding a new line. For example,
str_view(x, ".a.", match = NA, html = TRUE)
matches any three characters with “a” in the middle.
Similarly, ...
matches any characters of length three.
x <- c("a", "ab", "abc", "abcd")
str_view(x, "...", match = NA, html = TRUE)
However, the new line character \n
is not counted as a
single character.
x <- 'ab\ncd'
writeLines(x)
## ab
## cd
str_view(x, "...", match = NA, html = TRUE)
Before we learn more ways to describe patterns, we need to learn the
escape sequences that are needed in regular expression. Now we know that
.
is used to represent any single character, but then how
we express the literal .
by itself? We have to use the
escape sequence \.
to represent a literal
.
However, if we try to do this in R, there will be some error message
x <- c("2.357", "apple")
str_view(x, "\.", match = NA, html = TRUE)
Why doesn’t this work? The reason is that, we use a string to
represent the regular expression \.
. But for a literal
\.
, we need \\.
as the representation as we
learned above. So the right thing to do is:
x <- c("2.357", "apple")
str_view(x, "\\.", match = NA, html = TRUE)
In summary, we have two ways to write a regular expression:
.
or \.
"."
, or
"\\."
where we must use a pair of quotation marks to
enclose the string.For our textbook, we may use both ways to write a regular expression.
But remember in R, we have to use the second way as the input of
str_view
or other functions that work on regular
expressions.
As below is a table that helps you understand this.
literal values of regular expression | string representation used in R | Meaning |
---|---|---|
. |
"." |
Any single character (excluding new line) |
\. |
"\\." |
A literal . |
\\ |
"\\\\" |
A literal \ |
" |
'"' or "\\\"" |
A literal " |
' |
"'" or '\\\'' |
A literal ' |
\d |
"\\d" |
any digit (0-9) |
\s |
"\\s" |
any white space |
The following regular expression matches a literal
\
:
x <- c("a\\b", "a/b")
writeLines(x)
## a\b
## a/b
str_view(x, "\\\\", match = NA, html = TRUE)
The following regular expression matches a literal like
2.3
:
x <- c("30", "2.56", "1e5", "0.5943")
str_view(x, "\\d\\.\\d", match = NA, html = TRUE)
How would you match the sequence "'\
?
What patterns will the regular expression \..\..\..
match? How would you represent it as a string?
^
and $
By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:
^
to match the start of the string.$
to match the end of the string.For example, to match the pattern of starting or ending with letter “a”, we can do
x <- c("apple", "banana")
str_view(x, "^a", match = NA, html = TRUE)
str_view(x, "a$", match = NA, html = TRUE)
Our textbook provides an interesting mnemonic from Evan Misshula to
help remember this: if you begin with power (^
), you end up
with money ($
).
To force a regular expression to only match a complete string, anchor
it with both ^
and $
:
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple", match = NA, html = TRUE)
str_view(x, "^apple$", match = NA, html = TRUE)
Now let’s get all words in words
that ends with “a”. We
can use the str_detect
method to do the filter, which
returns TRUE
or FALSE
for each string in the
vector.
x <- c("apple", "banana")
str_detect(x, "a$")
## [1] FALSE TRUE
word_data %>%
filter(str_detect(value, "a$")) %>%
print()
## # A tibble: 6 × 2
## value length
## <chr> <int>
## 1 america 7
## 2 extra 5
## 3 area 4
## 4 idea 4
## 5 tea 3
## 6 a 1
words
data set.words
data
set with and without using str_length
function.$
and ^
Since $
and ^
has special meanings in
regular expressions. We have to use the escape sequence to represent
literal $
and ^
as well. This applies to all
future symbols of such type as well.
# To match the literal $^$
x <- c("a$^$b")
writeLines(x)
## a$^$b
str_view(x, "\\$\\^\\$", match = NA, html = TRUE)
There are a number of special patterns that match more than one
character. You’ve already seen .
, which matches any
character apart from a newline. There are four other useful tools:
\d
: matches any digit.\s
: matches any whitespace (e.g. space, tab,
newline).[abc]
: matches a, b, or c.[^abc]
: matches anything except a, b, or c.Note that ^
has different meaning outside
[]
or inside a []
. Also, to use
\d
and \s
in string representations, we must
use "\\d"
and "\\s"
.
A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c", match = NA, html = TRUE)
The code above finds the same pattern as a\.c
, but its
string representation is more readable than "a\\.c"
.
This works for most (but not all) regex metacharacters:
$ . | ? * + ( ) [ {
. Unfortunately, a few characters have
special meaning even inside a character class and must be handled with
backslash escapes: ]
\
^
and
-
.
Question: Find all words in words
that
starts with a vowel (“a”, “e”, “i”, “o” or “u”).
Solution:
word_data %>%
filter(str_detect(value, "^[aeiou]")) %>%
print()
## # A tibble: 175 × 2
## value length
## <chr> <int>
## 1 appropriate 11
## 2 environment 11
## 3 opportunity 11
## 4 experience 10
## 5 individual 10
## 6 understand 10
## 7 university 10
## 8 advertise 9
## 9 afternoon 9
## 10 associate 9
## # … with 165 more rows
Question: Find all words in words
that
ends with ed
, but not with eed
.
Solution:
word_data %>%
filter(str_detect(value, "ed$")) %>% # This only finds words ending with "ed"
print() %>%
filter(str_detect(value, "[^e]ed$")) %>%
print()
## # A tibble: 9 × 2
## value length
## <chr> <int>
## 1 hundred 7
## 2 proceed 7
## 3 succeed 7
## 4 indeed 6
## 5 speed 5
## 6 feed 4
## 7 need 4
## 8 bed 3
## 9 red 3
## # A tibble: 3 × 2
## value length
## <chr> <int>
## 1 hundred 7
## 2 bed 3
## 3 red 3
One can use alternation to pick between one or more
alternative patterns. For example, abc|d..f
will match
either abc
, or deaf
. Note that the precedence
for |
is low, so that abc|xyz
matches
abc
or xyz
not abcyz
or
abxyz
. Like with mathematical expressions, if precedence
ever gets confusing, use parentheses to make it clear what you want:
str_view(c("grey", "gray"), "gr(e|a)y", match = NA, html = TRUE)
words
that ends with
ing
or ise
.word_data %>%
filter(str_detect(value, "(ing|ise)$")) %>%
print()
## # A tibble: 17 × 2
## value length
## <chr> <int>
## 1 advertise 9
## 2 otherwise 9
## 3 exercise 8
## 4 practise 8
## 5 surprise 8
## 6 evening 7
## 7 meaning 7
## 8 morning 7
## 9 realise 7
## 10 during 6
## 11 bring 5
## 12 raise 5
## 13 thing 5
## 14 king 4
## 15 ring 4
## 16 rise 4
## 17 sing 4
Next, let’s see how to describe patterns that repeat itself for exactly or selectively some number of times.
The following symbols define how many times the previous character repeat:
?
: 0 or 1+
: 1 or more*
: 0 or more{n}
: exactly \(n\)
timesFor example, if we hope to search words starting with “a” and ending with “e”, we may do:
str_view(words, "^a.*e$", html = TRUE)
Here the .*
refers to any character of any length since
.
refers to any character and *
refers to 0 or
more times.
As another example, if we hope to search words that has three vowel letters connecting each other, for example, “iou”, we can do
str_view(words, "[aeiou]{3}", html = TRUE)
Here [aeiou]
refers to a single
character which is a vowel letter, and {3}
refers
to repeating three times.
^.*$
\d{4}-\d{2}-\d{2}
"\\\\{4}"
{n,}
: \(n\) times or
more{,m}
: at most \(m\)
times{n,m}
: between \(n\)
times and \(m\) timesParenthesis ()
in regular expressions can be used to
refer to a numbered capturing group. This is useful
when we want to refer to exactly the same text later.
For example, how to search for words that have more than three same letters of “a”, “e” or “i”? We may do the following:
str_view(words, "([aei]).*\\1.*\\1", html = TRUE)
Here .*
refers to any character of any length as seen
before. ([aei])
refers to either “a” or “e” or “i”, and
"\\1"
which is \1
in value refers to ** the
same letter in ()
occurring again.
As another example, the following regular expression finds all fruits that have a repeated pair of letters.
str_view(fruit, "(..)\\1", html = TRUE)
2.Construct regular expressions to match words that start and end with the same character.