Load Libraries


library(tidyverse)
library(nycflights13)

Introduction


Strings are collection of characters, which are used to store “text data”, or any data format in terms of texts. It is very important to be skilled at handling strings in data science. In this module, we will study

  • Basics in string manipulation in R

  • Basics in regular expressions (regexps)

String basics


You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

If you forget to close a quote, you’ll see +, the continuation character:

> "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK

If this happen to you, just press Esc(Escape) and try again!

To include a literal single or double quote in a string you can use  to “escape” it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

If you need a literal backslash, we need

backslash <- "\\"

Literal value and representation of strings


An important thing to know about strings is that they have literal values (what they actually are) and their representations (how you input that into a programming language). A literal value pair with a representation.

For example, for the literal value \, we must input "\\" in R.

Unlike in Python, the print() function in R returns the representation. We need to use the function writeLines() to show the literal value of a string.

print("\\")
## [1] "\\"
writeLines("\\")
## \

Like many other programming languages, R use backslash to start an escape sequence inside a string:

Representation Literal value
\n new line
\t tab charcter
\\ backslash \
\" double quotation marks "
\' single quotation marks '
\` backticks `

For the full table of escape sequences, you may check the help documentation of quotes.

help("'")

For example, if we hope to write a string with literal value of "\", we need to write

my_string <- "\"\\\""
writeLines(my_string)
## "\"

Lab Exercise


Write a string of literal value of \\\

Use UTF-8 code


All characters have a UTF-8 code and we can print them out in R:

writeLines("\u00b5") # The greek letter "mu"
## µ
writeLines("\xe4\xbd\xa0\xe5\xa5\xbd") # The Chinese "你好"
## 你好
writeLines("\u2660") # Spade symbol of a card suit
## ♠

There are many online encoders to convert any character into UTF-8 codes.

Multiple strings are often stored in a character vector, which you can create with c():

string_vector <- c("One", "Two", "Three")
print(string_vector)
## [1] "One"   "Two"   "Three"

String functions in stringr


Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr. These have more intuitive names, and all start with str_. For example, str_length() tells you the number of characters in a string:

str_length(c("a", "R for data science", NA))
## [1]  1 18 NA

To combine two or more strings, use str_c():

str_c("x","y","z")
## [1] "xyz"

Use the sep argument to control how they’re separated:

str_c("x","y","z", sep = "+")
## [1] "x+y+z"

str_c() is vectorised, and it automatically recycles shorter vectors to the same length as the longest:

str_c("b", c("a", "e", "u"), "g")
## [1] "bag" "beg" "bug"

To collapse a vector of strings into a single string, use collapse argument:

str_c(c("x", "y", "z"), collapse = ",")
## [1] "x,y,z"

Subsetting strings


You can extract parts of a string using str_sub(). As well as the string, str_sub() takes start and end arguments which give the (inclusive) position of the substring:

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"

Note that str_sub() won’t fail if the string is too short: it will just return as much as possible:

str_sub("a", 1, 5)
## [1] "a"

You can also use the assignment form of str_sub() to modify part of a string:

x = "There is a typo in the word studant"
str_sub(x, -3, -3) <- "e"
x
## [1] "There is a typo in the word student"

str_to_lower and str_to_upper functions convert the text to lower/upper case respectively.

x <- "China"
str_to_lower(x)
## [1] "china"
str_to_upper(x)
## [1] "CHINA"

str_sort() function sort a vector of strings by alphabetic order. We can do it either in increasing (by default) or decreasing order.

str_sort(c("apple", "orange", "banana"))
## [1] "apple"  "banana" "orange"
str_sort(c("apple", "orange", "banana"), decreasing = TRUE)
## [1] "orange" "banana" "apple"

stringr data


To exercise string manipulations, we will use the three pre-loaded string data sets in stringr package. They are, words, fruit and sentences

  • words contain 980 most commonly used English words
  • fruit contain 80 English words of fruits
  • sentences contain 720 English sentences which was used for standardised testing of voice from “Harvard sentences”

Let’s play with it - first find the longest word in the words data set:

word_data <- as_tibble(words) %>%
  mutate(length = str_length(value)) %>%
  arrange(desc(length)) %>%
  print()
## # A tibble: 980 × 2
##    value       length
##    <chr>        <int>
##  1 appropriate     11
##  2 environment     11
##  3 opportunity     11
##  4 responsible     11
##  5 department      10
##  6 difference      10
##  7 experience      10
##  8 individual      10
##  9 particular      10
## 10 photograph      10
## # ℹ 970 more rows

Now, let’s say we hope to find all words with some patterns such as:

  • has exactly an as part of the word
  • has at least two as in the word
  • end with e and be longer than 6 words

How to do these jobs? We would need to refer to our next topic - regular expressions.

Introduction to regular expressions


Regular expressions, or in short “regexps” or “regex”, are a mini programming language that allow you to describe patterns in strings. They are very powerful in handling file names, folder names, texts or any job related to strings.

For example, we have tidied the who data about TB cases in different countries and years. One step there is to separate a “key” column into a few different ones:

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
## # A tibble: 76,046 × 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # ℹ 76,036 more rows

Previously we had used mutate and separate function to separate the key column into types, Gender and Age_Group. After learning regular expressions, we would be able to do all these just in one line:

tidyr::extract(who1, key, c("type", "Gender", "Age_Group"), "new[_]?(.*)_(m|f)(\\d*)", remove = F) -> who1
who1
## # A tibble: 76,046 × 9
##    country     iso2  iso3   year key          type  Gender Age_Group cases
##    <chr>       <chr> <chr> <dbl> <chr>        <chr> <chr>  <chr>     <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014  sp    m      014           0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524 sp    m      1524         10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534 sp    m      2534          6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544 sp    m      3544          3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554 sp    m      4554          5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564 sp    m      5564          2
##  7 Afghanistan AF    AFG    1997 new_sp_m65   sp    m      65            0
##  8 Afghanistan AF    AFG    1997 new_sp_f014  sp    f      014           5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524 sp    f      1524         38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534 sp    f      2534         36
## # ℹ 76,036 more rows

The odd-looking string here uses regular expression to identify some particular patterns to decode the key column. Essentially, regular expression is like a mini programming language. At the beginning it may take some time to get used to it, but after you understand how it works you will find that it is very useful and fun to work with.

To learn regular expressions, we’ll use str_view() as the starting point. str_view() takes a character vector and a regular expression, and show you how they match.

Basic matches


Let’s start from the simplest case, match exactly one or more letters.


x <- c("apple", "banana", "pear")
str_view(x, "a", html = TRUE)
x <- c("apple", "banana", "pear")
str_view(x, "an", html = TRUE, match = NA)

So we see that the function highlights any part of the words that matches the given pattern, which is exactly "a" or "an" in this case.

The template to use str_view is:

str_view(string, pattern, match = TRUE, html = FALSE)

Here string is the string or a vector of strings to inspect, pattern uses regular expression to describe what pattern we are looking for. match controls what we print (only words with match, without match or all words regardless of having match or not). html should only be TRUE when we want to print the reuslt in a webpage (such as a markdown).

. matches any single character (except a new line)


Now let’s study the mini language of regular expression. First, a mere . in a regular expression represents any single character excluding a new line. For example,

str_view(x, ".a.", match = NA, html = TRUE)

matches any three characters with “a” in the middle. Similarly, ... matches any characters of length three.

x <- c("a", "ab", "abc", "abcd")
str_view(x, "...", match = NA, html = TRUE)

However, the new line character \n is not counted as a single character.

x <- 'ab\ncd'
writeLines(x)
## ab
## cd
str_view(x, "...", match = NA, html = TRUE)

Use escape sequences in regular expressions


Before we learn more ways to describe patterns, we need to learn the escape sequences that are needed in regular expression. Now we know that . is used to represent any single character, but then how we express the literal . by itself? We have to use the escape sequence \. to represent a literal .

However, if we try to do this in R, there will be some error message

x <- c("2.357", "apple")
str_view(x, "\.", match = NA, html = TRUE)

Why doesn’t this work? The reason is that, we use a string to represent the regular expression \.. But for a literal \., we need \\. as the representation as we learned above. So the right thing to do is:

x <- c("2.357", "apple")
str_view(x, "\\.", match = NA, html = TRUE)

In summary, we have two ways to write a regular expression:

  • its literal value, such as . or \.
  • its string representation, such as ".", or "\\." where we must use a pair of quotation marks to enclose the string.

For our textbook, we may use both ways to write a regular expression. But remember in R, we have to use the second way as the input of str_view or other functions that work on regular expressions.

As below is a table that helps you understand this.

literal values of regular expression string representation used in R Meaning
. "." Any single character (excluding new line)
\. "\\." A literal .
\\ "\\\\" A literal \
" '"' or "\\\"" A literal "
' "'" or '\\\'' A literal '
\d "\\d" any digit (0-9)
\s "\\s" any white space

Examples


The following regular expression matches a literal \:

x <- c("a\\b", "a/b")
writeLines(x)
## a\b
## a/b
str_view(x, "\\\\", match = NA, html = TRUE)

The following regular expression matches a literal like 2.3:

x <- c("30", "2.56", "1e5", "0.5943")
str_view(x, "\\d\\.\\d", match = NA, html = TRUE)

Lab Exercises


  1. How would you match the sequence "'\?

  2. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

Anchors ^ and $


By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:

  • ^ to match the start of the string.
  • $ to match the end of the string.

For example, to match the pattern of starting or ending with letter “a”, we can do

x <- c("apple", "banana")
str_view(x, "^a", match = NA, html = TRUE)
str_view(x, "a$", match = NA, html = TRUE)

Our textbook provides an interesting mnemonic from Evan Misshula to help remember this: if you begin with power (^), you end up with money ($).

To force a regular expression to only match a complete string, anchor it with both ^ and $:

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple", match = NA, html = TRUE)
str_view(x, "^apple$", match = NA, html = TRUE)

Example


Now let’s get all words in words that ends with “a”. We can use the str_detect method to do the filter, which returns TRUE or FALSE for each string in the vector.

x <- c("apple", "banana")
str_detect(x, "a$")
## [1] FALSE  TRUE
word_data %>%
  filter(str_detect(value, "a$")) %>%
  print()
## # A tibble: 6 × 2
##   value   length
##   <chr>    <int>
## 1 america      7
## 2 extra        5
## 3 area         4
## 4 idea         4
## 5 tea          3
## 6 a            1

Lab Exercises


  1. Find all words that starts with “a” and ends with “e” in words data set.
  2. Find all words that are four letters long in words data set with and without using str_length function.

Literal $ and ^


Since $ and ^ has special meanings in regular expressions. We have to use the escape sequence to represent literal $ and ^ as well. This applies to all future symbols of such type as well.

# To match the literal $^$
x <- c("a$^$b")
writeLines(x)
## a$^$b
str_view(x, "\\$\\^\\$", match = NA, html = TRUE)

Character classes and alternatives


There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.

Note that ^ has different meaning outside [] or inside a []. Also, to use \d and \s in string representations, we must use "\\d" and "\\s".

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c", match = NA, html = TRUE)

The code above finds the same pattern as a\.c, but its string representation is more readable than "a\\.c".

This works for most (but not all) regex metacharacters: $ . | ? * + ( ) [ {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: ] \ ^ and -.

Example


Question: Find all words in words that starts with a vowel (“a”, “e”, “i”, “o” or “u”).

Solution:

word_data %>%
  filter(str_detect(value, "^[aeiou]")) %>%
  print()
## # A tibble: 175 × 2
##    value       length
##    <chr>        <int>
##  1 appropriate     11
##  2 environment     11
##  3 opportunity     11
##  4 experience      10
##  5 individual      10
##  6 understand      10
##  7 university      10
##  8 advertise        9
##  9 afternoon        9
## 10 associate        9
## # ℹ 165 more rows

Question: Find all words in words that ends with ed, but not with eed.

Solution:

word_data %>%
  filter(str_detect(value, "ed$")) %>% # This only finds words ending with "ed"
  print() %>%
  filter(str_detect(value, "[^e]ed$")) %>%
  print()
## # A tibble: 9 × 2
##   value   length
##   <chr>    <int>
## 1 hundred      7
## 2 proceed      7
## 3 succeed      7
## 4 indeed       6
## 5 speed        5
## 6 feed         4
## 7 need         4
## 8 bed          3
## 9 red          3
## # A tibble: 3 × 2
##   value   length
##   <chr>    <int>
## 1 hundred      7
## 2 bed          3
## 3 red          3

Alteration


One can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either abc, or deaf. Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:

str_view(c("grey", "gray"), "gr(e|a)y", match = NA, html = TRUE)

Example: Find all words in words that ends with ing or ise


word_data %>%
  filter(str_detect(value, "(ing|ise)$")) %>% 
  print()
## # A tibble: 17 × 2
##    value     length
##    <chr>      <int>
##  1 advertise      9
##  2 otherwise      9
##  3 exercise       8
##  4 practise       8
##  5 surprise       8
##  6 evening        7
##  7 meaning        7
##  8 morning        7
##  9 realise        7
## 10 during         6
## 11 bring          5
## 12 raise          5
## 13 thing          5
## 14 king           4
## 15 ring           4
## 16 rise           4
## 17 sing           4

Lab Exercises


Find all words with the first letter being “a” or “e”, and the third letter being “r” or “s”.

Repetition


Next, let’s see how to describe patterns that repeat itself for exactly or selectively some number of times.

The following symbols define how many times the previous character repeat:

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
  • {n}: exactly \(n\) times

For example, if we hope to search words starting with “a” and ending with “e”, we may do:

str_view(words, "^a.*e$", html = TRUE)

Here the .* refers to any character of any length since . refers to any character and * refers to 0 or more times.

As another example, if we hope to search words that has three vowel letters connecting each other, for example, “iou”, we can do

str_view(words, "[aeiou]{3}", html = TRUE)

Here [aeiou] refers to a single character which is a vowel letter, and {3} refers to repeating three times.

Lab Exercise


  1. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
  • ^.*$
  • \d{4}-\d{2}-\d{2}
  • "\\\\{4}"
  1. Create regular expressions to find all words that start with “s” and are at least 7 letters long.

A few more regular expression rules for repetition


  • {n,}: \(n\) times or more

  • {,m}: at most \(m\) times

  • {n,m}: between \(n\) times and \(m\) times

Grouping and backreferences


Parenthesis () in regular expressions can be used to refer to a numbered capturing group. This is useful when we want to refer to exactly the same text later.

For example, how to search for words that have more than three same letters of “a”, “e” or “i”? We may do the following:

str_view(words, "([aei]).*\\1.*\\1", html = TRUE)

Here .* refers to any character of any length as seen before. ([aei]) refers to either “a” or “e” or “i”, and "\\1" which is \1 in value refers to the same letter in () occurring again.

As another example, the following regular expression finds all fruits that have a repeated pair of letters.

str_view(fruit, "(..)\\1", html = TRUE)

Lab Exercises


  1. Give an example of the pattern that the following regular expressions would match:
  • (.)\1\1
  • "(.)(.)\\2\\1"

2.Construct regular expressions to match words that start and end with the same character.

Application of regular expressions


Next, let’s learn how to apply regular expressions to real problems. We will learn stringr functions that help

  • Detect which strings match a pattern.

  • Find the positions of matches.

  • Extract the content of matches.

  • Replace matches with new values.

  • Split a string based on a match.

Detect Matches


To determine if a character vector matches a pattern, use str_detect(). It returns a logical vector the same length as the input. When there is a match, a TRUE will be returned; otherwise it will be FALSE.

x <- c("apple", "banana", "pear")
str_detect(x, "e")
## [1]  TRUE FALSE  TRUE

We can then use sum() and mean() function to answer how many strings match the given pattern and what is the proportion of matching strings in the vector:

# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306

Lab Exercise


find how many words ending with “e*e” where “*” can be any letter.

Select a pattern


A common use of str_detect() is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient str_subset() wrapper:

words[str_detect(words, "x$")]
## [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"

Typically, however, your strings will be one column of a data frame. We will use filter together with str_detect():

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
## # A tibble: 4 × 2
##   word      i
##   <chr> <int>
## 1 box     108
## 2 sex     747
## 3 six     772
## 4 tax     841

Here the seq_along() function returns the sequence number of each word in the list.

A variation on str_detect() is str_count(): rather than a simple yes or no, it tells you how many matches there are in a string:

x <- c("apple", "banana", "pear")
str_count(x, "a")
## [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
## [1] 1.991837

It’s natural to use str_count() with mutate():

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
## # A tibble: 980 × 4
##    word         i vowels consonants
##    <chr>    <int>  <int>      <int>
##  1 a            1      1          0
##  2 able         2      2          2
##  3 about        3      3          2
##  4 absolute     4      4          4
##  5 accept       5      2          4
##  6 account      6      3          4
##  7 achieve      7      4          3
##  8 across       8      2          4
##  9 act          9      1          2
## 10 active      10      3          3
## # ℹ 970 more rows

Example


Now let’s look at an example with the sentences data set. First, let’s find the longest sentences

df1 <- tibble(
  sentence = sentences,
  word_number = str_count(sentence, "\\s") + 1
)

df1 %>%
  arrange(desc(word_number)) %>%
  print()
## # A tibble: 720 × 2
##    sentence                                                  word_number
##    <chr>                                                           <dbl>
##  1 It was hidden from sight by a mass of leaves and shrubs.           12
##  2 It was a bad error on the part of the new judge.                   12
##  3 A ridge on a smooth surface is a bump or flaw.                     11
##  4 The barrel of beer was a brew of malt and hops.                    11
##  5 The crunch of feet in the snow was the only sound.                 11
##  6 The vane on top of the pole revolved in the wind.                  11
##  7 The bills were mailed promptly on the tenth of the month.          11
##  8 In the rear of the ground floor was a large passage.               11
##  9 The water in this well is a source of good health.                 11
## 10 He wrote his name boldly at the top of the sheet.                  11
## # ℹ 710 more rows

So we change the list into a tibble, then count the number of words in each sentence by counting the number of white spaces and then add one. Then we arrange it in descending order by the word number.

Next let’s find the sentences that do not have “a”, “an” or “the”:

df1 %>%
  filter(!str_detect(str_to_lower(sentence), "( a )|( the )|( an )")) %>%
  print()
## # A tibble: 254 × 2
##    sentence                                    word_number
##    <chr>                                             <dbl>
##  1 Rice is often served in round bowls.                  7
##  2 The juice of lemons makes fine punch.                 7
##  3 The hogs were fed chopped corn and garbage.           8
##  4 Four hours of steady work faced us.                   7
##  5 A large size in stockings is hard to sell.            9
##  6 A rod is used to catch pink salmon.                   8
##  7 Smoky fires lack flame and heat.                      6
##  8 The swan dive was far short of perfect.               8
##  9 Her purse was full of useless trash.                  7
## 10 Read verse out loud for pleasure.                     6
## # ℹ 244 more rows

Here we use ! as logical NOT to exclude the given patterns. Note that we must have space in the parentheses to capture the single words of “a”, “an” or “the”.

Lab Exercises


Find all sentences in sentences that neither have “r” nor “s”.

Extract matches


In data cleaning tasks, we usually need to extract the actual text of a match. For example, we hope to know whether “q” is always followed by “u” in a word.

In that case, we can use str_extract().

q_string <- str_extract(words, "q.")
head(q_string)
## [1] NA NA NA NA NA NA

This returns a lot of NA values which are from vectors that do not contain “q”. Let’s remove those NA values.

q_string <- q_string[!is.na(q_string)]
print(q_string)
##  [1] "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu"

Now it becomes clear that we only have “u” after “q” in those words. Another way to do this is to use the str_subset function to keep strings with the given patterns only.

q_string2 <- str_subset(words, "q.")
q_string2 <- str_extract(q_string2, "q.")
print(q_string2)
##  [1] "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu"

As another example, imagine we want to find all sentences in sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
## [1] "red|orange|yellow|green|blue|purple"

Now we can select the sentences that contain a colour in the list, and then extract the colour to figure out which one it is:

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
## [1] "blue" "blue" "red"  "red"  "red"  "blue"

Note that str_extract() only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match, html = TRUE)

To get all matches, use str_extract_all(). It returns a list (we will learn list later) or a matrix (if using simplify = TRUE):

str_extract_all(more, colour_match)
## [[1]]
## [1] "blue" "red" 
## 
## [[2]]
## [1] "green" "red"  
## 
## [[3]]
## [1] "orange" "red"
str_extract_all(more, colour_match, simplify = TRUE)
##      [,1]     [,2] 
## [1,] "blue"   "red"
## [2,] "green"  "red"
## [3,] "orange" "red"

Lab Exercise


There is a sentence in the previous example that doesn’t meet our criterion (“flickered” is not a color). Think about how to remove it.

Grouped Matches


Earlier we talked about the use of parentheses for clarifying precedence and for back-references when matching. You can also use parentheses to extract parts of a complex match using str_match function.

For example, let’s see how to extract year, month and day from a date string like "2023-03-28"

date_string <- "2023-03-28"
str_match(date_string, "(\\d{4})-(\\d{2})-(\\d{2})")
##      [,1]         [,2]   [,3] [,4]
## [1,] "2023-03-28" "2023" "03" "28"

Here string_match returns a matrix with the first column being the complete match, and next three columns the each group in parentheses.

If your data is in a tibble, it’s often easier to use tidyr::extract(). It works like str_match() but requires you to name the matches, which are then placed in new columns.

Let’s take the flights data set as an example. There is a column time_hour that contains all the date-time information:

flights1 <- flights %>%
  select(time_hour) %>%
  print()
## # A tibble: 336,776 × 1
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 05:00:00
##  3 2013-01-01 05:00:00
##  4 2013-01-01 05:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # ℹ 336,766 more rows

To show how things work, we remove all other columns. Now let’s create new columns named “year”, “month”, “day”, “hour”, “minute”, “second” which are all extracted from the time_hour string.

flights1 %>%
  extract(
    time_hour, 
    c("year", "month", "day", "hour", "minute", "second"),    
    "(\\d{4})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})",
    remove = FALSE, convert = TRUE
  ) %>%
  print()
## # A tibble: 336,776 × 7
##    time_hour            year month   day  hour minute second
##    <dttm>              <int> <int> <int> <int>  <int>  <int>
##  1 2013-01-01 05:00:00  2013     1     1     5      0      0
##  2 2013-01-01 05:00:00  2013     1     1     5      0      0
##  3 2013-01-01 05:00:00  2013     1     1     5      0      0
##  4 2013-01-01 05:00:00  2013     1     1     5      0      0
##  5 2013-01-01 06:00:00  2013     1     1     6      0      0
##  6 2013-01-01 05:00:00  2013     1     1     5      0      0
##  7 2013-01-01 06:00:00  2013     1     1     6      0      0
##  8 2013-01-01 06:00:00  2013     1     1     6      0      0
##  9 2013-01-01 06:00:00  2013     1     1     6      0      0
## 10 2013-01-01 06:00:00  2013     1     1     6      0      0
## # ℹ 336,766 more rows

Other than the data set name, tidyr::extract() takes three arguments, the column name to be matched, names of new columns in a string vector, and the regular expression with grouped matches. Matches in each parentheses will be placed in the newly created columns correspondingly. remove = FALSE keeps the original column time_hour and convert = TRUE demands new columns to be parsed into most appropriate data types.

Another case study


Let’s take the tidied who data set (TB case numbers) as another example. After tidying data, we arrive at the following data frame:

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
## # A tibble: 76,046 × 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # ℹ 76,036 more rows

As we know, the key column contains information about TB types, gender and age group. To recall the pattern:

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

  2. The next two letters describe the type of TB:

  • rel stands for cases of relapse
  • ep stands for cases of extrapulmonary TB
  • sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
  • sp stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
  1. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

  2. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

  • 014 = 0 – 14 years old
  • 1524 = 15 – 24 years old
  • 2534 = 25 – 34 years old
  • 3544 = 35 – 44 years old
  • 4554 = 45 – 54 years old
  • 5564 = 55 – 64 years old
  • 65 = 65 or older

The following grouped regular expression would put all needed information into three new columns, “type”, “gender”, and “age_group”.

who1 %>%
  extract(key, 
          c("type", "gender", "age_Group"), 
          "new[_]?(.*)_(m|f)(\\d*)", 
          remove = F) %>%
  print()
## # A tibble: 76,046 × 9
##    country     iso2  iso3   year key          type  gender age_Group cases
##    <chr>       <chr> <chr> <dbl> <chr>        <chr> <chr>  <chr>     <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014  sp    m      014           0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524 sp    m      1524         10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534 sp    m      2534          6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544 sp    m      3544          3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554 sp    m      4554          5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564 sp    m      5564          2
##  7 Afghanistan AF    AFG    1997 new_sp_m65   sp    m      65            0
##  8 Afghanistan AF    AFG    1997 new_sp_f014  sp    f      014           5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524 sp    f      1524         38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534 sp    f      2534         36
## # ℹ 76,036 more rows

In this regular expression, "new" refers to the “new” at the beginning of every string in key columns. [_]? refers to either nothing or a single “_” since for rel type there is no _ between “new” and “rel”.

Then the first group (.*) before the next “_” would capture the code of TB types (either “rel”, “ep”, “sn” or “sp”). The second group (m|f) then captures the gender code. The digits afterwards can be captured by the third group (\\d*).

Replacing matches


str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-"  "p--r"   "b-n-n-"

With str_replace_all() you can perform multiple replacements by supplying a named vector:

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house"    "two cars"     "three people"

In the case study above, what we did is to replace "newrel" with "new_rel" before we further analyze it.

who2 <- who1 %>% 
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  filter(str_detect(key, "new_rel")) # Only keep rows with "new_rel" for checking
who2
## # A tibble: 2,580 × 6
##    country     iso2  iso3   year key           cases
##    <chr>       <chr> <chr> <dbl> <chr>         <dbl>
##  1 Afghanistan AF    AFG    2013 new_rel_m014   1705
##  2 Afghanistan AF    AFG    2013 new_rel_f014   1749
##  3 Albania     AL    ALB    2013 new_rel_m014     14
##  4 Albania     AL    ALB    2013 new_rel_m1524    60
##  5 Albania     AL    ALB    2013 new_rel_m2534    61
##  6 Albania     AL    ALB    2013 new_rel_m3544    32
##  7 Albania     AL    ALB    2013 new_rel_m4554    44
##  8 Albania     AL    ALB    2013 new_rel_m5564    50
##  9 Albania     AL    ALB    2013 new_rel_m65      67
## 10 Albania     AL    ALB    2013 new_rel_f014      5
## # ℹ 2,570 more rows

Lab Exercise


Switch the first and last letters for all words in words and place the result in a new column.

Splitting


A particularly useful function is the str_split function. This function can split, for example, a sentence into words.

sentences %>%
  head(5) %>% 
  str_split(" ")
## [[1]]
## [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
## [8] "planks."
## 
## [[2]]
## [1] "Glue"        "the"         "sheet"       "to"          "the"        
## [6] "dark"        "blue"        "background."
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish."  
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Because each component might contain a different number of pieces, this returns a list.

The splitting above is somehow unsatisfactory, since the punctuation are included. So we can refine the pattern:

sentences1 <- sentences

str_sub(sentences1, -1, -1) <- ""  # Remove the last character which is a period

sentences1 %>%
  head(5) %>% 
  str_split("[^A-Za-z]+")
## [[1]]
## [1] "The"    "birch"  "canoe"  "slid"   "on"     "the"    "smooth" "planks"
## 
## [[2]]
## [1] "Glue"       "the"        "sheet"      "to"         "the"       
## [6] "dark"       "blue"       "background"
## 
## [[3]]
##  [1] "It"    "s"     "easy"  "to"    "tell"  "the"   "depth" "of"    "a"    
## [10] "well" 
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish"   
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls"

Here "a-z" and "A-Z" inside [] in a regular expression represent all lower-case and upper-case letters in English. So we are splitting the sentence by any non-letter character of length one or more (+ represents one time or more).

Actually there is a simpler way to do this, using boundary() function.

sentences %>%
  head(5) %>%
  str_split(boundary("word"))
## [[1]]
## [1] "The"    "birch"  "canoe"  "slid"   "on"     "the"    "smooth" "planks"
## 
## [[2]]
## [1] "Glue"       "the"        "sheet"      "to"         "the"       
## [6] "dark"       "blue"       "background"
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well" 
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish"   
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls"

Here boundry("word") refers to split by each word. We can also split by “line_break”, “character” or “sentence”.

Other uses of regular expressions


There are two useful function in base R that also use regular expressions:

  • apropos() searches all objects available from the global environment that match with the given regular expression. This is useful if you can’t quite remember the name of the function.
apropos("replace")
## [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
## [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"
  • dir() lists all the files in a directory. The pattern argument takes a regular expression and only returns file names that match the pattern. For example, you can find all the R Markdown files in the current directory with:
dir(pattern = "\\.Rmd$")
##  [1] "1-Introduction.Rmd"                    
##  [2] "EDA_Class_Exercise.Rmd"                
##  [3] "R_Functions.Rmd"                       
##  [4] "RMD10_Data_Tidying_1.Rmd"              
##  [5] "RMD11_Data_Tidying_2.Rmd"              
##  [6] "RMD12_Data_Import.Rmd"                 
##  [7] "RMD13_Strings.Rmd"                     
##  [8] "RMD2_Basics_Descriptive_Statistics.Rmd"
##  [9] "RMD3_Data_Visualization_1.Rmd"         
## [10] "RMD4_Data_Visualization_2.Rmd"         
## [11] "RMD5_Data_Visualization_3.Rmd"         
## [12] "RMD6_Data_Transformation_1.Rmd"        
## [13] "RMD7_Data_Transformation_2.Rmd"        
## [14] "RMD8_EDA_1.Rmd"                        
## [15] "RMD9_EDA_2.Rmd"

In real applications, we frequently use regular expressions to search for names of files, folders in coding projects.

Lab Homework


  1. Finish all lab exercises.

Submit your answer in a single pdf or html knitted from a R markdown file. Submit your R markdown file as well.