Load Libraries

library(tidyverse)
library(nycflights13)


0. Introduction


Strings are collection of characters, which are used to store “text data”, or any data format in terms of texts. It is very important to be skilled at handling strings in data science. In this module, we will study


1. String basics


You can create strings with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour.

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

If you forget to close a quote, you’ll see +, the continuation character:

> "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK

If this happen to you, just press Esc(Escape) and try again!

To include a literal single or double quote in a string you can use  to “escape” it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

If you need a literal backslash, we need

backslash <- "\\"


Literal value and representation of strings


An important thing to know about strings is that they have literal values (what they actually are) and their representations (how you input that into a programming language). A literal value pair with a representation.

For example, for the literal value \, we must input "\\" in R.

Unlike in Python, the print() function in R returns the representation. We need to use the function writeLines() to show the literal value of a string.

print("\\")
## [1] "\\"
writeLines("\\")
## \

Like many other programming languages, R use backslash to start an escape sequence inside a string:

Representation Literal value
\n new line
\t tab charcter
\\ backslash \
\" double quotation marks "
\' single quotation marks '
\` backticks `

For the full table of escape sequences, you may check the help documentation of quotes.

help("'")

For example, if we hope to write a string with literal value of "\", we need to write

my_string <- "\"\\\""
writeLines(my_string)
## "\"


Lab Exercise: Write a string of literal value of \\\


Use UTF-8 code


All characters have a UTF-8 code and we can print them out in R:

writeLines("\u00b5") # The greek letter "mu"
## µ
writeLines("\xe4\xbd\xa0\xe5\xa5\xbd") # The Chinese "你好"
## 你好
writeLines("\u2660") # Spade symbol of a card suit
## ♠

There are many online encoders to convert any character into UTF-8 codes.

Multiple strings are often stored in a character vector, which you can create with c():

string_vector <- c("One", "Two", "Three")
print(string_vector)
## [1] "One"   "Two"   "Three"


2. String functions in stringr


Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr. These have more intuitive names, and all start with str_. For example, str_length() tells you the number of characters in a string:

str_length(c("a", "R for data science", NA))
## [1]  1 18 NA

To combine two or more strings, use str_c():

str_c("x","y","z")
## [1] "xyz"

Use the sep argument to control how they’re separated:

str_c("x","y","z", sep = "+")
## [1] "x+y+z"

str_c() is vectorised, and it automatically recycles shorter vectors to the same length as the longest:

str_c("b", c("a", "e", "u"), "g")
## [1] "bag" "beg" "bug"

To collapse a vector of strings into a single string, use collapse argument:

str_c(c("x", "y", "z"), collapse = ",")
## [1] "x,y,z"


Subsetting strings


You can extract parts of a string using str_sub(). As well as the string, str_sub() takes start and end arguments which give the (inclusive) position of the substring:

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"

Note that str_sub() won’t fail if the string is too short: it will just return as much as possible:

str_sub("a", 1, 5)
## [1] "a"

You can also use the assignment form of str_sub() to modify part of a string:

x = "There is a typo in the word studant"
str_sub(x, -3, -3) <- "e"
x
## [1] "There is a typo in the word student"

str_to_lower and str_to_upper functions convert the text to lower/upper case respectively.

x <- "china"
str_to_lower(x)
## [1] "china"
str_to_upper(x)
## [1] "CHINA"

str_sort() function sort a vector of strings by alphabetic order. We can do it either in increasing (by default) or decreasing order.

str_sort(c("apple", "orange", "banana"))
## [1] "apple"  "banana" "orange"
str_sort(c("apple", "orange", "banana"), decreasing = TRUE)
## [1] "orange" "banana" "apple"


stringr data


To exercise string manipulations, we will use the three pre-loaded string data sets in stringr package. They are, words, fruit and sentences

  • words contain 980 most commonly used English words
  • fruit contain 80 English words of fruits
  • sentences contain 720 English sentences which was used for standardised testing of voice from “Harvard sentences”

Let’s play with it - first find the longest word in the words data set:

word_data <- as_tibble(words) %>%
  mutate(length = str_length(value)) %>%
  arrange(desc(length)) %>%
  print()
## # A tibble: 980 × 2
##    value       length
##    <chr>        <int>
##  1 appropriate     11
##  2 environment     11
##  3 opportunity     11
##  4 responsible     11
##  5 department      10
##  6 difference      10
##  7 experience      10
##  8 individual      10
##  9 particular      10
## 10 photograph      10
## # … with 970 more rows

Now, let’s say we hope to find all words with some patterns such as:

  • has exactly an as part of the word
  • has at least two as in the word
  • end with e and be longer than 6 words

How to do these jobs? We would need to refer to our next topic - regular expressions.


3. Regular expressions


Introduction

Regular expressions, or in short “regexps” or “regex”, are a mini programming language that allow you to describe patterns in strings. They are very powerful in handling file names, folder names, texts or any job related to strings.

For example, we have tidied the who data about TB cases in different countries and years. One step there is to separate a “key” column into a few different ones:

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
## # A tibble: 76,046 × 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # … with 76,036 more rows

Previously we had used mutate and separate function to separate the key column into types, Gender and Age_Group. After learning regular expressions, we would be able to do all these just in one line:

tidyr::extract(who1, key, c("type", "Gender", "Age_Group"), "new[_]?(.*)_(m|f)(\\d*)", remove = F) -> who1
who1
## # A tibble: 76,046 × 9
##    country     iso2  iso3   year key          type  Gender Age_Group cases
##    <chr>       <chr> <chr> <dbl> <chr>        <chr> <chr>  <chr>     <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014  sp    m      014           0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524 sp    m      1524         10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534 sp    m      2534          6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544 sp    m      3544          3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554 sp    m      4554          5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564 sp    m      5564          2
##  7 Afghanistan AF    AFG    1997 new_sp_m65   sp    m      65            0
##  8 Afghanistan AF    AFG    1997 new_sp_f014  sp    f      014           5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524 sp    f      1524         38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534 sp    f      2534         36
## # … with 76,036 more rows

The odd-looking string here uses regular expression to identify some particular patterns to decode the key column. Essentially, regular expression is like a mini programming language. At the beginning it may take some time to get used to it, but after you understand how it works you will find that it is very useful and fun to work with.

To learn regular expressions, we’ll use str_view() as the starting point. str_view() takes a character vector and a regular expression, and show you how they match.


Basic matches

Let’s start from the simplest case, match exactly one or more letters.


x <- c("apple", "banana", "pear")
str_view(x, "a", html = TRUE)
x <- c("apple", "banana", "pear")
str_view(x, "an", html = TRUE, match = NA)

So we see that the function highlights any part of the words that matches the given pattern, which is exactly "a" or "an" in this case.

The template to use str_view is:

str_view(string, pattern, match = TRUE, html = FALSE)

Here string is the string or a vector of strings to inspect, pattern uses regular expression to describe what pattern we are looking for. match controls what we print (only words with match, without match or all words regardless of having match or not). html should only be TRUE when we want to print the reuslt in a webpage (such as a markdown).


. matches any single character (except a new line)

Now let’s study the mini language of regular expression. First, a mere . in a regular expression represents any single character excluding a new line. For example,

str_view(x, ".a.", match = NA, html = TRUE)

matches any three characters with “a” in the middle. Similarly, ... matches any characters of length three.

x <- c("a", "ab", "abc", "abcd")
str_view(x, "...", match = NA, html = TRUE)

However, the new line character \n is not counted as a single character.

x <- 'ab\ncd'
writeLines(x)
## ab
## cd
str_view(x, "...", match = NA, html = TRUE)

Use escape sequences in regular expressions.

Before we learn more ways to describe patterns, we need to learn the escape sequences that are needed in regular expression. Now we know that . is used to represent any single character, but then how we express the literal . by itself? We have to use the escape sequence \. to represent a literal .

However, if we try to do this in R, there will be some error message

x <- c("2.357", "apple")
str_view(x, "\.", match = NA, html = TRUE)

Why doesn’t this work? The reason is that, we use a string to represent the regular expression \.. But for a literal \., we need \\. as the representation as we learned above. So the right thing to do is:

x <- c("2.357", "apple")
str_view(x, "\\.", match = NA, html = TRUE)

In summary, we have two ways to write a regular expression:

  • its literal value, such as . or \.
  • its string representation, such as ".", or "\\." where we must use a pair of quotation marks to enclose the string.

For our textbook, we may use both ways to write a regular expression. But remember in R, we have to use the second way as the input of str_view or other functions that work on regular expressions.

As below is a table that helps you understand this.

literal values of regular expression string representation used in R Meaning
. "." Any single character (excluding new line)
\. "\\." A literal .
\\ "\\\\" A literal \
" '"' or "\\\"" A literal "
' "'" or '\\\'' A literal '
\d "\\d" any digit (0-9)
\s "\\s" any white space


Examples

The following regular expression matches a literal \:

x <- c("a\\b", "a/b")
writeLines(x)
## a\b
## a/b
str_view(x, "\\\\", match = NA, html = TRUE)

The following regular expression matches a literal like 2.3:

x <- c("30", "2.56", "1e5", "0.5943")
str_view(x, "\\d\\.\\d", match = NA, html = TRUE)


Lab Exercises:

  1. How would you match the sequence "'\?

  2. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?


Anchors ^ and $


By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use:

  • ^ to match the start of the string.
  • $ to match the end of the string.

For example, to match the pattern of starting or ending with letter “a”, we can do

x <- c("apple", "banana")
str_view(x, "^a", match = NA, html = TRUE)
str_view(x, "a$", match = NA, html = TRUE)

Our textbook provides an interesting mnemonic from Evan Misshula to help remember this: if you begin with power (^), you end up with money ($).

To force a regular expression to only match a complete string, anchor it with both ^ and $:

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple", match = NA, html = TRUE)
str_view(x, "^apple$", match = NA, html = TRUE)


Example

Now let’s get all words in words that ends with “a”. We can use the str_detect method to do the filter, which returns TRUE or FALSE for each string in the vector.

x <- c("apple", "banana")
str_detect(x, "a$")
## [1] FALSE  TRUE
word_data %>%
  filter(str_detect(value, "a$")) %>%
  print()
## # A tibble: 6 × 2
##   value   length
##   <chr>    <int>
## 1 america      7
## 2 extra        5
## 3 area         4
## 4 idea         4
## 5 tea          3
## 6 a            1


Lab Exercises:

  1. Find all words that starts with “a” and ends with “e” in words data set.
  2. Find all words that are four letters long in words data set with and without using str_length function.


Literal $ and ^


Since $ and ^ has special meanings in regular expressions. We have to use the escape sequence to represent literal $ and ^ as well. This applies to all future symbols of such type as well.

# To match the literal $^$
x <- c("a$^$b")
writeLines(x)
## a$^$b
str_view(x, "\\$\\^\\$", match = NA, html = TRUE)


Character classes and alternatives


There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.

Note that ^ has different meaning outside [] or inside a []. Also, to use \d and \s in string representations, we must use "\\d" and "\\s".

A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex. Many people find this more readable.

str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c", match = NA, html = TRUE)

The code above finds the same pattern as a\.c, but its string representation is more readable than "a\\.c".

This works for most (but not all) regex metacharacters: $ . | ? * + ( ) [ {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: ] \ ^ and -.


Example

Question: Find all words in words that starts with a vowel (“a”, “e”, “i”, “o” or “u”).

Solution:

word_data %>%
  filter(str_detect(value, "^[aeiou]")) %>%
  print()
## # A tibble: 175 × 2
##    value       length
##    <chr>        <int>
##  1 appropriate     11
##  2 environment     11
##  3 opportunity     11
##  4 experience      10
##  5 individual      10
##  6 understand      10
##  7 university      10
##  8 advertise        9
##  9 afternoon        9
## 10 associate        9
## # … with 165 more rows

Question: Find all words in words that ends with ed, but not with eed.

Solution:

word_data %>%
  filter(str_detect(value, "ed$")) %>% # This only finds words ending with "ed"
  print() %>%
  filter(str_detect(value, "[^e]ed$")) %>%
  print()
## # A tibble: 9 × 2
##   value   length
##   <chr>    <int>
## 1 hundred      7
## 2 proceed      7
## 3 succeed      7
## 4 indeed       6
## 5 speed        5
## 6 feed         4
## 7 need         4
## 8 bed          3
## 9 red          3
## # A tibble: 3 × 2
##   value   length
##   <chr>    <int>
## 1 hundred      7
## 2 bed          3
## 3 red          3


Alteration

One can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either abc, or deaf. Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:

str_view(c("grey", "gray"), "gr(e|a)y", match = NA, html = TRUE)


Example: Find all words in words that ends with ing or ise.
word_data %>%
  filter(str_detect(value, "(ing|ise)$")) %>% 
  print()
## # A tibble: 17 × 2
##    value     length
##    <chr>      <int>
##  1 advertise      9
##  2 otherwise      9
##  3 exercise       8
##  4 practise       8
##  5 surprise       8
##  6 evening        7
##  7 meaning        7
##  8 morning        7
##  9 realise        7
## 10 during         6
## 11 bring          5
## 12 raise          5
## 13 thing          5
## 14 king           4
## 15 ring           4
## 16 rise           4
## 17 sing           4


Lab Exercises: Find all words with the first letter being “a” or “e”, and the third letter being “r” or “s”.


Repetition


Next, let’s see how to describe patterns that repeat itself for exactly or selectively some number of times.

The following symbols define how many times the previous character repeat:

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
  • {n}: exactly \(n\) times

For example, if we hope to search words starting with “a” and ending with “e”, we may do:

str_view(words, "^a.*e$", html = TRUE)

Here the .* refers to any character of any length since . refers to any character and * refers to 0 or more times.

As another example, if we hope to search words that has three vowel letters connecting each other, for example, “iou”, we can do

str_view(words, "[aeiou]{3}", html = TRUE)

Here [aeiou] refers to a single character which is a vowel letter, and {3} refers to repeating three times.

Lab Exercises:

  1. Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
  • ^.*$
  • \d{4}-\d{2}-\d{2}
  • "\\\\{4}"
  1. Create regular expressions to find all words that start with “s” and are at least 7 letters long.


A few more regular expression rules for repetition
  • {n,}: \(n\) times or more
  • {,m}: at most \(m\) times
  • {n,m}: between \(n\) times and \(m\) times


Grouping and backreferences


Parenthesis () in regular expressions can be used to refer to a numbered capturing group. This is useful when we want to refer to exactly the same text later.

For example, how to search for words that have more than three same letters of “a”, “e” or “i”? We may do the following:

str_view(words, "([aei]).*\\1.*\\1", html = TRUE)

Here .* refers to any character of any length as seen before. ([aei]) refers to either “a” or “e” or “i”, and "\\1" which is \1 in value refers to ** the same letter in () occurring again.

As another example, the following regular expression finds all fruits that have a repeated pair of letters.

str_view(fruit, "(..)\\1", html = TRUE)


Lab Exercises:

  1. Give an example of the pattern that the following regular expressions would match:
  • (.)\1\1
  • “(.)(.)\2\1”

2.Construct regular expressions to match words that start and end with the same character.