1 String basics

1.1 Introduction

1.1.1 Quotes

Let’s get started by entering some strings in R. In the video you saw that you use quotes to tell R to interpret something as a string. Both double quotes (") and single (’) quotes work, but there are some guidelines for which to use.

First, you should prefer double quotes (“) to single quotes (’). That means, whenever you are defining a string your first intuition should be to use”.

Unfortunately if your string has " inside it, R will interpret the double quote as “this is the end of the string”, not as “this is the character”". This is one time you can forget the first guideline and use the single quote, ’, to define the string.

Finally, there are cases where you need both ’ and " inside the string. In this case, fall back to the first guideline and use " to define the string, but you’ll have to escape any double quotes inside the string using a backslash (i.e. \").

# Define line1
line1 <- "The table was a large one, but the three were all crowded together at one corner of it:"

# Define line2
line2 <- '"No room! No room!" they cried out when they saw Alice coming.'

# Define line3
line3 <- "\"There's plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."

1.1.2 What you see isn’t always what you have

Take a look at line2, the string you just defined, by printing it:

line2
[1] "\"No room! No room!\" they cried out when they saw Alice coming."

Even though you used single quotes so you didn’t have to escape any double quotes, when R prints it, you’ll see escaped double quotes (\")! R doesn’t care how you defined the string, it only knows what the string represents, in this case, a string with double quotes inside.

When you ask R for line2 it is actually calling print(line2) and the print() method for strings displays strings as you might enter them. If you want to see the string it represents you’ll need to use a different function: writeLines().

You can pass writeLines() a vector of strings and it will print them to the screen, each on a new line. This is a great way to check the string you entered really does represent the string you wanted.

# Putting lines in a vector
lines <- c(line1, line2, line3)

# Print lines
print(lines)
[1] "The table was a large one, but the three were all crowded together at one corner of it:"                           
[2] "\"No room! No room!\" they cried out when they saw Alice coming."                                                  
[3] "\"There's plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."
# Use writeLines() on lines
writeLines(lines)
The table was a large one, but the three were all crowded together at one corner of it:
"No room! No room!" they cried out when they saw Alice coming.
"There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table.
# Write lines with a space separator
writeLines(lines, sep = " ")
The table was a large one, but the three were all crowded together at one corner of it: "No room! No room!" they cried out when they saw Alice coming. "There's plenty of room!" said Alice indignantly, and she sat down in a large arm-chair at one end of the table. 
# Use writeLines() on the string "hello\n\U1F30D"
writeLines("hello\n\U1F30D")
hello
<U+0001F30D>

The function cat() is very similar to writeLines(), but by default separates elements with a space, and will attempt to convert non-character objects to a string.

1.1.3 Escape sequences

How did you get two lines from one string, and how did you get that little globe? The key is the \.

A sequence in a string that starts with a \ is called an escape sequence and allows us to include special characters in our strings. You saw one escape sequence in the first exercise: \" is used to denote a double quote.

In “hello\n\U1F30D” there are two escape sequences: \n gives a newline, and \U followed by up to 8 hex digits sequence denotes a particular Unicode character.

Unicode is a standard for representing characters that might not be on your keyboard. Each available character has a Unicode code point: a number that uniquely identifies it. These code points are generally written in hex notation, that is, using base 16 and the digits 0-9 and A-F. You can find the code point for a particular character by looking up a code chart. If you only need four digits for the codepoint, an alternative escape sequence is \u. When R comes across a \ it assumes you are starting an escape, so if you actually need a backslash in your string you’ll need the sequence \\.

# Should display: To have a \ you need \\
writeLines("To have a \\ you need \\\\")
To have a \ you need \\
# Should display: 
# This is a really 
# really really 
# long string
writeLines("This is a really\nreally really\nlong string")
This is a really
really really
long string
# Use writeLines() with 
# "\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e" - "Hello World" in Hindi!
writeLines("\u0928\u092e\u0938\u094d\u0924\u0947 \u0926\u0941\u0928\u093f\u092f\u093e") 
<U+0928><U+092E><U+0938><U+094D><U+0924><U+0947> <U+0926><U+0941><U+0928><U+093F><U+092F><U+093E>

1.2 Turning numbers into strings

1.2.1 Using format() with numbers

The behavior of format() can be pretty confusing, so you’ll spend most of this exercise exploring how it works.

Recall from the video, the scientific argument to format() controls whether the numbers are displayed in fixed (scientific = FALSE) or scientific (scientific = TRUE) format.

When the representation is scientific, the digits argument is the number of digits before the exponent. When the representation is fixed, digits controls the significant digits used for the smallest (in magnitude) number. Each other number will be formatted to match the number of decimal places in the smallest number. This means the number of decimal places you get in your output depends on all the values you are formatting!

For example, if the smallest number is 0.0011, and digits = 1, then 0.0011 requires 3 places after the decimal to represent it to 1 significant digit, 0.001. Every other number will be formatted to 3 places after the decimal point. Format c(0.0011, 0.011, 1) with digits = 1. This is like the example described above. Now, format c(1.0011, 2.011, 1) with digits = 1. Try to predict what you might get before you try it. Format percent_change by choosing the digits argument so that the values are presented with one place after the decimal point. Format income by choosing the digits argument so that the values are presented as whole numbers (i.e. no places after the decimal point). Format p_values using a fixed representation.

# Some vectors of numbers
percent_change  <- c(4, -1.91, 3.00, -5.002)
income <-  c(72.19, 1030.18, 10291.93, 1189192.18)
p_values <- c(0.12, 0.98, 0.0000191, 0.00000000002)

# Format c(0.0011, 0.011, 1) with digits = 1
format(c(0.0011, 0.011, 1), digits = 1)
[1] "0.001" "0.011" "1.000"
# Format c(1.0011, 2.011, 1) with digits = 1
format(c(1.0011, 2.011, 1), digits = 1)
[1] "1" "2" "1"
# Format percent_change to one place after the decimal point
format(percent_change, digits = 2)
[1] " 4.0" "-1.9" " 3.0" "-5.0"
# Format income to whole numbers
format(income, digits = 2)
[1] "     72" "   1030" "  10292" "1189192"
# Format p_values in fixed format
format(p_values, scientific = FALSE)
[1] "0.12000000000" "0.98000000000" "0.00001910000" "0.00000000002"

1.2.2 Controlling other aspects of the string

ot only does format() control the way the number is represented, it also controls some of the properties of the resulting string that affect its display.

For example, by default format() will pad the start of the strings with spaces so that the decimal points line up, which is really useful if you are presenting the numbers in a vertical column. However, if you are putting the number in the middle of a sentence, you might not want these extra spaces. You can set trim = TRUE to remove them.

When numbers are long it can be helpful to “prettify” them, for example instead of 1000000000 display 1,000,000,000. In this case a , is added every 3 digits. This can be controlled by the big.interval and big.mark arguments, e.g. format(1000000000, big.mark = “,”, big.interval = 3, scientific = FALSE). These arguments are actually passed on to prettyNum() so head there for any further details.

formatted_income <- format(income, digits = 2)

# Print formatted_income
formatted_income
[1] "     72" "   1030" "  10292" "1189192"
# Call writeLines() on the formatted income
writeLines(formatted_income)
     72
   1030
  10292
1189192
# Define trimmed_income
trimmed_income <- format(income, digits = 2, trim = TRUE)

# Call writeLines() on the trimmed_income
writeLines(trimmed_income)
72
1030
10292
1189192
# Define pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")

# Call writeLines() on the pretty_income
writeLines(pretty_income)
       72
    1,030
   10,292
1,189,192

1.2.3 formatC()

The function formatC() provides an alternative way to format numbers based on C style syntax.

Rather than a scientific argument, formatC() has a format argument that takes a code representing the required format. The most useful are:

“f” for fixed, “e” for scientific, and “g” for fixed unless scientific saves space When using scientific format, the digits argument behaves like it does in format(); it specifies the number of significant digits. However, unlike format(), when using fixed format, digits is the number of digits after the decimal point. This is more predictable than format(), because the number of places after the decimal is fixed regardless of the values being formatted.

formatC() also formats numbers individually, which means you always get the same output regardless of other numbers in the vector.

The flag argument allows you to provide some modifiers that, for example, force the display of the sign (flag = “+”), left align numbers (flag = “-”) and pad numbers with leading zeros (flag = “0”).

# From the format() exercise
x <- c(0.0011, 0.011, 1)
y <- c(1.0011, 2.011, 1)

# formatC() on x with format = "f", digits = 1
formatC(x, format = "f", digits = 1)
[1] "0.0" "0.0" "1.0"
# formatC() on y with format = "f", digits = 1
formatC(y, format = "f", digits = 1)
[1] "1.0" "2.0" "1.0"
# Format percent_change to one place after the decimal point
formatC(percent_change, format = "f", digits = 1)
[1] "4.0"  "-1.9" "3.0"  "-5.0"
# percent_change with flag = "+"
formatC(percent_change,  format = "f", digits = 1, flag = "+")
[1] "+4.0" "-1.9" "+3.0" "-5.0"
# Format p_values using format = "g" and digits = 2
formatC(p_values, format = "g", digits = 2)
[1] "0.12"    "0.98"    "1.9e-05" "2e-11"  

1.3 Putting strings together

1.3.1 Annotation of numbers

To get a handle on using paste(), you are going to annotate some of your formatted number strings.

The key points to remember are:

  • The vectors you pass to paste() are pasted together element by element, using the sep argument to combine them.
  • If the vectors passed to paste() aren’t the same length, the shorter vectors are recycled up to the length of the longest one.
  • Only use collapse if you want a single string as output. collapse specifies the string to place between different elements.
years <- 2010:2013
pretty_percent <- formatC(percent_change,  format = "f", digits = 1, flag = "+")
# Add $ to pretty_income
paste("$", pretty_income, sep = "")
[1] "$       72" "$    1,030" "$   10,292" "$1,189,192"
 
# Add % to pretty_percent
paste(pretty_percent, "%", sep = "")
[1] "+4.0%" "-1.9%" "+3.0%" "-5.0%"
# Create vector with elements like 2010: +4.0%`
year_percent <- paste(years, ": ", pretty_percent, "%", sep = "")

# Collapse all years into single string
paste(year_percent, collapse = ",")
[1] "2010: +4.0%,2011: -1.9%,2012: +3.0%,2013: -5.0%"

Specifying sep = "" is so common, there is actually another function paste0() that works like paste() but always pastes elements together without a separator between them.

1.3.2 A very simple table

Combining format() and paste() is one way to display very simple tables. Remember, since format() looks at all the values in a vector before formatting, it uses a consistent format and will, by default, align on the decimal point. This is usually the behavior you want for a column of numbers in table.

format() can also take character vectors as input. In this case, you can use the justify argument, specific to character input, to justify the text to the left, right, or center.

# Define the names vector
income_names <- c("Year 0", "Year 1", "Year 2", "Project Lifetime")

# Create pretty_income
pretty_income <- format(income, digits = 2, big.mark = ",")

# Create dollar_income
dollar_income <- paste("$", pretty_income, sep = "")

# Create formatted_names
formatted_names <- format(income_names, justify = "right")

# Create rows
rows <- paste(formatted_names, dollar_income, sep = "   ")

# Write rows
writeLines(rows)
          Year 0   $       72
          Year 1   $    1,030
          Year 2   $   10,292
Project Lifetime   $1,189,192

If you wanted the dollar signs right next to the numbers, you could format the incomes with trim = TRUE, paste on the $, then format again as a string with justify = “right”.

1.3.3 Let’s order pizza!

As a final exercise in using paste() and to celebrate getting to the end of the first chapter, let’s order some pizza.

We’ve got a list of possible pizza toppings in toppings.

You are going to randomly select three toppings, and then put them together using paste() into an order for pizza, that should result in a string like, “I want to order a pizza with mushrooms, spinach, and pineapple.”

toppings <- c("anchovies", "artichoke", "bacon", "breakfast bacon", "Canadian bacon", 
"cheese", "chicken", "chili peppers", "feta", "garlic", "green peppers", 
"grilled onions", "ground beef", "ham", "hot sauce", "meatballs", 
"mushrooms", "olives", "onions", "pepperoni", "pineapple", "sausage", 
"spinach", "sun-dried tomato", "tomatoes")
# Randomly sample 3 toppings
my_toppings <- sample(toppings, size = 3)

# Print my_toppings
my_toppings
[1] "green peppers"  "anchovies"      "grilled onions"
# Paste "and " to last element: my_toppings_and
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")

# Collapse with comma space: these_toppings
these_toppings <- paste(my_toppings_and, collapse = ", ")

# Add rest of sentence: my_order
my_order <- paste("I want to order a pizza with ", these_toppings, ".", sep = "")

# Order pizza with writeLines()
writeLines(my_order)
I want to order a pizza with green peppers, anchovies, and grilled onions.

2 stringr

2.1 Introduction

2.1.1 Putting strings together with stringr

For your first stringr function, we’ll look at str_c(), the c is short for concatenate, a function that works like paste(). It takes vectors of strings as input along with sep and collapse arguments.

There are two key ways str_c() differs from paste(). First, the default separator is an empty string, sep = "“, as opposed to a space, so it’s more like paste0(). This is an example of a stringr function, performing a similar operation to a base function, but using a default that is more likely to be what you want. Remember in your pizza order, you had to set sep =”" multiple times.

The second way str_c() differs to paste() is in its handling of missing values. paste() turns missing values into the string “NA”, whereas str_c() propagates missing values. That means combining any strings with a missing value will result in another missing value.

library(stringr)

my_toppings <- c("cheese", NA, NA)
my_toppings_and <- paste(c("", "", "and "), my_toppings, sep = "")

# Print my_toppings_and
my_toppings_and
[1] "cheese" "NA"     "and NA"
# Use str_c() instead of paste(): my_toppings_str
my_toppings_str <- str_c(c("", "", "and "), my_toppings)

# Print my_toppings_str
my_toppings_str
[1] "cheese" NA       NA      
# paste() my_toppings_and with collapse = ", "
paste(my_toppings_and, collapse = ", ")
[1] "cheese, NA, and NA"
# str_c() my_toppings_str with collapse = ", "
str_c(my_toppings_str, collapse = ", ")
[1] NA

This behavior is nice because you learn quickly when you might have missing values, rather than discovering later weird “NA”s inside your strings. Another stringr function that is useful when you may have missing values, is str_replace_na() which replaces missing values with any string you choose.

2.1.2 String length

Our next stringr function is str_length(). str_length() takes a vector of strings as input and returns the number of characters in each string. For example, try finding the number of characters in Batman’s name:

str_length(c("Bruce", "Wayne"))
[1] 5 5

This is very similar to the base function nchar() but you’ll see in the exercises str_length() handles factors in an intuitive way, whereas nchar() will just return an error.

Historically, nchar() was even worse, rather than returning an error if you passed it a factor, it would return the number of characters in the numeric encoding of the factor. Thankfully this behavior has been fixed, but it was one of the original motivations behind str_length().

Take your first look at babynames by asking if girls’ names are longer than boys’ names.

library(stringr)
library(babynames)
package 㤼㸱babynames㤼㸲 was built under R version 4.0.3
library(dplyr)

Attaching package: 㤼㸱dplyr㤼㸲

The following objects are masked from 㤼㸱package:stats㤼㸲:

    filter, lag

The following objects are masked from 㤼㸱package:base㤼㸲:

    intersect, setdiff, setequal, union
# Extracting vectors for boys' and girls' names
babynames_2014 <- filter(babynames, year == 2014)
boy_names <- filter(babynames_2014, sex == "M")$name
girl_names <- filter(babynames_2014, sex == "F")$name

# Take a look at a few boy_names
head(boy_names)
[1] "Noah"    "Liam"    "Mason"   "Jacob"   "William" "Ethan"  
# Find the length of all boy_names
boy_length <- str_length(boy_names)

# Take a look at a few lengths
head(boy_length)
[1] 4 4 5 5 7 5
# Find the length of all girl_names
girl_length <- str_length(girl_names)

# Find the difference in mean length
mean(girl_length) - mean(boy_length)
[1] 0.3374758
# Confirm str_length() works with factors
head(str_length(factor(boy_names)))
[1] 4 4 5 5 7 5

2.1.3 Extracting substrings

The str_sub() function in stringr extracts parts of strings based on their location. As with all stringr functions, the first argument, string, is a vector of strings. The arguments start and end specify the boundaries of the piece to extract in characters.

For example, str_sub(x, 1, 4) asks for the substring starting at the first character, up to the fourth character, or in other words the first four characters. Try it with my Batman’s name:

str_sub(c("Bruce", "Wayne"), 1, 4)
[1] "Bruc" "Wayn"

Both start and end can be negative integers, in which case, they count from the end of the string. For example, str_sub(x, -4, -1), asks for the substring starting at the fourth character from the end, up to the first character from the end, i.e. the last four characters. Again, try it with Batman:

str_sub(c("Bruce", "Wayne"), -4, -1)
[1] "ruce" "ayne"
# Extract first letter from boy_names
boy_first_letter <- str_sub(boy_names, 1, 1)

# Tabulate occurrences of boy_first_letter
table(boy_first_letter)
boy_first_letter
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W 
1454  651  770  998  549  185  334  403  235 1390 1291  537  914  424  207  230   56  778  806  771   43  160  174 
   X    Y    Z 
  56  252  379 
  
# Extract the last letter in boy_names, then tabulate
boy_last_letter <- str_sub(boy_names, -1, -1)
table(boy_last_letter)
boy_last_letter
   a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    q    r    s    t    u    v    w 
 421  104   92  436 1148   66   82  583  705   57  349  945  389 4672  730   32   19 1011  826  292   81   71   34 
   x    y    z 
  86  697  119 
# Extract the first letter in girl_names, then tabulate
girl_first_letter <- str_sub(girl_names, 1, 1)
table(girl_first_letter)
girl_first_letter
   A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P    Q    R    S    T    U    V    W 
3101  699  946  810  933  209  345  469  373 1430 1694 1122 1746  752  143  303   38  831 1369  683   28  214   85 
   X    Y    Z 
  62  294  502 
# Extract the last letter in girl_names, then tabulate
girl_last_letter <- str_sub(girl_names, -1, -1)
table(girl_last_letter)
girl_last_letter
   a    b    c    d    e    f    g    h    i    j    k    l    m    n    o    p    q    r    s    t    u    v    w 
6632   20   13   81 3114    8   21 1942 1581   12   31  450  115 2608  105    3    2  291  326  208   59    6   17 
   x    y    z 
  50 1435   51 

Did you see that “A” is the most popular first letter for both boys and girls, and the most popular last letter for girls. However, the most popular last letter for boys’ names was “n”. You might have seen substr() a base R function that is similar to str_sub(). The big advantage of str_sub() is the ability to use negative indexes to count from the end of a string

2.2 Hunting for matches

2.2.1 Detecting matches

str_detect() is used to answer the question: Does the string contain the pattern? It returns a logical vector of the same length as that of the input vector string, with TRUE for elements that contain the pattern and FALSE otherwise.

Let’s take a look at a simple example where you have a vector of strings that represent pizza orders:

pizzas <- c("cheese", "pepperoni", 
  "sausage and green peppers")

You can ask which orders contain the pattern “pepper”, with

str_detect(pizzas, 
  pattern = fixed("pepper"))
[1] FALSE  TRUE  TRUE

Notice how both pepperoni and green peppers contain the pattern of interest.

The output from str_detect() can be used to count the number of occurrences, or to subset 0out the strings that contain the pattern.

# Look for pattern "zz" in boy_names
contains_zz <- str_detect(boy_names, pattern = fixed("zz"))

# Examine str() of contains_zz
str(contains_zz)
 logi [1:14047] FALSE FALSE FALSE FALSE FALSE FALSE ...
# How many names contain "zz"?
sum(contains_zz)
[1] 16
# Which names contain "zz"?
boy_names[contains_zz]
 [1] "Uzziah"    "Ozzie"     "Ozzy"      "Jazz"      "Uzziel"    "Chazz"     "Izzy"      "Azzam"     "Izzac"    
[10] "Izzak"     "Fabrizzio" "Jazziel"   "Azzan"     "Izzaiah"   "Muizz"     "Yazziel"  
# Which rows in boy_df have names that contain "zz"?
boy_df <- data.frame(boy_names)
boy_df[contains_zz,]
 [1] "Uzziah"    "Ozzie"     "Ozzy"      "Jazz"      "Uzziel"    "Chazz"     "Izzy"      "Azzam"     "Izzac"    
[10] "Izzak"     "Fabrizzio" "Jazziel"   "Azzan"     "Izzaiah"   "Muizz"     "Yazziel"  

That last example is another common use of str_detect() subsetting a data frame to rows where the values in a column contain the pattern of interest. In this case it lets us see these double-z names are pretty rare. For example, even the most popular, Uzziah, only accounted for 0.003% of boys born in 2014.

2.2.2 Subsetting strings based on match

Since detecting strings with a pattern and then subsetting out those strings is such a common operation, stringr provides a function str_subset() that does that in one step.

For example, let’s repeat our search for “pepper” in our pizzas using str_subset():

pizzas <- c("cheese", "pepperoni", "sausage and green peppers")
str_subset(pizzas, pattern = fixed("pepper"))
[1] "pepperoni"                 "sausage and green peppers"

We get a new vector of strings, but it only contains those original strings that contained the pattern.

str_subset() can be easily confused with str_extract(). str_extract() returns a vector of the same length as that of the input vector, but with only the parts of the strings that matched the pattern.

# Find boy_names that contain "zz"
str_subset(boy_names, pattern = fixed("zz"))
 [1] "Uzziah"    "Ozzie"     "Ozzy"      "Jazz"      "Uzziel"    "Chazz"     "Izzy"      "Azzam"     "Izzac"    
[10] "Izzak"     "Fabrizzio" "Jazziel"   "Azzan"     "Izzaiah"   "Muizz"     "Yazziel"  
# Find girl_names that contain "zz"
str_subset(girl_names, pattern = fixed("zz"))
 [1] "Izzabella"  "Jazzlyn"    "Jazzlynn"   "Lizzie"     "Izzy"       "Lizzy"      "Mazzy"      "Izzabelle" 
 [9] "Jazzmine"   "Jazzmyn"    "Jazzelle"   "Jazzmin"    "Izzah"      "Jazzalyn"   "Jazzmyne"   "Izzabell"  
[17] "Jazz"       "Mazzie"     "Alyzza"     "Izza"       "Izzie"      "Jazzlene"   "Lizzeth"    "Jazzalynn" 
[25] "Jazzy"      "Alizzon"    "Elizzabeth" "Jazzilyn"   "Jazzlynne"  "Jizzelle"   "Izzabel"    "Izzabellah"
[33] "Izzibella"  "Jazzabella" "Jazzabelle" "Jazzel"     "Jazzie"     "Jazzlin"    "Jazzlyne"   "Aizza"     
[41] "Brizza"     "Ezzah"      "Fizza"      "Izzybella"  "Rozzlyn"   
# Find girl_names that contain "U"
starts_U <- str_subset(girl_names, pattern = fixed("U"))
starts_U
 [1] "Unique"  "Uma"     "Unknown" "Una"     "Uriah"   "Ursula"  "Unity"   "Umaiza"  "Urvi"    "Ulyana"  "Ula"    
[12] "Udy"     "Urwa"    "Ulani"   "Umaima"  "Umme"    "Ugochi"  "Ulyssa"  "Umika"   "Uriyah"  "Ubah"    "Umaira" 
[23] "Umi"     "Ume"     "Urenna"  "Uriel"   "Urijah"  "Uyen"   
# Find girl_names that contain "U" and "z"
str_subset(starts_U, pattern = fixed("z"))
[1] "Umaiza"

2.2.3 Counting matches

Another stringr function that takes a vector of strings and a pattern is str_count(). str_count() answers the question “How many times does the pattern occur in each string?”. It always returns an integer vector of the same length as that of the input vector.

If you count the occurrences of “pepper” in your pizzas, you’ll find no occurrences in the first, and one each in the second and third,

pizzas <- c("cheese", "pepperoni", 
  "sausage and green peppers")
str_count(pizzas, pattern = fixed("pepper"))
[1] 0 1 1

Perhaps a little more interesting is to count how many “e”s occur in each order

str_count(pizzas, pattern = fixed("e"))
[1] 3 2 5
# Count occurrences of "a" in girl_names
number_as <- str_count(girl_names, pattern = fixed("a"))

# Count occurrences of "A" in girl_names
number_As <- str_count(girl_names, pattern = fixed("A"))

# Histograms of number_as and number_As
hist(number_as)

hist(number_As)  


# Find total "a" + "A"
total_as <- number_as +   number_As

# girl_names with more than 4 a's
girl_names[total_as > 4]
[1] "Aaradhana"

2.3 Splitting strings

2.3.1 Parsing strings into variables

A common use for str_split() is to pull apart raw string data into more useful variables. In this exercise you’ll start by pulling apart a date range, something like “23.01.2017 - 29.01.2017”, into separate variables for the start of the range, “23.01.2017”, and the end of the range, “29.01.2017”.

Remember, if the simplify argument is FALSE (the default) you’ll get back a list of the same length as that of the input vector. More commonly, you’ll want to pull out the first piece (or second piece etc.) from every element, which is easier if you specify simplify = TRUE and get a matrix as output. You’ll explore both of these output types in this exercise.

# Some date data
date_ranges <- c("23.01.2017 - 29.01.2017", "30.01.2017 - 06.02.2017")

# Split dates using " - "
split_dates <- str_split(date_ranges, pattern = fixed(" - "))
split_dates
[[1]]
[1] "23.01.2017" "29.01.2017"

[[2]]
[1] "30.01.2017" "06.02.2017"
# Split dates with n and simplify specified
split_dates_n <- str_split(date_ranges, pattern = fixed(" - "), simplify = TRUE, n = 2)
split_dates_n
     [,1]         [,2]        
[1,] "23.01.2017" "29.01.2017"
[2,] "30.01.2017" "06.02.2017"
split_dates_n <- str_split(date_ranges, fixed(" - "), n = 2, simplify = TRUE)

# Subset split_dates_n into start_dates and end_dates
start_dates <- split_dates_n[,1]

# Split start_dates into day, month and year pieces
str_split(start_dates, fixed("."), simplify = TRUE)
     [,1] [,2] [,3]  
[1,] "23" "01" "2017"
[2,] "30" "01" "2017"
both_names <- c("Box, George", "Cox, David")

# Split both_names into first_names and last_names
both_names_split <- str_split(both_names, fixed(", "), n = 2, simplify = TRUE)

# Get first names
first_names <- both_names_split[, 2]

# Get last names
last_names <- both_names_split[, 1]

2.3.2 Some simple text statistics

Generally, specifying simplify = TRUE will give you output that is easier to work with, but you’ll always get n pieces (even if some are empty, "").

Sometimes, you want to know how many pieces a string can be split into, or you want to do something with every piece before moving to a simpler structure. This is a situation where you don’t want to simplify and you’ll have to process the output with something like lapply().

As an example, you’ll be performing some simple text statistics on your lines from Alice’s Adventures in Wonderland from Chapter 1. Your goal will be to calculate how many words are in each line, and the average length of words in each line.

To do these calculations, you’ll need to split the lines into words. One way to break a sentence into words is to split on an empty space " “. This is a little naive because, for example, it wouldn’t pick up words separated by a newline escape sequence like in”two\nwords", but since this situation doesn’t occur in your lines, it will do.

# Split lines into words
words <- str_split(lines, fixed(" "), simplify = FALSE)

# Number of words per line
lapply(words, length)
[[1]]
[1] 18

[[2]]
[1] 12

[[3]]
[1] 21
  
# Number of characters in each word
word_lengths <- lapply(words, str_length)
  
# Average word length per line
lapply(word_lengths, mean)
[[1]]
[1] 3.888889

[[2]]
[1] 4.25

[[3]]
[1] 4.380952

The word lengths aren’t quite right because you were including some punctuation symbols. One way to deal with that is to replace them first with str_replace().

2.4 Replacing matches in strings

2.4.1 Replacing to tidy strings

You’ve seen one common strategy to pull variables out of strings is to split the string based on a pattern. Sometimes, it’s easier to just replace the parts you don’t want with an empty string "". This is also a common strategy to clean strings up, for example, to remove unwanted punctuation or white space.

In this exercise you’ll pull out some numbers by replacing the part of the string that isn’t a number, you’ll also play with the format of some phone numbers. Pay close attention to the difference between str_replace() and str_replace_all().

# Some IDs
ids <- c("ID#: 192", "ID#: 118", "ID#: 001")

# Replace "ID#: " with ""
id_nums <- str_replace(ids, "ID#: ", "")

# Turn id_nums into numbers
id_ints <- as.numeric(id_nums)
# Some (fake) phone numbers
phone_numbers <- c("510-555-0123", "541-555-0167")

# Use str_replace() to replace "-" with " "
str_replace(phone_numbers, "-", " ")
[1] "510 555-0123" "541 555-0167"
# Use str_replace_all() to replace "-" with " "
str_replace_all(phone_numbers, "-", " ")
[1] "510 555 0123" "541 555 0167"
# Turn phone numbers into the format xxx.xxx.xxxx
str_replace_all(phone_numbers, "-", ".")
[1] "510.555.0123" "541.555.0167"

Review You’ve covered a lot of stringr functions in this chapter:

  • str_c()
  • str_length()
  • str_sub()
  • str_detect()
  • str_subset()
  • str_count()
  • str_split()
  • str_replace()

As a review we’ve got a few tasks for you to do with some DNA sequences. We’ve put three sequences, corresponding to three genes, from the genome of Yersinia pestis – the bacteria that causes bubonic plague – into the vector genes.

Each string represents a gene, each character a particular nucleotide: Adenine, Cytosine, Guanine or Thymine.

genes <- readRDS("dna.rds")
# Find the number of nucleotides in each sequence
str_length(genes)
[1] 441 462 993
# Find the number of A's occur in each sequence
str_count(genes, fixed("A"))
[1] 118 117 267
# Return the sequences that contain "TTTTTT"
str_subset(genes, fixed("TTTTTT"))
[1] "TTAAGGAACGATCGTACGCATGATAGGGTTTTGCAGTGATATTAGTGTCTCGGTTGACTGGATCTCATCAATAGTCTGGATTTTGTTGATAAGTACCTGCTGCAATGCATCAATGGATTTACACATCACTTTAATAAATATGCTGTAGTGGCCAGTGGTGTAATAGGCCTCAACCACTTCTTCTAAGCTTTCCAATTTTTTCAAGGCGGAAGGGTAATCTTTGGCACTTTTCAAGATTATGCCAATAAAGCAGCAAACGTCGTAACCCAGTTGTTTTGGGTTAACGTGTACACAAGCTGCGGTAATGATCCCTGCTTGCCGCATCTTTTCTACTCTTACATGAATAGTTCCGGGGCTAACAGCGAGGTTTTTGGCTAATTCAGCATAGGGTGTGCGTGCATTTTCCATTAATGCTTTCAGGATGCTGCGATCGAGATTATCGATCTGATAAATTTCACTCAT"
# Replace all the "A"s in the sequences with a "_"
str_replace_all(genes, pattern = fixed("A"), replacement = "_")
[1] "TT_G_GT___TT__TCC__TCTTTG_CCC___TCTCTGCTGG_TCCTCTGGT_TTTC_TGTTGG_TG_CGTC__TTTCT__T_TTTC_CCC__CCGTTG_GC_CCTTGTGCG_TC__TTGTTG_TCC_GTTTT_TG_TTGC_CCGC_G___GTGTC_T_TTCTG_GCTGCCT___CC__CCGCCCC___GCGT_CTTGGG_T___TC_GGCTTTTGTTGTTCG_TCTGTTCT__T__TGGCTGC__GTT_TC_GGT_G_TCCCCGGC_CC_TG_GTGG_TGTC_CG_TT__CC_C_GGCC_TTC_GCGT__GTTCGTCC__CTCTGGGCC_TG__GT_TTTCTGT_G____CCC_GCTTCTTCT__TTT_TCCGCT___TGTTC_GC__C_T_TTC_GC_CT_CC__GCGT_CTGCC_CTT_TC__CGTT_TGTC_GCC_T"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[2] "TT__GG__CG_TCGT_CGC_TG_T_GGGTTTTGC_GTG_T_TT_GTGTCTCGGTTG_CTGG_TCTC_TC__T_GTCTGG_TTTTGTTG_T__GT_CCTGCTGC__TGC_TC__TGG_TTT_C_C_TC_CTTT__T___T_TGCTGT_GTGGCC_GTGGTGT__T_GGCCTC__CC_CTTCTTCT__GCTTTCC__TTTTTTC__GGCGG__GGGT__TCTTTGGC_CTTTTC__G_TT_TGCC__T___GC_GC___CGTCGT__CCC_GTTGTTTTGGGTT__CGTGT_C_C__GCTGCGGT__TG_TCCCTGCTTGCCGC_TCTTTTCT_CTCTT_C_TG__T_GTTCCGGGGCT__C_GCG_GGTTTTTGGCT__TTC_GC_T_GGGTGTGCGTGC_TTTTCC_TT__TGCTTTC_GG_TGCTGCG_TCG_G_TT_TCG_TCTG_T___TTTC_CTC_T"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
[3] "_TG______C__TTT_TCC_____C__C__C___TC_GCTTCGT____TC_TTCTTTTCCCGCC__TT_G_GC__C__CTTGGCTTG_TCG__GTCC_GGCTCCT_TTTTG_GCCGTGTGGGTG_TGG__CCC__G_T__CCTTTCTGGTTCTG_G___GCGGT_C_GGT____GTT__GTC_TTGCCGG_TTC__CTTTTG__GTTGT_C_TTC_TT_GCG__GTGG___CGT____CCTT_GGGCGTTTTG_TTTTGGTGCTG_CC__GGGGTGT_T_CCC_T_TG___GC_TTGCGCCC_G_TG__G_TCGCCTG_GTGCT_TTC_TTCTGT_T_TGT_G_TC_GTGGG_TTGGG__CGGGTT_TGGGGG_CGGTG__CGT__CCTGGCTT_CCTG___TCG_CTGTT__C__G_TTT_TGC_GCG_TT___G___CTG__GCGGCG_TC_GTGCTG_GTTTGGTGTG__GCCTTTCCTGCCGG_TC_T_TTC_GTTT_TCC_C_GTG___GCCTGCGGGCC_G_TTCCCTG_TTT_G_TGCT___GGCCGTG__CGTGC__TTGCC___G_GTT_GGTGCTGTCTTCCTT_T_GGG_TTGGTGGC___TTGGC_G_TGGTC__TCCC_TG_TGTTCGTGCGCC_G_TT_TG_TG_TTGG_CCTCTCCG_GTGCGG__GGTTTCTCTGG_TT___CGGCG_C_TT_TTGTCTGG__CCC__T_TTGG__G_TGCCTTTG_G_T_TCTTCT_TGGG__TTCGTGTTG_TGCCG__GCTCTT__GCGTC_GTT_GCCCTG_CTGGCG_TG__G_CCGCTTGG__CTGG__TGGC_TC__TC_CTGTTGCGCGGTG___TGCC_C___CT_TCGGGGG_GGT_TTGGTC_GTCCCGCTT_GTG_TGTT_TTGCTGC_G___C__C_T_TTGGTC_GGTGC__TGTGGTGTTTGGGGCCCTG___TC_GCG_G___GTTG_TGGCCTGCTGT__"

2.4.2 Final challenges

By combining multiple operations together in sequence you can achieve quite complicated manipulations.

As the final exercise we want to expose you to the power of combining operations. You’ll complete two tasks:

  1. You’ll turn a vector of full names, like “Bruce Wayne”, into abbreviated names like “B. Wayne”. This requires combining str_split(), str_sub() and str_c().
  2. You’ll compare how many boy names end in “ee” compared to girl names. This requires combining str_sub() with str_detect() along with the base function table().
# Define some full names
names <- c("Diana Prince", "Clark Kent")

# Split into first and last names
names_split <- str_split(names, fixed(" "), simplify = TRUE)

# Extract the first letter in the first name
abb_first <- str_sub(names_split[,1], 1, 1)

# Combine the first letter ". " and last name
str_c(abb_first, ". ", names_split[,2])
[1] "D. Prince" "C. Kent"  
# Use all names in babynames_2014
all_names <- babynames_2014$name

# Get the last two letters of all_names
last_two_letters <- str_sub(all_names, -2, -1)
# Does the name end in "ee"?
ends_in_ee <- str_detect(last_two_letters, fixed("ee"))

# Extract rows and "sex" column
sex <- babynames_2014$sex[ends_in_ee]

# Display result as a table
table(sex)
sex
  F   M 
572  84 

3 Pattern matching with regular expressions

3.1 Regular expressions

3.1.1 Matching the start or end of the string

rebus provides START and END shortcuts to specify regular expressions that match the start and end of the string. These are also known as anchors. You can try it out just by typing

START You’ll see the output ^. The denotes this is a special regex object and it has the value ^. ^ is the character used in the regular expression language to denote the start of a string.

The special operator provided by rebus, %R% allows you to compose complicated regular expressions from simple pieces. When you are reading rebus code, think of %R% as “then”. For example, you could combine START with c,

START %R% "c"

to match the pattern “the start of string then a c”, or in other words: strings that start with c. In rebus, if you want to match a specific character, or a specific sequence of characters, you simply specify them as a string, e.g. surround them with ".

library(rebus)
package 㤼㸱rebus㤼㸲 was built under R version 4.0.3
Attaching package: 㤼㸱rebus㤼㸲

The following object is masked from 㤼㸱package:stringr㤼㸲:

    regex
# Some strings to practice with
x <- c("cat", "coat", "scotland", "tic toc")

# Print END
END
<regex> $
# Run me
str_view(x, pattern = START %R% "c")
Registered S3 method overwritten by 'htmlwidgets':
  method           from         
  print.htmlwidget tools:rstudio
# Match the strings that start with "co" 
str_view(x, pattern = START %R% "co")
# Match the strings that end with "at"
str_view(x, pattern = "at" %R% END)

# Match the string that is exactly "cat"
str_view(x, pattern = START %R% "cat" %R% END)

or that last example, rebus also provides the function exactly(x) which is a shortcut for START %R% x %R% END that matches only if the string is exactly x.

3.1.2 Matching any character

In a regular expression you can use a wildcard to match a single character, no matter what the character is. In rebus it is specified with ANY_CHAR. Try typing ANY_CHAR in the console. You should see that in the regular expression language this is specified by a dot, ..

For example, “c” %R% ANY_CHAR %R% “t” will look for patterns like “c_t” where the blank can be any character. Consider the strings: “cat”, “coat”, “scotland” and “tic toc”. Where would the matches to “c” %R% ANY_CHAR %R% “t” be?

Test your intuition by running:

str_view(c("cat", "coat", "scotland", "tic toc"), 
  pattern = "c" %R% ANY_CHAR %R% "t")

Notice that ANY_CHAR will match a space character (c t in tic toc). It will also match numbers or punctuation symbols, but ANY_CHAR will only ever match one character, which is why we get no match in coat.

# Match two characters, where the second is a "t"
str_view(x, pattern = ANY_CHAR %R% "t")


# Match a "t" followed by any character
str_view(x, pattern = "t" %R% ANY_CHAR)


# Match two characters
str_view(x, pattern = ANY_CHAR %R% ANY_CHAR)


# Match a string with exactly three characters
str_view(x, pattern = START %R% ANY_CHAR %R% ANY_CHAR %R% ANY_CHAR %R% END)

3.1.3 Combining with stringr functions

You can pass a regular expression as the pattern argument to any stringr function that has the pattern argument. You can use str_detect() to get a logical vector for whether there was a match, str_subset() to return just the strings with matches, and str_count() to count the number of matches in each string.

As a reminder, compare the output of those three functions with our “c_t” pattern from the previous exercise:

x <- c("cat", "coat", "scotland", "tic toc")
pattern <- "c" %R% ANY_CHAR %R% "t"
str_detect(x, pattern)
[1]  TRUE FALSE  TRUE  TRUE
str_subset(x, pattern)
[1] "cat"      "scotland" "tic toc" 
str_count(x, pattern)
[1] 1 0 1 1

It now also makes sense to add str_extract() to your repertoire. It returns just the part of the string that matched the pattern:

str_extract(x, pattern)
[1] "cat" NA    "cot" "c t"

You’ll combine your regular expression skills with stringr to ask how often a q is followed by any character in boy names.

It’s always a good idea to test your pattern, so this pattern is shown matched with four names. The first two shouldn’t have matches (can you explain why?) but the last two should.

pattern <- "q" %R% ANY_CHAR

# Find names that have the pattern
names_with_q <- str_subset(boy_names, pattern)

# How many names were there?
length(names_with_q)
[1] 96
# Find part of name that matches pattern
part_with_q <- str_extract(boy_names, pattern)

# Get a table of counts
table(part_with_q)
part_with_q
qa qe qi qm qo qu 
 1  1  2  2  1 89 
# Did any names have the pattern more than once?
count_of_q <- str_count(boy_names, pattern)

# Get a table of counts
table(count_of_q)
count_of_q
    0     1 
13951    96 
# Which babies got these names?
with_q <- str_detect(boy_names, pattern)

# What fraction of babies got these names?
mean(with_q)
[1] 0.006834199

3.2 More regular expressions

3.2.1 Alternation

The rebus function or() allows us to specify a set of alternatives, which may be single characters or character strings, to be matched. Each alternative is passed as a separate argument.

For example, or(“grey”, “gray”) allows us to detect either the American or British spelling:

x <- c("grey sky", "gray elephant")
str_view(x, pattern = or("grey", "gray"))

Since these two words only differ by one character you could equivalently specify this match with “gr” %R% or(“e”, “a”) %R% “y”, that is “a gr followed by, an e or an a, then a y”.

# Match Jeffrey or Geoffrey
whole_names <- or("Jeffrey", "Geoffrey")
str_view(boy_names, pattern = whole_names, match = TRUE)

# Match Jeffrey or Geoffrey, another way
common_ending <- or("Je", "Geo") %R% "ffrey"
str_view(boy_names, pattern = common_ending, match = TRUE)

# Match with alternate endings
by_parts <- or("Je", "Geo") %R% "ff" %R% or ("ry", "ery", "rey", "erey")
str_view(boy_names, pattern = by_parts, match = TRUE)

# Match names that start with Cath or Kath
ckath <- START %R% or("C", "K") %R% "ath"
str_view(girl_names, pattern = ckath, match = TRUE)

3.2.2 Character classes

In regular expressions a character class is a way of specifying “match one (and only one) of the following characters”. In rebus you can specify the set of allowable characters using the function char_class().

This is another way you could specify an alternate spelling, for example, specifying “a gr followed by, either an a or e, followed by a y”:

x <- c("grey sky", "gray elephant")
str_view(x, pattern = "gr" %R% char_class("ae") %R% "y")

A negated character class matches “any single character that isn’t one of the following”, and in rebus is specified with negated_char_class().

Unlike in other places in a regular expression you don’t need to escape characters that might otherwise have a special meaning inside character classes. If you want to match . you can include . directly, e.g. char_class(“.”). Matching a - is a bit trickier. If you need to do it, just make sure it comes first in the character class.

# Create character class containing vowels
vowels <- char_class("aeiouAEIOU")

# Print vowels
vowels
<regex> [aeiouAEIOU]
# See vowels in x with str_view()
str_view(x, vowels)

# See vowels in x with str_view_all()
str_view_all(x, vowels)

# Number of vowels in boy_names
num_vowels <- str_count(boy_names, vowels)

# Number of characters in boy_names
name_length <- str_length(boy_names)
# Calc mean number of vowels
mean(num_vowels)
[1] 2.385563
# Calc mean fraction of vowels per name
mean(num_vowels / name_length)
[1] 0.4000596

The names in boy_names are on average about 40% vowels.

3.2.3 Repetition

The rebus functions one_or_more(), zero_or_more() and optional() can be used to wrap parts of a regular expression to allow a pattern to match a variable number of times.

Take our vowels pattern from the last exercise. You can pass it to one_or_more() to create the pattern that matches “one or more vowels”. Take a look with these interjections:

x <- c("ow", "ooh", "yeeeah!", "shh")
str_view(x, pattern = one_or_more(vowels))

You’ll see we can match the single o in ow, the double o in ooh and the string of es followed by the a in yeeeah, but nothing in shh because there isn’t a single vowel.

In contrast zero_or_more() will match even if there isn’t an occurrence, try

str_view(x, pattern = zero_or_more(vowels))

Since both yeeeah and shh start without a vowel, they match “zero vowels”, and since regular expressions are lazy, they look no further and return the start of the string as a match.

# Vowels from last exercise
vowels <- char_class("aeiouAEIOU")

# See names with only vowels
str_view(boy_names, 
  pattern = exactly(one_or_more(vowels)), 
  match = TRUE)

# Use `negated_char_class()` for everything but vowels
not_vowels <- negated_char_class("aeiouAEIOU")

# See names with no vowels
str_view(boy_names, 
  pattern = exactly(one_or_more(not_vowels)), 
  match = TRUE)

3.3 Shortcuts

3.3.1 Hunting for phone numbers

For your first task you are going to pull out the phone numbers from this vector of contact information:

contact <- c("Call me at 555-555-0191", "123 Main St", "(555) 555 0191", 
"Phone: 555.555.0191 Mobile: 555.555.0192")

You’ll assume the phone numbers you are looking for follow the American standard of a three digit area code, a three digit exchange and then a four digit number, but each part could be separated by spaces or various punctuation symbols.

# Create a three digit pattern
three_digits <- DGT %R% DGT %R% DGT 

# Test it
str_view_all(contact, pattern = three_digits)

# Create a separator pattern
separator <- char_class("- . ( ) ")

# Test it
str_view_all(contact, pattern = separator)

# Use these components
three_digits <- DGT %R% DGT %R% DGT
four_digits <- three_digits %R% DGT
separator <- char_class("-.() ")

# Create phone pattern
phone_pattern <- optional(OPEN_PAREN) %R%
  three_digits %R%
  zero_or_more(separator) %R%
  three_digits %R% 
  zero_or_more(separator) %R%
  four_digits
        
# Test it           
str_view_all(contact, pattern = phone_pattern)

# Extract phone numbers
str_extract(contact, phone_pattern)
[1] "555-555-0191"   NA               "(555) 555 0191" "555.555.0191"  
# Extract ALL phone numbers
str_extract_all(contact, phone_pattern)
[[1]]
[1] "555-555-0191"

[[2]]
character(0)

[[3]]
[1] "(555) 555 0191"

[[4]]
[1] "555.555.0191" "555.555.0192"

rebus also provides the functions number_range() when you want to extract numbers in a certain range, and datetime() to specify common date patterns.

3.3.2 Extracting age and gender from accident narratives

Recall from the video, you want to parse out age and gender from accident narratives. For example, this narrative

19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS 

describes male of age 19, and this one

TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*

a female of age 33.

You are generally looking for a pattern with a number, something to indicate the units, e.g. YO or YR for years old, or MO for months old, and a character that identifies the gender.

In this exercise you’ll build up a pattern to pull out the part of the narrative that has the age and gender information. Then, in the next exercise you’ll parse out the age and gender into separate variables.

narratives <- readRDS("narratives.rds")
# Use these patterns
age <- DGT %R% optional(DGT)
unit <- optional(SPC) %R% or("YO", "YR", "MO")
gender <- optional(SPC) %R% or("M", "F")

# Extract age, unit, gender
str_extract(narratives, pattern = age %R% unit %R% gender)
 [1] "19YOM"   "31 YOF"  "82 YOM"  "33 YOF"  "10YOM"   "53 YO F" "13 MOF"  "14YR M"  "55YOM"   "5 YOM"  

You can also use dgt(1, 2) to match one or two digits.

3.3.3 Parsing age and gender into pieces

To finish up, you need to pull out the individual pieces and tidy them into usable variables

There are a few ways you could get at one piece: you could extract out the piece you need, you could replace everything that isn’t the piece you need with "“, or you could try to split into the pieces you need. You’ll try a few of these in this exercise and you’ll see yet another way in the next chapter. For the first option, stringr has a nice convenience function, str_remove(), that works like str_replace() with replacement =”".

One benefit of building up your pattern in pieces is you already have patterns for each part that you can reuse now.

age_gender <- str_extract(narratives, pattern = age %R% unit %R% gender)
# age_gender, age, gender, unit are pre-defined
ls.str()
abb_first :  chr [1:2] "D" "C"
age :  'regex' chr "\\d[\\d]?"
age_gender :  chr [1:10] "19YOM" "31 YOF" "82 YOM" "33 YOF" "10YOM" "53 YO F" "13 MOF" "14YR M" "55YOM" "5 YOM"
all_names :  chr [1:33228] "Emma" "Olivia" "Sophia" "Isabella" "Ava" "Mia" "Emily" "Abigail" "Madison" "Charlotte" "Harper" ...
babynames_2014 : tibble [33,228 x 5] (S3: tbl_df/tbl/data.frame)
both_names :  chr [1:2] "Box, George" "Cox, David"
both_names_split :  chr [1:2, 1:2] "Box" "Cox" "George" "David"
boy_df : 'data.frame':  14047 obs. of  1 variable:
 $ boy_names: chr  "Noah" "Liam" "Mason" "Jacob" ...
boy_first_letter :  chr [1:14047] "N" "L" "M" "J" "W" "E" "M" "A" "J" "D" "E" "B" "L" "A" "J" "M" "J" "D" "L" "J" "A" "A" "S" "G" ...
boy_last_letter :  chr [1:14047] "h" "m" "n" "b" "m" "n" "l" "r" "s" "l" "h" "n" "n" "n" "n" "w" "n" "d" "s" "h" "y" "w" "l" "l" ...
boy_length :  int [1:14047] 4 4 5 5 7 5 7 9 5 6 ...
boy_names :  chr [1:14047] "Noah" "Liam" "Mason" "Jacob" "William" "Ethan" "Michael" "Alexander" "James" "Daniel" "Elijah" ...
by_parts :  'regex' chr "(?:Je|Geo)ff(?:ry|ery|rey|erey)"
ckath :  'regex' chr "^(?:C|K)ath"
common_ending :  'regex' chr "(?:Je|Geo)ffrey"
contact :  chr [1:4] "Call me at 555-555-0191" "123 Main St" "(555) 555 0191" "Phone: 555.555.0191 Mobile: 555.555.0192"
contains_zz :  logi [1:14047] FALSE FALSE FALSE FALSE FALSE FALSE ...
count_of_q :  int [1:14047] 0 0 0 0 0 0 0 0 0 0 ...
date_ranges :  chr [1:2] "23.01.2017 - 29.01.2017" "30.01.2017 - 06.02.2017"
dollar_income :  chr [1:4] "$       72" "$    1,030" "$   10,292" "$1,189,192"
ends_in_ee :  logi [1:33228] FALSE FALSE FALSE FALSE FALSE FALSE ...
first_names :  chr [1:2] "George" "David"
formatted_income :  chr [1:4] "     72" "   1030" "  10292" "1189192"
formatted_names :  chr [1:4] "          Year 0" "          Year 1" "          Year 2" "Project Lifetime"
four_digits :  'regex' chr "\\d\\d\\d\\d"
gender :  'regex' chr "[\\s]?(?:M|F)"
genes :  Named chr [1:3] "TTAGAGTAAATTAATCCAATCTTTGACCCAAATCTCTGCTGGATCCTCTGGTATTTCATGTTGGATGACGTCAATTTCTAATATTTCACCCAACCGTTGAGCACCTTGTGC"| __truncated__ ...
girl_first_letter :  chr [1:19181] "E" "O" "S" "I" "A" "M" "E" "A" "M" "C" "H" "S" "A" "E" "A" "E" "E" "C" "V" "G" "A" "Z" "N" "A" ...
girl_last_letter :  chr [1:19181] "a" "a" "a" "a" "a" "a" "y" "l" "n" "e" "r" "a" "y" "h" "a" "n" "a" "e" "a" "e" "y" "y" "e" "n" ...
girl_length :  int [1:19181] 4 6 6 8 3 3 5 7 7 9 ...
girl_names :  chr [1:19181] "Emma" "Olivia" "Sophia" "Isabella" "Ava" "Mia" "Emily" "Abigail" "Madison" "Charlotte" "Harper" ...
id_ints :  num [1:3] 192 118 1
id_nums :  chr [1:3] "192" "118" "001"
ids :  chr [1:3] "ID#: 192" "ID#: 118" "ID#: 001"
income :  num [1:4] 7.22e+01 1.03e+03 1.03e+04 1.19e+06
income_names :  chr [1:4] "Year 0" "Year 1" "Year 2" "Project Lifetime"
last_names :  chr [1:2] "Box" "Cox"
last_two_letters :  chr [1:33228] "ma" "ia" "ia" "la" "va" "ia" "ly" "il" "on" "te" "er" "ia" "ry" "th" "ia" "yn" "la" "oe" "ia" ...
line1 :  chr "The table was a large one, but the three were all crowded together at one corner of it:"
line2 :  chr "\"No room! No room!\" they cried out when they saw Alice coming."
line3 :  chr "\"There's plenty of room!\" said Alice indignantly, and she sat down in a large arm-chair at one end of the table."
lines :  chr [1:3] "The table was a large one, but the three were all crowded together at one corner of it:" ...
my_order :  chr "I want to order a pizza with green peppers, anchovies, and grilled onions."
my_toppings :  chr [1:3] "cheese" NA NA
my_toppings_and :  chr [1:3] "cheese" "NA" "and NA"
my_toppings_str :  chr [1:3] "cheese" NA NA
name_length :  int [1:14047] 4 4 5 5 7 5 7 9 5 6 ...
names :  chr [1:2] "Diana Prince" "Clark Kent"
names_split :  chr [1:2, 1:2] "Diana" "Clark" "Prince" "Kent"
names_with_q :  chr [1:96] "Joaquin" "Enrique" "Ezequiel" "Marquis" "Dominique" "Jaquan" "Marquise" "Marquez" "Marques" ...
narratives :  chr [1:10] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS " ...
not_vowels :  'regex' chr "[^aeiouAEIOU]"
num_vowels :  int [1:14047] 2 2 2 2 3 2 3 4 2 3 ...
number_as :  int [1:19181] 1 1 1 2 1 1 0 1 1 1 ...
number_As :  int [1:19181] 0 0 0 0 1 0 0 1 0 0 ...
p_values :  num [1:4] 1.20e-01 9.80e-01 1.91e-05 2.00e-11
part_with_q :  chr [1:14047] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
pattern :  'regex' chr "q."
percent_change :  num [1:4] 4 -1.91 3 -5
phone_numbers :  chr [1:2] "510-555-0123" "541-555-0167"
phone_pattern :  'regex' chr "[\\(]?\\d\\d\\d[-.() ]*\\d\\d\\d[-.() ]*\\d\\d\\d\\d"
pizzas :  chr [1:3] "cheese" "pepperoni" "sausage and green peppers"
pretty_income :  chr [1:4] "       72" "    1,030" "   10,292" "1,189,192"
pretty_percent :  chr [1:4] "+4.0" "-1.9" "+3.0" "-5.0"
rows :  chr [1:4] "          Year 0   $       72" "          Year 1   $    1,030" "          Year 2   $   10,292" ...
separator :  'regex' chr "[-.() ]"
sex :  chr [1:656] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" ...
split_dates : List of 2
 $ : chr [1:2] "23.01.2017" "29.01.2017"
 $ : chr [1:2] "30.01.2017" "06.02.2017"
split_dates_n :  chr [1:2, 1:2] "23.01.2017" "30.01.2017" "29.01.2017" "06.02.2017"
start_dates :  chr [1:2] "23.01.2017" "30.01.2017"
starts_U :  chr [1:28] "Unique" "Uma" "Unknown" "Una" "Uriah" "Ursula" "Unity" "Umaiza" "Urvi" "Ulyana" "Ula" "Udy" "Urwa" ...
these_toppings :  chr "green peppers, anchovies, and grilled onions"
three_digits :  'regex' chr "\\d\\d\\d"
toppings :  chr [1:25] "anchovies" "artichoke" "bacon" "breakfast bacon" "Canadian bacon" "cheese" "chicken" ...
total_as :  int [1:19181] 1 1 1 2 2 1 0 2 1 1 ...
trimmed_income :  chr [1:4] "72" "1030" "10292" "1189192"
unit :  'regex' chr "[\\s]?(?:YO|YR|MO)"
vowels :  'regex' chr "[aeiouAEIOU]"
whole_names :  'regex' chr "(?:Jeffrey|Geoffrey)"
with_q :  logi [1:14047] FALSE FALSE FALSE FALSE FALSE FALSE ...
word_lengths : List of 3
 $ : int [1:18] 3 5 3 1 5 4 3 3 5 4 ...
 $ : int [1:12] 3 5 2 6 4 5 3 4 4 3 ...
 $ : int [1:21] 8 6 2 6 4 5 12 3 3 3 ...
words : List of 3
 $ : chr [1:18] "The" "table" "was" "a" ...
 $ : chr [1:12] "\"No" "room!" "No" "room!\"" ...
 $ : chr [1:21] "\"There's" "plenty" "of" "room!\"" ...
x :  chr [1:4] "ow" "ooh" "yeeeah!" "shh"
y :  num [1:3] 1 2.01 1
year_percent :  chr [1:4] "2010: +4.0%" "2011: -1.9%" "2012: +3.0%" "2013: -5.0%"
years :  int [1:4] 2010 2011 2012 2013
# Extract age and make numeric
as.numeric(str_extract(age_gender, age))
 [1] 19 31 82 33 10 53 13 14 55  5
# Replace age and units with ""
genders <- str_remove(age_gender, age %R% unit) 

# Replace extra spaces
str_remove_all(genders, pattern = one_or_more(SPC))
 [1] "M" "F" "M" "F" "M" "F" "F" "M" "M" "M"
# Numeric ages, from previous step
ages_numeric <- as.numeric(str_extract(age_gender, age))

# Extract units 
time_units <- str_extract(age_gender, unit)

# Extract first word character
time_units_clean <- str_extract(time_units, pattern = WRD)

# Turn ages in months to years
ifelse(time_units_clean == "Y", ages_numeric, ages_numeric / 12)
 [1] 19.000000 31.000000 82.000000 33.000000 10.000000 53.000000  1.083333 14.000000 55.000000  5.000000

– You can imagine this would be an important first step if you wanted to investigate the frequency of accidents by age or gender. If you want a challenge, get the neiss package and see if you can extract age and gender from all accidents.

4 More advanced matching and manipulation

4.1 Capturing

4.1.1 Capturing parts of a pattern

n rebus, to denote a part of a regular expression you want to capture, you surround it with the function capture(). For example, a simple pattern to match an email address might be,

email <- one_or_more(WRD) %R% 
  "@" %R% one_or_more(WRD) %R% 
  DOT %R% one_or_more(WRD)
str_view("(wolverine@xmen.com)", pattern = email)  

If you want to capture the part before the @, you simply wrap that part of the regular expression in capture():

email <- capture(one_or_more(WRD)) %R% 
  "@" %R% one_or_more(WRD) %R% 
  DOT %R% one_or_more(WRD)
str_view("(wolverine@xmen.com)", pattern = email)  

The part of the string that matches hasn’t changed, but if we pull out the match with str_match() we get access to the captured piece:

str_match("(wolverine@xmen.com)", pattern =  email)  
     [,1]                 [,2]       
[1,] "wolverine@xmen.com" "wolverine"
hero_contacts <- c("(wolverine@xmen.com)", "wonderwoman@justiceleague.org", "thor@avengers.com"
)
# Capture parts between @ and . and after .
email <- capture(one_or_more(WRD)) %R% 
  "@" %R% capture(one_or_more(WRD)) %R% 
  DOT %R% capture(one_or_more(WRD))

# Check match hasn't changed
str_view(hero_contacts, email)
# Pattern from previous step
email <- capture(one_or_more(WRD)) %R% 
  "@" %R% capture(one_or_more(WRD)) %R% 
  DOT %R% capture(one_or_more(WRD))
  
# Pull out match and captures
email_parts <- str_match(hero_contacts, email)
email_parts
     [,1]                            [,2]          [,3]            [,4] 
[1,] "wolverine@xmen.com"            "wolverine"   "xmen"          "com"
[2,] "wonderwoman@justiceleague.org" "wonderwoman" "justiceleague" "org"
[3,] "thor@avengers.com"             "thor"        "avengers"      "com"
# Save host
host <- email_parts[,3]
host
[1] "xmen"          "justiceleague" "avengers"     

Actually, detecting an email address can be really hard.

4.1.2 Pulling out parts of a phone number

You can now go back to the phone number example from the previous chapter. You developed a pattern to extract the parts of string that looked like phone numbers, and now you have the skills to pull out the pieces of the number. Let’s see if you can put your skills together to output the first phone number in each string in a common format.

separator <- or(char_class("-.()| "))
'or' is intended to be called with at least 2 arguments in '...'. 1 was passed. Maybe you wanted 'or1' instead?
# View text containing phone numbers
contact
[1] "Call me at 555-555-0191"                  "123 Main St"                             
[3] "(555) 555 0191"                           "Phone: 555.555.0191 Mobile: 555.555.0192"
# Add capture() to get digit parts
phone_pattern <- capture(three_digits) %R% zero_or_more(separator) %R% 
           capture(three_digits) %R% zero_or_more(separator) %R%
           capture(four_digits)
           
# Pull out the parts with str_match()
phone_numbers <- str_match(contact, phone_pattern)
phone_numbers
     [,1]            [,2]  [,3]  [,4]  
[1,] "555-555-0191"  "555" "555" "0191"
[2,] NA              NA    NA    NA    
[3,] "555) 555 0191" "555" "555" "0191"
[4,] "555.555.0191"  "555" "555" "0191"
# Put them back together
str_c(
  "(",
  phone_numbers[,2],
  ")",
  phone_numbers[,3],
  "-",
  phone_numbers[,4])
[1] "(555)555-0191" NA              "(555)555-0191" "(555)555-0191"

If you wanted to extract beyond the first phone number, e.g. The second phone number in the last string, you could use str_match_all(). But, like str_split() it will return a list with one component for each input string, and you’ll need to use lapply() to handle the result.

4.1.3 Extracting age and gender again

You can now also take another look at your pattern for pulling out age and gender from the injury narratives.

# narratives has been pre-defined
narratives
 [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE PLAYING FOOTBALL W/ FRIENDS "                      
 [2] "31 YOF FELL FROM TOILET HITITNG HEAD SUSTAINING A CHI "                                    
 [3] "ANKLE STR. 82 YOM STRAINED ANKLE GETTING OUT OF BED "                                      
 [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"            
 [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "                          
 [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "                                      
 [7] "13 MOF TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
 [8] "14YR M PLAYING FOOTBALL; DX KNEE SPRAIN "                                                  
 [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "                      
[10] "5 YOM ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"                
# Add capture() to get age, unit and sex
pattern <- capture(optional(DGT) %R% DGT) %R%  
  optional(SPC) %R% capture(or("YO", "YR", "MO")) %R%
  optional(SPC) %R% capture(or("M", "F"))

# Pull out from narratives
str_match(narratives, pattern)
      [,1]      [,2] [,3] [,4]
 [1,] "19YOM"   "19" "YO" "M" 
 [2,] "31 YOF"  "31" "YO" "F" 
 [3,] "82 YOM"  "82" "YO" "M" 
 [4,] "33 YOF"  "33" "YO" "F" 
 [5,] "10YOM"   "10" "YO" "M" 
 [6,] "53 YO F" "53" "YO" "F" 
 [7,] "13 MOF"  "13" "MO" "F" 
 [8,] "14YR M"  "14" "YR" "M" 
 [9,] "55YOM"   "55" "YO" "M" 
[10,] "5 YOM"   "5"  "YO" "M" 
# Edit to capture just Y and M in units
pattern2 <- capture(optional(DGT) %R% DGT) %R%  
  optional(SPC) %R% capture(or("Y", "M")) %R% optional(or("O","R")) %R%
  optional(SPC) %R% capture(or("M", "F"))

# Check pattern
str_view(narratives, pattern2)


# Pull out pieces
str_match(narratives, pattern2)
      [,1]      [,2] [,3] [,4]
 [1,] "19YOM"   "19" "Y"  "M" 
 [2,] "31 YOF"  "31" "Y"  "F" 
 [3,] "82 YOM"  "82" "Y"  "M" 
 [4,] "33 YOF"  "33" "Y"  "F" 
 [5,] "10YOM"   "10" "Y"  "M" 
 [6,] "53 YO F" "53" "Y"  "F" 
 [7,] "13 MOF"  "13" "M"  "F" 
 [8,] "14YR M"  "14" "Y"  "M" 
 [9,] "55YOM"   "55" "Y"  "M" 
[10,] "5 YOM"   "5"  "Y"  "M" 

The combination of capture() and str_match() is powerful for extracting pieces of text.

4.2 Backreferences

4.2.1 Using backreferences in patterns

Backreferences can be useful in matching because they allow you to find repeated patterns or words. Using a backreference requires two things: you need to capture() the part of the pattern you want to reference, and then you refer to it with REF1.

Take a look at this pattern: capture(LOWER) %R% REF1. It matches and captures any lower case character, then is followed by the captured character: it detects repeated characters regardless of what character is repeated. To see it in action try this:

str_view(c("hello", "sweet", "kitten"), 
  pattern = capture(LOWER) %R% REF1)

If you capture more than one thing you can refer to them with REF2, REF3 etc. up to REF9, counting the captures from the left of the pattern.

# Names with three repeated letters
repeated_three_times <- capture(LOWER) %R% REF1 %R% REF1

# Test it
str_view(boy_names, pattern = repeated_three_times, match = TRUE)

# Names with a pair of repeated letters
pair_of_repeated <- capture(LOWER %R% LOWER) %R% REF1

# Test it
str_view(boy_names, pattern = pair_of_repeated, match = TRUE)

# Names with a pair that reverses
pair_that_reverses <- capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1

# Test it
str_view(boy_names, pattern = pair_that_reverses, match = TRUE)

# Four letter palindrome names
four_letter_palindrome <- exactly(capture(LOWER) %R% capture(LOWER) %R% REF2 %R% REF1)

# Test it
str_view(boy_names, pattern = four_letter_palindrome, match = TRUE)

In addition to matching repeated values, backreferences can also be used for replacement.

4.2.2 Replacing with regular expressions

Now, you’ve mastered matching with backreferences, you’ll build up to replacing with backreferences, but first let’s review str_replace() now that you’ve got regular expressions under your belt.

Remember str_replace() takes three arguments, string a vector of strings to do the replacements in, pattern that identifies the parts of strings to replace and replacement the thing to use as a replacement.

replacement can be a vector, the same length as string, each element specifies the replacement to be used in each string.

# View text containing phone numbers
contact
[1] "Call me at 555-555-0191"                  "123 Main St"                             
[3] "(555) 555 0191"                           "Phone: 555.555.0191 Mobile: 555.555.0192"
# Replace digits with "X"
str_replace(contact, DGT, "X")
[1] "Call me at X55-555-0191"                  "X23 Main St"                             
[3] "(X55) 555 0191"                           "Phone: X55.555.0191 Mobile: 555.555.0192"
# Replace all digits with "X"
str_replace_all(contact, DGT, "X")
[1] "Call me at XXX-XXX-XXXX"                  "XXX Main St"                             
[3] "(XXX) XXX XXXX"                           "Phone: XXX.XXX.XXXX Mobile: XXX.XXX.XXXX"
# Replace all digits with different symbol
str_replace_all(contact, DGT, c("X", ".", "*", "_"))
[1] "Call me at XXX-XXX-XXXX"                  "... Main St"                             
[3] "(***) *** ****"                           "Phone: ___.___.____ Mobile: ___.___.____"

Using "" for the replacement value is a great way to cut unwanted bits from a string.

4.2.3 Replacing with backreferences

The replacement argument to str_replace() can also include backreferences. This works just like specifying patterns with backreferences, except the capture happens in the pattern argument, and the backreference is used in the replacement argument.

x <- c("hello", "sweet", "kitten")
str_replace(x, capture(ANY_CHAR), str_c(REF1, REF1))
[1] "hhello"  "ssweet"  "kkitten"

capture(ANY_CHAR) will match the first character no matter what it is. Then the replacement str_c(REF1, REF1) combines the captured character with itself, in effect doubling the first letter of each string.

You are going to use this to create some alternative, more lively accident narratives.

The strategy you’ll take is to match words ending in “ING” then replace them with an adverb followed by the original word.

adverbs <- readRDS("adverbs.rds")
# Build pattern to match words ending in "ING"
pattern <- one_or_more(WRD) %R% "ING"
str_view(narratives, pattern)


# Test replacement
str_replace(narratives, capture(pattern), 
  str_c("CARELESSLY", REF1, sep = " "))
 [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE CARELESSLY PLAYING FOOTBALL W/ FRIENDS "                      
 [2] "31 YOF FELL FROM TOILET HITITNG HEAD CARELESSLY SUSTAINING A CHI "                                    
 [3] "ANKLE STR. 82 YOM STRAINED ANKLE CARELESSLY GETTING OUT OF BED "                                      
 [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"                       
 [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "                                     
 [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "                                                 
 [7] "13 MOF CARELESSLY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
 [8] "14YR M CARELESSLY PLAYING FOOTBALL; DX KNEE SPRAIN "                                                  
 [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "                                 
[10] "5 YOM CARELESSLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"                
# One adverb per narrative
adverbs_10 <- sample(adverbs, 10)

# Replace "***ing" with "adverb ***ly"
str_replace(narratives, 
  capture(pattern),
  str_c(adverbs_10, REF1, sep = " "))  
 [1] "19YOM-SHOULDER STRAIN-WAS TACKLED WHILE EXCITEDLY PLAYING FOOTBALL W/ FRIENDS "                         
 [2] "31 YOF FELL FROM TOILET HITITNG HEAD FEROCIOUSLY SUSTAINING A CHI "                                     
 [3] "ANKLE STR. 82 YOM STRAINED ANKLE HIGHLY GETTING OUT OF BED "                                            
 [4] "TRIPPED OVER CAT AND LANDED ON HARDWOOD FLOOR. LACERATION ELBOW, LEFT. 33 YOF*"                         
 [5] "10YOM CUT THUMB ON METAL TRASH CAN DX AVULSION OF SKIN OF THUMB "                                       
 [6] "53 YO F TRIPPED ON CARPET AT HOME. DX HIP CONTUSION "                                                   
 [7] "13 MOF TREMENDOUSLY TRYING TO STAND UP HOLDING ONTO BED FELL AND HIT FOREHEAD ON RADIATOR DX LACERATION"
 [8] "14YR M ACCIDENTALLY PLAYING FOOTBALL; DX KNEE SPRAIN "                                                  
 [9] "55YOM RIDER OF A BICYCLE AND FELL OFF SUSTAINED A CONTUSION TO KNEE "                                   
[10] "5 YOM RIGHTEOUSLY ROLLING ON FLOOR DOING A SOMERSAULT AND SUSTAINED A CERVICAL STRA IN"                 

Replacement combined with backreferences can be really useful for reformatting text data.

4.3 Unicode and pattern matching

4.3.1 Matching a specific code point or code groups

Things can get tricky when some characters can be specified two ways, for example è, an e with a grave accent, can be specified either with the single code point \u00e8 or the combination of a \u0065 and a combining grave accent \u0300. They look the same:

x <- c("\u00e8", "\u0065\u0300")
writeLines(x)
è
e`

But, specifying the single code point only matches that version:

str_view(x, "\u00e8")

The stringi package that stringr is built on contains functions for converting between the two forms. stri_trans_nfc() composes characters with combining accents into a single character. stri_trans_nfd() decomposes character with accents into separate letter and accent characters. You can see how the characters differ by looking at the hexadecimal codes.

library(stringi)
as.hexmode(utf8ToInt(stri_trans_nfd("\u00e8")))
[1] "065" "300"
as.hexmode(utf8ToInt(stri_trans_nfc("\u0065\u0300")))
[1] "e8"

In Unicode, an accent is known as a diacritic Unicode Property, and you can match it using the rebus value UP_DIACRITIC.

Vietnamese makes heavy use of diacritics to denote the tones in its words. In this exercise, you’ll manipulate the diacritics in the names of Vietnamese rulers.

# Names with builtin accents

(tay_son_builtin <- c(
  "Nguy\u1ec5n Nh\u1ea1c", 
  "Nguy\u1ec5n Hu\u1ec7",
  "Nguy\u1ec5n Quang To\u1ea3n"
))
[1] "Nguyễn Nhạc"       "Nguyễn Huệ"        "Nguyễn Quang Toản"
# Convert to separate accents
tay_son_separate <- stri_trans_nfd(tay_son_builtin)

# Verify that the string prints the same
tay_son_separate
[1] "Nguyễn Nhạc"       "Nguyễn Huệ"        "Nguyễn Quang Toản"
# Match all accents
str_view_all(tay_son_separate, UP_DIACRITIC)

Xin chúc mừng! ‘It is useful to be consistent about combining or separating diacritics.’ - possibly a Vietnamese proverb.

4.3.2 Matching a single grapheme

A related problem is matching a single character. You’ve used ANY_CHAR to do this up until now, but it will only match a character represented by a single code point. Take these three names:

x <- c("Adele", "Ad\u00e8le", "Ad\u0065\u0300le")
writeLines(x)
Adele
Adèle
Ade`le

They look the similar, but this regular expression only matches two of them:

str_view(x, "Ad" %R% ANY_CHAR %R% "le")

because in the third name è is represented by two code points. The Unicode standard has a concept of a grapheme that represents a display character, but may be composed of many code points. To match any grapheme you can use GRAPHEME.

str_view(x, "Ad" %R% GRAPHEME %R% "le")

Names of rulers from the Vietnamese Tây Sơn dynasty, with diacritics given as separate graphemes, is pre-defined as tay_son_separate.

# tay_son_separate has been pre-defined
tay_son_separate
[1] "Nguyễn Nhạc"       "Nguyễn Huệ"        "Nguyễn Quang Toản"
# View all the characters in tay_son_separate
str_view_all(tay_son_separate, ANY_CHAR)

# View all the graphemes in tay_son_separate
str_view_all(tay_son_separate, GRAPHEME)

# Combine the diacritics with their letters
tay_son_builtin <- stri_trans_nfc(tay_son_separate)
tay_son_builtin
[1] "Nguyễn Nhạc"       "Nguyễn Huệ"        "Nguyễn Quang Toản"
# View all the graphemes in tay_son_builtin
str_view_all(tay_son_builtin, GRAPHEME)

5 Case studies

5.1 Importance of being earnest

5.1.1 Getting the play into R

Your first step is to read the play into R using stri_read_lines().

You should take a look at the original text file: importance-of-being-earnest.txt

You’ll see there is some foreword and afterword text that Project Gutenberg has added. You’ll want to remove that, and then split the play into the introduction (the list of characters, scenes, etc.) and the main body.

earnest_file <- "importance-of-being-earnest.txt"
# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)
# Detect start and end lines
start <- str_which(earnest, fixed("START OF THE PROJECT"))
end <- str_which(earnest, fixed("END OF THE PROJECT"))

# Get rid of gutenberg intro text
earnest_sub  <- earnest[(start + 1):(end - 1)]
# Read play in using stri_read_lines()
earnest <- stri_read_lines(earnest_file)

# Detect first act
lines_start <- str_which(earnest_sub, fixed("FIRST ACT"))

# Set up index
intro_line_index <- 1:(lines_start - 1)

# Split play into intro and play
intro_text <- earnest_sub[intro_line_index]
play_text <- earnest_sub[-intro_line_index]
# Take a look at the first 20 lines
writeLines(play_text[1:20])
FIRST ACT


SCENE


Morning-room in Algernon's flat in Half-Moon Street.  The room is
luxuriously and artistically furnished.  The sound of a piano is heard in
the adjoining room.

[Lane is arranging afternoon tea on the table, and after the music has
ceased, Algernon enters.]

Algernon.  Did you hear what I was playing, Lane?

Lane.  I didn't think it polite to listen, sir.

Algernon.  I'm sorry for that, for your sake.  I don't play
accurately--any one can play accurately--but I play with wonderful
expression.  As far as the piano is concerned, sentiment is my forte.  I

stri_read_lines() is a high performance alternative to readLines().

5.1.2 Identifying the lines, take 1

The first thing you might notice when you look at your vector play_text is there are lots of empty lines. They don’t really affect your task so you might want to remove them. The easiest way to find empty strings is to use the stringi function stri_isempty(), which returns a logical you can use to subset the not-empty strings:

# Get rid of empty strings
empty <- stri_isempty(play_text)
play_lines <- play_text[!empty]

So, how are you going to find the elements that indicate a character starts their line? Consider the following lines

play_lines[10:15]
[1] "Algernon.  I'm sorry for that, for your sake.  I don't play"             
[2] "accurately--any one can play accurately--but I play with wonderful"      
[3] "expression.  As far as the piano is concerned, sentiment is my forte.  I"
[4] "keep science for Life."                                                  
[5] "Lane.  Yes, sir."                                                        
[6] "Algernon.  And, speaking of the science of Life, have you got the"       

The first line is for Algernon, the next three strings are continuations of that line, then line 5 is for Lane and line 6 for Algernon.

How about looking for lines that start with a word followed by a .?

# Pattern for start, word then .
pattern_1 <- START %R% one_or_more(WRD) %R% DOT

# Test pattern_1
# str_view(play_lines, pattern_1, match = TRUE) 
# str_view(play_lines, pattern_1, match = FALSE)

# Pattern for start, capital, word then .
pattern_2 <- START %R% char_class("A-Z") %R% one_or_more(WRD) %R% DOT

# Test pattern_2
# str_view(play_lines, pattern_2, match = TRUE)
# str_view(play_lines, pattern_2, match = FALSE)

# Pattern from last step
pattern_2 <- START %R% ascii_upper() %R% one_or_more(WRD) %R% DOT

# Get subset of lines that match
lines <- str_subset(play_lines, pattern_2)

# Extract match from lines
who <- str_extract(lines, pattern_2)

# Let's see what we have
unique(who)
 [1] "Algernon."   "Lane."       "Jack."       "Cecily."     "Ernest."     "University." "Gwendolen."  "July."      
 [9] "Chasuble."   "Merriman."   "Sunday."     "Mr."         "London."     "Cardew."     "Opera."      "Markby."    
[17] "Oxonian."   

It looks like your pattern wasn’t 100% successful. It missed Lady Bracknell, and picked up lines starting with University., July. and a few others. Let’s try a slightly different strategy.

5.1.3 Identifying the lines, take 2

The pattern “starts with a capital letter, has some other characters then a full stop” wasn’t specific enough. You ended up matching lines that started with things like University., July., London., and you missed characters like Lady Bracknell and Miss Prism.

Let’s take a different approach. You know the characters names from the play introduction. So, try specifically looking for lines that start with their names. You’ll find the or1() function from the rebus package helpful. It specifies alternatives but rather than each alternative being an argument like in or(), you can pass in a vector of alternatives.

# Create vector of characters
characters <- c("Algernon", "Jack", "Lane", "Cecily", "Gwendolen", "Chasuble", 
  "Merriman", "Lady Bracknell", "Miss Prism")

# Match start, then character name, then .
pattern_3 <- START %R% or1(characters) %R% DOT

# View matches of pattern_3
# str_view(play_lines, pattern_3, match = TRUE)
  
# View non-matches of pattern_3
# str_view(play_lines, pattern_3, match = FALSE)

# Pull out matches
lines <- str_subset(play_lines, pattern_3)

# Extract match from lines
who <- str_extract(lines, pattern_3)

# Let's see what we have
unique(who)
[1] "Algernon."       "Lane."           "Jack."           "Cecily."         "Gwendolen."      "Lady Bracknell."
[7] "Miss Prism."     "Chasuble."       "Merriman."      
# Count lines per character
table(who)
who
      Algernon.         Cecily.       Chasuble.      Gwendolen.           Jack. Lady Bracknell.           Lane. 
            201             154              42             102             219              84              21 
      Merriman.     Miss Prism. 
             17              41 

Algernon and Jack get the most lines, more than ten times more than Merriman who has the fewest. If you were looking really closely you might have noticed the pattern didn’t pick up the line Jack and Algernon [Speaking together.] which you really should be counting as a line for both Jack and Algernon. One solution might be to look for these “Speaking together” lines, parse out the characters, and add to your counts.

5.2 A case study on case

5.2.1 Changing case to ease matching

A simple solution to working with strings in mixed case, is to simply transform them into all lower or all upper case. Depending on your choice, you can then specify your pattern in the same case.

For example, while looking for “cat” finds no matches in the following string,

x <- c("Cat", "CAT", "cAt") 
str_view(x, "cat")

transforming the string to lower case first ensures all variations match.

str_view(str_to_lower(x), "cat")

See if you can find the catcidents that also involved dogs. You’ll see a new rebus function called whole_word(). The argument to whole_word() will only match if it occurs as a word on its own, for example whole_word(“cat”) will match cat in “The cat” and “cat.” but not in “caterpillar”.

catcidents <- readRDS("catcidents.rds")

# catcidents has been pre-defined
head(catcidents)
[1] "79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*"                                                               
[2] "21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%"                         
[3] "87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION "                                                                            
[4] "bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS"
[5] "42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA"                                                        
[6] "4YOf DOg jUst hAd PUpPieS, Cat TRIED 2 get PuPpIes, pT THru CaT dwn stA Irs, LoST foOTING & FELl down ~12 stePS; MInor hEaD iNJuRY"          
# Construct pattern of DOG in boundaries
whole_dog_pattern <- whole_word("DOG")

# See matches to word DOG
str_view(catcidents, whole_dog_pattern, match = TRUE)


# Transform catcidents to upper case
catcidents_upper <- str_to_upper(catcidents)

# View matches to word "DOG" again
str_view(catcidents_upper, whole_dog_pattern, match = TRUE)


# Which strings match?
has_dog <- str_detect(catcidents_upper, whole_dog_pattern)

# Pull out matching strings in original 
catcidents[has_dog]
 [1] "bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS"   
 [2] "4YOf DOg jUst hAd PUpPieS, Cat TRIED 2 get PuPpIes, pT THru CaT dwn stA Irs, LoST foOTING & FELl down ~12 stePS; MInor hEaD iNJuRY"             
 [3] "unhelmeted 14yof riding her bike with her dog when she saw a cat and sw erved c/o head/shoulder/elbow pain.dx: minor head injury,left shoulder" 
 [4] "Rt Shoulder Strain.26Yof Was Walking Dog On Leash And Dot Saw A Cat And Pulled Leash."                                                          
 [5] "67 YO F WENT TO WALK DOG, IT STARTED TO CHASE CAT JERKED LEASH PULLED H ER OFF PATIO, FELL HURT ANKLES. DX BILATERAL ANKLE FRACTURES"           
 [6] "46yof taking dog outside, dog bent her fingers back on a door. dog jerk ed when saw cat. hand holding leash caught on door jamb/ct hand"        
 [7] "PUSHING HER UTD WITH SHOTS DOG AWAY FROM THE CAT'S BOWL&BITTEN TO FINGE R>>PW/DOG BITE"                                                         
 [8] "DX R SH PN: 27YOF W/ R SH PN X 5D. STATES WAS YANK' BY HER DOG ON LEASH W DOG RAN AFTER CAT; WORSE' PN SINCE. FULL ROM BUT VERY PAINFUL TO MOVE"
 [9] "39Yof dog pulled her down the stairs while chasing a cat dx: rt ankle inj"                                                                      
[10] "44Yof Walking Dog And The Dof Took Off After A Cat And Pulled Pt Down B Y The Leash Strained Neck"                                              

5.2.2 Ignoring case when matching

Rather than transforming the input strings, another approach is to specify that the matching should be case insensitive. This is one of the options to the stringr regex() function.

Take our previous example,

x <- c("Cat", "CAT", "cAt") 
str_view(x, "cat")

o match the pattern cat in a case insensitive way, we wrap our pattern in regex() and specify the argument ignore_case = TRUE,

str_view(x, stringr::regex("cat", ignore_case = TRUE))

Notice that the matches retain their original case and any variant of cat matches.

# View matches to "TRIP"
str_view(catcidents, "TRIP", match = TRUE)


# Construct case insensitive pattern
trip_pattern <- stringr::regex("TRIP", ignore_case = TRUE)

# View case insensitive matches to "TRIP"
str_view(catcidents, trip_pattern, match = TRUE)


# Get subset of matches
trip <- str_subset(catcidents, trip_pattern)

# Extract matches
str_extract(trip, trip_pattern)
 [1] "tRiP" "TRIP" "TRIP" "TRiP" "triP" "TRIP" "TRIP" "TRIP" "Trip" "trip" "TRIP" "TRIP"

5.2.3 Fixing case problems

Finally, you might want to transform strings to a common case. You’ve seen you can use str_to_upper() and str_to_lower(), but there is also str_to_title() which transforms to title case, in which every word starts with a capital letter.

This is another situation where stringi functions offer slightly more functionality than the stringr functions. The stringi function stri_trans_totitle() allows a specification of the type which, by default, is “word”, resulting in title case, but can also be “sentence” to give sentence case: only the first word in each sentence is capitalized.

library(stringi)

# Get first five catcidents
cat5 <- catcidents[1:5]

# Take a look at original
writeLines(cat5)
79yOf Fractured fingeR tRiPPED ovER cAT ANd fell to FlOOr lAst nIGHT AT HOME*
21 YOF REPORTS SUS LACERATION OF HER LEFT HAND WHEN SHE WAS OPENING A CAN OF CAT FOOD JUST PTA. DX HAND LACERATION%
87YOF TRIPPED OVER CAT, HIT LEG ON STEP. DX LOWER LEG CONTUSION 
bLUNT CHest trAUma, R/o RIb fX, R/O CartiLAgE InJ To RIB cAge; 32YOM walKiNG DOG, dog took OfF aFtER cAt,FelL,stRucK CHest oN STepS,hiT rIbS
42YOF TO ER FOR BACK PAIN AFTER PUTTING DOWN SOME CAT LITTER DX: BACK PAIN, SCIATICA
# Transform to title case
writeLines(str_to_title(cat5))
79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*
21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%
87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion 
Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs
42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica
# Transform to title case with stringi
writeLines(stri_trans_totitle(cat5))
79yof Fractured Finger Tripped Over Cat And Fell To Floor Last Night At Home*
21 Yof Reports Sus Laceration Of Her Left Hand When She Was Opening A Can Of Cat Food Just Pta. Dx Hand Laceration%
87yof Tripped Over Cat, Hit Leg On Step. Dx Lower Leg Contusion 
Blunt Chest Trauma, R/O Rib Fx, R/O Cartilage Inj To Rib Cage; 32yom Walking Dog, Dog Took Off After Cat,Fell,Struck Chest On Steps,Hit Ribs
42yof To Er For Back Pain After Putting Down Some Cat Litter Dx: Back Pain, Sciatica
# Transform to sentence case with stringi
writeLines(stri_trans_totitle(cat5, type = "sentence"))
79yof fractured finger tripped over cat and fell to floor last night at home*
21 yof reports sus laceration of her left hand when she was opening a can of cat food just pta. Dx hand laceration%
87yof tripped over cat, hit leg on step. Dx lower leg contusion 
Blunt chest trauma, r/o rib fx, r/o cartilage inj to rib cage; 32yom walking dog, dog took off after cat,fell,struck chest on steps,hit ribs
42yof to er for back pain after putting down some cat litter dx: back pain, sciatica
