Course Description

Character strings can turn up in all stages of a data science project. You might have to clean messy string input before analysis, extract data that is embedded in text or automatically turn numeric results into a sentence to include in a report. Perhaps the strings themselves are the data of interest, and you need to detect and match patterns within them. This course will help you master these tasks by teaching you how to pull strings apart, put them back together and use stringr to detect, extract, match and split strings using regular expressions, a powerful way to express patterns.

1 String basics

You’ll start with some basics: how to enter strings in R, how to control how numbers are transformed to strings, and finally how to combine strings together to produce output that combines text and nicely formatted numbers.

1.1 Welcome!


You can escape quotes inside strings using a backslash.

1.1.2 What you see isn’t always what you have

Take a look at line2

## [1] "\"No room! No room!\" they cried out when they saw Alice \ncoming."

Even though you used single quotes so you didn’t have to escape any double quotes, when R prints it, you’ll see escaped double quotes (\")! R doesn’t care how you defined the string, it only knows what the string represents, in this case, a string with double quotes inside.

When you ask R for line2 it is actually calling print(line2) and the print() method for strings displays strings as you might enter them. If you want to see the string it represents you’ll need to use a different function: writeLines().

## [1] "The table was a large one, but the three were all crowded \ntogether at one corner of it:"                           
## [2] "\"No room! No room!\" they cried out when they saw Alice \ncoming."                                                  
## [3] "\"There's plenty of room!\" said Alice indignantly, \nand she sat down in a large arm-chair at one end of the table."
## The table was a large one, but the three were all crowded 
## together at one corner of it:
## "No room! No room!" they cried out when they saw Alice 
## coming.
## "There's plenty of room!" said Alice indignantly, 
## and she sat down in a large arm-chair at one end of the table.
## The table was a large one, but the three were all crowded 
## together at one corner of it: "No room! No room!" they cried out when they saw Alice 
## coming. "There's plenty of room!" said Alice indignantly, 
## and she sat down in a large arm-chair at one end of the table.
## hello
## 🌍

The function cat() is very similar to writeLines(), but by default separates elements with a space, and will attempt to convert non-character objects to a string. We won’t use it in this course, but you might see it in other people’s code.

1.1.3 Escape sequences

You might have been surprised at the output from the last part of the last exercise. How did you get two lines from one string, and how did you get that little globe? The key is the \.

A sequence in a string that starts with a \is called an escape sequence and allows us to include special characters in our strings. You saw one escape sequence in the first exercise: \" is used to denote a double quote.

In "hello\n\U1F30D" there are two escape sequences: \n gives a newline, and \U followed by up to 8 hex digits sequence denotes a particular Unicode character.

Unicode is a standard for representing characters that might not be on your keyboard. Each available character has a Unicode code point: a number that uniquely identifies it. These code points are generally written in hex notation, that is, using base 16 and the digits 0-9 and A-F. You can find the code point for a particular character by looking up a code chart. If you only need four digits for the codepoint, an alternative escape sequence is \u.

When R comes across a \it assumes you are starting an escape, so if you actually need a backslash in your string you’ll need the sequence \\.

## To have a \ you need \\
## This is a really 
## really 
## really 
## long string
## नमस्ते दुनिया

You can read about a few other escape sequences in the help page ?Quotes.

1.2 Turning numbers into strings

1.2.1 Using format() with numbers

The scientific argument to format() controls whether the numbers are displayed in fixed (scientific = FALSE) or scientific (scientific = TRUE) format.

  • When the representation is scientific, the digits argument is the number of digits before the exponent.
  • When the representation is fixed, digits controls the significant digits used for the smallest (in magnitude) number.
    • Each other number will be formatted to match the number of decimal places in the smallest number.
    • This means the number of decimal places you get in your output depends on all the values you are formatting!

For example, if the smallest number is 0.0011, and digits = 1, then 0.0011 requires 3 places after the decimal to represent it to 1 significant digit, 0.001. Every other number will be formatted to 3 places after the decimal point.

So, how many decimal places will you get if 1.0011 is the smallest number? You’ll find out in this exercise.

## [1] "0.001" "0.011" "1.000"
## [1] "1" "2" "1"
## [1] " 4.0" "-1.9" " 3.0" "-5.0"
## [1] "     72" "   1030" "  10292" "1189192"
## [1] "0.12000000000" "0.98000000000" "0.00001910000" "0.00000000002"

1.2.2 Controlling other aspects of the string

## [1] "     72" "   1030" "  10292" "1189192"
##      72
##    1030
##   10292
## 1189192
## 72
## 1030
## 10292
## 1189192
##        72
##     1,030
##    10,292
## 1,189,192

1.2.3 formatC()

The function formatC() provides an alternative way to format numbers based on C style syntax.

Rather than a scientific argument, formatC() has a format argument that takes a code representing the required format. The most useful are:

  • "f"for fixed,
  • "e" for scientific, and
  • "g" for fixed unless scientific saves space

When using scientific format, the digits argument behaves like it does in format(); it specifies the number of significant digits. However, unlike format(), when using fixed format, digits is the number of digits after the decimal point. This is more predictable than format(), because the number of places after the decimal is fixed regardless of the values being formatted.

formatC() also formats numbers individually, which means you always get the same output regardless of other numbers in the vector.

The flag argument allows you to provide some modifiers that, for example, force the display of the sign (flag = "+"), left align numbers (flag = "-") and pad numbers with leading zeros (flag = "0").

## [1] "0.0" "0.0" "1.0"
## [1] "1.0" "2.0" "1.0"
## [1] "4.0"  "-1.9" "3.0"  "-5.0"
## [1] "+4.0" "-1.9" "+3.0" "-5.0"
## [1] "0.12"    "0.98"    "1.9e-05" "2e-11"

1.3 Putting strings together

1.3.1 Annotation of numbers

## [1] "$72"        "$1,030"     "$10,292"    "$1,189,192"
## [1] "+4.0%" "-1.9%" "+3.0%" "-5.0%"
## [1] "2010: +4.0%,2011: -1.9%,2012: +3.0%,2013: -5.0%"

Specifying sep = "" is so common, there is actually another function paste0() that works like paste() but always pastes elements together without a separator between them.

1.3.2 A very simple table

##           Year 0   $       72
##           Year 1   $    1,030
##           Year 2   $   10,292
## Project Lifetime   $1,189,192

If you wanted the dollar signs right next to the numbers, you could format the incomes with trim = TRUE, paste on the $, then format again as a string with justify = "right".

1.3.3 Let’s order pizza!

## [1] "meatballs" "garlic"    "cheese"
## I want to order a pizza with meatballs, garlic, and cheese.

2 Introduction to stringr

Time to meet stringr! You’ll start by learning about some stringr functions that are very similar to some base R functions, then how to detect specific patterns in strings, how to split strings apart and how to find and replace parts of strings.

2.1 Introducing stringr

2.1.1 Putting strings together with stringr


  • the c is short for concatentate, a function that works like paste().
  • It takes vectors of strings as input along with sep and collapse arguments.

There are two key ways str_c() differs from paste().

  • First, the default separator is an empty string, sep = "", as opposed to a space, so it’s more like paste0().
  • handling missing values. paste() turns missing values into the string “NA”, whereas str_c() propagates missing values. That means combining any strings with a missing value will result in another missing value.

This behavior is nice because you learn quickly when you might have missing values, rather than discovering later weird “NA”s inside your strings. Another stringr function that is useful when you may have missing values, is `str_replace_na() which replaces missing values with any string you choose.

2.1.2 String length


  • takes a vector of strings as input and returns the number of characters in each string.
## [1] 5 5

This is very similar to the base function nchar() but str_length() handles factors in an intuitive way, whereas nchar() will just return an error.

## [1] "Noah"    "Liam"    "Mason"   "Jacob"   "William" "Ethan"
## [1] 4 4 5 5 7 5
## [1] 0.3360496
## [1] 4 4 5 5 7 5

The average length of the girls’ names in 2014 is about 1/3 of a character longer. Just be aware this is a naive average where each name is counted once, not weighted by how many babies recevied the name. A better comparison might be an average weighted by the n column in babynames

2.1.3 Extracting substrings


  • extracts parts of strings based on their location.
  • first argument, string, is a vector of strings.
  • The arguments start and end specify the boundaries of the piece to extract in characters.

For example, str_sub(x, 1, 4) asks for the substring starting at the first character, up to the fourth character, or in other words the first four characters. Try it with my Batman’s name:

## [1] "Bruc" "Wayn"

Both start and end can be negative integers, in which case, they count from the end of the string. For example, str_sub(x, -4, -1), asks for the substring starting at the fourth character from the end, up to the first character from the end, i.e. the last four characters. Again, try it with Batman:

## [1] "ruce" "ayne"
## boy_first_letter
##    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O 
## 1450  651  767  996  549  185  332  401  234 1388 1290  536  913  424  207 
##    P    Q    R    S    T    U    V    W    X    Y    Z 
##  230   56  778  804  771   43  160  174   56  252  379
## boy_last_letter
##    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o 
##  421  104   92  436 1145   66   81  582  704   57  349  942  389 4664  729 
##    p    q    r    s    t    u    v    w    x    y    z 
##   32   19 1011  825  291   81   71   34   86  696  119
## girl_first_letter
##    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O 
## 3099  698  941  808  932  209  345  468  373 1429 1689 1121 1744  752  143 
##    P    Q    R    S    T    U    V    W    X    Y    Z 
##  301   38  830 1366  681   28  214   85   62  294  500
## girl_last_letter
##    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o 
## 6624   20   13   81 3111    8   21 1936 1580   12   31  450  115 2600  104 
##    p    q    r    s    t    u    v    w    x    y    z 
##    3    2  291  326  208   59    6   17   49 1432   51
  • "A" is the most popular first letter for both boys and girls, and the most popular last letter for girls.
  • However, the most popular last letter for boys’ names was "n".
  • You might have seen substr() a base R function that is similar to str_sub()
  • The big advantage of str_sub() is the ability to use negative indexes to count from the end of a string.

2.2 Hunting for matches

stringr functions that look for matches

All take a pattern argument

  • str_detect()
  • str_subset()
  • str_count()

2.2.1 Detecting matches


  • answers the question: Does the string contain the pattern?
  • returns a logical vector of the same length as that of the input vector string, with TRUE for elements that contain the pattern and FALSE otherwise.
## [1] 16
##  [1] "Uzziah"    "Ozzie"     "Ozzy"      "Uzziel"    "Jazz"     
##  [6] "Chazz"     "Izzy"      "Azzam"     "Izzac"     "Izzak"    
## [11] "Fabrizzio" "Jazziel"   "Azzan"     "Izzaiah"   "Muizz"    
## [16] "Yazziel"
## # A tibble: 16 x 5
##     year sex   name          n       prop
##    <dbl> <chr> <chr>     <int>      <dbl>
##  1  2014 M     Uzziah       67 0.0000328 
##  2  2014 M     Ozzie        62 0.0000304 
##  3  2014 M     Ozzy         57 0.0000279 
##  4  2014 M     Uzziel       21 0.0000103 
##  5  2014 M     Jazz         20 0.00000980
##  6  2014 M     Chazz        17 0.00000833
##  7  2014 M     Izzy         16 0.00000784
##  8  2014 M     Azzam        14 0.00000686
##  9  2014 M     Izzac        13 0.00000637
## 10  2014 M     Izzak         8 0.00000392
## 11  2014 M     Fabrizzio     7 0.00000343
## 12  2014 M     Jazziel       6 0.00000294
## 13  2014 M     Azzan         5 0.00000245
## 14  2014 M     Izzaiah       5 0.00000245
## 15  2014 M     Muizz         5 0.00000245
## 16  2014 M     Yazziel       5 0.00000245

That last example is another common use of str_detect() subsetting a data frame to rows where the values in a column contain the pattern of interest. In this case it lets us see these double-z names are pretty rare. For example, even the most popular, Uzziah, only accounted for 0.003% of boys born in 2014.

2.2.2 Subsetting strings based on match

Since detecting strings with a pattern and then subsetting out those strings is such a common operation, stringr provides a function str_subset() that does that in one step.

For example, let’s repeat our search for “pepper” in our pizzas using str_subset():

## [1] "pepperoni"                 "sausage and green peppers"

We get a new vector of strings, but it only contains those original strings that contained the pattern.

str_subset() can be easily confused with str_extract(). str_extract() returns a vector of the same length as that of the input vector, but with only the parts of the strings that matched the pattern.

##  [1] "Uzziah"    "Ozzie"     "Ozzy"      "Uzziel"    "Jazz"     
##  [6] "Chazz"     "Izzy"      "Azzam"     "Izzac"     "Izzak"    
## [11] "Fabrizzio" "Jazziel"   "Azzan"     "Izzaiah"   "Muizz"    
## [16] "Yazziel"
##  [1] "Izzabella"  "Jazzlyn"    "Jazzlynn"   "Lizzie"     "Izzy"      
##  [6] "Lizzy"      "Mazzy"      "Izzabelle"  "Jazzmine"   "Jazzmyn"   
## [11] "Jazzelle"   "Jazzmin"    "Izzah"      "Jazzalyn"   "Jazzmyne"  
## [16] "Izzabell"   "Jazz"       "Mazzie"     "Alyzza"     "Izza"      
## [21] "Izzie"      "Jazzlene"   "Lizzeth"    "Jazzalynn"  "Jazzy"     
## [26] "Alizzon"    "Elizzabeth" "Jazzilyn"   "Jazzlynne"  "Jizzelle"  
## [31] "Izzabel"    "Izzabellah" "Izzibella"  "Jazzabella" "Jazzabelle"
## [36] "Jazzel"     "Jazzie"     "Jazzlin"    "Jazzlyne"   "Aizza"     
## [41] "Brizza"     "Ezzah"      "Fizza"      "Izzybella"  "Rozzlyn"
##  [1] "Unique"  "Uma"     "Unknown" "Una"     "Uriah"   "Ursula"  "Unity"  
##  [8] "Umaiza"  "Urvi"    "Ulyana"  "Ula"     "Udy"     "Urwa"    "Ulani"  
## [15] "Umaima"  "Umme"    "Ugochi"  "Ulyssa"  "Umika"   "Uriyah"  "Ubah"   
## [22] "Umaira"  "Umi"     "Ume"     "Urenna"  "Uriel"   "Urijah"  "Uyen"
## [1] "Umaiza"

Only one girls’ name that starts with “U” and contains a “z”. Have you ever met an “Umaiza”?

2.2.3 Counting matches


  • answers the question “How many times does the pattern occur in each string?”
  • always returns an integer vector of the same length as that of the input vector

If you count the occurrences of "pepper" in your pizzas, you’ll find no occurrences in the first, and one each in the second and third,

## [1] 0 1 1

Perhaps a little more interesing is to count how many "e"s occur in each order

## [1] 3 2 5

## [1] "Aaradhana"

2.3 Splitting strings

2.3.1 Parsing strings into variables


  • pull apart raw string data into more useful variables.

In this exercise pull apart a date range, something like "23.01.2017 - 29.01.2017", into separate variables for the start of the range, "23.01.2017", and the end of the range, "29.01.2017".

If the simplify argument is FALSE (the default) you’ll get back a list of the same length as that of the input vector. More commonly, you’ll want to pull out the first piece (or second piece etc.) from every element, which is easier if you specify simplify = TRUE and get a matrix as output.

## [[1]]
## [1] "23.01.2017" "29.01.2017"
## [[2]]
## [1] "30.01.2017" "06.02.2017"
##      [,1]         [,2]        
## [1,] "23.01.2017" "29.01.2017"
## [2,] "30.01.2017" "06.02.2017"
##      [,1] [,2] [,3]  
## [1,] "23" "01" "2017"
## [2,] "30" "01" "2017"

Use the simplify = TRUE argument when you want to split each string into the same number of pieces.

2.3.2 Some simple text statistics

Generally, specifying simplify = TRUE will give you output that is easier to work with, but you’ll always get n pieces (even if some are empty, "").

Sometimes, you want to know how many pieces a string can be split into, or you want to do something with every piece before moving to a simpler structure. This is a situation where you don’t want to simplify and you’ll have to process the output with something like lapply().

As an example, you’ll be performing some simple text statistics on your lines from Alice’s Adventures in Wonderland from Chapter 1. Your goal will be to calculate how many words are in each line, and the average length of words in each line.

To do these calculations, you’ll need to split the lines into words. One way to break a sentence into words is to split on an empty space " ". This is a little naive because, for example, it wouldn’t pick up words separated by a newline escape sequence like in "two\nwords", but since this situation doesn’t occur in your lines, it will do.

## [[1]]
## [1] 18
## [[2]]
## [1] 12
## [[3]]
## [1] 21
## [[1]]
## [1] 3.944444
## [[2]]
## [1] 4.333333
## [[3]]
## [1] 4.428571

The word lengths aren’t quite right because you were including some punctuation symbols. One way to deal with that is to replace them first with str_replace()

2.4 Replacing matches in strings

2.4.1 Replacing to tidy strings

Sometimes, it’s easier to just replace the parts you don’t want with an empty string "". This is also a common strategy to clean strings up, for example, to remove unwanted punctuation or white space.

In this exercise you’ll pull out some numbers by replacing the part of the string that isn’t a number, you’ll also play with the format of some phone numbers. Pay close attention to the difference between str_replace() and str_replace_all().

## [1] "510 555-0123" "541 555-0167"
## [1] "510 555 0123" "541 555 0167"
## [1] "510.555.0123" "541.555.0167"

2.4.2 Review

You’ve covered a lot of stringr functions in this chapter:

  • str_c()
  • str_length()
  • str_sub()
  • str_detect()
  • str_subset()
  • str_count()
  • str_split()
  • str_replace()

As a review we’ve got a few tasks for you to do with some DNA sequences. We’ve put three sequences, corresponding to three genes, from the genome of Yersinia pestis – the bacteria that causes bubonic plague – into the vector genes.

Each string represents a gene, each character a particular nucleotide: Adenine, Cytosine, Guanine or Thymine.

We aren’t going to tell you which function to use. It’s up to you to choose the right one and specify the needed arguments. Good luck!

## [1] 441 462 993
## [1] 118 117 267
## [1] "TT_G_GT___TT__TCC__TCTTTG_CCC___TCTCTGCTGG_TCCTCTGGT_TTTC_TGTTGG_TG_CGTC__TTTCT__T_TTTC_CCC__CCGTTG_GC_CCTTGTGCG_TC__TTGTTG_TCC_GTTTT_TG_TTGC_CCGC_G___GTGTC_T_TTCTG_GCTGCCT___CC__CCGCCCC___GCGT_CTTGGG_T___TC_GGCTTTTGTTGTTCG_TCTGTTCT__T__TGGCTGC__GTT_TC_GGT_G_TCCCCGGC_CC_TG_GTGG_TGTC_CG_TT__CC_C_GGCC_TTC_GCGT__GTTCGTCC__CTCTGGGCC_TG__GT_TTTCTGT_G____CCC_GCTTCTTCT__TTT_TCCGCT___TGTTC_GC__C_T_TTC_GC_CT_CC__GCGT_CTGCC_CTT_TC__CGTT_TGTC_GCC_T"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [2] "TT__GG__CG_TCGT_CGC_TG_T_GGGTTTTGC_GTG_T_TT_GTGTCTCGGTTG_CTGG_TCTC_TC__T_GTCTGG_TTTTGTTG_T__GT_CCTGCTGC__TGC_TC__TGG_TTT_C_C_TC_CTTT__T___T_TGCTGT_GTGGCC_GTGGTGT__T_GGCCTC__CC_CTTCTTCT__GCTTTCC__TTTTTTC__GGCGG__GGGT__TCTTTGGC_CTTTTC__G_TT_TGCC__T___GC_GC___CGTCGT__CCC_GTTGTTTTGGGTT__CGTGT_C_C__GCTGCGGT__TG_TCCCTGCTTGCCGC_TCTTTTCT_CTCTT_C_TG__T_GTTCCGGGGCT__C_GCG_GGTTTTTGGCT__TTC_GC_T_GGGTGTGCGTGC_TTTTCC_TT__TGCTTTC_GG_TGCTGCG_TCG_G_TT_TCG_TCTG_T___TTTC_CTC_T"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

2.4.3 Final challenges

As the final exercise we want to expose you to the power of combining operations. You’ll complete two tasks:

  1. You’ll turn a vector of full names, like “Bruce Wayne”, into abbreviated names like “B. Wayne”. This requires combining str_split(), str_sub() and str_c().

  2. You’ll compare how many boy names end in “ee” compared to girl names. This requires combining str_sub() with str_detect() along with the base function table().

## [1] "D. Prince" "C. Kent"
## sex
##   F   M 
## 572  84

3 Pattern matching with regular expressions

In this chapter you’ll learn about regular expressions, a language for describing patterns in strings. By combining regular expressions with the stringr functions you’ll greatly increase your power to manipulate strings.

3.1 Regular expressions

3.1.1 Matching the start or end of the string

rebus provides START and END shortcuts to specify regular expressions that match the start and end of the string. These are also known as anchors. You can try it out just by typing


You’ll see the output <regex> ^. The <regex> denotes this is a special regex object and it has the value ^. ^is the character used in the regular expression language to denote the start of a string.

The special operator provided by rebus, %R% allows you to compose complicated regular expressions from simple pieces. When you are reading rebus code, think of %R% as “then”. For example, you could combine START with c,

START %R% "c"

to match the pattern “the start of string then a c”, or in other words: strings that start with c. In `rebus, if you want to match a specific character, or a specific sequence of characters, you simply specify them as a string, e.g. surround them with “.

## <regex> $

For that last example, rebus also provides the function exactly(x) which is a shortcut for START %R% x %R% END that matches only if the string is exactly x.

3.1.2 Matching any character

In a regular expression you can use a wildcard to match a single character, no matter what the character is. In rebus it is specified with ANY_CHAR.

## <regex> .

For example, "c" %R% ANY_CHAR %R% "t" will look for patterns like

  • "c_t" where the blank can be any character.
  • Consider the strings:
    • “cat”,
    • “coat”,
    • “scotland” and
    • “tic toc”.

Where would the matches to "c" %R% ANY_CHAR %R% "t" be?

Test your intuition by running:

Notice that ANY_CHAR will match a space character (c t in tic toc). It will also match numbers or punctuation symbols, but ANY_CHAR will only ever match one character, which is why we get no match in coat.

3.1.3 Combining with stringr functions

You can pass a regular expression as the pattern argument to any stringr function that has the pattern argument.

It now also makes sense to add str_extract() to your repertoire. It returns just the part of the string that matched the pattern:

## [1] 96
## part_with_q
## qa qe qi qm qo qu 
##  1  1  2  2  1 89
## count_of_q
##     0     1 
## 13930    96
## [1] 0.006844432

3.2 More regular expressions