Load Libraries

library(tidyverse)
library(nycflights13)


0. Tools of using regular expressions


Next, let’s learn how to apply regular expressions to real problems. We will learn stringr functions that help


Detect Matches

To determine if a character vector matches a pattern, use str_detect(). It returns a logical vector the same length as the input. When there is a match, a TRUE will be returned; otherwise it will be FALSE.

x <- c("apple", "banana", "pear")
str_detect(x, "e")
## [1]  TRUE FALSE  TRUE

We can then use sum() and mean() function to answer how many strings match the given pattern and what is the proportion of matching strings in the vector:

# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306


Lab Exercise: find how many words ending with “e*e” where “*” can be any letter.


A common use of str_detect() is to select the elements that match a pattern. You can do this with logical subsetting, or the convenient str_subset() wrapper:

words[str_detect(words, "x$")]
## [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"

Typically, however, your strings will be one column of a data frame. We will use filter together with str_detect():

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
## # A tibble: 4 × 2
##   word      i
##   <chr> <int>
## 1 box     108
## 2 sex     747
## 3 six     772
## 4 tax     841

Here the seq_along() function returns the sequence number of each word in the list.

A variation on str_detect() is str_count(): rather than a simple yes or no, it tells you how many matches there are in a string:

x <- c("apple", "banana", "pear")
str_count(x, "a")
## [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
## [1] 1.991837

It’s natural to use str_count() with mutate():

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
## # A tibble: 980 × 4
##    word         i vowels consonants
##    <chr>    <int>  <int>      <int>
##  1 a            1      1          0
##  2 able         2      2          2
##  3 about        3      3          2
##  4 absolute     4      4          4
##  5 accept       5      2          4
##  6 account      6      3          4
##  7 achieve      7      4          3
##  8 across       8      2          4
##  9 act          9      1          2
## 10 active      10      3          3
## # … with 970 more rows


Example

Now let’s look at an example with the sentences data set. First, let’s find the longest sentences

df1 <- tibble(
  sentence = sentences,
  word_number = str_count(sentence, "\\s") + 1
)

df1 %>%
  arrange(desc(word_number)) %>%
  print()
## # A tibble: 720 × 2
##    sentence                                                  word_number
##    <chr>                                                           <dbl>
##  1 It was hidden from sight by a mass of leaves and shrubs.           12
##  2 It was a bad error on the part of the new judge.                   12
##  3 A ridge on a smooth surface is a bump or flaw.                     11
##  4 The barrel of beer was a brew of malt and hops.                    11
##  5 The crunch of feet in the snow was the only sound.                 11
##  6 The vane on top of the pole revolved in the wind.                  11
##  7 The bills were mailed promptly on the tenth of the month.          11
##  8 In the rear of the ground floor was a large passage.               11
##  9 The water in this well is a source of good health.                 11
## 10 He wrote his name boldly at the top of the sheet.                  11
## # … with 710 more rows

So we change the list into a tibble, then count the number of words in each sentence by counting the number of white spaces and then add one. Then we arrange it in descending order by the word number.

Next let’s find the sentences that do not have “a”, “an” or “the”:

df1 %>%
  filter(!str_detect(str_to_lower(sentence), "( a )|( the )|( an )")) %>%
  print()
## # A tibble: 254 × 2
##    sentence                                    word_number
##    <chr>                                             <dbl>
##  1 Rice is often served in round bowls.                  7
##  2 The juice of lemons makes fine punch.                 7
##  3 The hogs were fed chopped corn and garbage.           8
##  4 Four hours of steady work faced us.                   7
##  5 A large size in stockings is hard to sell.            9
##  6 A rod is used to catch pink salmon.                   8
##  7 Smoky fires lack flame and heat.                      6
##  8 The swan dive was far short of perfect.               8
##  9 Her purse was full of useless trash.                  7
## 10 Read verse out loud for pleasure.                     6
## # … with 244 more rows

Here we use ! as logical NOT to exclude the given patterns. Note that we must have space in the parentheses to capture the single words of “a”, “an” or “the”.


Lab Exercises: Find all sentences in sentences that neither have “r” nor “s”.


Extract matches

In data cleaning tasks, we usually need to extract the actual text of a match. For example, we hope to know whether “q” is always followed by “u” in a word.

In that case, we can use str_extract().

q_string <- str_extract(words, "q.")
head(q_string)
## [1] NA NA NA NA NA NA

This returns a lot of NA values which are from vectors that do not contain “q”. Let’s remove those NA values.

q_string <- q_string[!is.na(q_string)]
print(q_string)
##  [1] "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu"

Now it becomes clear that we only have “u” after “q” in those words. Another way to do this is to use the str_subset function to keep strings with the given patterns only.

q_string2 <- str_subset(words, "q.")
q_string2 <- str_extract(q_string2, "q.")
print(q_string2)
##  [1] "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu" "qu"

As another example, imagine we want to find all sentences in sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
## [1] "red|orange|yellow|green|blue|purple"

Now we can select the sentences that contain a colour, and then extract the colour to figure out which one it is:

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
## [1] "blue" "blue" "red"  "red"  "red"  "blue"

Note that str_extract() only extracts the first match. We can see that most easily by first selecting all the sentences that have more than 1 match:

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match, html = TRUE)

To get all matches, use str_extract_all(). It returns a list (we will learn list later) or a matrix (if using simplify = TRUE):

str_extract_all(more, colour_match)
## [[1]]
## [1] "blue" "red" 
## 
## [[2]]
## [1] "green" "red"  
## 
## [[3]]
## [1] "orange" "red"
str_extract_all(more, colour_match, simplify = TRUE)
##      [,1]     [,2] 
## [1,] "blue"   "red"
## [2,] "green"  "red"
## [3,] "orange" "red"


Lab Exercise: There is a sentence in the previous example that doesn’t meet our criterion (“flickered” is not a color). Think about how to remove it.


Grouped Matches

Earlier we talked about the use of parentheses for clarifying precedence and for back-references when matching. You can also use parentheses to extract parts of a complex match using str_match function.

For example, let’s see how to extract year, month and day from a date string like "2023-03-28"

date_string <- "2023-03-28"
str_match(date_string, "(\\d{4})-(\\d{2})-(\\d{2})")
##      [,1]         [,2]   [,3] [,4]
## [1,] "2023-03-28" "2023" "03" "28"

Here string_match returns a matrix with the first column being the complete match, and next three columns the each group in parentheses.

If your data is in a tibble, it’s often easier to use tidyr::extract(). It works like str_match() but requires you to name the matches, which are then placed in new columns.

Let’s take the flights data set as an example. There is a column time_hour that contains all the date-time information:

flights1 <- flights %>%
  select(time_hour) %>%
  print()
## # A tibble: 336,776 × 1
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 05:00:00
##  3 2013-01-01 05:00:00
##  4 2013-01-01 05:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # … with 336,766 more rows

To show how things work, we remove all other columns. Now let’s create new columns named “year”, “month”, “day”, “hour”, “minute”, “second” which are all extracted from the time_hour string.

flights1 %>%
  extract(
    time_hour, 
    c("year", "month", "day", "hour", "minute", "second"),    
    "(\\d{4})-(\\d{2})-(\\d{2}) (\\d{2}):(\\d{2}):(\\d{2})",
    remove = FALSE, convert = TRUE
  ) %>%
  print()
## # A tibble: 336,776 × 7
##    time_hour            year month   day  hour minute second
##    <dttm>              <int> <int> <int> <int>  <int>  <int>
##  1 2013-01-01 05:00:00  2013     1     1     5      0      0
##  2 2013-01-01 05:00:00  2013     1     1     5      0      0
##  3 2013-01-01 05:00:00  2013     1     1     5      0      0
##  4 2013-01-01 05:00:00  2013     1     1     5      0      0
##  5 2013-01-01 06:00:00  2013     1     1     6      0      0
##  6 2013-01-01 05:00:00  2013     1     1     5      0      0
##  7 2013-01-01 06:00:00  2013     1     1     6      0      0
##  8 2013-01-01 06:00:00  2013     1     1     6      0      0
##  9 2013-01-01 06:00:00  2013     1     1     6      0      0
## 10 2013-01-01 06:00:00  2013     1     1     6      0      0
## # … with 336,766 more rows

Other than the data set name, tidyr::extract() takes three arguments, the column name to be matched, names of new columns in a string vector, and the regular expression with grouped matches. Matches in each parentheses will be placed in the newly created columns correspondingly. remove = FALSE keeps the original column time_hour and convert = TRUE demands new columns to be parsed into most appropriate data types.


Another case study

Let’s take the tidied who data set (TB case numbers) as another example. After tidying data, we arrive at the following data frame:

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
## # A tibble: 76,046 × 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <dbl> <chr>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # … with 76,036 more rows

As we know, the key column contains information about TB types, gender and age group. To recall the pattern:

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

  2. The next two letters describe the type of TB:

  • rel stands for cases of relapse
  • ep stands for cases of extrapulmonary TB
  • sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
  • sp stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)
  1. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

  2. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

  • 014 = 0 – 14 years old
  • 1524 = 15 – 24 years old
  • 2534 = 25 – 34 years old
  • 3544 = 35 – 44 years old
  • 4554 = 45 – 54 years old
  • 5564 = 55 – 64 years old
  • 65 = 65 or older

The following grouped regular expression would put all needed information into three new columns, “type”, “gender”, and “age_group”.

who1 %>%
  extract(key, 
          c("type", "gender", "age_Group"), 
          "new[_]?(.*)_(m|f)(\\d*)", 
          remove = F) %>%
  print()
## # A tibble: 76,046 × 9
##    country     iso2  iso3   year key          type  gender age_Group cases
##    <chr>       <chr> <chr> <dbl> <chr>        <chr> <chr>  <chr>     <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014  sp    m      014           0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524 sp    m      1524         10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534 sp    m      2534          6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544 sp    m      3544          3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554 sp    m      4554          5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564 sp    m      5564          2
##  7 Afghanistan AF    AFG    1997 new_sp_m65   sp    m      65            0
##  8 Afghanistan AF    AFG    1997 new_sp_f014  sp    f      014           5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524 sp    f      1524         38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534 sp    f      2534         36
## # … with 76,036 more rows

In this regular expression, "new" refers to the “new” at the beginning of every string in key columns. [_]? refers to either nothing or a single “_” since for rel type there is no _ between “new” and “rel”.

Then the first group (.*) before the next “_” would capture the code of TB types (either “rel”, “ep”, “sn” or “sp”). The second group (m|f) then captures the gender code. The digits afterwards can be captured by the third group (\\d*).


Replacing matches

str_replace() and str_replace_all() allow you to replace matches with new strings. The simplest use is to replace a pattern with a fixed string:

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-"  "p--r"   "b-n-n-"

With str_replace_all() you can perform multiple replacements by supplying a named vector:

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house"    "two cars"     "three people"

In the case study above, what we did is to replace "newrel" with "new_rel" before we further analyze it.

who2 <- who1 %>% 
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  filter(str_detect(key, "new_rel")) # Only keep rows with "new_rel" for checking
who2
## # A tibble: 2,580 × 6
##    country     iso2  iso3   year key           cases
##    <chr>       <chr> <chr> <dbl> <chr>         <dbl>
##  1 Afghanistan AF    AFG    2013 new_rel_m014   1705
##  2 Afghanistan AF    AFG    2013 new_rel_f014   1749
##  3 Albania     AL    ALB    2013 new_rel_m014     14
##  4 Albania     AL    ALB    2013 new_rel_m1524    60
##  5 Albania     AL    ALB    2013 new_rel_m2534    61
##  6 Albania     AL    ALB    2013 new_rel_m3544    32
##  7 Albania     AL    ALB    2013 new_rel_m4554    44
##  8 Albania     AL    ALB    2013 new_rel_m5564    50
##  9 Albania     AL    ALB    2013 new_rel_m65      67
## 10 Albania     AL    ALB    2013 new_rel_f014      5
## # … with 2,570 more rows


Lab Exercise: Switch the first and last letters for all words in words and place the result in a new column.


Splitting

A particularly useful function is the str_split function. This function can split, for example, a sentence into words.

sentences %>%
  head(5) %>% 
  str_split(" ")
## [[1]]
## [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
## [8] "planks."
## 
## [[2]]
## [1] "Glue"        "the"         "sheet"       "to"          "the"        
## [6] "dark"        "blue"        "background."
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish."  
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

Because each component might contain a different number of pieces, this returns a list.

The splitting above is somehow unsatisfactory, since the punctuation are included. So we can refine the pattern:

sentences1 <- sentences

str_sub(sentences1, -1, -1) <- ""  # Remove the last character which is a period

sentences1 %>%
  head(5) %>% 
  str_split("[^A-Za-z]+")
## [[1]]
## [1] "The"    "birch"  "canoe"  "slid"   "on"     "the"    "smooth" "planks"
## 
## [[2]]
## [1] "Glue"       "the"        "sheet"      "to"         "the"       
## [6] "dark"       "blue"       "background"
## 
## [[3]]
##  [1] "It"    "s"     "easy"  "to"    "tell"  "the"   "depth" "of"    "a"    
## [10] "well" 
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish"   
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls"

Here "a-z" and "A-Z" inside [] in a regular expression represent all lower-case and upper-case letters in English. So we are splitting the sentence by any non-letter character of length one or more (+ represents one time or more).

Actually there is a simpler way to do this, using boundary() function.

sentences %>%
  head(5) %>%
  str_split(boundary("word"))
## [[1]]
## [1] "The"    "birch"  "canoe"  "slid"   "on"     "the"    "smooth" "planks"
## 
## [[2]]
## [1] "Glue"       "the"        "sheet"      "to"         "the"       
## [6] "dark"       "blue"       "background"
## 
## [[3]]
## [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well" 
## 
## [[4]]
## [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
## [8] "rare"    "dish"   
## 
## [[5]]
## [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls"

Here boundry("word") refers to split by each word. We can also split by “line_break”, “character” or “sentence”.


Other uses of regular expressions

There are two useful function in base R that also use regular expressions:

  • apropos() searches all objects available from the global environment that match with the given regular expression. This is useful if you can’t quite remember the name of the function.
apropos("replace")
## [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
## [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"
  • dir() lists all the files in a directory. The pattern argument takes a regular expression and only returns file names that match the pattern. For example, you can find all the R Markdown files in the current directory with:
dir(pattern = "\\.Rmd$")
##  [1] "03-transformation-1.Rmd"                       
##  [2] "03-transformation-2.Rmd"                       
##  [3] "1-Introduction.Rmd"                            
##  [4] "2-Data Visualization1.Rmd"                     
##  [5] "3-Data Visualization2.Rmd"                     
##  [6] "4-Data Visualization3.Rmd"                     
##  [7] "Data Visualization Examples for Self Study.Rmd"
##  [8] "Example_R_Markdown_homework.Rmd"               
##  [9] "Homework_Recitation1.Rmd"                      
## [10] "Homework1.Rmd"                                 
## [11] "Homework2.Rmd"                                 
## [12] "Homework3.Rmd"                                 
## [13] "Homework4.Rmd"                                 
## [14] "Homework5.Rmd"                                 
## [15] "Midterm_project_EDA.Rmd"                       
## [16] "R Markdown Basics.Rmd"                         
## [17] "R Markdown Guideline.Rmd"                      
## [18] "R_Shiny_test.Rmd"                              
## [19] "RMD_1-introduction.Rmd"                        
## [20] "RMD_10-EDA2.Rmd"                               
## [21] "RMD_11-TidyData1.Rmd"                          
## [22] "RMD_12-TidyData2.Rmd"                          
## [23] "RMD_13-TidyData3.Rmd"                          
## [24] "RMD_14-Import_Data.Rmd"                        
## [25] "RMD_15-Strings.Rmd"                            
## [26] "RMD_16-Strings2.Rmd"                           
## [27] "RMD_2-Data Visualization1.Rmd"                 
## [28] "RMD_3-Data Visualization2.Rmd"                 
## [29] "RMD_4-Data Visualization3.Rmd"                 
## [30] "RMD_5-Data Visualization for Self Study.Rmd"   
## [31] "RMD_6-Data Transformation1.Rmd"                
## [32] "RMD_7-DataTransformation2.Rmd"                 
## [33] "RMD_8-ThePipe.Rmd"                             
## [34] "RMD_9-EDA1.Rmd"                                
## [35] "setup.Rmd"                                     
## [36] "Solution_to_HW5.Rmd"                           
## [37] "Solution_to_Lab_Exercises.Rmd"

In real applications, we frequently use regular expressions to search for names of files, folders in coding projects.