String Processing

library(tidyverse)
library(rvest)

One of the most common data wrangling challenges involves converting or extracting numeric data, contained in character strings, into the numeric representations required to make plots, summarize, or fit models in R. Also common is processing unorganized text into meaningful variable names or categorical variables.

In the Web Scraping section we encountered an example related to creating the murders dataset after successfully extracting the raw data from a web page into a table:

url <- "https://en.wikipedia.org/wiki/Murder_in_the_United_States_by_state"
murders_raw <- read_html(url) %>% html_nodes("table") %>% .[2] %>% 
               html_table %>% .[[1]] %>%
               setNames(c("state", "population", "murder_manslaughter_total", 
                          "murder_total", "gun_murder_total", "ownership", 
                          "murder_manslaughter_rate", "murder_rate", "gun_rate"))
head(murders_raw)
## # A tibble: 6 x 9
##   state      population murder_manslaug~ murder_total gun_murder_total ownership
##   <chr>      <chr>      <chr>            <chr>        <chr>                <dbl>
## 1 Alabama    4,853,875  348              —[a]         —[a]                  48.9
## 2 Alaska     737,709    59               57           39                    61.7
## 3 Arizona    6,817,565  306              278          171                   32.3
## 4 Arkansas   2,977,853  181              164          110                   57.9
## 5 California 38,993,940 1,861            1,861        1,275                 20.1
## 6 Colorado   5,448,819  176              176          115                   34.3
## # ... with 3 more variables: murder_manslaughter_rate <dbl>, murder_rate <chr>,
## #   gun_rate <chr>

We realized that two of the columns that we expected to contain numbers actually contained characters:

class(murders_raw$population)
## [1] "character"
class(murders_raw$murder_total)
## [1] "character"

This is a very common occurrence when web scraping, since webpages, and other formal documents, use commas in numbers to improve readability, for example 4,853,875 is easier to read than 4853875. Because this is such a common task there is already a function, parse_number that readily does the conversion. However, many of the string processing challenges a data scientist faces are unique and often unexpected. It is therefore quite ambitious to write a comprehensive section on this topic, so here we use case studies that help us demonstrate how string processing is a powerful tool needed for many data wrangling challenges. Specifically, we show the original raw data used to create the data frames we have studied in this course and describe the string processing that was needed.

By going over these case studies we will cover some of the most common tasks in string processing including

Defining strings: single and double quotes and how to escape

To define strings in R we can use either double quotes:

s <- "Hello!"

or single quotes:

s <- 'Hello!'

But be careful you choose the correct single quote since using the backquote will give you an error:

s <- `Hello`
Error: object 'Hello' not found

Now, what happens if our string includes double quotes? For example if we want to write 10 inches like this 10". For this you can use the single quotes.

You can’t use

s <- "10""

because this is just ten (the string 10) followed by a double quote. If you type this into R you get an error because you have an unclosed double quote. So to avoid this, we can use the single quotes:

s <- '10"'

In R, the function cat let’s us see what the string actually looks like:

cat(s)
## 10"

Now, what if we want our string to be 5 feet written like this 5'? In this case we can use the double quotes:

s <- "5'"
cat(s)
## 5'

So we’ve learned how to write 5 feet and 10 inches separately but what if we want to write them togeter to represent 5 feet and 10 inches like this 5'10". In this case neither the single or double quotes will work! This

s <- '5'10"'

closes the string after 5 and this

s <- "5'10""

closes the string after 10. Note that if we type one of the above code sniptes into R, it will get stuck waiting for you to close the open quote and you will have to escape with the esc button. In this situation we need to escape the function of the quotes. This is done with the backslash \. You can escape either character like this:

s <- '5\'10"'
cat(s)
## 5'10"

or

s <- "5'10\""
cat(s)
## 5'10"

Escaping a character is something we often have to use when processing strings.

The stringr package

In general, string processing involves a string and a pattern. In R we usually store strings in a character vector such as murders$population. Above we scrapped the web to obtain the murders_raw data. The first three strings in this vector defined by the population variable are:

murders_raw$population[1:3]
## [1] "4,853,875" "737,709"   "6,817,565"

Note that the usual coercion does not work here:

as.numeric(murders_raw$population[1:3])
## Warning: NAs introduced by coercion
## [1] NA NA NA

This is because of the commas ,. The string processing we want to do here is remove the pattern , from the strings in murders_raw$population and then coerce to numbers.

In general string processing tasks can be divided into detecting, locating, extracting or replacing patterns in strings. In our example we need to locate the , and replace them with empty characters "".

Base R includes functions to perform all these tasks. However, they don’t follow a unifying convention which makes it a bit hard to memorize and use. The stringr package basically repackages this functionality but uses a more consistent approach of naming functions and ordering their arguments. For example, in stringr, all the string processing functions start with str_ which means that if you type this and hit tab, R will auto-complete and show you all the available functions, which means we don’t necessarily have to memorize all the function names. Another advantage is that the string is always the first argument, which means we can more easily use the pipe. So we will be focusing on the stringr package. However, because the R-base equivalents are so widely used, below we include a table describing the stringr functions and include the R-base equivalent when available.

The str_ functions

library(stringr)

The table below includes the functions available to you in the stringr package. We split them by task.

All these functions take a character vector as first argument. Also, for each function, operations are vectorized: the operation gets applied to each string in the vector.

Finally note that in this table we mention groups. These will be explained later.

Task Description stringr R-base
Detect Is the pattern in the string? str_detect grepl
Detect Returns the index of entries that contain the pattern. str_which grep
Detect Returns the subset of strings that contain the pattern. str_subset grep with value = TRUE
Locate Returns positions of first occurrence of pattern in a string. str_locate regexpr
Locate Returns position of all occurrence of patter in a string. str_locate_all gregexpr
Locate Show the first part of the string that matches pattern. str_view
Locate Show me all the parts of the string that match the pattern. str_view_all
Extract Extract the first part the string that matches the pattern. str_extract
Extract Extract all parts the string that match the pattern. str_extract_all
Extract Extract first part the string that matches the groups and the patterns defined by the groups. str_match
Extract Extract all parts of the string that matches the groups and the patterns defined by the groups. str_match_all
Extract Extract a substring. str_sub substring
Extract Split a string into a list with parts separated by pattern. str_split strsplit
Extract Split a string into a matrix with parts separated by pattern. str_split_fixed strsplit with fixed = TRUE
Describe Count number of times a pattern appears in a string. str_count
Describe Number of character in string. str_length nchar
Replace Replace first part of string matching a pattern with another. str_replace
Replace Replace all parts of string matching a pattern with another. str_replace_all gsub
Replace Change all characters to upper case. str_to_upper toupper
Replace Change all characters to lower case. str_to_lower tolower
Replace Change first character to upper and rest to lower. str_to_title
Replace Replace all NAs to a new value. str_replace_na
Replace Remove white space from start and end of string. str_trim
Manipulate Join multiple strings. str_c paste0
Manipulate Change the encoding of the string. str_conv
Manipulate Sort the vector in alphabetical order. str_sort sort
Manipulate Index needed to order the vector in alphabetical order. str_order order
Manipulate Truncate a string to a fixed size. str_trunc
Manipulate Add white space to string to make it a fixed size. str_pad
Manipulate Repeat a string. str_dup rep then paste
Manipulate Wrap things into formatted paragraphs. str_wrap
Manipulate String interpolation. str_interp sprintf

Case Study 1: US murders data

Before, we used code to scrape the web and create the murders_raw table. We noted that four (two that we care about) columns need to be parsed from characters into numbers but that commas where making it difficult. We can use the str_detect function to see that four of the nine columns have commas in the entries:

commas <- function(x) any(str_detect(x, ","))
murders_raw %>% summarize_all(funs(commas))
## # A tibble: 1 x 9
##   state population murder_manslaughter_~ murder_total gun_murder_total ownership
##   <lgl> <lgl>      <lgl>                 <lgl>        <lgl>            <lgl>    
## 1 FALSE TRUE       TRUE                  TRUE         TRUE             FALSE    
## # ... with 3 more variables: murder_manslaughter_rate <lgl>, murder_rate <lgl>,
## #   gun_rate <lgl>

We can then use the str_replace_all function to remove them:

test_1 <- str_replace_all(murders_raw$population, ",", "")
test_1 <- as.numeric(test_1)
## Warning: NAs introduced by coercion

But we get a warning that NAs are produced. What is wrong? Look at the entry for Illinois:

murders_raw[13,]
## # A tibble: 1 x 9
##   state    population  murder_manslaugh~ murder_total gun_murder_total ownership
##   <chr>    <chr>       <chr>             <chr>        <chr>                <dbl>
## 1 Illinois 12,859,995~ 744               497          440                   26.2
## # ... with 3 more variables: murder_manslaughter_rate <dbl>, murder_rate <chr>,
## #   gun_rate <chr>

The population value for this row doesn’t just contain commas, it contains [5] - the result of a reference added for this value in the table on Wikipedia. We first need to remove these three characters and then the commas and then convert the values to integers.

We can use the same function, str_replace_all, to replace the [5] for this entry. We’ll see why we have to use \\ for each square bracket in a moment.

murders_raw <- murders_raw %>% mutate(population = str_replace_all(population, "\\[5\\]", ""))
murders_raw[10:14,]
## # A tibble: 5 x 9
##   state    population murder_manslaught~ murder_total gun_murder_total ownership
##   <chr>    <chr>      <chr>              <chr>        <chr>                <dbl>
## 1 Georgia  10,199,398 615                565          464                   31.6
## 2 Hawaii   1,425,157  19                 19           4                     45.1
## 3 Idaho    1,652,828  32                 30           24                    56.9
## 4 Illinois 12,859,995 744                497          440                   26.2
## 5 Indiana  6,612,768  373                272          209                   33.8
## # ... with 3 more variables: murder_manslaughter_rate <dbl>, murder_rate <chr>,
## #   gun_rate <chr>

Now we can run the code from above and can see that it works:

test_1 <- str_replace_all(murders_raw$population, ",", "")
test_1 <- as.numeric(test_1)
test_1[10:14]
## [1] 10199398  1425157  1652828 12859995  6612768

We can then use mutate_all to apply this operation to each column, since it won’t affect the columns without commas. It turns out that this operation is so common, that readr includes the function parse_number specifically meant to remove non-numeric characters before coercing:

test_2 <- parse_number(murders_raw$population)
identical(test_1, test_2)
## [1] TRUE

So we can obtain our desired table using:

murders_new <- murders_raw %>% mutate_at(2:5, parse_number)
murders_new %>% head
## # A tibble: 6 x 9
##   state      population murder_manslaug~ murder_total gun_murder_total ownership
##   <chr>           <dbl>            <dbl>        <dbl>            <dbl>     <dbl>
## 1 Alabama       4853875              348           NA               NA      48.9
## 2 Alaska         737709               59           57               39      61.7
## 3 Arizona       6817565              306          278              171      32.3
## 4 Arkansas      2977853              181          164              110      57.9
## 5 California   38993940             1861         1861             1275      20.1
## 6 Colorado      5448819              176          176              115      34.3
## # ... with 3 more variables: murder_manslaughter_rate <dbl>, murder_rate <chr>,
## #   gun_rate <chr>

We still get NAs, but these are for truly missing values. This is actually due to the fact that the raw table we are using is for the year 2015, and not 2010 like the data available in dslabs.

This case is relatively simple compared to the string processing challanges that we typically face in data science. The next example is a rather complex one.

Case Study 2: Reported Heights

The dslabs package includes the raw data from which the heights dataset was obtained. You can load it like this:

library(dslabs)
## Warning: package 'dslabs' was built under R version 4.0.5
data(reported_heights)

These heights were obtained using a web form in which students were asked to enter their heights into a form. They could enter anything, but the instructions asked for height in inches, a number. We compiled 1095 submissions, but unfortunately the column vetor with the reported heights had several non-numeric entries and as a result became a character vector:

class(reported_heights$height)
## [1] "character"

If we try to parse it into numbers we get a warning:

x <- as.numeric(reported_heights$height)
## Warning: NAs introduced by coercion

Although most values appear to be height in inches as requested,

head(x)
## [1] 75 70 68 74 61 65

we do end up with many NAs:

sum(is.na(x))
## [1] 81

We can see some of the entries that are not successfully converted by filter to keep only the entries resulting in NAs:

reported_heights %>% mutate(new_height = as.numeric(height)) %>%
                     filter(is.na(new_height)) %>% 
                     head(n = 10)
##             time_stamp    sex                 height new_height
## 1  2014-09-02 15:16:28   Male                  5' 4"         NA
## 2  2014-09-02 15:16:37 Female                  165cm         NA
## 3  2014-09-02 15:16:52   Male                    5'7         NA
## 4  2014-09-02 15:16:56   Male                  >9000         NA
## 5  2014-09-02 15:16:56   Male                   5'7"         NA
## 6  2014-09-02 15:17:09 Female                   5'3"         NA
## 7  2014-09-02 15:18:00   Male 5 feet and 8.11 inches         NA
## 8  2014-09-02 15:19:48   Male                   5'11         NA
## 9  2014-09-04 00:46:45   Male                  5'9''         NA
## 10 2014-09-04 10:29:44   Male                 5'10''         NA

We immediately see what is happening. Some of the students did not report their heights in inches as requested. We could discard these data and continue. However, many of the entries follow patterns that, in principle, we can easily convert to inches. For example, in the output above we see various cases that use the format x'y'' with x and y representing feet and inches respectively. Each one of these cases can be read and converted to inches by a human, for example 5'4'' is 5*12 + 4 = 64. So we could fix all the problematic entries by hand. However, humans are prone to making mistakes. Also, because we plan on continuing to collect data it will be convenient to write code that automatically does this.

A first step in this type of task is to survey the problematic entries and try to define specific patterns followed by large groups of entries. The larger these groups the more entries we can fix with a single programmatic approach. We want to find patterns that can be accurately described with a rule: such as “a digit, followed by a feet symbol followed by one or two digits followed by an inches symbol”.

To look for such patterns it helps to remove the entries that are consistent with being inches and view only the problematic entries. We write a function to automatically do this. We keep entries that either result in NAs when applying as.numeric or are outside a range of plausible heights. We permit a range that covers about 99.9999% of the adult population. We also use suppressWarnings to avoid the warning message we know as.numeric will give us.

not_inches <- function(x, smallest = 50, tallest = 84){
  inches <- suppressWarnings(as.numeric(x))
  ind <- is.na(inches) | inches < smallest | inches > tallest
  ind
}

We apply this function and find that there are

problems <- reported_heights %>% 
  filter(not_inches(height)) %>%
  .$height
length(problems)
## [1] 292

problematic entries!

We can now view all the cases by simply printing them. We don’t do that here because there are 292, but after surveying them carefully we see three patterns that are followed by three large groups.

  1. A pattern of the form x'y or x' y'' or x'y" with x and y representing feet and inches respectively. Here are ten examples:
## 5' 4" 5'7 5'7" 5'3" 5'11 5'9'' 5'10'' 5' 10 5'5" 5'2"
  1. A pattern of the form x.y or x,y with x feet and y inches. Here are ten examples:
## 5.3 5.5 6.5 5.8 5.6 5,3 5.9 6,8 5.5 6.2
  1. Entries that were reported in centimeters rather than inches. Here are ten examples:
## 150 175 177 178 163 175 178 165 165 180

Once seeing these large groups following specific patterns, we can develop a plan of attack. Keep in mind that there is rarely just one way to perform these tasks. Here we pick one that helps us teach several useful techniques. But surely there is a more efficient way of performing the task.

Plan of attack

We will convert entries fitting the first two patterns into a standardized one. We will then leverage the standardization to extract the feet and inches and covert to inches. We will then define a procedure for identifying entries that are in centimeters and covert them to inches. After applying these steps we will then check again to see what entries were not fixed and see if we can tweak our approach to be more comprehensive.

At the end we hope to have a script that makes a web-based data collection method robust to the most common user mistakes.

To achieve our goal we will use a technique that enables us to accurately detect patterns and extract the parts we want: regular expressions (regex).

Regular Expressions

library(stringr)

A regular expression (regex) is a way to describe a specific pattern of characters or text. They can be used to determine if a given string matches the pattern. A set of rules have been defined to do this efficiently and precisely and here we show some examples. We can learn more about these rules by reading a detailed tutorial such as this one or this one. The stringr cheat sheet is also very useful.

The patterns supplied to the stringr functions can be a regex rather than a standard string. We will learn how this works through a series of examples.

A string

Technically any string is a regex, perhaps the simplest example is a single character. So the comma , used here

pattern <- ","
str_detect(murders_raw$murder_total, pattern) 
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE

is a simple example of searching with regex.

Above we noted that an entry included a cm. This is also a simple example of a regex. We can show all the entries that used cm like this:

str_subset(reported_heights$height, "cm")
## [1] "165cm"  "170 cm"

Special characters

Now let’s consider a slightly more complicated example. Let’s ask: which of the following strings

yes <- c("180 cm", "70 inches")
no  <- c("180", "70''")
s   <- c(yes, no)

have the pattern cm or inches in it? We could call str_detect twice:

str_detect(s, "cm") | str_detect(s, "inches")
## [1]  TRUE  TRUE FALSE FALSE

However, we don’t need to do this. The main feature that distinguishes the regex language from plain strings is that we can use special characters. These are characters with a meaning. We start by introducing | which means or. So if we want to know if either cm or inches appear in the strings we can use the regex cm|inches:

str_detect(s, "cm|inches")
## [1]  TRUE  TRUE FALSE FALSE

and obtain the correct answer.

Another special character that will be useful for identifying feet and inches values is \d which means any digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. The backslash is used to distinguish it from the character d. In R we have to escape the backslash \ so we actually have to use \\d to represent digits. Here is an example:

yes <- c("5", "6", "5'10", "5 feet", "4'11")
no  <- c("", ".", "Five", "six")
s   <- c(yes, no)
pattern <- "\\d"
str_detect(s, pattern)
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

We take this opportunity to introduce the str_view function, which is helpful for troubleshooting as it shows us the first match for each string:

str_view(s, pattern)

and str_view_all shows us all the matches, so 3'2 has two matches and 5'10 has three.

str_view_all(s, pattern)

There are many other special characters. We will learn some others below but you can see most or all of them in the cheat sheet mentioned earlier.

Testing

Throughout this section you will see that we create strings to test our regex. To do this we define patterns that we know should match and also patterns that we know should not. We will call them yes and no respectively. This permits us to check for the two types of errors: failing to match and incorrectly matching.

Character classes

Character classes are used to define a series of characters that can be matched. We define character classes with square brackets []. So, for example, if we want the pattern to match only if we have a 5 or a 6, we use the regex [56]

str_view(s, "[56]")

Suppose we want to match values between 4 and 7. A common way to define character classes is with ranges. So, for example, [0-9] is equivalent to \\d. The pattern we want is [4-7].

yes <- as.character(4:7)
no  <- as.character(1:3)
s   <- c(yes, no)
str_detect(s, "[4-7]")
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

However, it is important to know that in regex everything is a character, there are no numbers. So 4 is the character 4 not the number four. Note, for example, that [1-20] does not mean 1 through 20, it means the characeters 1 through 2 and the character 0. So [1-20] simply means the character class composed of 0, 1 and 2.

Note that characters do have an order and the digits do follow the numeric order. So 0 comes before 1 which comes before 2 and so on. For the same reason we can define letters as [a-z] are lowercase letters, [A-Z] are uppercase letters and [a-zA-z] are both.

Anchors

What if we want a match when we have exactly 1 digit? This will be useful in our case study since feet are never more than 1 digit so a restriction will help us. One way to do this with regex is by using anchors which let us define patterns that must start or end at specific places. The two most common anchors are ^ and $ which represent the beginning and end of a string respectively. So the pattern ^\\d$ is read as

start of the string followed by one digit followed by end of string. Note how this pattern now only detects the strings with exactly one digit:

pattern <- "^\\d$"
yes <- c("1", "5", "9")
no  <- c("12", "123", " 1", "a4", "b")
s   <- c(yes, no)
str_view(s, pattern)

Note that the " 1" does not match because it does not start with the digit but rather a space.

Quantifiers

For the inches part we can have one or two digits. This can be specified in regex with quantifiers. This is done by following the pattern by curly brackets with the possible number of times the previous entry repeats. So the pattern for one or two digits is:

pattern <- "^\\d{1,2}$"
yes     <- c("1", "5", "9", "12")
no      <- c("123", "a4", "b")
str_view(c(yes, no), pattern)

In this case 123 does not match, but 12 does. So now to look for our feet and inches pattern we can add the symbols for feet ' and inches " after the digits.

A pattern for feet and inches

With what we have learned we can now construct an example for the pattern x'y\" with x feet and y inches.

pattern <- "^[4-7]'\\d{1,2}\"$"

The pattern is now getting complex, but we can look at it carefully and break it down:

  • ^ = start of the string
  • [4-7] = one digit, either 4,5,6 or 7
  • ' = feet symbol
  • \\d{1,2} = one or two digits
  • \" = inches symbol
  • $ = end of the string

Let’s test it out:

yes <- c("5'7\"", "6'2\"",  "5'12\"")
no  <- c("6,2\"", "6.2\"","I am 5'11\"", "3'2\"", "64")
str_detect(yes, pattern)
## [1] TRUE TRUE TRUE
str_detect(no, pattern)
## [1] FALSE FALSE FALSE FALSE FALSE

Note that, for now, we are permitting the inches to be 12 or larger. We will add a restriction later as the regex for this is a bit more complex than we are ready to show.

Search and replace

Earlier we defined the object problems containing the strings that do not appear to be in inches. We can see that only

pattern <- "^[4-7]'\\d{1,2}\"$"
sum(str_detect(problems, pattern))
## [1] 14

match the pattern we defined. To see why this is we show some examples that expose why we don’t have more matches:

problems[c(2, 10, 11, 12, 15)] %>% str_view(pattern)

A first problem we see immediately is that some students wrote out the words feet and inches. We can see the entries that did this with the str_subset function:

str_subset(problems, "inches")
## [1] "5 feet and 8.11 inches" "Five foot eight inches" "5 feet 7inches"        
## [4] "5ft 9 inches"           "5 ft 9 inches"          "5 feet 6 inches"

We also see that some entries used two single quotes '' instead of a double quote ".

str_subset(problems, "''")
##  [1] "5'9''"   "5'10''"  "5'10''"  "5'3''"   "5'7''"   "5'6''"   "5'7.5''"
##  [8] "5'7.5''" "5'10''"  "5'11''"  "5'10''"  "5'5''"

A first thing we can do then is replace the different ways of representing inches and feet with a uniform symbol. We will use ' for feet. For inches we will simply not use a symbol since some entries were of the form x'y. Now, if we no longer use the inches symbol we have to change our pattern accordingly:

pattern <- "^[4-7]'\\d{1,2}$"

If we do this replacement before the matching we get many more matches:

problems %>% 
  str_replace("feet|ft|foot", "'") %>%   # replace feet, ft, foot with ' 
  str_replace("inches|in|''|\"", "") %>% # remove all inches symbols
  str_detect(pattern) %>% 
  sum
## [1] 48

However, we still have many cases to go. Note that in the code above we leveraged the stringr consistency and used the pipe.

White space \s

Another problem we have are spaces. The pattern 5' 4\" does not match because there is a space between ' and 4 which our pattern does not permit. Spaces are characters and R does not ignore them:

identical("Hi", "Hi ")
## [1] FALSE

In regex, \s represents white space. So to find patterns like 5' 4, we can change our pattern to

pattern_2 <- "^[4-7]'\\s\\d{1,2}\"$"
str_subset(problems, pattern_2)
## [1] "5' 4\""  "5' 11\"" "5' 7\""

So do we need more than one regex pattern? It turns out we can use a quantifier for this as well.

Quantifiers: *, ?, +

We want the pattern to permit spaces but not require them. Even if there are several spaces, like this 5' 4, we still want it to match. There is a quantfier for exactly this purpose. In regex, the character * means zero or more instance of the previous character. Here is an example

yes <- c("AB", "A1B", "A11B", "A111B", "A1111B")
no  <- c("A2B", "A21B")
str_detect(yes, "A1*B")
## [1] TRUE TRUE TRUE TRUE TRUE
str_detect(no, "A1*B")
## [1] FALSE FALSE

Note that it matches the first string which has zero 1s and all the strings with one or more 1. So, we can improve our pattern by adding the * after the space character \s.

There are two other similar quantifiers. For none or once we can use ? and for one or more we can use +. You can see how they differ:

data.frame(string = c("AB", "A1B", "A11B", "A111B", "A1111B"),
           none_or_more = str_detect(yes, "A1*B"),
           nore_or_once = str_detect(yes, "A1?B"),
           once_or_more = str_detect(yes, "A1+B"))
##   string none_or_more nore_or_once once_or_more
## 1     AB         TRUE         TRUE        FALSE
## 2    A1B         TRUE         TRUE         TRUE
## 3   A11B         TRUE        FALSE         TRUE
## 4  A111B         TRUE        FALSE         TRUE
## 5 A1111B         TRUE        FALSE         TRUE

We will actually use all three in our reported heights example, and we will see these in a later section.

To improve our pattern we can add \\s* in front and after the feet symbol ' to permit space between the feet symbol and the numbers. Now we match a few more entries:

pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"
problems %>% 
  str_replace("feet|ft|foot", "'") %>% # replace feet, ft, foot with ' 
  str_replace("inches|in|''|\"", "") %>% # remove all inches symbols
  str_detect(pattern) %>% 
  sum
## [1] 53

We might be tempted to avoid doing this by removing all the spaces with str_replace_all. However, when doing such an operation we need to make sure that it does not have some unintended effect. In our reported heights examples, this will be a problem because some entries are of the form x y with space separating the feet from the inches. If we remove all spaces we will incorrectly turn x y into xy which implies that a 6 1 would turn in to a 61 instead of 73.

Groups

The second large group of problematic entries were of the form x.y, x,y and x y. We want to change all these to our common format x'y. But we can’t just do a search and replace because we would change values such as 70.5 into 70'5. Our strategy will therefore be to search for a very specific pattern that assures us feet and inches are being provided, then for those that match, replace appropriately.

Groups are a powerful aspect of regex that permits the extraction of values. Groups are defined using parentheses. They don’t affect the pattern matching per-se. Instead, it permits tools to identify specific parts of the pattern so we can extract them.

We want to change heights written like 5.6 to 5'6.

To avoid changing patterns such as 70.2 we will require that the first digit be between 4 and 7 [4-7] and that the second be none or more digits \\d*. Let’s start by defining a simple pattern that matches this

pattern_without_groups <- "^[4-7],\\d*$"

We want to extract the digits so we can then form the new version using an apostrophe These are our two groups, so we encapsulate them with parentheses:

pattern_with_groups <-  "^([4-7]),(\\d*)$"

Note that we encapsulate the part of the pattern that matches the parts we want. Adding groups does not affect the detection, since it only signals that we want to save what is captured by the groups:

yes <- c("5,9", "5,11", "6,", "6,1")
no  <- c("5'9", ",", "2,8", "6.1.1")
s   <- c(yes, no)
str_detect(s, pattern_without_groups)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
str_detect(s, pattern_with_groups)
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

Once we define groups, we can use the function str_match to extract the values these groups define:

str_match(s, pattern_with_groups)
##      [,1]   [,2] [,3]
## [1,] "5,9"  "5"  "9" 
## [2,] "5,11" "5"  "11"
## [3,] "6,"   "6"  ""  
## [4,] "6,1"  "6"  "1" 
## [5,] NA     NA   NA  
## [6,] NA     NA   NA  
## [7,] NA     NA   NA  
## [8,] NA     NA   NA

Note the second and third columns contains feet and inches respectively. The first column is the part of the string matching the pattern. If no match occurred we see an NA.

Now we can understand the difference between the functions str_extract and str_match: str_extract extracts only strings that match a pattern, not the values defined by groups:

str_extract(s, pattern_with_groups)
## [1] "5,9"  "5,11" "6,"   "6,1"  NA     NA     NA     NA

Search and replace using groups

Another powerful aspect of groups is that you can refer to the extracted values in a regex when searching and replacing.

The regex special character for the i-th groups is \\i. So \\1 is the value extracted from the first group, \\2 the value from the second and so on. So as a simple example, note that the following code will replace a comma by an apostrophe but only if it is between two digits:

pattern_with_groups <-  "^([4-7]),(\\d*)$"
yes <- c("5,9", "5,11", "6,", "6,1")
no  <- c("5'9", ",", "2,8", "6.1.1")
s   <- c(yes, no)
str_replace(s, pattern_with_groups, "\\1'\\2")
## [1] "5'9"   "5'11"  "6'"    "6'1"   "5'9"   ","     "2,8"   "6.1.1"

We can use this to convert cases in our reported heights.

Now we are ready to define a pattern that helps us convert all the x.y, x,y and x y to our preferred format. We need to adapt pattern_with_groups to be bit more flexible and capture all the cases.

pattern_with_groups <-"^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$"

Let’s break this one down:

  • ^ = start of the string
  • [4-7] = one digit, either 4,5,6 or 7
  • \\s* = none or more white space
  • [,\\.\\s+] = feet symbol is either ,, . or at least one space.
  • \\s* = none or more white space
  • \\d* = none or more digits
  • $ = end of the string

We can see that it appears to be working:

str_subset(problems, pattern_with_groups) %>% head
## [1] "5.3"  "5.25" "5.5"  "6.5"  "5.8"  "5.6"

and will be able to perform the search and replace:

str_subset(problems, pattern_with_groups) %>% 
  str_replace(pattern_with_groups, "\\1'\\2") %>% head
## [1] "5'3"  "5'25" "5'5"  "6'5"  "5'8"  "5'6"

Again, we will deal with inches larger than twelve later.

Test and improve

We have developed a powerful string processing technique that can help us catch many of the problematic entries. Now it’s time to test our approach, search for further problems, and tweak our approch for possible improvements. Let’s write a function that captures all the entries that can’t be converted into numbers remembering that some are in centimeters (we will deal with those later).

not_inches_or_cm <- function(x, smallest = 50, tallest = 84){
  inches <- suppressWarnings(as.numeric(x))
  ind <- !is.na(inches) & 
    ((inches >= smallest & inches <= tallest) |
     (inches/2.54 >= smallest & inches/2.54 <= tallest))
  !ind
}
problems <- reported_heights %>% 
  filter(not_inches_or_cm(height)) %>%
  .$height
length(problems)
## [1] 200

Let’s see how many of these we can make fit our pattern after we implement the processing we developed. Here we leverage the pipe, one of the advantages of using stringr:

converted <- problems %>% 
  str_replace("feet|foot|ft", "'") %>% #convert feet symbols to '
  str_replace("inches|in|''|\"", "") %>%  #remove inches symbols
  str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2") ##change format
pattern <- "^[4-7]\\s*'\\s*\\d{1,2}$"
index <- str_detect(converted, pattern)
sum(index)
## [1] 123
mean(index)
## [1] 0.615

We have well over half now. Let’s examine the remaining cases:

converted[!index]
##  [1] "6"             "165cm"         "511"           "6"            
##  [5] "2"             ">9000"         "5 ' and 8.11 " "11111"        
##  [9] "6"             "103.2"         "19"            "5"            
## [13] "300"           "6'"            "6"             "Five ' eight "
## [17] "7"             "214"           "6"             "0.7"          
## [21] "6"             "2'33"          "612"           "1,70"         
## [25] "87"            "5'7.5"         "5'7.5"         "111"          
## [29] "5' 7.78"       "12"            "6"             "yyy"          
## [33] "89"            "34"            "25"            "6"            
## [37] "6"             "22"            "684"           "6"            
## [41] "1"             "1"             "6*12"          "87"           
## [45] "6"             "1.6"           "120"           "120"          
## [49] "23"            "1.7"           "6"             "5"            
## [53] "69"            "5' 9 "         "5 ' 9 "        "6"            
## [57] "6"             "86"            "708,661"       "5 ' 6 "       
## [61] "6"             "649,606"       "10000"         "1"            
## [65] "728,346"       "0"             "6"             "6"            
## [69] "6"             "100"           "88"            "6"            
## [73] "170 cm"        "7,283,465"     "5"             "5"            
## [77] "34"

Four clear patterns arise and some other minor problems:

  1. Many students measuring exactly 5 or 6 feet did not enter any inches, for example 6' and our pattern requires that inches be included.
  2. Some students measuring exactly 5 or 6 feet entered just that number.
  3. Some of the inches were entered with decimal points. For example 5'7.5''. Our pattern only looks for two digits.
  4. Some entries have spaces at the end, for example 5 ' 9
  5. Some entries are in meters and some of these use European decimals: 1.6, 1,7.
  6. Two students added cm.
  7. A student spelled out the numbers: Five foot eight inches

It is not necessarily clear that it is worth writing code to handle all these cases since they might be rare enough. However, some give us an opportunity to learn more regex techniques so we will build a fix.

Using groups and qunatifiers:

For case 1 if we add a '0 to first zero, for example, convert all 6 to 6'0, then our pattern will match. This can be done using groups.

yes <- c("5", "6", "5")
no  <- c("5'", "5''", "5'4")
s   <- c(yes, no)
s
## [1] "5"   "6"   "5"   "5'"  "5''" "5'4"
str_replace(s, "^([56])$", "\\1'0")
## [1] "5'0" "6'0" "5'0" "5'"  "5''" "5'4"

The pattern says it has to start (^) followed from a digit between 4 and 7 and end there ($). The parenthesis defines the group that we pass as \\1 to the replace regex.

We can adapt this code slightly to handle case 2 as well which covers the entry 5'. Note 5' is left untouched. This is because the extra ' make the pattern not match since we have to end with a 5 or 6. We want to permit the 5 or 6 to be followed by 0 or 1 feet sign. So we can simply add '{0,1} after the ' to do this. However, we can use the none or once special character ?. As we saw above, this is different from * which is none or more. We now see that the fourth case is also converted:

str_replace(s, "^([56])'?$", "\\1'0")
## [1] "5'0" "6'0" "5'0" "5'0" "5''" "5'4"

Note that here we only permit 5 and 6 but not 4 and 7. This is because 5 and 6 feet tall is quite common, so we assume those that typed 5 or 6 really meant 60 or 72 inches. However, 4 and 7 feet tall are so rare that, although we accept 84 as a valid entry, we assume 7 was entered in error.

We can use quantifiers to deal with case 3. These entries are not matched because the inches include decimals and our pattern does not permit this. We need to allow the second group to include decimals and not just digits. This means we must permit zero or one period . then zero or more digits. So we will be using both ? and *. Also remember that for this particular case the period needs to be escaped since it is a special character (it means any character except line break). Here is a simple example of how we can use *.

So we can adapt our pattern, currently ^[4-7]\\s*'\\s*\\d{1,2}$ to permit a decimal at the end:

pattern <- "^[4-7]\\s*'\\s*(\\d+\\.?\\d*)$"

Case 5, meters using commas, we can approach similarly to how we converted the x.y to x'y. A difference is that we require that the first digit is 1 or 2:

yes <- c("1,7", "1, 8", "2, " )
no  <- c("5,8", "5,3,2", "1.7")
s   <- c(yes, no)
str_replace(s, "^([12])\\s*,\\s*(\\d*)$", "\\1\\.\\2")
## [1] "1.7"   "1.8"   "2."    "5,8"   "5,3,2" "1.7"

We will later check if the entries are meters using their numeric values.

Trimming

In general, spaces at the start or end of the string are uninformative. These can be particularly deceptive because sometimes they can be hard to see:

s <- "Hi "
cat(s)
## Hi
identical(s, "Hi")
## [1] FALSE

This is a general enough problem that there is a function dedicated to removing them: str_trim.

str_trim("       5 ' 9 ")
## [1] "5 ' 9"

To upper and to lower case

One of the entries writes out numbers as words Five foot eight inches. Although not efficient, we could add 12 extra str_replace to convert zero to 0, one to 1, and so on. To avoid having to write two separate operations for Zero and zero, One and one, etc., we can use the str_to_lower function to make all words lower case first:

s <- c("Five feet eight inches")
str_to_lower(s)
## [1] "five feet eight inches"

We are now be ready to define a procedure that can handle all the problematic cases.

Putting it all together

We now put all this together into a function that takes a string vector and tries to convert as many strings as possible to one format. We write a function that puts together what we have done above.

convert_format <- function(s){
  s %>%
    str_replace("feet|foot|ft", "'") %>% #convert feet symbols to '
    str_replace_all("inches|in|''|\"|cm|and", "") %>%  #remove inches and other symbols
    str_replace("^([4-7])\\s*[,\\.\\s+]\\s*(\\d*)$", "\\1'\\2") %>% #change x.y, x,y x y
    str_replace("^([56])'?$", "\\1'0") %>% #add 0 when to 5 or 6
    str_replace("^([12])\\s*,\\s*(\\d*)$", "\\1\\.\\2") %>% #change European decimal
    str_trim() #remove extra space
}

We can also write a function that converts words to numbers:

words_to_numbers <- function(s){
  str_to_lower(s) %>%  
    str_replace_all("zero", "0") %>%
    str_replace_all("one", "1") %>%
    str_replace_all("two", "2") %>%
    str_replace_all("three", "3") %>%
    str_replace_all("four", "4") %>%
    str_replace_all("five", "5") %>%
    str_replace_all("six", "6") %>%
    str_replace_all("seven", "7") %>%
    str_replace_all("eight", "8") %>%
    str_replace_all("nine", "9") %>%
    str_replace_all("ten", "10") %>%
    str_replace_all("eleven", "11")
}

Now we can see which problematic entries remain:

converted <- problems %>% words_to_numbers %>% convert_format
pattern <- "^[4-7]\\s*'\\s*\\d+\\.?\\d*$"
index <- str_detect(converted, pattern)
converted[!index]
##  [1] "165"       "511"       "2"         ">9000"     "11111"     "103.2"    
##  [7] "19"        "300"       "7"         "214"       "0.7"       "2'33"     
## [13] "612"       "1.70"      "87"        "111"       "12"        "yyy"      
## [19] "89"        "34"        "25"        "22"        "684"       "1"        
## [25] "1"         "6*12"      "87"        "1.6"       "120"       "120"      
## [31] "23"        "1.7"       "69"        "86"        "708,661"   "649,606"  
## [37] "10000"     "1"         "728,346"   "0"         "100"       "88"       
## [43] "170"       "7,283,465" "34"

they all seem to be cases that are not convertable.

separate

In a previous section we constructed a regex that lets us identify which elements of a character vector match the feet and inches pattern. However, we want to do more, we want to extract and save the feet and number values so that we can convert them to inches when appropriate.

If we have a simpler case like this

s   <- c("5'10", "6'1")
tab <- data.frame(x = s)
tab
##      x
## 1 5'10
## 2  6'1

we have already learned to use the separate function:

tab %>% separate(x, c("feet", "inches"), sep = "'")
##   feet inches
## 1    5     10
## 2    6      1

The extract function from the tidyr package let’s us use regex groups to extract the desired values. Here is the equivalent code using extract to the code above using separate:

tab %>% extract(x, c("feet", "inches"), regex = "(\\d)'(\\d{1,2})")
##   feet inches
## 1    5     10
## 2    6      1

So why do we even need the new function extract? We have seen how small changes can throw off exact pattern matching. Groups in regex give us more flexibility. For example if we define:

s   <- c("5'10", "6'1\"","5'8inches")
tab <- data.frame(x = s)
tab
##           x
## 1      5'10
## 2      6'1"
## 3 5'8inches

And we only want the numbers, separate fails:

tab %>% separate(x, c("feet","inches"), sep = "'", fill = "right")
##   feet  inches
## 1    5      10
## 2    6      1"
## 3    5 8inches

But we can use extract. The regex here is a bit more complicated as we have to permit ' with spaces and feet. We also do not want the " included in the value, so we do not include that in the group:

tab %>% extract(x, c("feet", "inches"), regex = "(\\d)'(\\d{1,2})")
##   feet inches
## 1    5     10
## 2    6      1
## 3    5      8

Case study 2 (continued)

We are now ready to put it all together and wrangle our reported heights data to try to recover as many heights as possible. The code is complex but we will break it down into parts.

We start by cleaning up the height column so that the heights are closer to a feet’inches format. We added an original heights column so we can compare before and after.

Let’s start by writing a function that cleans up strings so that all the feet and inches formats use the same x'y format when appropriate.

Putting it all together continued

Now we are ready to wrangle our reported heights dataset:

pattern <- "^([4-7])\\s*'\\s*(\\d+\\.?\\d*)$"
smallest <- 50
tallest  <- 84
new_heights <- reported_heights %>% 
  mutate(original = height, 
         height = words_to_numbers(height) %>% convert_format()) %>%
  extract(height, c("feet", "inches"), regex = pattern, remove = FALSE) %>% 
  mutate_at(c("height", "feet", "inches"), as.numeric) %>%
  mutate(guess = 12*feet + inches) %>%
  mutate(height = case_when(
    !is.na(height) & between(height, smallest, tallest) ~ height, #inches 
    !is.na(height) & between(height/2.54, smallest, tallest) ~ height/2.54, #centimeters
    !is.na(height) & between(height*100/2.54, smallest, tallest) ~ height*100/2.54, #meters
    !is.na(guess) & inches < 12 & between(guess, smallest, tallest) ~ guess, #feet'inches
    TRUE ~ as.numeric(NA))) %>%
  select(-guess)
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
head(new_heights)
##            time_stamp    sex height feet inches original
## 1 2014-09-02 13:40:36   Male     75   NA     NA       75
## 2 2014-09-02 13:46:59   Male     70   NA     NA       70
## 3 2014-09-02 13:59:20   Male     68   NA     NA       68
## 4 2014-09-02 14:51:53   Male     74   NA     NA       74
## 5 2014-09-02 15:16:15   Male     61   NA     NA       61
## 6 2014-09-02 15:16:16 Female     65   NA     NA       65

We can check all the entries we converted by typing:

new_heights %>%
  filter(not_inches(original)) %>%
  arrange(height) %>%
  head()
##            time_stamp    sex   height feet inches original
## 1 2014-09-04 09:24:41 Female 59.05512   NA     NA      150
## 2 2015-12-22 19:03:56   Male 59.05512   NA     NA      150
## 3 2016-04-21 06:15:43   Male 59.84252   NA     NA      152
## 4 2014-09-06 22:38:40 Female 60.00000    5      0        5
## 5 2016-01-26 10:00:09 Female 60.00000    5      0        5
## 6 2017-08-21 05:10:17   Male 60.00000    5      0        5

One last comment. If we look at the shortest students in our course:

new_heights %>% arrange(height) %>% head(n=7)
##            time_stamp    sex height feet inches original
## 1 2017-07-04 01:30:25   Male  50.00   NA     NA       50
## 2 2017-09-07 10:40:35   Male  50.00   NA     NA       50
## 3 2014-09-02 15:18:30 Female  51.00   NA     NA       51
## 4 2016-06-05 14:07:20 Female  52.00   NA     NA       52
## 5 2016-06-05 14:07:38 Female  52.00   NA     NA       52
## 6 2014-09-23 03:39:56 Female  53.00   NA     NA       53
## 7 2015-01-07 08:57:29   Male  53.77   NA     NA    53.77

We see heights of 50, 51, and 52. In the originals we also have 50, 51 and 52. These short heights are rare and it is likely that the students actually meant 5'0, 5'1, 5'2, 5'3, 5'4, and 5'5. Because we are not completely sure, we will leave them as reported.

string splitting

Another very common data wrangling operation is string splitting. To illustrate how this comes up, we start with an illustrative example. Suppose we did not have the function read_csv available to use. We instead have to read a csv file using the base R function readLines like this:

filename <- system.file("extdata/murders.csv", package = "dslabs")
lines    <- readLines(filename)

This function reads in the data line-by-line to create a vector of strings. In this case one string for each row in the spreadsheet. The first six lines are:

lines %>% head()
## [1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"     
## [3] "Alaska,AK,West,710231,19"          "Arizona,AZ,West,6392017,232"      
## [5] "Arkansas,AR,South,2915918,93"      "California,CA,West,37253956,1257"

We want to extract the values that are separated by a comma for each string in the vector. The command str_split does exactly this:

x <- str_split(lines, ",") 
x %>% head()
## [[1]]
## [1] "state"      "abb"        "region"     "population" "total"     
## 
## [[2]]
## [1] "Alabama" "AL"      "South"   "4779736" "135"    
## 
## [[3]]
## [1] "Alaska" "AK"     "West"   "710231" "19"    
## 
## [[4]]
## [1] "Arizona" "AZ"      "West"    "6392017" "232"    
## 
## [[5]]
## [1] "Arkansas" "AR"       "South"    "2915918"  "93"      
## 
## [[6]]
## [1] "California" "CA"         "West"       "37253956"   "1257"

Note that the first entry has the column names so we can separate that out:

col_names <- x[[1]]
col_names
## [1] "state"      "abb"        "region"     "population" "total"
x <- x[-1]
head(x)
## [[1]]
## [1] "Alabama" "AL"      "South"   "4779736" "135"    
## 
## [[2]]
## [1] "Alaska" "AK"     "West"   "710231" "19"    
## 
## [[3]]
## [1] "Arizona" "AZ"      "West"    "6392017" "232"    
## 
## [[4]]
## [1] "Arkansas" "AR"       "South"    "2915918"  "93"      
## 
## [[5]]
## [1] "California" "CA"         "West"       "37253956"   "1257"      
## 
## [[6]]
## [1] "Colorado" "CO"       "West"     "5029196"  "65"

To covert our list into a data frame we can use a shortcut provided by the map function in the purrr package. The map function applies the same function to each element in a list. So if we want to extract the first entry of each element in x we can write

library(purrr)
map(x, function(y) y[1]) %>% head()
## [[1]]
## [1] "Alabama"
## 
## [[2]]
## [1] "Alaska"
## 
## [[3]]
## [1] "Arizona"
## 
## [[4]]
## [1] "Arkansas"
## 
## [[5]]
## [1] "California"
## 
## [[6]]
## [1] "Colorado"

However, because this is such a common task, purrr provides a shortcut: if the second argument, instead of a function receives an integer, it assumes we want that entry:

map(x, 1) %>% head()
## [[1]]
## [1] "Alabama"
## 
## [[2]]
## [1] "Alaska"
## 
## [[3]]
## [1] "Arizona"
## 
## [[4]]
## [1] "Arkansas"
## 
## [[5]]
## [1] "California"
## 
## [[6]]
## [1] "Colorado"

For map to return a character vector instead of a list, we can use map_chr. Similarly, map_int returns integers. So to create our data frame we can use:

dat <- data.frame(map_chr(x, 1),  
                  map_chr(x, 2),
                  map_chr(x, 3),
                  map_chr(x, 4),
                  map_chr(x, 5),
                  stringsAsFactors = FALSE) %>%
  mutate_all(parse_guess) %>%
  setNames(col_names)
dat %>% head
##        state abb region population total
## 1    Alabama  AL  South    4779736   135
## 2     Alaska  AK   West     710231    19
## 3    Arizona  AZ   West    6392017   232
## 4   Arkansas  AR  South    2915918    93
## 5 California  CA   West   37253956  1257
## 6   Colorado  CO   West    5029196    65

Using other functions included in the purrr package we can accomplish this with more efficient code:

dat <- x %>% transpose() %>%
             map( ~ parse_guess(unlist(.))) %>%
             setNames(col_names) %>% 
             as.data.frame()