CIS 4730
Unstructured Data Management

Lab: Text processing

Rongen Zhang

Agenda

Package stringr

The stringr package provides a cohesive set of functions designed to make working with strings as easy as possible.

stringr is included in the tidyverse but it is not loaded automatically with library(tidyverse). You’ll need to load it with its own call to library().

library(tidyverse)
library(stringr) 

If you see any errors when loading those libraries:

install.packages("stringr")
library(stringr)

Why stringr?

Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr for text processing. These have more intuitive names, and all start with str_.

The common str_ prefix is particularly useful if you use RStudio, because typing str_ will trigger auto complete, allowing you to see all stringr functions

String basics

x = c("Georgia", "State", "University")
str_length(x)
## [1]  7  5 10

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
str_sub(x, -2, -1) # negative numbers count backwards from end
## [1] "le" "na" "ar"

str_c("Georgia", "State", "University")
## [1] "GeorgiaStateUniversity"
str_c("Georgia", "State", "University", sep=" ")
## [1] "Georgia State University"
str_c(c("Georgia", "State", "University"), sep=" ")
## [1] "Georgia"    "State"      "University"
str_c(c("Georgia", "State", "University"), collapse = " ") 
## [1] "Georgia State University"
#collapse:  optional string used to combine input vectors into single string.

vec = c("Georgia", "State", "University")
str_to_upper(vec)
## [1] "GEORGIA"    "STATE"      "UNIVERSITY"
str_to_lower(vec)
## [1] "georgia"    "state"      "university"
str_sub(vec, 3, 5) <- str_to_upper(str_sub(vec, 3, 5)) #Extract and replace substrings from a character vector.
vec
## [1] "GeORGia"    "StATE"      "UnIVErsity"

Use the pipe operator to make the codes more readable:

vec = c("Georgia", "State", "University")
str_sub(vec, 3, 5) = str_sub(vec, 3, 5) %>% 
  str_to_upper()
vec
## [1] "GeORGia"    "StATE"      "UnIVErsity"

x <- c("Apple", "Banana", "Pear")
str_split(x, pattern="a")
## [[1]]
## [1] "Apple"
## 
## [[2]]
## [1] "B" "n" "n" "" 
## 
## [[3]]
## [1] "Pe" "r"

x <- c("Apple", "Banana", "Pear")
str_split(x, pattern="o")
## [[1]]
## [1] "Apple"
## 
## [[2]]
## [1] "Banana"
## 
## [[3]]
## [1] "Pear"

x <- c("apple", "eggplant", "banana")
str_sort(x)
## [1] "apple"    "banana"   "eggplant"
str_sort(x, decreasing=TRUE)
## [1] "eggplant" "banana"   "apple"
y <- c("apple", "123", "Apple", "#")
str_sort(y)
## [1] "#"     "123"   "apple" "Apple"

Your turn

Given the input vector below:

x <- c("Atlanta", "New York", "Los Angeles")

Try to convert it to the following output using str_to_upper, str_sort and str_c:

## Cities: ATLANTA, LOS ANGELES, NEW YORK

x <- c("apple", "eggplant", "banana")
str_detect(x, "pl")
## [1]  TRUE  TRUE FALSE
str_locate(x, "pl")
##      start end
## [1,]     3   4
## [2,]     4   5
## [3,]    NA  NA

For str_locate, an integer matrix. First column gives start postion of match, and second column gives end position. For str_locate_all a list of integer matrices. ## {.smaller}

x <- c("apple", "banana")
str_locate_all(x, "a") # notice [[1]] and [[2]]
## [[1]]
##      start end
## [1,]     1   1
## 
## [[2]]
##      start end
## [1,]     2   2
## [2,]     4   4
## [3,]     6   6
class(str_locate_all(x, "a")[[2]])
## [1] "matrix" "array"
str(str_locate_all(x, "a")[[2]])
##  int [1:3, 1:2] 2 4 6 2 4 6
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "start" "end"

Your turn

R has a built-in vector of state names. Just type state.name you can get this vector of state names. Use this vector for the following:

x <- c("Atlanta, GA", "New York, NY", "Los Angeles, CA")
str_replace(x, "A", "@")
## [1] "@tlanta, GA"     "New York, NY"    "Los @ngeles, CA"
str_replace(x, "a", "@")
## [1] "Atl@nta, GA"     "New York, NY"    "Los Angeles, CA"
str_replace_all(x, "a", "@")
## [1] "Atl@nt@, GA"     "New York, NY"    "Los Angeles, CA"
str_replace_all(x, "a|A", "@")
## [1] "@tl@nt@, G@"     "New York, NY"    "Los @ngeles, C@"

Regular expression (regex)

Regular expression use cases

Regular expression syntax

Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string.

This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \.

We will use some small examples to introduce regular expression syntax and what these metacharacters mean.

Quantifiers

Quantifiers specify how many repetitions of the pattern.

x <- c("a", "ab", "acb", "accb", "acccb", "accccb")
str_extract(x, "ac*b")
## [1] NA       "ab"     "acb"    "accb"   "acccb"  "accccb"
str_extract(x, "ac+b")
## [1] NA       NA       "acb"    "accb"   "acccb"  "accccb"
str_extract(x, "ac?b")
## [1] NA    "ab"  "acb" NA    NA    NA

x <- c("a", "ab", "acb", "accb", "acccb", "accccb")
str_extract(x, "ac{2}b")
## [1] NA     NA     NA     "accb" NA     NA
str_extract(x, "ac{2,}b")
## [1] NA       NA       NA       "accb"   "acccb"  "accccb"
str_extract(x, "ac{2,3}b")
## [1] NA      NA      NA      "accb"  "acccb" NA

Anchors

x <- c("abcd", "cdab", "cabd", "c abd")
x[str_detect(x, "ab")]
## [1] "abcd"  "cdab"  "cabd"  "c abd"
x[str_detect(x, "^ab")]
## [1] "abcd"
x[str_detect(x, "ab$")]
## [1] "cdab"

Advanced operators

x <- c("abcd", "abc", "abe", "ab.")
x[str_detect(x, "..c")]
## [1] "abcd" "abc"
x[str_detect(x, "ab[ce]")]
## [1] "abcd" "abc"  "abe"
x[str_detect(x, "ab[^ce]")]
## [1] "ab."
x[str_detect(x, "abc|abe")]
## [1] "abcd" "abc"  "abe"
x[str_detect(x, "ab\\.")]
## [1] "ab."

Your turn

Continue with the previous state name vector. Find all state names that

Character classes

Character classes allow us to specify entire classes of characters, such as numbers, letters, etc. There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character. They are sometimes interchangeable.

x <- c("Atlanta, GA", "New York, NY")
str_split(x, ", ")
## [[1]]
## [1] "Atlanta" "GA"     
## 
## [[2]]
## [1] "New York" "NY"
str_split(x, "[:punct:][:blank:]")
## [[1]]
## [1] "Atlanta" "GA"     
## 
## [[2]]
## [1] "New York" "NY"

Your turn

  1. Find all state names that start with South, North, or West

  2. Find all state names which contains the letter v (can be either upper or lower case) and ends with the letter a.

  3. Complete the following code with your phone number regular expression:

x <- c("4044132000", "520-123-2000", "844.999.4500", 
       "10/30/2017", "100,000,000")
phone_number_regex_str = "your regex string goes here"
str_detect(x, phone_number_regex_str) # see below for the right output
## [1]  TRUE  TRUE  TRUE FALSE FALSE