CIS 4730 Unstructured Data Management

Agenda

String manipulation
Regular expression

Package stringr

The stringr package provides a cohesive set of functions designed to make working with strings as easy as possible.

stringr is included in the tidyverse but it is not loaded automatically with library(tidyverse). You’ll need to load it with its own call to library().

library(tidyverse)
library(stringr)

If you see any errors when loading those libraries:

install.packages("stringr")
library(stringr)

Why stringr?

Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr for text processing. These have more intuitive names, and all start with str_.

The common str_ prefix is particularly useful if you use RStudio, because typing str_ will trigger auto complete, allowing you to see all stringr functions

String basics - String length

x = c("Georgia", "State", "University")
str_length(x)

## [1]  7  5 10

String basics - Subsetting strings

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)

## [1] "App" "Ban" "Pea"

str_sub(x, -2, -1) # negative numbers count backwards from end

## [1] "le" "na" "ar"

String basics - Combining strings

str_c("Georgia", "State", "University")

## [1] "GeorgiaStateUniversity"

str_c("Georgia", "State", "University", sep=" ")

## [1] "Georgia State University"

str_c(c("Georgia", "State", "University"), sep=" ")

## [1] "Georgia"    "State"      "University"

str_c(c("Georgia", "State", "University"), collapse = " ")

## [1] "Georgia State University"

#collapse:  optional string used to combine input vectors into single string.

String basics - To upper/lower case

vec = c("Georgia", "State", "University")
str_to_upper(vec)

## [1] "GEORGIA"    "STATE"      "UNIVERSITY"

str_to_lower(vec)

## [1] "georgia"    "state"      "university"

str_sub(vec, 3, 5) <- str_to_upper(str_sub(vec, 3, 5)) 
#Extract and replace substrings from a character vector.
vec

## [1] "GeORGia"    "StATE"      "UnIVErsity"

Pipe %>%

Use the pipe operator to make the codes more readable:

vec = c("Georgia", "State", "University")
str_sub(vec, 3, 5) = str_sub(vec, 3, 5) %>% 
  str_to_upper()
vec

## [1] "GeORGia"    "StATE"      "UnIVErsity"

Spliting strings

Split up a string into pieces. The pattern is the pattern to look for, default to be a regular expression.

x <- c("Apple", "Banana", "Pear")
str_split(x, pattern="a")

## [[1]]
## [1] "Apple"
## 
## [[2]]
## [1] "B" "n" "n" "" 
## 
## [[3]]
## [1] "Pe" "r"

Spliting strings (continued)

x <- c("Apple", "Banana", "Pear")
str_split(x, pattern="o")

## [[1]]
## [1] "Apple"
## 
## [[2]]
## [1] "Banana"
## 
## [[3]]
## [1] "Pear"

Sorting

x <- c("apple", "eggplant", "banana")
str_sort(x)

## [1] "apple"    "banana"   "eggplant"

str_sort(x, decreasing=TRUE)

## [1] "eggplant" "banana"   "apple"

y <- c("apple", "123", "Apple", "#")
str_sort(y)

## [1] "#"     "123"   "apple" "Apple"

Your turn

Given the input vector below:

x <- c("Atlanta", "New York", "Los Angeles")

Try to convert it to the following output using str_to_upper, str_sort and str_c:

## CITIES: ATLANTA, LOS ANGELES, NEW YORK"

Detecting and locating substrings

x <- c("apple", "eggplant", "banana")
str_detect(x, "pl")

## [1]  TRUE  TRUE FALSE

str_locate(x, "pl")

##      start end
## [1,]     3   4
## [2,]     4   5
## [3,]    NA  NA

For str_locate, an integer matrix. First column gives start postion of match, and second column gives end position. For str_locate_all a list of integer matrices. ## {.smaller}

x <- c("apple", "banana")
str_locate_all(x, "a") # notice [[1]] and [[2]]

## [[1]]
##      start end
## [1,]     1   1
## 
## [[2]]
##      start end
## [1,]     2   2
## [2,]     4   4
## [3,]     6   6

class(str_locate_all(x, "a")[[2]])

## [1] "matrix" "array"

str(str_locate_all(x, "a")[[2]])

##  int [1:3, 1:2] 2 4 6 2 4 6
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:2] "start" "end"

Your turn

R has a built-in vector of state names. Just type state.name you can get this vector of state names. Use this vector for the following:

Find all states with a space in its name
Find all states with ss in its name

Replacing substrings

x <- c("Atlanta, GA", "New York, NY", "Los Angeles, CA")
str_replace(x, "A", "@")

## [1] "@tlanta, GA"     "New York, NY"    "Los @ngeles, CA"

str_replace(x, "a", "@")

## [1] "Atl@nta, GA"     "New York, NY"    "Los Angeles, CA"

str_replace_all(x, "a", "@")

## [1] "Atl@nt@, GA"     "New York, NY"    "Los Angeles, CA"

str_replace_all(x, "a|A", "@")

## [1] "@tl@nt@, G@"     "New York, NY"    "Los @ngeles, C@"

Regular expression (regex)

A formal language for specifying text strings
Regular expression is a pattern that describes a specific set of strings with a common structure.
You may use online tools like https://regex101.com/ to help you test/validate your regular expressions
Regular expression in R official document

Regular expression use cases

identify match to a pattern: str_detect()
extract match to a pattern: str_extract(), str_extract_all()
locate pattern within a string, i.e. give the start position of matched patterns: str_locate(), str_locate_all()
replace a pattern: str_replace(), str_replace_all()
split a string using a pattern: str_split()

Regular expression syntax

Regular expressions typically specify characters (or character classes) to seek out, possibly with information about repeats and location within the string.

This is accomplished with the help of metacharacters that have specific meaning: $ * + . ? [ ] ^ { } | ( ) \.

We will use some small examples to introduce regular expression syntax and what these metacharacters mean.

Quantifiers

Quantifiers specify how many repetitions of the pattern.

*: matches at least 0 times.
+: matches at least 1 times.
?: matches at most 1 times.
{n}: matches exactly n times.
{n,}: matches at least n times.
{n,m}: matches between n and m times.

x <- c("a", "ab", "acb", "accb", "acccb", "accccb")
str_extract(x, "ac*b")

## [1] NA       "ab"     "acb"    "accb"   "acccb"  "accccb"

str_extract(x, "ac+b")

## [1] NA       NA       "acb"    "accb"   "acccb"  "accccb"

str_extract(x, "ac?b")

## [1] NA    "ab"  "acb" NA    NA    NA

x <- c("a", "ab", "acb", "accb", "acccb", "accccb")
str_extract(x, "ac{2}b")

## [1] NA     NA     NA     "accb" NA     NA

str_extract(x, "ac{2,}b")

## [1] NA       NA       NA       "accb"   "acccb"  "accccb"

str_extract(x, "ac{2,3}b")

## [1] NA      NA      NA      "accb"  "acccb" NA

Anchors

^: matches the start of the string.
$: matches the end of the string.

x <- c("abcd", "cdab", "cabd", "c abd")
x[str_detect(x, "ab")]

## [1] "abcd"  "cdab"  "cabd"  "c abd"

x[str_detect(x, "^ab")]

## [1] "abcd"

x[str_detect(x, "ab$")]

## [1] "cdab"

Advanced operators

.: matches any single character.
[...]: a character list, matches any one of the characters inside the square brackets. We can also use - inside the brackets to specify a range of characters.
[^...]: an inverted character list, similar to [...], but matches any characters except those inside the square brackets.
\: suppress the special meaning of metacharacters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) \, similar to its usage in escape sequences. Since \ itself needs to be escaped in R, we need to escape these metacharacters with double backslash like \\$.
|: an “or” operator, matches patterns on either side of the |.

x <- c("abcd", "abc", "abe", "ab.")
x[str_detect(x, "..c")]

## [1] "abcd" "abc"

x[str_detect(x, "ab[ce]")]

## [1] "abcd" "abc"  "abe"

x[str_detect(x, "ab[^ce]")]

## [1] "ab."

x[str_detect(x, "abc|abe")]

## [1] "abcd" "abc"  "abe"

x[str_detect(x, "ab\\.")]

## [1] "ab."

Your turn

Using the build-in s<-state.name, find all state names that

Start with “South” and
End with “na”

## [1] "South Carolina" "South Dakota"

## [1] "Arizona"        "Indiana"        "Louisiana"      "Montana"       
## [5] "North Carolina" "South Carolina"

Character classes

Character classes allow us to specify entire classes of characters, such as numbers, letters, etc. There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character. They are sometimes interchangeable.

[:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].
\D: non-digits, equivalent to [^0-9].
[:lower:]: lower-case letters, equivalent to [a-z].
[:upper:]: upper-case letters, equivalent to [A-Z].
[:alpha:]: alphabetic characters, equivalent to [A-z].
[:alnum:]: alphanumeric characters, equivalent to [A-z0-9].
\w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_].
\W: not word, equivalent to [^A-z0-9_].
[:blank:]: blank characters, i.e. space and tab.
[:space:]: space characters: tab, newline, vertical tab, carriage return, space.
[:punct:]: punctuation characters, ! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.

x <- c("Atlanta, GA", "New York, NY")
str_split(x, ", ")

## [[1]]
## [1] "Atlanta" "GA"     
## 
## [[2]]
## [1] "New York" "NY"

str_split(x, "[:punct:][:blank:]")

## [[1]]
## [1] "Atlanta" "GA"     
## 
## [[2]]
## [1] "New York" "NY"

Your turn

Find all state names that start with South, North, or West
Find all state names which contains the letter v (can be either upper or lower case) and ends with the letter a.
Complete the following code with regular expression for your phone numbers:

x <- c("4044132000", "520-123-2000", "844.999.4500", 
       "10/30/2017", "100,000,000")
phone_number_regex_str = "your regex string goes here"
str_detect(x, phone_number_regex_str) # see below for the right output

## [1]  TRUE  TRUE  TRUE FALSE FALSE

another

I have colnames like c1c5, c5c1, c4c3 … And I want to retrieve all colnames that starts or ends with c4 and c5.