Introduction

Introduction

Strings play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages.

Fret not, stringr is here!

stringr

  • stringr is a lightweight package designed by Hadley Wickham to assist with string manipulation.
  • Interacts seamlessly with the pipe ( %>% ) operator from dplyr / magrittr.
  • Much like Hadley's other packages, stringr's function names are consistent and its arguments are easy to understand.

Before We Begin

Review of Strings

  • Character strings in R are wrapped with quotes " "
  • Character strings can be letters "a", numbers "1", symbols "&", or both "a1&"
  • While numbers can be both integers and characters, letters and symbols have no integer meaning and thus create NAs.
as.integer(c("a", "&"))
## Warning: NAs introduced by coercion
## [1] NA NA

Before We Begin

Review of Strings

  • Concatenating strings and integers with the c() function will convert the integers to characters.
  • By default, R converts objects to their lowest denomination.
  • Factors reduce to integers and integers reduce to characters
c(factor("a"), "b", "&", 1)
## [1] "1" "b" "&" "1"
c(as.character(factor("a")), "b", "&", 1)
## [1] "a" "b" "&" "1"

Agenda

  • Getting Started with stringr
  • Basic String Operators
  • Regular Expressions
  • Pattern Matching Functions
  • Final Exercise

Getting Started

Loading stringr

install.packages("stringr")
library(stringr)

Getting Started

Head over to github.com/UCIDataScienceInitiative/AdvancedRWorkshop, open Introduction to Stringr and download Variables.R

  • You can also download the slides to follow along but avoid looking ahead at the exercise answers.

Then open Variables.R and load strings, fruit, and movie_titles

movie_titles <- c("gold diggers of broadway", "gone baby gone", 
    "gone in 60 seconds", "gone with the wind", "good girl, the", 
    "good burger", "goodbye girl, the", "good bye lenin!", 
    "goodfellas", "good luck chuck", "good morning, vietnam", 
    "good night, and good luck.", "good son, the", "good will hunting")

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
  "387 287 6718", "apple", "233.398.9187  ", "482 952 3315", 
  "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679")

fruit <- c("apple", "banana", "pear", "pineapple")

Basic String Operators

Basic String Operators

  • String operators are basic string manipulation functions
  • Many of them have equivalent base R functions that are much slower and bulkier

Basic String Operators

str_to_upper(string)

- converts strings to uppercase
  - ex. Convert all movie_titles to uppercase and store them as movie_titles

str_to_lower(string)

- converts strings to lowercase
  - ex. Convert all movie_titles back to lowercase and save as movie_titles

str_to_title(string)

- converts strings to title case
  - ex. Convert all movie_titles to titlecase and store them as movie_titles
  

str_to_upper(string)

movie_titles <- str_to_upper(movie_titles)
movie_titles
##  [1] "GOLD DIGGERS OF BROADWAY"   "GONE BABY GONE"            
##  [3] "GONE IN 60 SECONDS"         "GONE WITH THE WIND"        
##  [5] "GOOD GIRL, THE"             "GOOD BURGER"               
##  [7] "GOODBYE GIRL, THE"          "GOOD BYE LENIN!"           
##  [9] "GOODFELLAS"                 "GOOD LUCK CHUCK"           
## [11] "GOOD MORNING, VIETNAM"      "GOOD NIGHT, AND GOOD LUCK."
## [13] "GOOD SON, THE"              "GOOD WILL HUNTING"

str_to_lower(string)

movie_titles <- str_to_lower(movie_titles)
movie_titles
##  [1] "gold diggers of broadway"   "gone baby gone"            
##  [3] "gone in 60 seconds"         "gone with the wind"        
##  [5] "good girl, the"             "good burger"               
##  [7] "goodbye girl, the"          "good bye lenin!"           
##  [9] "goodfellas"                 "good luck chuck"           
## [11] "good morning, vietnam"      "good night, and good luck."
## [13] "good son, the"              "good will hunting"

str_to_title(string)

movie_titles <- str_to_title(movie_titles)
movie_titles
##  [1] "Gold Diggers Of Broadway"   "Gone Baby Gone"            
##  [3] "Gone In 60 Seconds"         "Gone With The Wind"        
##  [5] "Good Girl, The"             "Good Burger"               
##  [7] "Goodbye Girl, The"          "Good Bye Lenin!"           
##  [9] "Goodfellas"                 "Good Luck Chuck"           
## [11] "Good Morning, Vietnam"      "Good Night, And Good Luck."
## [13] "Good Son, The"              "Good Will Hunting"

Basic String Operators

str_c(string, sep = "")

- Joins together multiple strings including integers
- Is the stringr equivalent to paste(sep = "") or paste0()

str_length(string)

- Returns the string length
- Similar to base function nchar()
- str_length() converts factors to strings and also preserves NA's
nchar(NA)
## [1] 2
str_length(NA)
## [1] NA

Basic Strings Operators

str_sub(string, start, end) <- value

- Subsets text within a string or vector of strings by specifying
  start and end positions.
- Base equivalent function is substr()
- By default, end goes to the end of the word
fruit
## [1] "apple"     "banana"    "pear"      "pineapple"
str_sub(fruit, start = 3)
## [1] "ple"     "nana"    "ar"      "neapple"

Basic Strings Operators

str_dup(string, times)

- Duplicates strings by a number of times.
- Essentially copy / paste function
str_dup(fruit, 3)
## [1] "appleappleapple"             "bananabananabanana"         
## [3] "pearpearpear"                "pineapplepineapplepineapple"

Basic Strings Operators

str_trim(string, side = c("both", "left", "right"))

- Removes leading and trailing whitespaces
- Side argument defaults to "both"
- ex. trim the whitespace from both sides of every string in "strings"

str_pad(string, width, side = c("left", "both", "right"), pad = " ")

- Pads strings with whitespace to make them a certain length
- Width argument lets users specify the width of the padding
- Side argument defaults to "left"
- ex. pad "movie_titles" with whitespace to the right such that each title 
  becomes 30 characters long.

str_trim(string, side = c("both", "left", "right"))

ex. trim the whitespace from both sides of every string in strings

str_trim(strings)
##  [1] "219 733 8965"                  "329-293-8753"                 
##  [3] "banana"                        "595 794 7569"                 
##  [5] "387 287 6718"                  "apple"                        
##  [7] "233.398.9187"                  "482 952 3315"                 
##  [9] "239 923 8115 and 842 566 4692" "Work: 579-499-7527"           
## [11] "$1000"                         "Home: 543.355.3679"

str_pad(string, width, side = c("left", "both", "right"), pad = " ")

ex. pad movie_titles with whitespace to the right so that each title becomes 30 characters long.

str_pad(movie_titles, side = "right", 30)
##  [1] "Gold Diggers Of Broadway      " "Gone Baby Gone                "
##  [3] "Gone In 60 Seconds            " "Gone With The Wind            "
##  [5] "Good Girl, The                " "Good Burger                   "
##  [7] "Goodbye Girl, The             " "Good Bye Lenin!               "
##  [9] "Goodfellas                    " "Good Luck Chuck               "
## [11] "Good Morning, Vietnam         " "Good Night, And Good Luck.    "
## [13] "Good Son, The                 " "Good Will Hunting             "

Regular Expressions

Regular Expressions

Pattern matching functions use patterns, otherwise known as "regular expressions" or "regex", to identify specific characteristics in strings.

Common expressions:

- "a"  = is the letter "a"
- "^a" = starts with the letter "a"
- "a$" = ends with the letter "a"
- "[ ]" = contains any letter (or number) within the brackets
- "[ - ]" = contains any letter (or number) within this range
- "[^ae]" = everything except these letters (or numbers)
- "{3}" = repeat the last regex 3 times.

For more expressions or examples, refer to http://www.regular-expressions.info/refquick.html

Compound Expressions

Regular expressions can be combined to form compound expressions.

- "a"  = is the letter "a"
- "^a" = starts with the letter "a"
- "a$" = ends with the letter "a"
- "[ ]" = contains any letter (or number) within the brackets
- "[ - ]" = contains any letter (or number) within this range
- "[^ae]" = everything except these letters (or numbers)
- "{3}" = repeat the last regex 3 times.

California license plate:

- California plates start with a number, followed by 3 letters, followed by   
  another 3 numbers.
- Regex expression: "^[0-9][A-Z]{3}[0-9]{3}$"

Compound Expressions

- "a"  = is the letter "a"
- "^a" = starts with the letter "a"
- "a$" = ends with the letter "a"
- "[ ]" = contains any letter (or number) within the brackets
- "[ - ]" = contains any letter (or number) within this range
- "[^ae]" = everything except these letters (or numbers)
- "{3}" = repeat the last regex 3 times.
  • Your turn: create a regex expression that would identify any social security number. Please do not write your own…
    • Format: SSS-SS-SSSS where S is any number between 0 and 9

Social Security Example

  • Format: SSS-SS-SSSS where S is any number between 0 and 9
  • Regex expression:
    • "^[0-9]{3}-[0-9]{2}-[0-9]{4}$"

Pattern Matching Functions

Pattern Matching Functions

  • Now that we know how to build regular expressions, we can leverage these skills to perform even more advanced functions.
  • Pattern matching functions in stringr take advantage of the regex syntax to perform helpful tasks.
  • The usual form of these pattern matching functions consists of:
    • function(string, pattern)
    • string = a character string or a vector of character strings
    • pattern = your regex request

Pattern Matching Functions

str_detect(string, pattern)

- detects the presence of a pattern within a string or vector of strings
- returns a boolean (TRUE FALSE) vector
- ex. use str_detect in a way that returns any string that contains "apple".
str_detect(fruit, pattern = "^apple$")
## [1]  TRUE FALSE FALSE FALSE
fruit[str_detect(fruit, "^apple$")]
## [1] "apple"

Pattern Matching Functions

str_detect(fruit, pattern = "apple")
## [1]  TRUE FALSE FALSE  TRUE
fruit[str_detect(fruit, "apple")]
## [1] "apple"     "pineapple"

Pattern Matching Functions

str_locate(string, pattern)

- locates and returns the start and end position of the first instance of the pattern.
   - to locate more than one within a string, use str_locate_all(string, pattern)
- ex. use str_locate to find every position of "apple"
fruit
## [1] "apple"     "banana"    "pear"      "pineapple"
# on the second word, this pattern exists from the first character to the sixth
str_locate(fruit, "banana")
##      start end
## [1,]    NA  NA
## [2,]     1   6
## [3,]    NA  NA
## [4,]    NA  NA

Pattern Matching Functions

fruit
## [1] "apple"     "banana"    "pear"      "pineapple"
str_locate(fruit, "apple")
##      start end
## [1,]     1   5
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]     5   9

Pattern Matching Functions

str_extract(string, pattern) or str_extract_all()

- matches the exact pattern to the string
- mainly used to extract compound patterns

str_match(string, pattern) or str_match_all()

- equivalent to str_extract except that str_match returns a matrix.
- str_(m)atch(): remember "m" for matrix!
labels <- c("a99", "a92", "a93l", "b99", "b92", "b93l",
            "c99", "c92", "c93l", "e99", "e92", "e93l")

# extract everything that begins with an "a" or "e" and ends with two numbers
str_extract(labels, "^[ae][0-9]{2}$")
##  [1] "a99" "a92" NA    NA    NA    NA    NA    NA    NA    "e99" "e92"
## [12] NA

Pattern Matching Functions

Exercise:

  • Extract every phone number from the variable "strings" that is composed of spaces " ". Extract it in matrix form.
strings
##  [1] " 219 733 8965"                 "329-293-8753 "                
##  [3] "banana"                        "595 794 7569"                 
##  [5] "387 287 6718"                  "apple"                        
##  [7] "233.398.9187  "                "482 952 3315"                 
##  [9] "239 923 8115 and 842 566 4692" "Work: 579-499-7527"           
## [11] "$1000"                         "Home: 543.355.3679"

Pattern Matching Functions

str_match(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
##       [,1]          
##  [1,] "219 733 8965"
##  [2,] NA            
##  [3,] NA            
##  [4,] "595 794 7569"
##  [5,] "387 287 6718"
##  [6,] NA            
##  [7,] NA            
##  [8,] "482 952 3315"
##  [9,] "239 923 8115"
## [10,] NA            
## [11,] NA            
## [12,] NA

Pattern Matching Functions

Exercise:

str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [[1]]
##      [,1]          
## [1,] "219 733 8965"
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]
## 
## [[4]]
##      [,1]          
## [1,] "595 794 7569"
## 
## [[5]]
##      [,1]          
## [1,] "387 287 6718"
## 
## [[6]]
##      [,1]
## 
## [[7]]
##      [,1]
## 
## [[8]]
##      [,1]          
## [1,] "482 952 3315"
## 
## [[9]]
##      [,1]          
## [1,] "239 923 8115"
## [2,] "842 566 4692"
## 
## [[10]]
##      [,1]
## 
## [[11]]
##      [,1]
## 
## [[12]]
##      [,1]

Pattern Matching Functions

Exercise:

str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>% 
  unlist() %>% matrix()
##      [,1]          
## [1,] "219 733 8965"
## [2,] "595 794 7569"
## [3,] "387 287 6718"
## [4,] "482 952 3315"
## [5,] "239 923 8115"
## [6,] "842 566 4692"
# ALTERNATIVELY
matrix(unlist(str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")))

Pattern Matching Functions

Exercise:

str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [[1]]
##      [,1]          
## [1,] "219 733 8965"
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]
## 
## [[4]]
##      [,1]          
## [1,] "595 794 7569"
## 
## [[5]]
##      [,1]          
## [1,] "387 287 6718"
## 
## [[6]]
##      [,1]
## 
## [[7]]
##      [,1]
## 
## [[8]]
##      [,1]          
## [1,] "482 952 3315"
## 
## [[9]]
##      [,1]          
## [1,] "239 923 8115"
## [2,] "842 566 4692"
## 
## [[10]]
##      [,1]
## 
## [[11]]
##      [,1]
## 
## [[12]]
##      [,1]

Pattern Matching Functions

tip: command shift m = %>%
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>% 
  unlist()
## [1] "219 733 8965" "595 794 7569" "387 287 6718" "482 952 3315"
## [5] "239 923 8115" "842 566 4692"
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>% 
  unlist() %>% matrix()
##      [,1]          
## [1,] "219 733 8965"
## [2,] "595 794 7569"
## [3,] "387 287 6718"
## [4,] "482 952 3315"
## [5,] "239 923 8115"
## [6,] "842 566 4692"

Pattern Matching Functions

str_replace(string, pattern, replacement)

- replaces the first instance of the matched pattern with the replacement string
- str_replace_all replaces all instances of the pattern with the replacement string
- str_replace_na replaces all NA with "NA".
str_replace(fruit, pattern = "a", replacement = "e") # only the first instance
## [1] "epple"     "benana"    "peer"      "pineepple"
str_replace_all(fruit, pattern = "a", replacement = "e") # every instance
## [1] "epple"     "benene"    "peer"      "pineepple"

Your turn: In movie_titles, replace all instances of "Good" with "Bad".

str_replace(string, pattern, replacement)

str_replace_all(movie_titles, pattern = "Good", replacement = "Bad")
##  [1] "Gold Diggers Of Broadway" "Gone Baby Gone"          
##  [3] "Gone In 60 Seconds"       "Gone With The Wind"      
##  [5] "Bad Girl, The"            "Bad Burger"              
##  [7] "Badbye Girl, The"         "Bad Bye Lenin!"          
##  [9] "Badfellas"                "Bad Luck Chuck"          
## [11] "Bad Morning, Vietnam"     "Bad Night, And Bad Luck."
## [13] "Bad Son, The"             "Bad Will Hunting"

Pattern Matching Functions

str_split(string, pattern)

  • splits a string into a variable number of pieces and returns a list of character vectors.
str_split(movie_titles, "[ ,]")
## [[1]]
## [1] "Gold"     "Diggers"  "Of"       "Broadway"
## 
## [[2]]
## [1] "Gone" "Baby" "Gone"
## 
## [[3]]
## [1] "Gone"    "In"      "60"      "Seconds"
## 
## [[4]]
## [1] "Gone" "With" "The"  "Wind"
## 
## [[5]]
## [1] "Good" "Girl" ""     "The" 
## 
## [[6]]
## [1] "Good"   "Burger"
## 
## [[7]]
## [1] "Goodbye" "Girl"    ""        "The"    
## 
## [[8]]
## [1] "Good"   "Bye"    "Lenin!"
## 
## [[9]]
## [1] "Goodfellas"
## 
## [[10]]
## [1] "Good"  "Luck"  "Chuck"
## 
## [[11]]
## [1] "Good"    "Morning" ""        "Vietnam"
## 
## [[12]]
## [1] "Good"  "Night" ""      "And"   "Good"  "Luck."
## 
## [[13]]
## [1] "Good" "Son"  ""     "The" 
## 
## [[14]]
## [1] "Good"    "Will"    "Hunting"

Pattern Matching Functions

str_split_fixed(string, pattern, n)

  • splits the string into a fixed number of pieces and returns a character matrix.
str_split_fixed(movie_titles, "[ ,]", 5)
##       [,1]         [,2]      [,3]      [,4]       [,5]        
##  [1,] "Gold"       "Diggers" "Of"      "Broadway" ""          
##  [2,] "Gone"       "Baby"    "Gone"    ""         ""          
##  [3,] "Gone"       "In"      "60"      "Seconds"  ""          
##  [4,] "Gone"       "With"    "The"     "Wind"     ""          
##  [5,] "Good"       "Girl"    ""        "The"      ""          
##  [6,] "Good"       "Burger"  ""        ""         ""          
##  [7,] "Goodbye"    "Girl"    ""        "The"      ""          
##  [8,] "Good"       "Bye"     "Lenin!"  ""         ""          
##  [9,] "Goodfellas" ""        ""        ""         ""          
## [10,] "Good"       "Luck"    "Chuck"   ""         ""          
## [11,] "Good"       "Morning" ""        "Vietnam"  ""          
## [12,] "Good"       "Night"   ""        "And"      "Good Luck."
## [13,] "Good"       "Son"     ""        "The"      ""          
## [14,] "Good"       "Will"    "Hunting" ""         ""

Questions?

Final Exercise

Final Exercise

Instructions

  1. Extract all phone numbers from the variable strings
  2. Remove all "-" and "."
  3. Split the numbers into a matrix
    • First column contains area codes and the second column contains the rest of the phone number.

Matrix should contain 10 phone numbers (rows) and 2 columns

Final Exercise

strings %>% str_match_all(pattern = "[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}") %>% 
  unlist() %>% str_replace_all(pattern = "[-. ]", replacement = " ") %>% 
  str_split_fixed(pattern = " ", 2)
##       [,1]  [,2]      
##  [1,] "219" "733 8965"
##  [2,] "329" "293 8753"
##  [3,] "595" "794 7569"
##  [4,] "387" "287 6718"
##  [5,] "233" "398 9187"
##  [6,] "482" "952 3315"
##  [7,] "239" "923 8115"
##  [8,] "842" "566 4692"
##  [9,] "579" "499 7527"
## [10,] "543" "355 3679"

Thank you