## Introduction

Strings play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages.

Fret not, stringr is here!

## stringr

• stringr is a lightweight package designed by Hadley Wickham to assist with string manipulation.
• Interacts seamlessly with the pipe ( %>% ) operator from dplyr / magrittr.
• Much like Hadley's other packages, stringr's function names are consistent and its arguments are easy to understand.

## Before We Begin

### Review of Strings

• Character strings in R are wrapped with quotes " "
• Character strings can be letters "a", numbers "1", symbols "&", or both "a1&"
• While numbers can be both integers and characters, letters and symbols have no integer meaning and thus create NAs.
as.integer(c("a", "&"))
## Warning: NAs introduced by coercion
## [1] NA NA

## Before We Begin

### Review of Strings

• Concatenating strings and integers with the c() function will convert the integers to characters.
• By default, R converts objects to their lowest denomination.
• Factors reduce to integers and integers reduce to characters
c(factor("a"), "b", "&", 1)
## [1] "1" "b" "&" "1"
c(as.character(factor("a")), "b", "&", 1)
## [1] "a" "b" "&" "1"

## Agenda

• Getting Started with stringr
• Basic String Operators
• Regular Expressions
• Pattern Matching Functions
• Final Exercise

## Getting Started

install.packages("stringr")
library(stringr)

## Getting Started

Then open Variables.R and load strings, fruit, and movie_titles

movie_titles <- c("gold diggers of broadway", "gone baby gone",
"gone in 60 seconds", "gone with the wind", "good girl, the",
"good burger", "goodbye girl, the", "good bye lenin!",
"goodfellas", "good luck chuck", "good morning, vietnam",
"good night, and good luck.", "good son, the", "good will hunting")

strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
"387 287 6718", "apple", "233.398.9187  ", "482 952 3315",

## str_pad(string, width, side = c("left", "both", "right"), pad = " ")

#### ex. pad movie_titles with whitespace to the right so that each title becomes 30 characters long.

str_pad(movie_titles, side = "right", 30)
##  [1] "Gold Diggers Of Broadway      " "Gone Baby Gone                "
##  [3] "Gone In 60 Seconds            " "Gone With The Wind            "
##  [5] "Good Girl, The                " "Good Burger                   "
##  [7] "Goodbye Girl, The             " "Good Bye Lenin!               "
##  [9] "Goodfellas                    " "Good Luck Chuck               "
## [11] "Good Morning, Vietnam         " "Good Night, And Good Luck.    "
## [13] "Good Son, The                 " "Good Will Hunting             "

## Regular Expressions

Pattern matching functions use patterns, otherwise known as "regular expressions" or "regex", to identify specific characteristics in strings.

#### Common expressions:

- "a"  = is the letter "a"
- "^a" = starts with the letter "a"
- "a$" = ends with the letter "a" - "[ ]" = contains any letter (or number) within the brackets - "[ - ]" = contains any letter (or number) within this range - "[^ae]" = everything except these letters (or numbers) - "{3}" = repeat the last regex 3 times. For more expressions or examples, refer to http://www.regular-expressions.info/refquick.html ## Compound Expressions Regular expressions can be combined to form compound expressions. - "a" = is the letter "a" - "^a" = starts with the letter "a" - "a$" = ends with the letter "a"
- "[ ]" = contains any letter (or number) within the brackets
- "[ - ]" = contains any letter (or number) within this range
- "[^ae]" = everything except these letters (or numbers)
- "{3}" = repeat the last regex 3 times.

- California plates start with a number, followed by 3 letters, followed by
another 3 numbers.
- Regex expression: "^[0-9][A-Z]{3}[0-9]{3}$" ## Compound Expressions - "a" = is the letter "a" - "^a" = starts with the letter "a" - "a$" = ends with the letter "a"
- "[ ]" = contains any letter (or number) within the brackets
- "[ - ]" = contains any letter (or number) within this range
- "[^ae]" = everything except these letters (or numbers)
- "{3}" = repeat the last regex 3 times.
• Your turn: create a regex expression that would identify any social security number. Please do not write your own…
• Format: SSS-SS-SSSS where S is any number between 0 and 9

## Social Security Example

• Format: SSS-SS-SSSS where S is any number between 0 and 9
• Regex expression:
• "^[0-9]{3}-[0-9]{2}-[0-9]{4}$" ## Pattern Matching Functions ## Pattern Matching Functions • Now that we know how to build regular expressions, we can leverage these skills to perform even more advanced functions. • Pattern matching functions in stringr take advantage of the regex syntax to perform helpful tasks. • The usual form of these pattern matching functions consists of: • function(string, pattern) • string = a character string or a vector of character strings • pattern = your regex request ## Pattern Matching Functions #### str_detect(string, pattern) - detects the presence of a pattern within a string or vector of strings - returns a boolean (TRUE FALSE) vector - ex. use str_detect in a way that returns any string that contains "apple". str_detect(fruit, pattern = "^apple$")
## [1]  TRUE FALSE FALSE FALSE
fruit[str_detect(fruit, "^apple$")] ## [1] "apple" ## Pattern Matching Functions str_detect(fruit, pattern = "apple") ## [1] TRUE FALSE FALSE TRUE fruit[str_detect(fruit, "apple")] ## [1] "apple" "pineapple" ## Pattern Matching Functions #### str_locate(string, pattern) - locates and returns the start and end position of the first instance of the pattern. - to locate more than one within a string, use str_locate_all(string, pattern) - ex. use str_locate to find every position of "apple" fruit ## [1] "apple" "banana" "pear" "pineapple" # on the second word, this pattern exists from the first character to the sixth str_locate(fruit, "banana") ## start end ## [1,] NA NA ## [2,] 1 6 ## [3,] NA NA ## [4,] NA NA ## Pattern Matching Functions fruit ## [1] "apple" "banana" "pear" "pineapple" str_locate(fruit, "apple") ## start end ## [1,] 1 5 ## [2,] NA NA ## [3,] NA NA ## [4,] 5 9 ## Pattern Matching Functions #### str_extract(string, pattern) or str_extract_all() - matches the exact pattern to the string - mainly used to extract compound patterns #### str_match(string, pattern) or str_match_all() - equivalent to str_extract except that str_match returns a matrix. - str_(m)atch(): remember "m" for matrix! labels <- c("a99", "a92", "a93l", "b99", "b92", "b93l", "c99", "c92", "c93l", "e99", "e92", "e93l") # extract everything that begins with an "a" or "e" and ends with two numbers str_extract(labels, "^[ae][0-9]{2}$")
##  [1] "a99" "a92" NA    NA    NA    NA    NA    NA    NA    "e99" "e92"
## [12] NA

## Pattern Matching Functions

#### Exercise:

• Extract every phone number from the variable "strings" that is composed of spaces " ". Extract it in matrix form.
strings
##  [1] " 219 733 8965"                 "329-293-8753 "
##  [3] "banana"                        "595 794 7569"
##  [5] "387 287 6718"                  "apple"
##  [7] "233.398.9187  "                "482 952 3315"
##  [9] "239 923 8115 and 842 566 4692" "Work: 579-499-7527"
## [11] "\$1000"                         "Home: 543.355.3679"

## Pattern Matching Functions

str_match(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
##       [,1]
##  [1,] "219 733 8965"
##  [2,] NA
##  [3,] NA
##  [4,] "595 794 7569"
##  [5,] "387 287 6718"
##  [6,] NA
##  [7,] NA
##  [8,] "482 952 3315"
##  [9,] "239 923 8115"
## [10,] NA
## [11,] NA
## [12,] NA

## Pattern Matching Functions

#### Exercise:

str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [[1]]
##      [,1]
## [1,] "219 733 8965"
##
## [[2]]
##      [,1]
##
## [[3]]
##      [,1]
##
## [[4]]
##      [,1]
## [1,] "595 794 7569"
##
## [[5]]
##      [,1]
## [1,] "387 287 6718"
##
## [[6]]
##      [,1]
##
## [[7]]
##      [,1]
##
## [[8]]
##      [,1]
## [1,] "482 952 3315"
##
## [[9]]
##      [,1]
## [1,] "239 923 8115"
## [2,] "842 566 4692"
##
## [[10]]
##      [,1]
##
## [[11]]
##      [,1]
##
## [[12]]
##      [,1]

## Pattern Matching Functions

#### Exercise:

str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>%
unlist() %>% matrix()
##      [,1]
## [1,] "219 733 8965"
## [2,] "595 794 7569"
## [3,] "387 287 6718"
## [4,] "482 952 3315"
## [5,] "239 923 8115"
## [6,] "842 566 4692"
# ALTERNATIVELY
matrix(unlist(str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")))

## Pattern Matching Functions

#### Exercise:

str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [[1]]
##      [,1]
## [1,] "219 733 8965"
##
## [[2]]
##      [,1]
##
## [[3]]
##      [,1]
##
## [[4]]
##      [,1]
## [1,] "595 794 7569"
##
## [[5]]
##      [,1]
## [1,] "387 287 6718"
##
## [[6]]
##      [,1]
##
## [[7]]
##      [,1]
##
## [[8]]
##      [,1]
## [1,] "482 952 3315"
##
## [[9]]
##      [,1]
## [1,] "239 923 8115"
## [2,] "842 566 4692"
##
## [[10]]
##      [,1]
##
## [[11]]
##      [,1]
##
## [[12]]
##      [,1]

## Pattern Matching Functions

##### tip: command shift m = %>%
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>%
unlist()
## [1] "219 733 8965" "595 794 7569" "387 287 6718" "482 952 3315"
## [5] "239 923 8115" "842 566 4692"
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>%
unlist() %>% matrix()
##      [,1]
## [1,] "219 733 8965"
## [2,] "595 794 7569"
## [3,] "387 287 6718"
## [4,] "482 952 3315"
## [5,] "239 923 8115"
## [6,] "842 566 4692"

## Pattern Matching Functions

#### str_replace(string, pattern, replacement)

- replaces the first instance of the matched pattern with the replacement string
- str_replace_all replaces all instances of the pattern with the replacement string
- str_replace_na replaces all NA with "NA".
str_replace(fruit, pattern = "a", replacement = "e") # only the first instance
## [1] "epple"     "benana"    "peer"      "pineepple"
str_replace_all(fruit, pattern = "a", replacement = "e") # every instance
## [1] "epple"     "benene"    "peer"      "pineepple"

## str_replace(string, pattern, replacement)

str_replace_all(movie_titles, pattern = "Good", replacement = "Bad")
##  [1] "Gold Diggers Of Broadway" "Gone Baby Gone"
##  [3] "Gone In 60 Seconds"       "Gone With The Wind"
## [13] "Bad Son, The"             "Bad Will Hunting"

## Pattern Matching Functions

#### str_split(string, pattern)

• splits a string into a variable number of pieces and returns a list of character vectors.
str_split(movie_titles, "[ ,]")
## [[1]]
## [1] "Gold"     "Diggers"  "Of"       "Broadway"
##
## [[2]]
## [1] "Gone" "Baby" "Gone"
##
## [[3]]
## [1] "Gone"    "In"      "60"      "Seconds"
##
## [[4]]
## [1] "Gone" "With" "The"  "Wind"
##
## [[5]]
## [1] "Good" "Girl" ""     "The"
##
## [[6]]
## [1] "Good"   "Burger"
##
## [[7]]
## [1] "Goodbye" "Girl"    ""        "The"
##
## [[8]]
## [1] "Good"   "Bye"    "Lenin!"
##
## [[9]]
## [1] "Goodfellas"
##
## [[10]]
## [1] "Good"  "Luck"  "Chuck"
##
## [[11]]
## [1] "Good"    "Morning" ""        "Vietnam"
##
## [[12]]
## [1] "Good"  "Night" ""      "And"   "Good"  "Luck."
##
## [[13]]
## [1] "Good" "Son"  ""     "The"
##
## [[14]]
## [1] "Good"    "Will"    "Hunting"

## Pattern Matching Functions

#### str_split_fixed(string, pattern, n)

• splits the string into a fixed number of pieces and returns a character matrix.
str_split_fixed(movie_titles, "[ ,]", 5)
##       [,1]         [,2]      [,3]      [,4]       [,5]
##  [1,] "Gold"       "Diggers" "Of"      "Broadway" ""
##  [2,] "Gone"       "Baby"    "Gone"    ""         ""
##  [3,] "Gone"       "In"      "60"      "Seconds"  ""
##  [4,] "Gone"       "With"    "The"     "Wind"     ""
##  [5,] "Good"       "Girl"    ""        "The"      ""
##  [6,] "Good"       "Burger"  ""        ""         ""
##  [7,] "Goodbye"    "Girl"    ""        "The"      ""
##  [8,] "Good"       "Bye"     "Lenin!"  ""         ""
##  [9,] "Goodfellas" ""        ""        ""         ""
## [10,] "Good"       "Luck"    "Chuck"   ""         ""
## [11,] "Good"       "Morning" ""        "Vietnam"  ""
## [12,] "Good"       "Night"   ""        "And"      "Good Luck."
## [13,] "Good"       "Son"     ""        "The"      ""
## [14,] "Good"       "Will"    "Hunting" ""         ""

## Final Exercise

#### Instructions

1. Extract all phone numbers from the variable strings
2. Remove all "-" and "."
3. Split the numbers into a matrix
• First column contains area codes and the second column contains the rest of the phone number.

Matrix should contain 10 phone numbers (rows) and 2 columns

## Final Exercise

strings %>% str_match_all(pattern = "[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}") %>%
unlist() %>% str_replace_all(pattern = "[-. ]", replacement = " ") %>%
str_split_fixed(pattern = " ", 2)
##       [,1]  [,2]
##  [1,] "219" "733 8965"
##  [2,] "329" "293 8753"
##  [3,] "595" "794 7569"
##  [4,] "387" "287 6718"
##  [5,] "233" "398 9187"
##  [6,] "482" "952 3315"
##  [7,] "239" "923 8115"
##  [8,] "842" "566 4692"
##  [9,] "579" "499 7527"
## [10,] "543" "355 3679"