Strings play a big role in many data cleaning and preparations tasks. R provides a solid set of string operations, but because they have grown organically over time, they can be inconsistent and a little hard to learn. Additionally, they lag behind the string operations in other programming languages.
Fret not, stringr is here!
as.integer(c("a", "&"))
## Warning: NAs introduced by coercion
## [1] NA NA
c(factor("a"), "b", "&", 1)
## [1] "1" "b" "&" "1"
c(as.character(factor("a")), "b", "&", 1)
## [1] "a" "b" "&" "1"
install.packages("stringr") library(stringr)
Head over to github.com/UCIDataScienceInitiative/AdvancedRWorkshop, open Introduction to Stringr and download Variables.R
Then open Variables.R and load strings, fruit, and movie_titles
movie_titles <- c("gold diggers of broadway", "gone baby gone", "gone in 60 seconds", "gone with the wind", "good girl, the", "good burger", "goodbye girl, the", "good bye lenin!", "goodfellas", "good luck chuck", "good morning, vietnam", "good night, and good luck.", "good son, the", "good will hunting") strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569", "387 287 6718", "apple", "233.398.9187 ", "482 952 3315", "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679") fruit <- c("apple", "banana", "pear", "pineapple")
- converts strings to uppercase - ex. Convert all movie_titles to uppercase and store them as movie_titles
- converts strings to lowercase - ex. Convert all movie_titles back to lowercase and save as movie_titles
- converts strings to title case - ex. Convert all movie_titles to titlecase and store them as movie_titles
movie_titles <- str_to_upper(movie_titles) movie_titles
## [1] "GOLD DIGGERS OF BROADWAY" "GONE BABY GONE" ## [3] "GONE IN 60 SECONDS" "GONE WITH THE WIND" ## [5] "GOOD GIRL, THE" "GOOD BURGER" ## [7] "GOODBYE GIRL, THE" "GOOD BYE LENIN!" ## [9] "GOODFELLAS" "GOOD LUCK CHUCK" ## [11] "GOOD MORNING, VIETNAM" "GOOD NIGHT, AND GOOD LUCK." ## [13] "GOOD SON, THE" "GOOD WILL HUNTING"
movie_titles <- str_to_lower(movie_titles) movie_titles
## [1] "gold diggers of broadway" "gone baby gone" ## [3] "gone in 60 seconds" "gone with the wind" ## [5] "good girl, the" "good burger" ## [7] "goodbye girl, the" "good bye lenin!" ## [9] "goodfellas" "good luck chuck" ## [11] "good morning, vietnam" "good night, and good luck." ## [13] "good son, the" "good will hunting"
movie_titles <- str_to_title(movie_titles) movie_titles
## [1] "Gold Diggers Of Broadway" "Gone Baby Gone" ## [3] "Gone In 60 Seconds" "Gone With The Wind" ## [5] "Good Girl, The" "Good Burger" ## [7] "Goodbye Girl, The" "Good Bye Lenin!" ## [9] "Goodfellas" "Good Luck Chuck" ## [11] "Good Morning, Vietnam" "Good Night, And Good Luck." ## [13] "Good Son, The" "Good Will Hunting"
- Joins together multiple strings including integers - Is the stringr equivalent to paste(sep = "") or paste0()
- Returns the string length - Similar to base function nchar() - str_length() converts factors to strings and also preserves NA's
nchar(NA)
## [1] 2
str_length(NA)
## [1] NA
- Subsets text within a string or vector of strings by specifying start and end positions. - Base equivalent function is substr() - By default, end goes to the end of the word
fruit
## [1] "apple" "banana" "pear" "pineapple"
str_sub(fruit, start = 3)
## [1] "ple" "nana" "ar" "neapple"
- Duplicates strings by a number of times. - Essentially copy / paste function
str_dup(fruit, 3)
## [1] "appleappleapple" "bananabananabanana" ## [3] "pearpearpear" "pineapplepineapplepineapple"
- Removes leading and trailing whitespaces - Side argument defaults to "both" - ex. trim the whitespace from both sides of every string in "strings"
- Pads strings with whitespace to make them a certain length - Width argument lets users specify the width of the padding - Side argument defaults to "left" - ex. pad "movie_titles" with whitespace to the right such that each title becomes 30 characters long.
str_trim(strings)
## [1] "219 733 8965" "329-293-8753" ## [3] "banana" "595 794 7569" ## [5] "387 287 6718" "apple" ## [7] "233.398.9187" "482 952 3315" ## [9] "239 923 8115 and 842 566 4692" "Work: 579-499-7527" ## [11] "$1000" "Home: 543.355.3679"
str_pad(movie_titles, side = "right", 30)
## [1] "Gold Diggers Of Broadway " "Gone Baby Gone " ## [3] "Gone In 60 Seconds " "Gone With The Wind " ## [5] "Good Girl, The " "Good Burger " ## [7] "Goodbye Girl, The " "Good Bye Lenin! " ## [9] "Goodfellas " "Good Luck Chuck " ## [11] "Good Morning, Vietnam " "Good Night, And Good Luck. " ## [13] "Good Son, The " "Good Will Hunting "
Pattern matching functions use patterns, otherwise known as "regular expressions" or "regex", to identify specific characteristics in strings.
- "a" = is the letter "a" - "^a" = starts with the letter "a" - "a$" = ends with the letter "a" - "[ ]" = contains any letter (or number) within the brackets - "[ - ]" = contains any letter (or number) within this range - "[^ae]" = everything except these letters (or numbers) - "{3}" = repeat the last regex 3 times.
For more expressions or examples, refer to http://www.regular-expressions.info/refquick.html
Regular expressions can be combined to form compound expressions. - "a" = is the letter "a" - "^a" = starts with the letter "a" - "a$" = ends with the letter "a" - "[ ]" = contains any letter (or number) within the brackets - "[ - ]" = contains any letter (or number) within this range - "[^ae]" = everything except these letters (or numbers) - "{3}" = repeat the last regex 3 times.
- California plates start with a number, followed by 3 letters, followed by another 3 numbers. - Regex expression: "^[0-9][A-Z]{3}[0-9]{3}$"
- "a" = is the letter "a" - "^a" = starts with the letter "a" - "a$" = ends with the letter "a" - "[ ]" = contains any letter (or number) within the brackets - "[ - ]" = contains any letter (or number) within this range - "[^ae]" = everything except these letters (or numbers) - "{3}" = repeat the last regex 3 times.
- detects the presence of a pattern within a string or vector of strings - returns a boolean (TRUE FALSE) vector - ex. use str_detect in a way that returns any string that contains "apple".
str_detect(fruit, pattern = "^apple$")
## [1] TRUE FALSE FALSE FALSE
fruit[str_detect(fruit, "^apple$")]
## [1] "apple"
str_detect(fruit, pattern = "apple")
## [1] TRUE FALSE FALSE TRUE
fruit[str_detect(fruit, "apple")]
## [1] "apple" "pineapple"
- locates and returns the start and end position of the first instance of the pattern. - to locate more than one within a string, use str_locate_all(string, pattern) - ex. use str_locate to find every position of "apple"
fruit
## [1] "apple" "banana" "pear" "pineapple"
# on the second word, this pattern exists from the first character to the sixth str_locate(fruit, "banana")
## start end ## [1,] NA NA ## [2,] 1 6 ## [3,] NA NA ## [4,] NA NA
fruit
## [1] "apple" "banana" "pear" "pineapple"
str_locate(fruit, "apple")
## start end ## [1,] 1 5 ## [2,] NA NA ## [3,] NA NA ## [4,] 5 9
- matches the exact pattern to the string - mainly used to extract compound patterns
- equivalent to str_extract except that str_match returns a matrix. - str_(m)atch(): remember "m" for matrix!
labels <- c("a99", "a92", "a93l", "b99", "b92", "b93l", "c99", "c92", "c93l", "e99", "e92", "e93l") # extract everything that begins with an "a" or "e" and ends with two numbers str_extract(labels, "^[ae][0-9]{2}$")
## [1] "a99" "a92" NA NA NA NA NA NA NA "e99" "e92" ## [12] NA
strings
## [1] " 219 733 8965" "329-293-8753 " ## [3] "banana" "595 794 7569" ## [5] "387 287 6718" "apple" ## [7] "233.398.9187 " "482 952 3315" ## [9] "239 923 8115 and 842 566 4692" "Work: 579-499-7527" ## [11] "$1000" "Home: 543.355.3679"
str_match(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [,1] ## [1,] "219 733 8965" ## [2,] NA ## [3,] NA ## [4,] "595 794 7569" ## [5,] "387 287 6718" ## [6,] NA ## [7,] NA ## [8,] "482 952 3315" ## [9,] "239 923 8115" ## [10,] NA ## [11,] NA ## [12,] NA
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [[1]] ## [,1] ## [1,] "219 733 8965" ## ## [[2]] ## [,1] ## ## [[3]] ## [,1] ## ## [[4]] ## [,1] ## [1,] "595 794 7569" ## ## [[5]] ## [,1] ## [1,] "387 287 6718" ## ## [[6]] ## [,1] ## ## [[7]] ## [,1] ## ## [[8]] ## [,1] ## [1,] "482 952 3315" ## ## [[9]] ## [,1] ## [1,] "239 923 8115" ## [2,] "842 566 4692" ## ## [[10]] ## [,1] ## ## [[11]] ## [,1] ## ## [[12]] ## [,1]
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>% unlist() %>% matrix()
## [,1] ## [1,] "219 733 8965" ## [2,] "595 794 7569" ## [3,] "387 287 6718" ## [4,] "482 952 3315" ## [5,] "239 923 8115" ## [6,] "842 566 4692"
# ALTERNATIVELY matrix(unlist(str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")))
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}")
## [[1]] ## [,1] ## [1,] "219 733 8965" ## ## [[2]] ## [,1] ## ## [[3]] ## [,1] ## ## [[4]] ## [,1] ## [1,] "595 794 7569" ## ## [[5]] ## [,1] ## [1,] "387 287 6718" ## ## [[6]] ## [,1] ## ## [[7]] ## [,1] ## ## [[8]] ## [,1] ## [1,] "482 952 3315" ## ## [[9]] ## [,1] ## [1,] "239 923 8115" ## [2,] "842 566 4692" ## ## [[10]] ## [,1] ## ## [[11]] ## [,1] ## ## [[12]] ## [,1]
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>% unlist()
## [1] "219 733 8965" "595 794 7569" "387 287 6718" "482 952 3315" ## [5] "239 923 8115" "842 566 4692"
str_match_all(strings, pattern = "[1-9]{3} [1-9]{3} [1-9]{4}") %>% unlist() %>% matrix()
## [,1] ## [1,] "219 733 8965" ## [2,] "595 794 7569" ## [3,] "387 287 6718" ## [4,] "482 952 3315" ## [5,] "239 923 8115" ## [6,] "842 566 4692"
- replaces the first instance of the matched pattern with the replacement string - str_replace_all replaces all instances of the pattern with the replacement string - str_replace_na replaces all NA with "NA".
str_replace(fruit, pattern = "a", replacement = "e") # only the first instance
## [1] "epple" "benana" "peer" "pineepple"
str_replace_all(fruit, pattern = "a", replacement = "e") # every instance
## [1] "epple" "benene" "peer" "pineepple"
Your turn: In movie_titles, replace all instances of "Good" with "Bad".
str_replace_all(movie_titles, pattern = "Good", replacement = "Bad")
## [1] "Gold Diggers Of Broadway" "Gone Baby Gone" ## [3] "Gone In 60 Seconds" "Gone With The Wind" ## [5] "Bad Girl, The" "Bad Burger" ## [7] "Badbye Girl, The" "Bad Bye Lenin!" ## [9] "Badfellas" "Bad Luck Chuck" ## [11] "Bad Morning, Vietnam" "Bad Night, And Bad Luck." ## [13] "Bad Son, The" "Bad Will Hunting"
str_split(movie_titles, "[ ,]")
## [[1]] ## [1] "Gold" "Diggers" "Of" "Broadway" ## ## [[2]] ## [1] "Gone" "Baby" "Gone" ## ## [[3]] ## [1] "Gone" "In" "60" "Seconds" ## ## [[4]] ## [1] "Gone" "With" "The" "Wind" ## ## [[5]] ## [1] "Good" "Girl" "" "The" ## ## [[6]] ## [1] "Good" "Burger" ## ## [[7]] ## [1] "Goodbye" "Girl" "" "The" ## ## [[8]] ## [1] "Good" "Bye" "Lenin!" ## ## [[9]] ## [1] "Goodfellas" ## ## [[10]] ## [1] "Good" "Luck" "Chuck" ## ## [[11]] ## [1] "Good" "Morning" "" "Vietnam" ## ## [[12]] ## [1] "Good" "Night" "" "And" "Good" "Luck." ## ## [[13]] ## [1] "Good" "Son" "" "The" ## ## [[14]] ## [1] "Good" "Will" "Hunting"
str_split_fixed(movie_titles, "[ ,]", 5)
## [,1] [,2] [,3] [,4] [,5] ## [1,] "Gold" "Diggers" "Of" "Broadway" "" ## [2,] "Gone" "Baby" "Gone" "" "" ## [3,] "Gone" "In" "60" "Seconds" "" ## [4,] "Gone" "With" "The" "Wind" "" ## [5,] "Good" "Girl" "" "The" "" ## [6,] "Good" "Burger" "" "" "" ## [7,] "Goodbye" "Girl" "" "The" "" ## [8,] "Good" "Bye" "Lenin!" "" "" ## [9,] "Goodfellas" "" "" "" "" ## [10,] "Good" "Luck" "Chuck" "" "" ## [11,] "Good" "Morning" "" "Vietnam" "" ## [12,] "Good" "Night" "" "And" "Good Luck." ## [13,] "Good" "Son" "" "The" "" ## [14,] "Good" "Will" "Hunting" "" ""
Matrix should contain 10 phone numbers (rows) and 2 columns
strings %>% str_match_all(pattern = "[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}") %>% unlist() %>% str_replace_all(pattern = "[-. ]", replacement = " ") %>% str_split_fixed(pattern = " ", 2)
## [,1] [,2] ## [1,] "219" "733 8965" ## [2,] "329" "293 8753" ## [3,] "595" "794 7569" ## [4,] "387" "287 6718" ## [5,] "233" "398 9187" ## [6,] "482" "952 3315" ## [7,] "239" "923 8115" ## [8,] "842" "566 4692" ## [9,] "579" "499 7527" ## [10,] "543" "355 3679"