Introduction

Vignette for the popular stringr functions from the Tidyverse packages.

The stringr library provides a suite of commonly used string manipulation functions to assist in data cleaning and data preparation tasks.

The 8 most popular stringr verbs:

  1. detect: Identifies a match to the pattern
  2. count: Counts the number of instances of the pattern
  3. subset: Extracts the strings with matching components to the pattern
  4. locate: Identifies the position index of the match in the string
  5. extract: Extracts the matching text to the pattern
  6. match: Extracts the parts of the match as defined in the parenthesis
  7. replace: Replaces the matching text with the provided text.
  8. split: Split the string at the matching text.

The stringr functions require a vector of strings as the first argument.

Load the Tidyverse package.

library(tidyverse)

Read in a CSV file from the web site fivethirtyeight Github repository and convert to an R dataframe. The CSV file contains tweets determined to be sent by Russian trolls. For examining the use cases of the stringr library, this exercise focuses on non-structured sentences from the tweets. The input file is subset to the first 10 tweets, which are displayed below.

# Read CSV from fivethirtyeight.com
data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv")
head(data)
## # A tibble: 6 x 21
##   external_author~ author content region language publish_date harvested_date
##              <dbl> <chr>  <chr>   <chr>  <chr>    <chr>        <chr>         
## 1          9.06e17 10_GOP "\"We ~ Unkno~ English  10/1/2017 1~ 10/1/2017 19:~
## 2          9.06e17 10_GOP "Marsh~ Unkno~ English  10/1/2017 2~ 10/1/2017 22:~
## 3          9.06e17 10_GOP "Daugh~ Unkno~ English  10/1/2017 2~ 10/1/2017 22:~
## 4          9.06e17 10_GOP "JUST ~ Unkno~ English  10/1/2017 2~ 10/1/2017 23:~
## 5          9.06e17 10_GOP "19,00~ Unkno~ English  10/1/2017 2~ 10/1/2017 2:13
## 6          9.06e17 10_GOP "Dan B~ Unkno~ English  10/1/2017 2~ 10/1/2017 2:47
## # ... with 14 more variables: following <dbl>, followers <dbl>, updates <dbl>,
## #   post_type <chr>, account_type <chr>, retweet <dbl>, account_category <chr>,
## #   new_june_2018 <dbl>, alt_external_id <dbl>, tweet_id <dbl>,
## #   article_url <chr>, tco1_step1 <chr>, tco2_step1 <chr>, tco3_step1 <lgl>
tweets <- data[1:10,1:6]

df <- as.data.frame(tweets)

# Output the content column as this will be the string data used for stringr functions
df$content
##  [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##  [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"                  
##  [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  #BoycottNFL https://t.co/qDlFBGMeag"               
##  [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"             
##  [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"                                                     
##  [6] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"                                                           
##  [7] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/MorL3AQW0z"                                                                                                  
##  [8] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"                 
##  [9] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"                  
## [10] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Detect

Detect if the exact string “US” appears in any of the tweets. Output will be “TRUE” or “FALSE” for each string.

detect_result <- str_detect(df$content, "US")

detect_result
##  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Count

Count the number of occurrences of the letters ‘u’ or ‘s’, uppercase and lowercase for each string. Output is the integer count for each string.

count_result <- str_count(df$content, "[USus]")

count_result
##  [1]  9 14 11 12  6  4  1  9  9  7

Subset

Subset the initial vector of strings to only the strings containing the exact string “US”. Output is a vector of strings.

subset_result <- str_subset(df$content, "US")

subset_result
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"

Locate

Identify the start and stop position in each string with a match of the exact string “the”. Output identifies start and end for each string and results in ‘NA’ for strings that do not match the pattern.

locate_result <- str_locate(df$content, "the")

locate_result
##       start end
##  [1,]   100 102
##  [2,]    82  84
##  [3,]    65  67
##  [4,]    77  79
##  [5,]    34  36
##  [6,]    NA  NA
##  [7,]    NA  NA
##  [8,]   109 111
##  [9,]    47  49
## [10,]     7   9

Extract

Extracts the first instance that matches the letters ‘u’ or ‘s’, uppercase and lowercase. Output is the string of the first matching pattern from the input vector of strings which in this case is a single letter.

extract_result <- str_extract(df$content, "[USus]")

extract_result
##  [1] "s" "s" "u" "U" "S" "s" "s" "S" "s" "S"

Match

Extracts the 5 characters from a string following the match of the exact string “US”. The 5 periods in the parenthesis define the part of the string to be extracted. Output is the matching 5-character string or ‘NA’.

match_result <- str_match(df$content, "US(.....)")

match_result
##       [,1]      [,2]   
##  [1,] "US Sena" " Sena"
##  [2,] NA        NA     
##  [3,] NA        NA     
##  [4,] "UST IN:" "T IN:"
##  [5,] NA        NA     
##  [6,] NA        NA     
##  [7,] NA        NA     
##  [8,] NA        NA     
##  [9,] NA        NA     
## [10,] NA        NA

Replace

Replaces all instances that matches the letters ‘u’ or ‘s’, uppercase and lowercase with an ampersand (‘%’). Output is the initial vectors of strings with the replaced characters.

replace_result <- str_replace(df$content, "[USus]", "%")

replace_result
##  [1] "\"We have a %itting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##  [2] "Mar%hawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"                  
##  [3] "Da%ghter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  #BoycottNFL https://t.co/qDlFBGMeag"               
##  [4] "J%ST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"             
##  [5] "19,000 RE%PECTING our National Anthem! #StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"                                                     
##  [6] "Dan Bongino: \"Nobody troll% liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"                                                           
##  [7] "<U+0001F41D><U+0001F41D><U+0001F41D> http%://t.co/MorL3AQW0z"                                                                                                  
##  [8] "'@%enatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"                 
##  [9] "A% much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"                  
## [10] "After the 'genocide' remark from %an Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Split

Splits the strings by the hashtag (‘#’). Output is a list of lists after splitting the input strings containing a hashtag.

split_result <- str_split(df$content, "#")

split_result
## [[1]]
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## 
## [[2]]
## [1] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## 
## [[3]]
## [1] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  "
## [2] "BoycottNFL https://t.co/qDlFBGMeag"                                                                          
## 
## [[4]]
## [1] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## 
## [[5]]
## [1] "19,000 RESPECTING our National Anthem! "    
## [2] "StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"
## 
## [[6]]
## [1] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"
## 
## [[7]]
## [1] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/MorL3AQW0z"
## 
## [[8]]
## [1] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## 
## [[9]]
## [1] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## 
## [[10]]
## [1] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Conclusion

The stringr library provides easy-to-use string manipulation functions for data cleaning and preparation tasks. The functions are applied on vectors of strings which allows for straightforward manipulation of entire columns in a dataframe.

Extend: Adam Gersowitz

I’m going to demonstrate 5 additional stringr funcitons that could come in useful when manipulating strings.

Length

Get the Length of the strings by the number of characters. Useful in situations of unknown data quality (i.e. a 5 digit phone number)

length_result <- str_length(df$content)

length_result
##  [1] 156 140 143 145  83  97  27 141 140 119

Upper

Convert the strings to all upper case. This is helpful when looking for a string match that may be case sensetive or preparing data for a database that’s users need consistency in the cas of the letters.

upper_result <- str_to_upper(df$content)

upper_result
##  [1] "\"WE HAVE A SITTING DEMOCRAT US SENATOR ON TRIAL FOR CORRUPTION AND YOU'VE BARELY HEARD A PEEP FROM THE MAINSTREAM MEDIA.\" ~ @NEDRYUN HTTPS://T.CO/GH6G0D1OIC"
##  [2] "MARSHAWN LYNCH ARRIVES TO GAME IN ANTI-TRUMP SHIRT. JUDGING BY HIS SAGGING PANTS THE SHIRT SHOULD SAY LYNCH VS. BELT HTTPS://T.CO/MLH1I30LZZ"                  
##  [3] "DAUGHTER OF FALLEN NAVY SAILOR DELIVERS POWERFUL MONOLOGUE ON ANTHEM PROTESTS, BURNS HER NFL PACKERS GEAR.  #BOYCOTTNFL HTTPS://T.CO/QDLFBGMEAG"               
##  [4] "JUST IN: PRESIDENT TRUMP DEDICATES PRESIDENTS CUP GOLF TOURNAMENT TROPHY TO THE PEOPLE OF FLORIDA, TEXAS AND PUERTO RICO. HTTPS://T.CO/Z9WVA4DJAE"             
##  [5] "19,000 RESPECTING OUR NATIONAL ANTHEM! #STANDFOROURANTHEM<U+0001F1FA><U+0001F1F8> HTTPS://T.CO/CZUTYGAMQV"                                                     
##  [6] "DAN BONGINO: \"NOBODY TROLLS LIBERALS BETTER THAN DONALD TRUMP.\" EXACTLY!  HTTPS://T.CO/AIGV93AC8J"                                                           
##  [7] "<U+0001F41D><U+0001F41D><U+0001F41D> HTTPS://T.CO/MORL3AQW0Z"                                                                                                  
##  [8] "'@SENATORMENENDEZ @CARMENYULINCRUZ DOESN'T MATTER THAT CNN DOESN'T REPORT ON YOUR CRIMES. THIS WON'T CHANGE THE FACT THAT YOU'RE GOING DOWN.'"                 
##  [9] "AS MUCH AS I HATE PROMOTING CNN ARTICLE, HERE THEY ARE ADMITTING EVERYTHING TRUMP SAID ABOUT PR RELIEF TWO DAYS AGO. HTTPS://T.CO/TZMSEA48OH"                  
## [10] "AFTER THE 'GENOCIDE' REMARK FROM SAN JUAN MAYOR THE NARRATIVE HAS CHANGED THOUGH. @CNN FIXES IT'S REPORTING CONSTANTLY."

Trim

Trim whitespace form strings on either the left, right or both sides. Helpful when cleaning up raw data.

trim_result <- str_trim(df$content,side = c("both"))

trim_result
##  [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##  [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"                  
##  [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  #BoycottNFL https://t.co/qDlFBGMeag"               
##  [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"             
##  [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"                                                     
##  [6] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"                                                           
##  [7] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/MorL3AQW0z"                                                                                                  
##  [8] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"                 
##  [9] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"                  
## [10] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Truncate

Truncate the width of strings which replaces the content after the length - 3. Helpful when the strings are a known pattern and for example you may only need the first 3 charaters.

trunc_result <- str_trunc(df$content,13)

trunc_result
##  [1] "\"We have a..." "Marshawn L..."  "Daughter o..."  "JUST IN: P..." 
##  [5] "19,000 RES..."  "Dan Bongin..."  "<U+0001F41D><U+0001F41D><U+0001F41D> https:..." "'@SenatorM..." 
##  [9] "As much as..."  "After the ..."

Sub

Returns only the specified substring of characters. Useful when part of a code or string has significance.

sub_result <- str_sub(df$content,1,20)

sub_result
##  [1] "\"We have a sitting D" "Marshawn Lynch arriv"  "Daughter of fallen N" 
##  [4] "JUST IN: President T"  "19,000 RESPECTING ou"  "Dan Bongino: \"Nobody"
##  [7] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/Mor" "'@SenatorMenendez @C"  "As much as I hate pr" 
## [10] "After the 'genocide'"

Extend Conclusion

Teh stringr library is extremely useful and easy to use when manipulating strings. All 13 of these functions would be extremely helpful to me in my day to day work.