Introduction

Vignette for the popular stringr functions from the Tidyverse packages.

The stringr library provides a suite of commonly used string manipulation functions to assist in data cleaning and data preparation tasks.

The 8 most popular stringr verbs:

  1. detect: Identifies a match to the pattern
  2. count: Counts the number of instances of the pattern
  3. subset: Extracts the strings with matching components to the pattern
  4. locate: Identifies the position index of the match in the string
  5. extract: Extracts the matching text to the pattern
  6. match: Extracts the parts of the match as defined in the parenthesis
  7. replace: Replaces the matching text with the provided text.
  8. split: Split the string at the matching text.

The stringr functions require a vector of strings as the first argument.

Load the Tidyverse package.

library(tidyverse)

Read in a CSV file from the web site fivethirtyeight Github repository and convert to an R dataframe. The CSV file contains tweets determined to be sent by Russian trolls. For examining the use cases of the stringr library, this exercise focuses on non-structured sentences from the tweets. The input file is subset to the first 10 tweets, which are displayed below.

# Read CSV from fivethirtyeight.com
data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv")
head(data)
## # A tibble: 6 x 21
##   external_author… author content region language publish_date
##              <dbl> <chr>  <chr>   <chr>  <chr>    <chr>       
## 1          9.06e17 10_GOP "\"We … Unkno… English  10/1/2017 1…
## 2          9.06e17 10_GOP Marsha… Unkno… English  10/1/2017 2…
## 3          9.06e17 10_GOP Daught… Unkno… English  10/1/2017 2…
## 4          9.06e17 10_GOP JUST I… Unkno… English  10/1/2017 2…
## 5          9.06e17 10_GOP 19,000… Unkno… English  10/1/2017 2…
## 6          9.06e17 10_GOP "Dan B… Unkno… English  10/1/2017 2…
## # … with 15 more variables: harvested_date <chr>, following <dbl>,
## #   followers <dbl>, updates <dbl>, post_type <chr>, account_type <chr>,
## #   retweet <dbl>, account_category <chr>, new_june_2018 <dbl>,
## #   alt_external_id <dbl>, tweet_id <dbl>, article_url <chr>,
## #   tco1_step1 <chr>, tco2_step1 <chr>, tco3_step1 <lgl>
tweets <- data[1:10,1:6]

df <- as.data.frame(tweets)

# Output the content column as this will be the string data used for stringr functions
df$content
##  [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##  [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"                  
##  [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  #BoycottNFL https://t.co/qDlFBGMeag"               
##  [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"             
##  [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"                                                         
##  [6] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"                                                           
##  [7] "\U0001f41d\U0001f41d\U0001f41d https://t.co/MorL3AQW0z"                                                                                                        
##  [8] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"                 
##  [9] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"                  
## [10] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Detect

Detect if the exact string “US” appears in any of the tweets. Output will be “TRUE” or “FALSE” for each string.

detect_result <- str_detect(df$content, "US")

detect_result
##  [1]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

Count

Count the number of occurrences of the letters ‘u’ or ‘s’, uppercase and lowercase for each string. Output is the integer count for each string.

count_result <- str_count(df$content, "[USus]")

count_result
##  [1]  9 14 11 12  6  4  1  9  9  7

Subset

Subset the initial vector of strings to only the strings containing the exact string “US”. Output is a vector of strings.

subset_result <- str_subset(df$content, "US")

subset_result
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"

Locate

Identify the start and stop position in each string with a match of the exact string “the”. Output identifies start and end for each string and results in ‘NA’ for strings that do not match the pattern.

locate_result <- str_locate(df$content, "the")

locate_result
##       start end
##  [1,]   100 102
##  [2,]    82  84
##  [3,]    65  67
##  [4,]    77  79
##  [5,]    34  36
##  [6,]    NA  NA
##  [7,]    NA  NA
##  [8,]   109 111
##  [9,]    47  49
## [10,]     7   9

Extract

Extracts the first instance that matches the letters ‘u’ or ‘s’, uppercase and lowercase. Output is the string of the first matching pattern from the input vector of strings which in this case is a single letter.

extract_result <- str_extract(df$content, "[USus]")

extract_result
##  [1] "s" "s" "u" "U" "S" "s" "s" "S" "s" "S"

Match

Extracts the 5 characters from a string following the match of the exact string “US”. The 5 periods in the parenthesis define the part of the string to be extracted. Output is the matching 5-character string or ‘NA’.

match_result <- str_match(df$content, "US(.....)")

match_result
##       [,1]      [,2]   
##  [1,] "US Sena" " Sena"
##  [2,] NA        NA     
##  [3,] NA        NA     
##  [4,] "UST IN:" "T IN:"
##  [5,] NA        NA     
##  [6,] NA        NA     
##  [7,] NA        NA     
##  [8,] NA        NA     
##  [9,] NA        NA     
## [10,] NA        NA

Replace

Replaces all instances that matches the letters ‘u’ or ‘s’, uppercase and lowercase with an ampersand (‘%’). Output is the initial vectors of strings with the replaced characters.

replace_result <- str_replace(df$content, "[USus]", "%")

replace_result
##  [1] "\"We have a %itting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##  [2] "Mar%hawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"                  
##  [3] "Da%ghter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  #BoycottNFL https://t.co/qDlFBGMeag"               
##  [4] "J%ST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"             
##  [5] "19,000 RE%PECTING our National Anthem! #StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"                                                         
##  [6] "Dan Bongino: \"Nobody troll% liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"                                                           
##  [7] "\U0001f41d\U0001f41d\U0001f41d http%://t.co/MorL3AQW0z"                                                                                                        
##  [8] "'@%enatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"                 
##  [9] "A% much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"                  
## [10] "After the 'genocide' remark from %an Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Split

Splits the strings by the hashtag (‘#’). Output is a list of lists after splitting the input strings containing a hashtag.

split_result <- str_split(df$content, "#")

split_result
## [[1]]
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## 
## [[2]]
## [1] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## 
## [[3]]
## [1] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear.  "
## [2] "BoycottNFL https://t.co/qDlFBGMeag"                                                                          
## 
## [[4]]
## [1] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## 
## [[5]]
## [1] "19,000 RESPECTING our National Anthem! "                      
## [2] "StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"
## 
## [[6]]
## [1] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly!  https://t.co/AigV93aC8J"
## 
## [[7]]
## [1] "\U0001f41d\U0001f41d\U0001f41d https://t.co/MorL3AQW0z"
## 
## [[8]]
## [1] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## 
## [[9]]
## [1] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## 
## [[10]]
## [1] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."

Conclusion

The stringr library provides easy-to-use string manipulation functions for data cleaning and preparation tasks. The functions are applied on vectors of strings which allows for straightforward manipulation of entire columns in a dataframe.