Vignette for the popular stringr functions from the Tidyverse packages.
The stringr library provides a suite of commonly used string manipulation functions to assist in data cleaning and data preparation tasks.
The 8 most popular stringr verbs:
The stringr functions require a vector of strings as the first argument.
Load the Tidyverse package.
library(tidyverse)
Read in a CSV file from the web site fivethirtyeight Github repository and convert to an R dataframe. The CSV file contains tweets determined to be sent by Russian trolls. For examining the use cases of the stringr library, this exercise focuses on non-structured sentences from the tweets. The input file is subset to the first 10 tweets, which are displayed below.
# Read CSV from fivethirtyeight.com
data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv")
head(data)
## # A tibble: 6 x 21
## external_author~ author content region language publish_date harvested_date
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 9.06e17 10_GOP "\"We ~ Unkno~ English 10/1/2017 1~ 10/1/2017 19:~
## 2 9.06e17 10_GOP "Marsh~ Unkno~ English 10/1/2017 2~ 10/1/2017 22:~
## 3 9.06e17 10_GOP "Daugh~ Unkno~ English 10/1/2017 2~ 10/1/2017 22:~
## 4 9.06e17 10_GOP "JUST ~ Unkno~ English 10/1/2017 2~ 10/1/2017 23:~
## 5 9.06e17 10_GOP "19,00~ Unkno~ English 10/1/2017 2~ 10/1/2017 2:13
## 6 9.06e17 10_GOP "Dan B~ Unkno~ English 10/1/2017 2~ 10/1/2017 2:47
## # ... with 14 more variables: following <dbl>, followers <dbl>, updates <dbl>,
## # post_type <chr>, account_type <chr>, retweet <dbl>, account_category <chr>,
## # new_june_2018 <dbl>, alt_external_id <dbl>, tweet_id <dbl>,
## # article_url <chr>, tco1_step1 <chr>, tco2_step1 <chr>, tco3_step1 <lgl>
tweets <- data[1:10,1:6]
df <- as.data.frame(tweets)
# Output the content column as this will be the string data used for stringr functions
df$content
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. #BoycottNFL https://t.co/qDlFBGMeag"
## [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"
## [6] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
## [7] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/MorL3AQW0z"
## [8] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## [9] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## [10] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
Detect if the exact string “US” appears in any of the tweets. Output will be “TRUE” or “FALSE” for each string.
detect_result <- str_detect(df$content, "US")
detect_result
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Count the number of occurrences of the letters ‘u’ or ‘s’, uppercase and lowercase for each string. Output is the integer count for each string.
count_result <- str_count(df$content, "[USus]")
count_result
## [1] 9 14 11 12 6 4 1 9 9 7
Subset the initial vector of strings to only the strings containing the exact string “US”. Output is a vector of strings.
subset_result <- str_subset(df$content, "US")
subset_result
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
Identify the start and stop position in each string with a match of the exact string “the”. Output identifies start and end for each string and results in ‘NA’ for strings that do not match the pattern.
locate_result <- str_locate(df$content, "the")
locate_result
## start end
## [1,] 100 102
## [2,] 82 84
## [3,] 65 67
## [4,] 77 79
## [5,] 34 36
## [6,] NA NA
## [7,] NA NA
## [8,] 109 111
## [9,] 47 49
## [10,] 7 9
Extracts the first instance that matches the letters ‘u’ or ‘s’, uppercase and lowercase. Output is the string of the first matching pattern from the input vector of strings which in this case is a single letter.
extract_result <- str_extract(df$content, "[USus]")
extract_result
## [1] "s" "s" "u" "U" "S" "s" "s" "S" "s" "S"
Extracts the 5 characters from a string following the match of the exact string “US”. The 5 periods in the parenthesis define the part of the string to be extracted. Output is the matching 5-character string or ‘NA’.
match_result <- str_match(df$content, "US(.....)")
match_result
## [,1] [,2]
## [1,] "US Sena" " Sena"
## [2,] NA NA
## [3,] NA NA
## [4,] "UST IN:" "T IN:"
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] NA NA
## [9,] NA NA
## [10,] NA NA
Replaces all instances that matches the letters ‘u’ or ‘s’, uppercase and lowercase with an ampersand (‘%’). Output is the initial vectors of strings with the replaced characters.
replace_result <- str_replace(df$content, "[USus]", "%")
replace_result
## [1] "\"We have a %itting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "Mar%hawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## [3] "Da%ghter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. #BoycottNFL https://t.co/qDlFBGMeag"
## [4] "J%ST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## [5] "19,000 RE%PECTING our National Anthem! #StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"
## [6] "Dan Bongino: \"Nobody troll% liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
## [7] "<U+0001F41D><U+0001F41D><U+0001F41D> http%://t.co/MorL3AQW0z"
## [8] "'@%enatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## [9] "A% much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## [10] "After the 'genocide' remark from %an Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
Splits the strings by the hashtag (‘#’). Output is a list of lists after splitting the input strings containing a hashtag.
split_result <- str_split(df$content, "#")
split_result
## [[1]]
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##
## [[2]]
## [1] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
##
## [[3]]
## [1] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. "
## [2] "BoycottNFL https://t.co/qDlFBGMeag"
##
## [[4]]
## [1] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
##
## [[5]]
## [1] "19,000 RESPECTING our National Anthem! "
## [2] "StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"
##
## [[6]]
## [1] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
##
## [[7]]
## [1] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/MorL3AQW0z"
##
## [[8]]
## [1] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
##
## [[9]]
## [1] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
##
## [[10]]
## [1] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
The stringr library provides easy-to-use string manipulation functions for data cleaning and preparation tasks. The functions are applied on vectors of strings which allows for straightforward manipulation of entire columns in a dataframe.
I’m going to demonstrate 5 additional stringr funcitons that could come in useful when manipulating strings.
Get the Length of the strings by the number of characters. Useful in situations of unknown data quality (i.e. a 5 digit phone number)
length_result <- str_length(df$content)
length_result
## [1] 156 140 143 145 83 97 27 141 140 119
Convert the strings to all upper case. This is helpful when looking for a string match that may be case sensetive or preparing data for a database that’s users need consistency in the cas of the letters.
upper_result <- str_to_upper(df$content)
upper_result
## [1] "\"WE HAVE A SITTING DEMOCRAT US SENATOR ON TRIAL FOR CORRUPTION AND YOU'VE BARELY HEARD A PEEP FROM THE MAINSTREAM MEDIA.\" ~ @NEDRYUN HTTPS://T.CO/GH6G0D1OIC"
## [2] "MARSHAWN LYNCH ARRIVES TO GAME IN ANTI-TRUMP SHIRT. JUDGING BY HIS SAGGING PANTS THE SHIRT SHOULD SAY LYNCH VS. BELT HTTPS://T.CO/MLH1I30LZZ"
## [3] "DAUGHTER OF FALLEN NAVY SAILOR DELIVERS POWERFUL MONOLOGUE ON ANTHEM PROTESTS, BURNS HER NFL PACKERS GEAR. #BOYCOTTNFL HTTPS://T.CO/QDLFBGMEAG"
## [4] "JUST IN: PRESIDENT TRUMP DEDICATES PRESIDENTS CUP GOLF TOURNAMENT TROPHY TO THE PEOPLE OF FLORIDA, TEXAS AND PUERTO RICO. HTTPS://T.CO/Z9WVA4DJAE"
## [5] "19,000 RESPECTING OUR NATIONAL ANTHEM! #STANDFOROURANTHEM<U+0001F1FA><U+0001F1F8> HTTPS://T.CO/CZUTYGAMQV"
## [6] "DAN BONGINO: \"NOBODY TROLLS LIBERALS BETTER THAN DONALD TRUMP.\" EXACTLY! HTTPS://T.CO/AIGV93AC8J"
## [7] "<U+0001F41D><U+0001F41D><U+0001F41D> HTTPS://T.CO/MORL3AQW0Z"
## [8] "'@SENATORMENENDEZ @CARMENYULINCRUZ DOESN'T MATTER THAT CNN DOESN'T REPORT ON YOUR CRIMES. THIS WON'T CHANGE THE FACT THAT YOU'RE GOING DOWN.'"
## [9] "AS MUCH AS I HATE PROMOTING CNN ARTICLE, HERE THEY ARE ADMITTING EVERYTHING TRUMP SAID ABOUT PR RELIEF TWO DAYS AGO. HTTPS://T.CO/TZMSEA48OH"
## [10] "AFTER THE 'GENOCIDE' REMARK FROM SAN JUAN MAYOR THE NARRATIVE HAS CHANGED THOUGH. @CNN FIXES IT'S REPORTING CONSTANTLY."
Trim whitespace form strings on either the left, right or both sides. Helpful when cleaning up raw data.
trim_result <- str_trim(df$content,side = c("both"))
trim_result
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. #BoycottNFL https://t.co/qDlFBGMeag"
## [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem<U+0001F1FA><U+0001F1F8> https://t.co/czutyGaMQV"
## [6] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
## [7] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/MorL3AQW0z"
## [8] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## [9] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## [10] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
Truncate the width of strings which replaces the content after the length - 3. Helpful when the strings are a known pattern and for example you may only need the first 3 charaters.
trunc_result <- str_trunc(df$content,13)
trunc_result
## [1] "\"We have a..." "Marshawn L..." "Daughter o..." "JUST IN: P..."
## [5] "19,000 RES..." "Dan Bongin..." "<U+0001F41D><U+0001F41D><U+0001F41D> https:..." "'@SenatorM..."
## [9] "As much as..." "After the ..."
Returns only the specified substring of characters. Useful when part of a code or string has significance.
sub_result <- str_sub(df$content,1,20)
sub_result
## [1] "\"We have a sitting D" "Marshawn Lynch arriv" "Daughter of fallen N"
## [4] "JUST IN: President T" "19,000 RESPECTING ou" "Dan Bongino: \"Nobody"
## [7] "<U+0001F41D><U+0001F41D><U+0001F41D> https://t.co/Mor" "'@SenatorMenendez @C" "As much as I hate pr"
## [10] "After the 'genocide'"
Teh stringr library is extremely useful and easy to use when manipulating strings. All 13 of these functions would be extremely helpful to me in my day to day work.