Vignette for the popular stringr functions from the Tidyverse packages.
The stringr library provides a suite of commonly used string manipulation functions to assist in data cleaning and data preparation tasks.
The 8 most popular stringr verbs:
The stringr functions require a vector of strings as the first argument.
Load the Tidyverse package.
library(tidyverse)
Read in a CSV file from the web site fivethirtyeight Github repository and convert to an R dataframe. The CSV file contains tweets determined to be sent by Russian trolls. For examining the use cases of the stringr library, this exercise focuses on non-structured sentences from the tweets. The input file is subset to the first 10 tweets, which are displayed below.
# Read CSV from fivethirtyeight.com
data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/russian-troll-tweets/master/IRAhandle_tweets_1.csv")
head(data)
## # A tibble: 6 x 21
## external_author… author content region language publish_date
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 9.06e17 10_GOP "\"We … Unkno… English 10/1/2017 1…
## 2 9.06e17 10_GOP Marsha… Unkno… English 10/1/2017 2…
## 3 9.06e17 10_GOP Daught… Unkno… English 10/1/2017 2…
## 4 9.06e17 10_GOP JUST I… Unkno… English 10/1/2017 2…
## 5 9.06e17 10_GOP 19,000… Unkno… English 10/1/2017 2…
## 6 9.06e17 10_GOP "Dan B… Unkno… English 10/1/2017 2…
## # … with 15 more variables: harvested_date <chr>, following <dbl>,
## # followers <dbl>, updates <dbl>, post_type <chr>, account_type <chr>,
## # retweet <dbl>, account_category <chr>, new_june_2018 <dbl>,
## # alt_external_id <dbl>, tweet_id <dbl>, article_url <chr>,
## # tco1_step1 <chr>, tco2_step1 <chr>, tco3_step1 <lgl>
tweets <- data[1:10,1:6]
df <- as.data.frame(tweets)
# Output the content column as this will be the string data used for stringr functions
df$content
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## [3] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. #BoycottNFL https://t.co/qDlFBGMeag"
## [4] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## [5] "19,000 RESPECTING our National Anthem! #StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"
## [6] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
## [7] "\U0001f41d\U0001f41d\U0001f41d https://t.co/MorL3AQW0z"
## [8] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## [9] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## [10] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
Detect if the exact string “US” appears in any of the tweets. Output will be “TRUE” or “FALSE” for each string.
detect_result <- str_detect(df$content, "US")
detect_result
## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Count the number of occurrences of the letters ‘u’ or ‘s’, uppercase and lowercase for each string. Output is the integer count for each string.
count_result <- str_count(df$content, "[USus]")
count_result
## [1] 9 14 11 12 6 4 1 9 9 7
Subset the initial vector of strings to only the strings containing the exact string “US”. Output is a vector of strings.
subset_result <- str_subset(df$content, "US")
subset_result
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
Identify the start and stop position in each string with a match of the exact string “the”. Output identifies start and end for each string and results in ‘NA’ for strings that do not match the pattern.
locate_result <- str_locate(df$content, "the")
locate_result
## start end
## [1,] 100 102
## [2,] 82 84
## [3,] 65 67
## [4,] 77 79
## [5,] 34 36
## [6,] NA NA
## [7,] NA NA
## [8,] 109 111
## [9,] 47 49
## [10,] 7 9
Extracts the first instance that matches the letters ‘u’ or ‘s’, uppercase and lowercase. Output is the string of the first matching pattern from the input vector of strings which in this case is a single letter.
extract_result <- str_extract(df$content, "[USus]")
extract_result
## [1] "s" "s" "u" "U" "S" "s" "s" "S" "s" "S"
Extracts the 5 characters from a string following the match of the exact string “US”. The 5 periods in the parenthesis define the part of the string to be extracted. Output is the matching 5-character string or ‘NA’.
match_result <- str_match(df$content, "US(.....)")
match_result
## [,1] [,2]
## [1,] "US Sena" " Sena"
## [2,] NA NA
## [3,] NA NA
## [4,] "UST IN:" "T IN:"
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] NA NA
## [9,] NA NA
## [10,] NA NA
Replaces all instances that matches the letters ‘u’ or ‘s’, uppercase and lowercase with an ampersand (‘%’). Output is the initial vectors of strings with the replaced characters.
replace_result <- str_replace(df$content, "[USus]", "%")
replace_result
## [1] "\"We have a %itting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
## [2] "Mar%hawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
## [3] "Da%ghter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. #BoycottNFL https://t.co/qDlFBGMeag"
## [4] "J%ST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
## [5] "19,000 RE%PECTING our National Anthem! #StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"
## [6] "Dan Bongino: \"Nobody troll% liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
## [7] "\U0001f41d\U0001f41d\U0001f41d http%://t.co/MorL3AQW0z"
## [8] "'@%enatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
## [9] "A% much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
## [10] "After the 'genocide' remark from %an Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
Splits the strings by the hashtag (‘#’). Output is a list of lists after splitting the input strings containing a hashtag.
split_result <- str_split(df$content, "#")
split_result
## [[1]]
## [1] "\"We have a sitting Democrat US Senator on trial for corruption and you've barely heard a peep from the mainstream media.\" ~ @nedryun https://t.co/gh6g0D1oiC"
##
## [[2]]
## [1] "Marshawn Lynch arrives to game in anti-Trump shirt. Judging by his sagging pants the shirt should say Lynch vs. belt https://t.co/mLH1i30LZZ"
##
## [[3]]
## [1] "Daughter of fallen Navy Sailor delivers powerful monologue on anthem protests, burns her NFL packers gear. "
## [2] "BoycottNFL https://t.co/qDlFBGMeag"
##
## [[4]]
## [1] "JUST IN: President Trump dedicates Presidents Cup golf tournament trophy to the people of Florida, Texas and Puerto Rico. https://t.co/z9wVa4djAE"
##
## [[5]]
## [1] "19,000 RESPECTING our National Anthem! "
## [2] "StandForOurAnthem\U0001f1fa\U0001f1f8 https://t.co/czutyGaMQV"
##
## [[6]]
## [1] "Dan Bongino: \"Nobody trolls liberals better than Donald Trump.\" Exactly! https://t.co/AigV93aC8J"
##
## [[7]]
## [1] "\U0001f41d\U0001f41d\U0001f41d https://t.co/MorL3AQW0z"
##
## [[8]]
## [1] "'@SenatorMenendez @CarmenYulinCruz Doesn't matter that CNN doesn't report on your crimes. This won't change the fact that you're going down.'"
##
## [[9]]
## [1] "As much as I hate promoting CNN article, here they are admitting EVERYTHING Trump said about PR relief two days ago. https://t.co/tZmSeA48oh"
##
## [[10]]
## [1] "After the 'genocide' remark from San Juan Mayor the narrative has changed though. @CNN fixes it's reporting constantly."
The stringr library provides easy-to-use string manipulation functions for data cleaning and preparation tasks. The functions are applied on vectors of strings which allows for straightforward manipulation of entire columns in a dataframe.