Working with strings using stringr library

Introduction

In programming, a string is considered as simply a sequence of characters - they are usually enclosed in either single or double quotation marks, depending on the programming language. They form an important data type in almost all programming lannguages, where they are used to store human readable information e.g. sentences, characters, alphanumeric data, etc. This page provides a comprehensive introduction to the manipulation of strings in R. The main focus will be on regular expressions which are useful in describing patterns in strings. Knowledge of string manipulation with regular expressions is particularly important since strings usually consist of unstructured or semi-unstructured data.

Install required packages

The following packages will be required and should be loaded first. If they are not already installed, begin by installing them by using the install.packages() function e.g. install.packages(“stringr”)

library(stringr) # contains the stringi package
library(kableExtra) # display table formatting

Creating strings

Strings can be created in R either by enclosing them in single or double quotation marks (with the latter being recommended). The example below illustrates this.

s1 = "This is my first string in R."
s2 = "You're going to have a lot of fun with programming in R."
s1; s2

## [1] "This is my first string in R."

## [1] "You're going to have a lot of fun with programming in R."

If a string contains the double quotation marks, then you will be required to use the black slash (\) to escape these double quotation markes. This is illustrated in the code below.

s3 = "The students answered, \"I like R programming.\""
s3

## [1] "The students answered, \"I like R programming.\""

Another way to achieve the above is to enclose the string in single quotation marks as shown in the code below (I personally prefer this alternative).

s4 = 'The students answered, "I like R programming."'
s4

## [1] "The students answered, \"I like R programming.\""

Remark: The printed representation (as shown in the output for s3 and s4) is different from the string itself since the printed representation includes the escape parameters in the output.

Concatenating strings

When there are multiple strings, they it is always common to concatenate them into a single character vector. Concatenation of multiple strings is achieved using the function c(). For example, the strings, “Female”, “Female”, “Male”, “Female”, “Male”, “Male”, “Female”, “Male” can be concatenated into a single character named gender as follows.

gender = c("Female", "Female", "Male", "Female", "Male", "Male", "Female", "Male")
gender

## [1] "Female" "Female" "Male"   "Female" "Male"   "Male"   "Female" "Male"

Remark: While the base R contains numerous functions to work on strings, we will use those from the stringr package since they are have intuitive names that makes them easy to remember. All string functions from the stringr package start with str_, which is particurlarly very useful for R Studio users, because typing str_ will trigger the R Studio auto-complete functionality that displays all the functions from the stringr package in a pop-up dropdown list as shown in the screenshot below.

Combining strings

Two or more strings can be combined by using the function str_c().

day = "01"; month = "04"; year = "2021"
# combine
date1 = str_c(year, month, day)
date1

## [1] "20210401"

The argument sep can be added to control how the separate strings are separated. For example, in the syntax below, we use the forward slash (/) is specified as the separating character.

day = "01"; month = "04"; year = "2021"
# combine, separating by forward slash
date2 = str_c(year, month, day, sep = "/")
date2

## [1] "2021/04/01"

There are numerous other ways of combining multiple strings. Below are three more examples on this. Results are displayed separately immediately after the code.

x1 = str_c("x", 1:5)
x2 = str_c("year", seq(from = 2000, to = 2020, by = 5))
x3 = str_c(LETTERS[1:5], 5)
x1; x2; x3

## [1] "x1" "x2" "x3" "x4" "x5"

## [1] "year2000" "year2005" "year2010" "year2015" "year2020"

## [1] "A5" "B5" "C5" "D5" "E5"

Collapsing a vector of strings

A part from combining multiple strings into a single string, the str_c() function can also be used to collapse values of a vector into a single strings. This is achieved by adding the argument collapse and specifying the character to separate the values.

s1 = c("x1", "x2", "x3", "x4", "x5")
# collapse into single string
s2 = str_c(s1, collapse = ", ")
s1; s2

## [1] "x1" "x2" "x3" "x4" "x5"

## [1] "x1, x2, x3, x4, x5"

It’s important to notice that s1 is a character vector with five elements “x1”, “x2”, “x3”, “x4”, “x5” while s2 is a single string with x1, x2, x3 and x4 as characters separated by a comma (,).

Extract parts of a string

Parts of a string can be extracted from a specified string using the function str_sub(). It takes two arguments namely start and end which respectively specify the character index where to begin the extraction and the end character index. When negative signs are used for the start and end arguments, then the counting beings from the right most character of the string (i.e. last character).

First character

gender = c("Female", "Male")
# extract first letter
gendersub1 = str_sub(gender, start = 1, end = 1)
gendersub1

## [1] "F" "M"

First 2 characters

This will extract the day from the given dates.

date1 = c("01/04/2021", "24/07/2018", "18/05/2019")
day = str_sub(date1, start = 1, end = 2)
day

## [1] "01" "24" "18"

4th to 5th character

This will extract the month from the given dates.

date1 = c("01/04/2021", "24/07/2018", "18/05/2019")
month = str_sub(date1, start = 4, end = 5)
month

## [1] "04" "07" "05"

7th to 10 characters

This will extract the year from the given dates.

date1 = c("01/04/2021", "24/07/2018", "18/05/2019")
year = str_sub(date1, start = 7, end = 10)
year

## [1] "2021" "2018" "2019"

The above can also be achieved by using the following code.

date1 = c("01/04/2021", "24/07/2018", "18/05/2019")
dateyear = str_sub(date1, start = -4, end = -1)
dateyear

## [1] "2021" "2018" "2019"

The arguments start = -4 and end = -1 are interpreted as the last four characters in the specified string (which is date1 in our case).

Another use of the str_c() function that can be very helpful in data management is to modify strings. In the following, this function is used to capitalize the first character of the names of the students.

students = c("james", "mark", "rose", "daniel", "christine")
str_sub(students, 1, 1) = str_to_upper(str_sub(students, 1, 1))
students

## [1] "James"     "Mark"      "Rose"      "Daniel"    "Christine"

Regular expressions for matching patterns in a string

Regular expressions are useful in describing patterns in strings. Knowledge of string manipulation with regular expressions is particularly important since strings usually consist of unstructured or semi-unstructured data.

Exact match

students = c("James", "Mark", "Rose", "Daniel", "Christine", "Dancun", "Mary")
students = str_view(students, "ar")
students

As can be noted from the above output, the code displays all values of the character vector students then highlights characters that exactly match the specified text (which in our case is ar).

Match any character

students = c("James", "Mark", "Rose", "Daniel", "Christine", "Dancun", "Mary")
students = str_view(students, ".r.")
students

Matching the start of a string

Strings in the character vector that start with the characters Da.

students = c("James", "Mark", "Rose", "Daniel", "Christine", "Dancun", "Mary")
students = str_view(students, "^Dan")
students

Matching the start of a string

Strings in the character vector that end with the character e.

students = c("James", "Mark", "Rose", "Daniel", "Christine", "Dancun", "Mary")
students = str_view(students, "e$")
students

Matching a complete string

Strings in the character vector that exactly match the characters Kenya.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = str_view(students, "^Kenya$")
students

Note that failing to enclose the string Kenya with the special characters ^ and $ causes the function to highlight all strings that contain the characters Kenya. This is shown below.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = str_view(students, "Kenya")
students

The second argument can be customized to specify multiple options. In the syntax below for example, the value “col(ou|o)r” is specified to highlight the occurrence of both colour or color.

colours = c("colour", "color")
colours = str_view(colours, "col(ou|o)r")
colours

Dectect matches

The function str_detect() returns a logical vector (TRUE/FALSE) to indicate if a string in a character vector matches a specified pattern. The value TRUE indicates that the string matches a specified pattern and FALSE otherwise.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = str_detect(students, "^Kenya")
students

## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

Extract matches

In the above syntax, we just returned a logical vector indicating whether (TRUE) or not (FALSE) the string matched the specified pattern. Ideally, you would like to return the particular string(s) that match the given pattern.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = students[str_detect(students, "^Kenya")]
students

## [1] "Kenya James" "Kenya"

Dectect non-matches

Sometimes you may be interested in those strings that do not match a specified pattern. This is done by negation using the character ^ as shown in the code below.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = str_detect(students, "^[^Kenya]")
students

## [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

The code returns a logical vector of TRUE and FALSE as seen earlier. In the next section, we are going to return the strings that do not match the specified pattern.

Extract non-matches

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = students[str_detect(students, "^[^Kenya]")]
students

## [1] "Uganda Mark" "Rusia Rose"  "USA Daniel"  "UK Dancun"   "China Mary"

Extract matches using str_subset() function

In the preceeding sections, we have used the function str_detect() to detect matches, and by modification,it has been used to extract the strings that match or do not match the specified pattern. In this section, we introduce the function str_subset() that will return the matches directly.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = str_subset(students, "^Kenya")
students

## [1] "Kenya James" "Kenya"

Extract non-matches using str_subset() function

Similarly, strings that do not match the specified pattern can be extracted. This is illustrated in the following code.

students = c("Kenya James", "Uganda Mark", "Rusia Rose", 
             "USA Daniel", "Kenya", "UK Dancun", "China Mary")
students = str_subset(students, "^[^Kenya]")
students

## [1] "Uganda Mark" "Rusia Rose"  "USA Daniel"  "UK Dancun"   "China Mary"

Splitting strings

The function str_split() is used to split strings into parts.

date1 = c("01/04/2021", "24/07/2018", "18/05/2019")
datesplit = str_split(date1, pattern = "/")
datesplit

## [[1]]
## [1] "01"   "04"   "2021"
## 
## [[2]]
## [1] "24"   "07"   "2018"
## 
## [[3]]
## [1] "18"   "05"   "2019"

The results of splitting each of the elements of the vector at the back slash (/) is stored into a list. This list can be converted into a data frame for enhanced presentation as follows.

datematrix = as.data.frame(t(as.data.frame(datesplit)))
datematrix = data.frame(date1, datematrix)
row.names(datematrix) = 1:nrow(datematrix)
names(datematrix) = c("date", "day", "month", "year")
datematrix

##         date day month year
## 1 01/04/2021  01    04 2021
## 2 24/07/2018  24    07 2018
## 3 18/05/2019  18    05 2019

In this tutorial, we have talked functions from the stringr library that you most commonly will need in manipulating strings in the R programming language. The list of functions we have discussed is certainly not exhaustive. The stringi library contains all functions that you may ever need for your string manipulation - it has a total of 250 functions (at the time of writing these tutorial). The code below displays all functions of the stringi package.

library(stringi)
ls("package:stringi")

##   [1] "%s!=%"                         "%s!==%"                       
##   [3] "%s$%"                          "%s*%"                         
##   [5] "%s+%"                          "%s<%"                         
##   [7] "%s<=%"                         "%s==%"                        
##   [9] "%s===%"                        "%s>%"                         
##  [11] "%s>=%"                         "%stri!=%"                     
##  [13] "%stri!==%"                     "%stri$%"                      
##  [15] "%stri*%"                       "%stri+%"                      
##  [17] "%stri<%"                       "%stri<=%"                     
##  [19] "%stri==%"                      "%stri===%"                    
##  [21] "%stri>%"                       "%stri>=%"                     
##  [23] "stri_c"                        "stri_c_list"                  
##  [25] "stri_cmp"                      "stri_cmp_eq"                  
##  [27] "stri_cmp_equiv"                "stri_cmp_ge"                  
##  [29] "stri_cmp_gt"                   "stri_cmp_le"                  
##  [31] "stri_cmp_lt"                   "stri_cmp_neq"                 
##  [33] "stri_cmp_nequiv"               "stri_coll"                    
##  [35] "stri_compare"                  "stri_conv"                    
##  [37] "stri_count"                    "stri_count_boundaries"        
##  [39] "stri_count_charclass"          "stri_count_coll"              
##  [41] "stri_count_fixed"              "stri_count_regex"             
##  [43] "stri_count_words"              "stri_datetime_add"            
##  [45] "stri_datetime_add<-"           "stri_datetime_create"         
##  [47] "stri_datetime_fields"          "stri_datetime_format"         
##  [49] "stri_datetime_fstr"            "stri_datetime_now"            
##  [51] "stri_datetime_parse"           "stri_datetime_symbols"        
##  [53] "stri_detect"                   "stri_detect_charclass"        
##  [55] "stri_detect_coll"              "stri_detect_fixed"            
##  [57] "stri_detect_regex"             "stri_dup"                     
##  [59] "stri_duplicated"               "stri_duplicated_any"          
##  [61] "stri_enc_detect"               "stri_enc_detect2"             
##  [63] "stri_enc_fromutf32"            "stri_enc_get"                 
##  [65] "stri_enc_info"                 "stri_enc_isascii"             
##  [67] "stri_enc_isutf16be"            "stri_enc_isutf16le"           
##  [69] "stri_enc_isutf32be"            "stri_enc_isutf32le"           
##  [71] "stri_enc_isutf8"               "stri_enc_list"                
##  [73] "stri_enc_mark"                 "stri_enc_set"                 
##  [75] "stri_enc_toascii"              "stri_enc_tonative"            
##  [77] "stri_enc_toutf32"              "stri_enc_toutf8"              
##  [79] "stri_encode"                   "stri_endswith"                
##  [81] "stri_endswith_charclass"       "stri_endswith_coll"           
##  [83] "stri_endswith_fixed"           "stri_escape_unicode"          
##  [85] "stri_extract"                  "stri_extract_all"             
##  [87] "stri_extract_all_boundaries"   "stri_extract_all_charclass"   
##  [89] "stri_extract_all_coll"         "stri_extract_all_fixed"       
##  [91] "stri_extract_all_regex"        "stri_extract_all_words"       
##  [93] "stri_extract_first"            "stri_extract_first_boundaries"
##  [95] "stri_extract_first_charclass"  "stri_extract_first_coll"      
##  [97] "stri_extract_first_fixed"      "stri_extract_first_regex"     
##  [99] "stri_extract_first_words"      "stri_extract_last"            
## [101] "stri_extract_last_boundaries"  "stri_extract_last_charclass"  
## [103] "stri_extract_last_coll"        "stri_extract_last_fixed"      
## [105] "stri_extract_last_regex"       "stri_extract_last_words"      
## [107] "stri_flatten"                  "stri_info"                    
## [109] "stri_isempty"                  "stri_join"                    
## [111] "stri_join_list"                "stri_length"                  
## [113] "stri_list2matrix"              "stri_locale_get"              
## [115] "stri_locale_info"              "stri_locale_list"             
## [117] "stri_locale_set"               "stri_locate"                  
## [119] "stri_locate_all"               "stri_locate_all_boundaries"   
## [121] "stri_locate_all_charclass"     "stri_locate_all_coll"         
## [123] "stri_locate_all_fixed"         "stri_locate_all_regex"        
## [125] "stri_locate_all_words"         "stri_locate_first"            
## [127] "stri_locate_first_boundaries"  "stri_locate_first_charclass"  
## [129] "stri_locate_first_coll"        "stri_locate_first_fixed"      
## [131] "stri_locate_first_regex"       "stri_locate_first_words"      
## [133] "stri_locate_last"              "stri_locate_last_boundaries"  
## [135] "stri_locate_last_charclass"    "stri_locate_last_coll"        
## [137] "stri_locate_last_fixed"        "stri_locate_last_regex"       
## [139] "stri_locate_last_words"        "stri_match"                   
## [141] "stri_match_all"                "stri_match_all_regex"         
## [143] "stri_match_first"              "stri_match_first_regex"       
## [145] "stri_match_last"               "stri_match_last_regex"        
## [147] "stri_na2empty"                 "stri_numbytes"                
## [149] "stri_omit_empty"               "stri_omit_empty_na"           
## [151] "stri_omit_na"                  "stri_opts_brkiter"            
## [153] "stri_opts_collator"            "stri_opts_fixed"              
## [155] "stri_opts_regex"               "stri_order"                   
## [157] "stri_pad"                      "stri_pad_both"                
## [159] "stri_pad_left"                 "stri_pad_right"               
## [161] "stri_paste"                    "stri_paste_list"              
## [163] "stri_rand_lipsum"              "stri_rand_shuffle"            
## [165] "stri_rand_strings"             "stri_read_lines"              
## [167] "stri_read_raw"                 "stri_remove_empty"            
## [169] "stri_remove_empty_na"          "stri_remove_na"               
## [171] "stri_replace"                  "stri_replace_all"             
## [173] "stri_replace_all_charclass"    "stri_replace_all_coll"        
## [175] "stri_replace_all_fixed"        "stri_replace_all_regex"       
## [177] "stri_replace_first"            "stri_replace_first_charclass" 
## [179] "stri_replace_first_coll"       "stri_replace_first_fixed"     
## [181] "stri_replace_first_regex"      "stri_replace_last"            
## [183] "stri_replace_last_charclass"   "stri_replace_last_coll"       
## [185] "stri_replace_last_fixed"       "stri_replace_last_regex"      
## [187] "stri_replace_na"               "stri_reverse"                 
## [189] "stri_sort"                     "stri_sort_key"                
## [191] "stri_split"                    "stri_split_boundaries"        
## [193] "stri_split_charclass"          "stri_split_coll"              
## [195] "stri_split_fixed"              "stri_split_lines"             
## [197] "stri_split_lines1"             "stri_split_regex"             
## [199] "stri_startswith"               "stri_startswith_charclass"    
## [201] "stri_startswith_coll"          "stri_startswith_fixed"        
## [203] "stri_stats_general"            "stri_stats_latex"             
## [205] "stri_sub"                      "stri_sub_all"                 
## [207] "stri_sub_all_replace"          "stri_sub_all<-"               
## [209] "stri_sub_replace"              "stri_sub_replace_all"         
## [211] "stri_sub<-"                    "stri_subset"                  
## [213] "stri_subset_charclass"         "stri_subset_charclass<-"      
## [215] "stri_subset_coll"              "stri_subset_coll<-"           
## [217] "stri_subset_fixed"             "stri_subset_fixed<-"          
## [219] "stri_subset_regex"             "stri_subset_regex<-"          
## [221] "stri_subset<-"                 "stri_timezone_get"            
## [223] "stri_timezone_info"            "stri_timezone_list"           
## [225] "stri_timezone_set"             "stri_trans_char"              
## [227] "stri_trans_general"            "stri_trans_isnfc"             
## [229] "stri_trans_isnfd"              "stri_trans_isnfkc"            
## [231] "stri_trans_isnfkc_casefold"    "stri_trans_isnfkd"            
## [233] "stri_trans_list"               "stri_trans_nfc"               
## [235] "stri_trans_nfd"                "stri_trans_nfkc"              
## [237] "stri_trans_nfkc_casefold"      "stri_trans_nfkd"              
## [239] "stri_trans_tolower"            "stri_trans_totitle"           
## [241] "stri_trans_toupper"            "stri_trim"                    
## [243] "stri_trim_both"                "stri_trim_left"               
## [245] "stri_trim_right"               "stri_unescape_unicode"        
## [247] "stri_unique"                   "stri_width"                   
## [249] "stri_wrap"                     "stri_write_lines"

STEM Research
https://stemresearchs.com