Automated Text Analysis: Week5-1

Basic String Operations

The package stringr provides functions for both 1) basic manipulations and 2) regular expression operations. First, we are going to cover those functions that have to do with basic manipulations.

The following table shows some stringr functions for basic string operations:

Function	Description	Similar Base Functions
`str_length()`	number of characters	`nchar()`
`str_split()`	split up a string into pieces	`strsplit()`
`str_c()`	string concatenation	`paste()`
`str_squish()`	removes any redundant white space
`str_detect()`	finds a particular pattern of characters
`str_view_all()`	show the matching result on the actual screen

Note that all functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Count the number of characters in a string with `str_length()`

The function str_length() is equivalent to the base function nchar(). Both functions return a vector of the numbers of characters in each string, that is, the length of a string (do not confuse it with the length() of a vector).

library(stringr)
str_length(c("abc", "adfds ", "1234343", ".!@#"))

## [1] 3 6 7 4

nchar(c("abc", "adfds ", "1234343", ".!@#"))

## [1] 3 6 7 4

length(c("abc", "adfds ", "1234343", ".!@#"))

## [1] 4

load("covid_text_word.RData")

covid_text_word[1:100] # First 100 elements in the character vector,  covid_text_word

##   [1] "COVID-19"                "vaccine\r\nA"           
##   [3] "COVID<U+2011>19"         "vaccine"                
##   [5] "is"                      "a"                      
##   [7] "vaccine"                 "intended"               
##   [9] "to"                      "provide"                
##  [11] "acquired"                "immunity\r\nagainst"    
##  [13] "severe"                  "acute"                  
##  [15] "respiratory"             "syndrome"               
##  [17] "coronavirus"             "2"                      
##  [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"           
##  [21] "causing"                 "coronavirus"            
##  [23] "disease"                 "2019"                   
##  [25] "(COVID<U+2011>19)."      "Prior"                  
##  [27] "to"                      "the\r\nCOVID<U+2011>19" 
##  [29] "pandemic,"               "there"                  
##  [31] "was"                     "an"                     
##  [33] "established"             "body"                   
##  [35] "of"                      "knowledge"              
##  [37] "about\r\nthe"            "structure"              
##  [39] "and"                     "function"               
##  [41] "of"                      "coronaviruses"          
##  [43] "causing"                 "diseases"               
##  [45] "like"                    "severe"                 
##  [47] "acute\r\nrespiratory"    "syndrome"               
##  [49] "(SARS)"                  "and"                    
##  [51] "Middle"                  "East"                   
##  [53] "respiratory"             "syndrome\r\n(MERS),"    
##  [55] "which"                   "enabled"                
##  [57] "accelerated"             "development"            
##  [59] "of"                      "various"                
##  [61] "vaccine\r\ntechnologies" "during"                 
##  [63] "early"                   "2020.[1]"               
##  [65] "On"                      "10"                     
##  [67] "January"                 "2020,"                  
##  [69] "the"                     "SARS-CoV-2\r\ngenetic"  
##  [71] "sequence"                "data"                   
##  [73] "was"                     "shared"                 
##  [75] "through"                 "GISAID,"                
##  [77] "and"                     "by"                     
##  [79] "19"                      "March,"                 
##  [81] "the\r\nglobal"           "pharmaceutical"         
##  [83] "industry"                "announced"              
##  [85] "a"                       "major"                  
##  [87] "commitment"              "to"                     
##  [89] "address\r\nCOVID-19.[2]" ""                       
##  [91] ""                        ""                       
##  [93] ""                        ""                       
##  [95] ""                        ""                       
##  [97] ""                        ""                       
##  [99] ""                        ""

str_length(covid_text_word[1:100]) # Some elements have no character

##   [1]  8 10  8  7  2  1  7  8  2  7  8 17  6  5 11  8 11  1 13 10  7 11  7  4 11
##  [26]  5  2 13  9  5  3  2 11  4  2  9 10  9  3  8  2 13  7  8  4  6 18  8  6  3
##  [51]  6  4 11 17  5  7 11 11  2  7 21  6  5  8  2  2  7  5  3 19  8  4  3  6  7
##  [76]  7  3  2  2  6 11 14  8  9  1  5 10  2 21  0  0  0  0  0  0  0  0  0  0  0

str_length(covid_text_word[1:100])>0

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE

covid_text_word[1:100][str_length(covid_text_word[1:100])>0]

##  [1] "COVID-19"                "vaccine\r\nA"           
##  [3] "COVID<U+2011>19"         "vaccine"                
##  [5] "is"                      "a"                      
##  [7] "vaccine"                 "intended"               
##  [9] "to"                      "provide"                
## [11] "acquired"                "immunity\r\nagainst"    
## [13] "severe"                  "acute"                  
## [15] "respiratory"             "syndrome"               
## [17] "coronavirus"             "2"                      
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"           
## [21] "causing"                 "coronavirus"            
## [23] "disease"                 "2019"                   
## [25] "(COVID<U+2011>19)."      "Prior"                  
## [27] "to"                      "the\r\nCOVID<U+2011>19" 
## [29] "pandemic,"               "there"                  
## [31] "was"                     "an"                     
## [33] "established"             "body"                   
## [35] "of"                      "knowledge"              
## [37] "about\r\nthe"            "structure"              
## [39] "and"                     "function"               
## [41] "of"                      "coronaviruses"          
## [43] "causing"                 "diseases"               
## [45] "like"                    "severe"                 
## [47] "acute\r\nrespiratory"    "syndrome"               
## [49] "(SARS)"                  "and"                    
## [51] "Middle"                  "East"                   
## [53] "respiratory"             "syndrome\r\n(MERS),"    
## [55] "which"                   "enabled"                
## [57] "accelerated"             "development"            
## [59] "of"                      "various"                
## [61] "vaccine\r\ntechnologies" "during"                 
## [63] "early"                   "2020.[1]"               
## [65] "On"                      "10"                     
## [67] "January"                 "2020,"                  
## [69] "the"                     "SARS-CoV-2\r\ngenetic"  
## [71] "sequence"                "data"                   
## [73] "was"                     "shared"                 
## [75] "through"                 "GISAID,"                
## [77] "and"                     "by"                     
## [79] "19"                      "March,"                 
## [81] "the\r\nglobal"           "pharmaceutical"         
## [83] "industry"                "announced"              
## [85] "a"                       "major"                  
## [87] "commitment"              "to"                     
## [89] "address\r\nCOVID-19.[2]"

length(covid_text_word[1:100])

## [1] 100

Split up a string into pieces with `str_split()`

The function str_split() is equivalent to the base function strsplit(). Both functions split a string into a variable number of pieces and return a list of character vectors.

strsplit("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).", split=" ") # Split up the string vector into pieces by a blank

## [[1]]
##  [1] "COVID-19"                    "vaccine\r\nA"               
##  [3] "COVID<U+2011>19"             "vaccine"                    
##  [5] "is"                          "a"                          
##  [7] "vaccine"                     "intended"                   
##  [9] "to"                          "provide"                    
## [11] "acquired"                    "immunity\r\nagainst"        
## [13] "severe"                      "acute"                      
## [15] "respiratory"                 "syndrome"                   
## [17] "coronavirus"                 "2"                          
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"               
## [21] "causing"                     "coronavirus"                
## [23] "disease"                     "2019"                       
## [25] "(COVID<U+2011>19)."

str_split("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).", pattern=" ") # Do the same

## [[1]]
##  [1] "COVID-19"                    "vaccine\r\nA"               
##  [3] "COVID<U+2011>19"             "vaccine"                    
##  [5] "is"                          "a"                          
##  [7] "vaccine"                     "intended"                   
##  [9] "to"                          "provide"                    
## [11] "acquired"                    "immunity\r\nagainst"        
## [13] "severe"                      "acute"                      
## [15] "respiratory"                 "syndrome"                   
## [17] "coronavirus"                 "2"                          
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"               
## [21] "causing"                     "coronavirus"                
## [23] "disease"                     "2019"                       
## [25] "(COVID<U+2011>19)."

Concatenating with `str_c()`

This function is equivalent to the base function paste().

covid_sent <- str_split("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).", pattern=" ") # Split up the vector of strings from the first sentence into words
class(covid_sent)

## [1] "list"

class(unlist(covid_sent))

## [1] "character"

paste(unlist(covid_sent), collapse = " ") # Concatenate the character vector to a string

## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the\r\nvirus causing coronavirus disease 2019 (COVID<U+2011>19)."

str_c(unlist(covid_sent), collapse = " ")

## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the\r\nvirus causing coronavirus disease 2019 (COVID<U+2011>19)."

Trimming with `str_squish()`

One of the typical tasks of string processing is that of parsing a text into individual words. Usually, you end up with words that have blanks, called whitespaces. Or, some sentences contain long whitespaces. In this situation, we can use the str_squish() function to remove any number of whitespaces inside a string or either end of a string. Its usage is very simple:

str_squish(string)

The input is the string to be trimmed,and any redundant whitespace will be removed.

Consider the following vector of strings, which have some whitespaces within the text.

covid_sent_trim <- str_squish("COVID-19 vaccine\r\nA COVID‑19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2), the\r\nvirus causing coronavirus disease 2019 (COVID‑19).") 
covid_sent_trim

## [1] "COVID-19 vaccine A COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the virus causing coronavirus disease 2019 (COVID<U+2011>19)."

White space: Solved!

But what about punctuation marks and numbers? 2. Punctuation marks: Cannot specify what punctuation marks appear and should be removed 3. Numbers: Cannot specify what numbers appear and should be removed

To solve these problems, we need to take things to the next level.

Automated Text Analysis: Week5-1

Shin Lee

2021/3/29

Basic String Operations

Count the number of characters in a string with `str_length()`

Split up a string into pieces with `str_split()`

Concatenating with `str_c()`

Trimming with `str_squish()`

Automated Text Analysis: Week5-1

Shin Lee

2021/3/29

Basic String Operations

Count the number of characters in a string with str_length()

Split up a string into pieces with str_split()

Concatenating with str_c()

Trimming with str_squish()

Count the number of characters in a string with `str_length()`

Split up a string into pieces with `str_split()`

Concatenating with `str_c()`

Trimming with `str_squish()`