Characters and Strings

A character vector is a collection of characters (e.g. letters/symbols). A typical use is to store letter or words as a character vector, such as

CV1 <- c("T","e","x","t")

But a piece of text can be also represented as a sequence of characters (letters, numbers, and symbols):

CV2 <- c("Text","mining","is","interesting!")

The data type R provides for storing sequences of characters is a character vecotr. Formally, the class of an object that holds character strings in R is "character".

A string is the same as the result of concatenating (pasting) the elements of a character vector into a single unit. So, a string is a container for pieces of text (e.g. words/sentences). A string vector is like

SV1 <- c("Text mining is interesting! We are going to learn about how to process text using R.", "Is RStudio easy to learn? Many people say so, indeed.")

What we are going to work with mostly in text pre-processing is a string, not character itself. Tokenization is the process of splitting a string into word units.

We can express strings by surrounding text within double quotes:

string1 <- "a character string is surrounded by double quotes"

or you can also surround text within single quotes:

string2 <- 'a character string is surrounded by single quotes'

Note that we must match the type of quotes that we are using: a starting single quote must have an ending single quote.

Common use of strings in R

Typing characters in R like in above examples is not very useful. Typically, we are going to analyze objects or variables that contain such character strings.

Our purpose of doing wordcloud task is to transform a character string into a character vector.

And such a string can be pre-processed for text mining with the package stringr.

What is stringr?

stringr is a package designed specially for text pre-processing. This package provides three main families of useful functions to process strings more consistent, simpler and easier:

Character manipulation: these functions manipulate individual characters within the strings in character vector objects
Whitespace tools to add, remove, and manipulate whitespace
Pattern matching functions: these functions mostly recognize regular expressions

This package adds more functionality to the base functions for handling strings in R such as nchar(), paste(), and strsplit().

We can install the package by using the function install.packages() and load it into our current session with library().

Basic String Operations

The package stringr provides functions for both 1) basic manipulations and 2) regular expression operations. First, we are going to cover those functions that have to do with basic manipulations.

The following table shows some stringr functions for basic string operations:

Function	Description	Similar Base Functions
`str_length()`	number of characters	`nchar()`
`str_split()`	split up a string into pieces	`strsplit()`
`str_c()`	string concatenation	`paste()`
`str_trim()`	removes leading and trailing whitespace	none
`str_squish()`	removes any redudantt whitespace
`str_detect()`	finds a particular pattern of characters
`str_view_all()`	show the matching result on the actual screen

Note that all functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Save the output of text preprocessing

load("covid_text_word.RData")

covid_text_word[1:100]

##   [1] "COVID-19"                "vaccine\r\nA"           
##   [3] "COVID<U+2011>19"         "vaccine"                
##   [5] "is"                      "a"                      
##   [7] "vaccine"                 "intended"               
##   [9] "to"                      "provide"                
##  [11] "acquired"                "immunity\r\nagainst"    
##  [13] "severe"                  "acute"                  
##  [15] "respiratory"             "syndrome"               
##  [17] "coronavirus"             "2"                      
##  [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"           
##  [21] "causing"                 "coronavirus"            
##  [23] "disease"                 "2019"                   
##  [25] "(COVID<U+2011>19)."      "Prior"                  
##  [27] "to"                      "the\r\nCOVID<U+2011>19" 
##  [29] "pandemic,"               "there"                  
##  [31] "was"                     "an"                     
##  [33] "established"             "body"                   
##  [35] "of"                      "knowledge"              
##  [37] "about\r\nthe"            "structure"              
##  [39] "and"                     "function"               
##  [41] "of"                      "coronaviruses"          
##  [43] "causing"                 "diseases"               
##  [45] "like"                    "severe"                 
##  [47] "acute\r\nrespiratory"    "syndrome"               
##  [49] "(SARS)"                  "and"                    
##  [51] "Middle"                  "East"                   
##  [53] "respiratory"             "syndrome\r\n(MERS),"    
##  [55] "which"                   "enabled"                
##  [57] "accelerated"             "development"            
##  [59] "of"                      "various"                
##  [61] "vaccine\r\ntechnologies" "during"                 
##  [63] "early"                   "2020.[1]"               
##  [65] "On"                      "10"                     
##  [67] "January"                 "2020,"                  
##  [69] "the"                     "SARS-CoV-2\r\ngenetic"  
##  [71] "sequence"                "data"                   
##  [73] "was"                     "shared"                 
##  [75] "through"                 "GISAID,"                
##  [77] "and"                     "by"                     
##  [79] "19"                      "March,"                 
##  [81] "the\r\nglobal"           "pharmaceutical"         
##  [83] "industry"                "announced"              
##  [85] "a"                       "major"                  
##  [87] "commitment"              "to"                     
##  [89] "address\r\nCOVID-19.[2]" ""                       
##  [91] ""                        ""                       
##  [93] ""                        ""                       
##  [95] ""                        ""                       
##  [97] ""                        ""                       
##  [99] ""                        ""

save(covid_text_word, file="covid_text_word.RData")

Automated Text Analysis: Week4-2

Shin Lee

2021 3 23

Characters and Strings

Common use of strings in R

What is stringr?

Basic String Operations

Save the output of text preprocessing