Characters and Strings

CV1 <- c("T","e","x","t")

But a piece of text can be also represented as a sequence of characters (letters, numbers, and symbols):

CV2 <- c("Text","mining","is","interesting!") 

The data type R provides for storing sequences of characters is a character vecotr. Formally, the class of an object that holds character strings in R is "character".

SV1 <- c("Text mining is interesting! We are going to learn about how to process text using R.", "Is RStudio easy to learn? Many people say so, indeed.")

What we are going to work with mostly in text pre-processing is a string, not character itself. Tokenization is the process of splitting a string into word units.

We can express strings by surrounding text within double quotes:

string1 <- "a character string is surrounded by double quotes"

or you can also surround text within single quotes:

string2 <- 'a character string is surrounded by single quotes'

Note that we must match the type of quotes that we are using: a starting single quote must have an ending single quote.

Common use of strings in R

Typing characters in R like in above examples is not very useful. Typically, we are going to analyze objects or variables that contain such character strings.

Our purpose of doing wordcloud task is to transform a character string into a character vector.

And such a string can be pre-processed for text mining with the package stringr.

What is stringr?

stringr is a package designed specially for text pre-processing. This package provides three main families of useful functions to process strings more consistent, simpler and easier:

  1. Character manipulation: these functions manipulate individual characters within the strings in character vector objects
  2. Whitespace tools to add, remove, and manipulate whitespace
  3. Pattern matching functions: these functions mostly recognize regular expressions

This package adds more functionality to the base functions for handling strings in R such as nchar(), paste(), and strsplit().

We can install the package by using the function install.packages() and load it into our current session with library().

Basic String Operations

The package stringr provides functions for both 1) basic manipulations and 2) regular expression operations. First, we are going to cover those functions that have to do with basic manipulations.

The following table shows some stringr functions for basic string operations:

Function Description Similar Base Functions
str_length() number of characters nchar()
str_split() split up a string into pieces strsplit()
str_c() string concatenation paste()
str_trim() removes leading and trailing whitespace none
str_squish() removes any redudantt whitespace
str_detect() finds a particular pattern of characters
str_view_all() show the matching result on the actual screen

Note that all functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Save the output of text preprocessing

load("covid_text_word.RData")

covid_text_word[1:100]
##   [1] "COVID-19"                "vaccine\r\nA"           
##   [3] "COVID<U+2011>19"         "vaccine"                
##   [5] "is"                      "a"                      
##   [7] "vaccine"                 "intended"               
##   [9] "to"                      "provide"                
##  [11] "acquired"                "immunity\r\nagainst"    
##  [13] "severe"                  "acute"                  
##  [15] "respiratory"             "syndrome"               
##  [17] "coronavirus"             "2"                      
##  [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"           
##  [21] "causing"                 "coronavirus"            
##  [23] "disease"                 "2019"                   
##  [25] "(COVID<U+2011>19)."      "Prior"                  
##  [27] "to"                      "the\r\nCOVID<U+2011>19" 
##  [29] "pandemic,"               "there"                  
##  [31] "was"                     "an"                     
##  [33] "established"             "body"                   
##  [35] "of"                      "knowledge"              
##  [37] "about\r\nthe"            "structure"              
##  [39] "and"                     "function"               
##  [41] "of"                      "coronaviruses"          
##  [43] "causing"                 "diseases"               
##  [45] "like"                    "severe"                 
##  [47] "acute\r\nrespiratory"    "syndrome"               
##  [49] "(SARS)"                  "and"                    
##  [51] "Middle"                  "East"                   
##  [53] "respiratory"             "syndrome\r\n(MERS),"    
##  [55] "which"                   "enabled"                
##  [57] "accelerated"             "development"            
##  [59] "of"                      "various"                
##  [61] "vaccine\r\ntechnologies" "during"                 
##  [63] "early"                   "2020.[1]"               
##  [65] "On"                      "10"                     
##  [67] "January"                 "2020,"                  
##  [69] "the"                     "SARS-CoV-2\r\ngenetic"  
##  [71] "sequence"                "data"                   
##  [73] "was"                     "shared"                 
##  [75] "through"                 "GISAID,"                
##  [77] "and"                     "by"                     
##  [79] "19"                      "March,"                 
##  [81] "the\r\nglobal"           "pharmaceutical"         
##  [83] "industry"                "announced"              
##  [85] "a"                       "major"                  
##  [87] "commitment"              "to"                     
##  [89] "address\r\nCOVID-19.[2]" ""                       
##  [91] ""                        ""                       
##  [93] ""                        ""                       
##  [95] ""                        ""                       
##  [97] ""                        ""                       
##  [99] ""                        ""
save(covid_text_word, file="covid_text_word.RData")