CV1 <- c("T","e","x","t")
But a piece of text can be also represented as a sequence of characters (letters, numbers, and symbols):
CV2 <- c("Text","mining","is","interesting!")
The data type R provides for storing sequences of characters is a character vecotr. Formally, the class of an object that holds character strings in R is "character"
.
SV1 <- c("Text mining is interesting! We are going to learn about how to process text using R.", "Is RStudio easy to learn? Many people say so, indeed.")
What we are going to work with mostly in text pre-processing is a string, not character itself. Tokenization is the process of splitting a string into word units.
We can express strings by surrounding text within double quotes:
string1 <- "a character string is surrounded by double quotes"
or you can also surround text within single quotes:
string2 <- 'a character string is surrounded by single quotes'
Note that we must match the type of quotes that we are using: a starting single quote must have an ending single quote.
Typing characters in R like in above examples is not very useful. Typically, we are going to analyze objects or variables that contain such character strings.
Our purpose of doing wordcloud task is to transform a character string into a character vector.
And such a string can be pre-processed for text mining with the package stringr.
stringr is a package designed specially for text pre-processing. This package provides three main families of useful functions to process strings more consistent, simpler and easier:
This package adds more functionality to the base functions for handling strings in R such as nchar()
, paste()
, and strsplit()
.
We can install the package by using the function install.packages()
and load it into our current session with library()
.
The package stringr provides functions for both 1) basic manipulations and 2) regular expression operations. First, we are going to cover those functions that have to do with basic manipulations.
The following table shows some stringr
functions for basic string operations:
Function | Description | Similar Base Functions |
---|---|---|
str_length() |
number of characters | nchar() |
str_split() |
split up a string into pieces | strsplit() |
str_c() |
string concatenation | paste() |
str_trim() |
removes leading and trailing whitespace | none |
str_squish() |
removes any redudantt whitespace | |
str_detect() |
finds a particular pattern of characters | |
str_view_all() |
show the matching result on the actual screen |
Note that all functions in stringr
starts with "str_"
followed by a term in relation to the task they perform.
load("covid_text_word.RData")
covid_text_word[1:100]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)." "Prior"
## [27] "to" "the\r\nCOVID<U+2011>19"
## [29] "pandemic," "there"
## [31] "was" "an"
## [33] "established" "body"
## [35] "of" "knowledge"
## [37] "about\r\nthe" "structure"
## [39] "and" "function"
## [41] "of" "coronaviruses"
## [43] "causing" "diseases"
## [45] "like" "severe"
## [47] "acute\r\nrespiratory" "syndrome"
## [49] "(SARS)" "and"
## [51] "Middle" "East"
## [53] "respiratory" "syndrome\r\n(MERS),"
## [55] "which" "enabled"
## [57] "accelerated" "development"
## [59] "of" "various"
## [61] "vaccine\r\ntechnologies" "during"
## [63] "early" "2020.[1]"
## [65] "On" "10"
## [67] "January" "2020,"
## [69] "the" "SARS-CoV-2\r\ngenetic"
## [71] "sequence" "data"
## [73] "was" "shared"
## [75] "through" "GISAID,"
## [77] "and" "by"
## [79] "19" "March,"
## [81] "the\r\nglobal" "pharmaceutical"
## [83] "industry" "announced"
## [85] "a" "major"
## [87] "commitment" "to"
## [89] "address\r\nCOVID-19.[2]" ""
## [91] "" ""
## [93] "" ""
## [95] "" ""
## [97] "" ""
## [99] "" ""
save(covid_text_word, file="covid_text_word.RData")