R Base Functions for Text Pre-Processing

Text mining begins with understanding text data in natural language. It is the act of pre-processing text into data that are appropriate to analysis.

Today, we will see how R can be used for text pre-processing. However, we will not install any package for text analysis, nothing but a couple of ones for producing wordcloud. That means, we will apply only functions already installed at the default setting, which are included in the R base package. Surely, there are pros and cons of such base functions in R. Let’s see the cons first.

Cons

  1. R base functions were not originally developed for text analysis.
  2. There are more R packages specializing in text mining.
  3. R base package requires some knowledge about programming language.

There are of course the pros.

Pros

  1. R base functions do not need any installation of special packages.
  2. R base functions are widely used in common so that they are applicable to diverse areas.
  3. R base functions allow us to learn about a scheme (knowledge) of pre-processing textual data.

Getting Help

You can access the help files about functions.

?c # Get help of a particular function c( )
## starting httpd help server ... done
?strsplit
help.search("split") # Search and return the help files for functions that include a word or phrase

We can deal with string objects as follows.

strsplit("hello world", split = " ")
## [[1]]
## [1] "hello" "world"
unlist(strsplit("hello world", split = " "))
## [1] "hello" "world"
hello <- unlist(strsplit("hello world", split = " "))
hello
## [1] "hello" "world"
?paste
paste(hello, collapse = " ")
## [1] "hello world"

R functions in the Base Package for text analysis

  1. nchar(): counting the number of characters
nchar("text")
## [1] 4
# Blank and punctuation marks are also counted 
nchar("text mining") #11? 10? 
## [1] 11
nchar("text, mining") #12? 11?
## [1] 12
  1. strsplit(): Texts are composed of words that consist of characters. In text mining, therefore, we segment texts into words as tokens, which are pre-processed. Then, we also might need to combine them again.
sent1 <- "Text mining begins with preprocessing and tokenization."
# We split sent1 into seven pieces of words that are separated from each other by white space (blank)
sent1_split <- strsplit(sent1, split=" ") # Parameter (split=" ") specifies a character (here, blank) that separates words as tokens
sent1_split # returns a list of vectors of words segmented from sent1
## [[1]]
## [1] "Text"          "mining"        "begins"        "with"         
## [5] "preprocessing" "and"           "tokenization."
class(sent1)
## [1] "character"
class(sent1_split)
## [1] "list"
class(unlist(sent1_split))
## [1] "character"
sent1_split_vector <- unlist(sent1_split) 
  1. paste(): We can combine word segments back into a sentence that separates words by blank
sent2 <- paste(sent1_split_vector, collapse=" ") # Parameter (collapse) specifies how elements in sent1_split_vector are combined. Here, we want to combine word elements into a sentence that separates them by blank (" ").
sent2
## [1] "Text mining begins with preprocessing and tokenization."
paste(sent1_split_vector, collapse="/") 
## [1] "Text/mining/begins/with/preprocessing/and/tokenization."
sent2
## [1] "Text mining begins with preprocessing and tokenization."

apply( ) and its derived functions

lapply( ): Applies a function to a list or a vector and returns a list object

lapply( ) applies a specified function to each element of a list or a vector and returns a new list object of the same length as the input list object. Each element of which is the result of applying a function to the corresponding element of the input list.

l in lapply( ) stands for list

sentence <- "Apply functions apply a specified function to each element of a list or a vector object."
List2 <- list(sent1, sentence)
List2
## [[1]]
## [1] "Text mining begins with preprocessing and tokenization."
## 
## [[2]]
## [1] "Apply functions apply a specified function to each element of a list or a vector object."
lapply(List2, nchar) # returns a list object of results from applying a function "nchar" to each element of List2 
## [[1]]
## [1] 55
## 
## [[2]]
## [1] 88
unlist(lapply(List2, nchar))
## [1] 55 88

sapply( ): Applies a function to a list and returns a vector object

sapply( ) applies a specified function to each element of a list and returns a vector object when possible. It is the same as applying the function unlist( ) to the result of lapply( ).

List2
## [[1]]
## [1] "Text mining begins with preprocessing and tokenization."
## 
## [[2]]
## [1] "Apply functions apply a specified function to each element of a list or a vector object."
sapply(List2, nchar) # applies a function "nchar" to elements as vectors in List2 as a list
## [1] 55 88
unlist(lapply(List2, nchar)) #"s" in sapply stands for "simplified"
## [1] 55 88

Special Packages for Text Pre-Processing

Package? Library?

R is an open-source software, which means there are tons of functions being developed by many people all over the world. Using such functions, we can operate functions for text mining easily and effectively.

A package is a collection of such R functions (as well as data and compiled code). And the location where the packages are stored is called the library.

When we download a package needed using the function, install.packages("package name"), it will be storied in the library. And to use the package, we should operate the function, library(package name), which makes the package available.

Processing character strings from a PDF document

#install.packages("pdftools")
library(pdftools)
## Using poppler version 0.73.0
covid_text <- pdf_text("COVID-19_vaccine.pdf") # Extracting texts from PDF document
#covid_text
class(covid_text)
## [1] "character"
length(covid_text)
## [1] 93
covid_text[1]
## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the\r\nvirus causing coronavirus disease 2019 (COVID<U+2011>19). Prior to the\r\nCOVID<U+2011>19 pandemic, there was an established body of knowledge about\r\nthe structure and function of coronaviruses causing diseases like severe acute\r\nrespiratory syndrome (SARS) and Middle East respiratory syndrome\r\n(MERS), which enabled accelerated development of various vaccine\r\ntechnologies during early 2020.[1] On 10 January 2020, the SARS-CoV-2\r\ngenetic sequence data was shared through GISAID, and by 19 March, the\r\nglobal pharmaceutical industry announced a major commitment to address\r\nCOVID-19.[2]                                                                   COVID-19 vaccination doses administered\r\n                                                                               per 100 people\r\nIn Phase III trials, several COVID<U+2011>19 vaccines have demonstrated efficacy\r\nas high as 95% in preventing symptomatic COVID<U+2011>19 infections. As of\r\nMarch 2021, 12 vaccines were authorized by at least one national regulatory\r\nauthority for public use: two RNA vaccines (the Pfizer<U+2013>BioNTech vaccine\r\nand the Moderna vaccine), four conventional inactivated vaccines (BBIBP-\r\nCorV, CoronaVac, Covaxin, and CoviVac), four viral vector vaccines\r\n(Sputnik V, the Oxford<U+2013>AstraZeneca vaccine, Convidecia, and the Johnson\r\n& Johnson vaccine), and two protein subunit vaccines (EpiVacCorona and\r\nRBD-Dimer).[3] In total, as of March 2021, 308 vaccine candidates were in\r\nvarious stages of development, with 73 in clinical research, including 24 in\r\n                                                                               Map of countries by approval status\r\nPhase I trials, 33 in Phase I<U+2013>II trials, and 16 in Phase III development.[3]\r\n                                                                                   Approved for general use, mass\r\nMany countries have implemented phased distribution plans that prioritize      vaccination underway\r\nthose at highest risk of complications, such as the elderly, and those at high     EUA (or equivalent) granted, mass\r\nrisk of exposure and transmission, such as healthcare workers.[4] As of        vaccination underway\r\n20 March 2021, 436.37 million doses of COVID<U+2011>19 vaccine have been                  EUA granted, limited vaccination\r\nadministered worldwide based on official reports from national health              Approved for general use, mass\r\nagencies.[5] AstraZeneca-Oxford anticipates producing 3 billion doses in       vaccination planned\r\n2021, Pfizer-BioNTech 1.3 billion doses, and Sputnik V, Sinopharm,                 EUA granted, mass vaccination planned\r\nSinovac, and Johnson & Johnson 1 billion doses each. Moderna targets               EUA pending\r\nproducing 600 million doses and Convidecia 500 million doses in 2021.[6][7]\r\nBy December 2020, more than 10 billion vaccine doses had been preordered\r\nby countries,[8] with about half of the doses purchased by high-income countries comprising 14% of the world's\r\npopulation.[9]\r\n Contents\r\n Background\r\n Planning and development\r\n       Challenges\r\n       Organizations\r\n       History\r\n Vaccine types\r\n       RNA vaccines\r\n       Adenovirus vector vaccines\r\n       Inactivated virus vaccines\r\n"
covid_text[93]
## [1] "Retrieved from \"https://en.wikipedia.org/w/index.php?title=COVID-19_vaccine&oldid=1013541498\"\r\nThis page was last edited on 22 March 2021, at 05:04 (UTC).\r\nText is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you\r\nagree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit\r\norganization.\r\n"
save(covid_text,file="covid_text.RData")

covid_text_word <- strsplit(covid_text, split=" ") # Parsing words; in each sentence, words can split by blank " ". What's the class of covid_text_word?
class(covid_text_word)
## [1] "list"
length(covid_text_word)
## [1] 93
covid_text_word[[1]]
##   [1] "COVID-19"                      "vaccine\r\nA"                 
##   [3] "COVID<U+2011>19"               "vaccine"                      
##   [5] "is"                            "a"                            
##   [7] "vaccine"                       "intended"                     
##   [9] "to"                            "provide"                      
##  [11] "acquired"                      "immunity\r\nagainst"          
##  [13] "severe"                        "acute"                        
##  [15] "respiratory"                   "syndrome"                     
##  [17] "coronavirus"                   "2"                            
##  [19] "(SARS<U+2011>CoV<U+2011>2),"   "the\r\nvirus"                 
##  [21] "causing"                       "coronavirus"                  
##  [23] "disease"                       "2019"                         
##  [25] "(COVID<U+2011>19)."            "Prior"                        
##  [27] "to"                            "the\r\nCOVID<U+2011>19"       
##  [29] "pandemic,"                     "there"                        
##  [31] "was"                           "an"                           
##  [33] "established"                   "body"                         
##  [35] "of"                            "knowledge"                    
##  [37] "about\r\nthe"                  "structure"                    
##  [39] "and"                           "function"                     
##  [41] "of"                            "coronaviruses"                
##  [43] "causing"                       "diseases"                     
##  [45] "like"                          "severe"                       
##  [47] "acute\r\nrespiratory"          "syndrome"                     
##  [49] "(SARS)"                        "and"                          
##  [51] "Middle"                        "East"                         
##  [53] "respiratory"                   "syndrome\r\n(MERS),"          
##  [55] "which"                         "enabled"                      
##  [57] "accelerated"                   "development"                  
##  [59] "of"                            "various"                      
##  [61] "vaccine\r\ntechnologies"       "during"                       
##  [63] "early"                         "2020.[1]"                     
##  [65] "On"                            "10"                           
##  [67] "January"                       "2020,"                        
##  [69] "the"                           "SARS-CoV-2\r\ngenetic"        
##  [71] "sequence"                      "data"                         
##  [73] "was"                           "shared"                       
##  [75] "through"                       "GISAID,"                      
##  [77] "and"                           "by"                           
##  [79] "19"                            "March,"                       
##  [81] "the\r\nglobal"                 "pharmaceutical"               
##  [83] "industry"                      "announced"                    
##  [85] "a"                             "major"                        
##  [87] "commitment"                    "to"                           
##  [89] "address\r\nCOVID-19.[2]"       ""                             
##  [91] ""                              ""                             
##  [93] ""                              ""                             
##  [95] ""                              ""                             
##  [97] ""                              ""                             
##  [99] ""                              ""                             
## [101] ""                              ""                             
## [103] ""                              ""                             
## [105] ""                              ""                             
## [107] ""                              ""                             
## [109] ""                              ""                             
## [111] ""                              ""                             
## [113] ""                              ""                             
## [115] ""                              ""                             
## [117] ""                              ""                             
## [119] ""                              ""                             
## [121] ""                              ""                             
## [123] ""                              ""                             
## [125] ""                              ""                             
## [127] ""                              ""                             
## [129] ""                              ""                             
## [131] ""                              ""                             
## [133] ""                              ""                             
## [135] ""                              ""                             
## [137] ""                              ""                             
## [139] ""                              ""                             
## [141] ""                              ""                             
## [143] ""                              ""                             
## [145] ""                              ""                             
## [147] ""                              ""                             
## [149] ""                              ""                             
## [151] ""                              ""                             
## [153] ""                              ""                             
## [155] ""                              "COVID-19"                     
## [157] "vaccination"                   "doses"                        
## [159] "administered\r\n"              ""                             
## [161] ""                              ""                             
## [163] ""                              ""                             
## [165] ""                              ""                             
## [167] ""                              ""                             
## [169] ""                              ""                             
## [171] ""                              ""                             
## [173] ""                              ""                             
## [175] ""                              ""                             
## [177] ""                              ""                             
## [179] ""                              ""                             
## [181] ""                              ""                             
## [183] ""                              ""                             
## [185] ""                              ""                             
## [187] ""                              ""                             
## [189] ""                              ""                             
## [191] ""                              ""                             
## [193] ""                              ""                             
## [195] ""                              ""                             
## [197] ""                              ""                             
## [199] ""                              ""                             
## [201] ""                              ""                             
## [203] ""                              ""                             
## [205] ""                              ""                             
## [207] ""                              ""                             
## [209] ""                              ""                             
## [211] ""                              ""                             
## [213] ""                              ""                             
## [215] ""                              ""                             
## [217] ""                              ""                             
## [219] ""                              ""                             
## [221] ""                              ""                             
## [223] ""                              ""                             
## [225] ""                              ""                             
## [227] ""                              ""                             
## [229] ""                              ""                             
## [231] ""                              ""                             
## [233] ""                              ""                             
## [235] ""                              ""                             
## [237] ""                              "per"                          
## [239] "100"                           "people\r\nIn"                 
## [241] "Phase"                         "III"                          
## [243] "trials,"                       "several"                      
## [245] "COVID<U+2011>19"               "vaccines"                     
## [247] "have"                          "demonstrated"                 
## [249] "efficacy\r\nas"                "high"                         
## [251] "as"                            "95%"                          
## [253] "in"                            "preventing"                   
## [255] "symptomatic"                   "COVID<U+2011>19"              
## [257] "infections."                   "As"                           
## [259] "of\r\nMarch"                   "2021,"                        
## [261] "12"                            "vaccines"                     
## [263] "were"                          "authorized"                   
## [265] "by"                            "at"                           
## [267] "least"                         "one"                          
## [269] "national"                      "regulatory\r\nauthority"      
## [271] "for"                           "public"                       
## [273] "use:"                          "two"                          
## [275] "RNA"                           "vaccines"                     
## [277] "(the"                          "Pfizer<U+2013>BioNTech"       
## [279] "vaccine\r\nand"                "the"                          
## [281] "Moderna"                       "vaccine),"                    
## [283] "four"                          "conventional"                 
## [285] "inactivated"                   "vaccines"                     
## [287] "(BBIBP-\r\nCorV,"              "CoronaVac,"                   
## [289] "Covaxin,"                      "and"                          
## [291] "CoviVac),"                     "four"                         
## [293] "viral"                         "vector"                       
## [295] "vaccines\r\n(Sputnik"          "V,"                           
## [297] "the"                           "Oxford<U+2013>AstraZeneca"    
## [299] "vaccine,"                      "Convidecia,"                  
## [301] "and"                           "the"                          
## [303] "Johnson\r\n&"                  "Johnson"                      
## [305] "vaccine),"                     "and"                          
## [307] "two"                           "protein"                      
## [309] "subunit"                       "vaccines"                     
## [311] "(EpiVacCorona"                 "and\r\nRBD-Dimer).[3]"        
## [313] "In"                            "total,"                       
## [315] "as"                            "of"                           
## [317] "March"                         "2021,"                        
## [319] "308"                           "vaccine"                      
## [321] "candidates"                    "were"                         
## [323] "in\r\nvarious"                 "stages"                       
## [325] "of"                            "development,"                 
## [327] "with"                          "73"                           
## [329] "in"                            "clinical"                     
## [331] "research,"                     "including"                    
## [333] "24"                            "in\r\n"                       
## [335] ""                              ""                             
## [337] ""                              ""                             
## [339] ""                              ""                             
## [341] ""                              ""                             
## [343] ""                              ""                             
## [345] ""                              ""                             
## [347] ""                              ""                             
## [349] ""                              ""                             
## [351] ""                              ""                             
## [353] ""                              ""                             
## [355] ""                              ""                             
## [357] ""                              ""                             
## [359] ""                              ""                             
## [361] ""                              ""                             
## [363] ""                              ""                             
## [365] ""                              ""                             
## [367] ""                              ""                             
## [369] ""                              ""                             
## [371] ""                              ""                             
## [373] ""                              ""                             
## [375] ""                              ""                             
## [377] ""                              ""                             
## [379] ""                              ""                             
## [381] ""                              ""                             
## [383] ""                              ""                             
## [385] ""                              ""                             
## [387] ""                              ""                             
## [389] ""                              ""                             
## [391] ""                              ""                             
## [393] ""                              ""                             
## [395] ""                              ""                             
## [397] ""                              ""                             
## [399] ""                              ""                             
## [401] ""                              ""                             
## [403] ""                              ""                             
## [405] ""                              ""                             
## [407] ""                              ""                             
## [409] ""                              ""                             
## [411] ""                              ""                             
## [413] "Map"                           "of"                           
## [415] "countries"                     "by"                           
## [417] "approval"                      "status\r\nPhase"              
## [419] "I"                             "trials,"                      
## [421] "33"                            "in"                           
## [423] "Phase"                         "I<U+2013>II"                  
## [425] "trials,"                       "and"                          
## [427] "16"                            "in"                           
## [429] "Phase"                         "III"                          
## [431] "development.[3]\r\n"           ""                             
## [433] ""                              ""                             
## [435] ""                              ""                             
## [437] ""                              ""                             
## [439] ""                              ""                             
## [441] ""                              ""                             
## [443] ""                              ""                             
## [445] ""                              ""                             
## [447] ""                              ""                             
## [449] ""                              ""                             
## [451] ""                              ""                             
## [453] ""                              ""                             
## [455] ""                              ""                             
## [457] ""                              ""                             
## [459] ""                              ""                             
## [461] ""                              ""                             
## [463] ""                              ""                             
## [465] ""                              ""                             
## [467] ""                              ""                             
## [469] ""                              ""                             
## [471] ""                              ""                             
## [473] ""                              ""                             
## [475] ""                              ""                             
## [477] ""                              ""                             
## [479] ""                              ""                             
## [481] ""                              ""                             
## [483] ""                              ""                             
## [485] ""                              ""                             
## [487] ""                              ""                             
## [489] ""                              ""                             
## [491] ""                              ""                             
## [493] ""                              ""                             
## [495] ""                              ""                             
## [497] ""                              ""                             
## [499] ""                              ""                             
## [501] ""                              ""                             
## [503] ""                              ""                             
## [505] ""                              ""                             
## [507] ""                              ""                             
## [509] ""                              ""                             
## [511] ""                              ""                             
## [513] ""                              "Approved"                     
## [515] "for"                           "general"                      
## [517] "use,"                          "mass\r\nMany"                 
## [519] "countries"                     "have"                         
## [521] "implemented"                   "phased"                       
## [523] "distribution"                  "plans"                        
## [525] "that"                          "prioritize"                   
## [527] ""                              ""                             
## [529] ""                              ""                             
## [531] ""                              "vaccination"                  
## [533] "underway\r\nthose"             "at"                           
## [535] "highest"                       "risk"                         
## [537] "of"                            "complications,"               
## [539] "such"                          "as"                           
## [541] "the"                           "elderly,"                     
## [543] "and"                           "those"                        
## [545] "at"                            "high"                         
## [547] ""                              ""                             
## [549] ""                              ""                             
## [551] "EUA"                           "(or"                          
## [553] "equivalent)"                   "granted,"                     
## [555] "mass\r\nrisk"                  "of"                           
## [557] "exposure"                      "and"                          
## [559] "transmission,"                 "such"                         
## [561] "as"                            "healthcare"                   
## [563] "workers.[4]"                   "As"                           
## [565] "of"                            ""                             
## [567] ""                              ""                             
## [569] ""                              ""                             
## [571] ""                              ""                             
## [573] "vaccination"                   "underway\r\n20"               
## [575] "March"                         "2021,"                        
## [577] "436.37"                        "million"                      
## [579] "doses"                         "of"                           
## [581] "COVID<U+2011>19"               "vaccine"                      
## [583] "have"                          "been"                         
## [585] ""                              ""                             
## [587] ""                              ""                             
## [589] ""                              ""                             
## [591] ""                              ""                             
## [593] ""                              ""                             
## [595] ""                              ""                             
## [597] ""                              ""                             
## [599] ""                              ""                             
## [601] ""                              "EUA"                          
## [603] "granted,"                      "limited"                      
## [605] "vaccination\r\nadministered"   "worldwide"                    
## [607] "based"                         "on"                           
## [609] "official"                      "reports"                      
## [611] "from"                          "national"                     
## [613] "health"                        ""                             
## [615] ""                              ""                             
## [617] ""                              ""                             
## [619] ""                              ""                             
## [621] ""                              ""                             
## [623] ""                              ""                             
## [625] ""                              ""                             
## [627] "Approved"                      "for"                          
## [629] "general"                       "use,"                         
## [631] "mass\r\nagencies.[5]"          "AstraZeneca-Oxford"           
## [633] "anticipates"                   "producing"                    
## [635] "3"                             "billion"                      
## [637] "doses"                         "in"                           
## [639] ""                              ""                             
## [641] ""                              ""                             
## [643] ""                              ""                             
## [645] "vaccination"                   "planned\r\n2021,"             
## [647] "Pfizer-BioNTech"               "1.3"                          
## [649] "billion"                       "doses,"                       
## [651] "and"                           "Sputnik"                      
## [653] "V,"                            "Sinopharm,"                   
## [655] ""                              ""                             
## [657] ""                              ""                             
## [659] ""                              ""                             
## [661] ""                              ""                             
## [663] ""                              ""                             
## [665] ""                              ""                             
## [667] ""                              ""                             
## [669] ""                              ""                             
## [671] "EUA"                           "granted,"                     
## [673] "mass"                          "vaccination"                  
## [675] "planned\r\nSinovac,"           "and"                          
## [677] "Johnson"                       "&"                            
## [679] "Johnson"                       "1"                            
## [681] "billion"                       "doses"                        
## [683] "each."                         "Moderna"                      
## [685] "targets"                       ""                             
## [687] ""                              ""                             
## [689] ""                              ""                             
## [691] ""                              ""                             
## [693] ""                              ""                             
## [695] ""                              ""                             
## [697] ""                              ""                             
## [699] ""                              "EUA"                          
## [701] "pending\r\nproducing"          "600"                          
## [703] "million"                       "doses"                        
## [705] "and"                           "Convidecia"                   
## [707] "500"                           "million"                      
## [709] "doses"                         "in"                           
## [711] "2021.[6][7]\r\nBy"             "December"                     
## [713] "2020,"                         "more"                         
## [715] "than"                          "10"                           
## [717] "billion"                       "vaccine"                      
## [719] "doses"                         "had"                          
## [721] "been"                          "preordered\r\nby"             
## [723] "countries,[8]"                 "with"                         
## [725] "about"                         "half"                         
## [727] "of"                            "the"                          
## [729] "doses"                         "purchased"                    
## [731] "by"                            "high-income"                  
## [733] "countries"                     "comprising"                   
## [735] "14%"                           "of"                           
## [737] "the"                           "world's\r\npopulation.[9]\r\n"
## [739] "Contents\r\n"                  "Background\r\n"               
## [741] "Planning"                      "and"                          
## [743] "development\r\n"               ""                             
## [745] ""                              ""                             
## [747] ""                              ""                             
## [749] ""                              "Challenges\r\n"               
## [751] ""                              ""                             
## [753] ""                              ""                             
## [755] ""                              ""                             
## [757] "Organizations\r\n"             ""                             
## [759] ""                              ""                             
## [761] ""                              ""                             
## [763] ""                              "History\r\n"                  
## [765] "Vaccine"                       "types\r\n"                    
## [767] ""                              ""                             
## [769] ""                              ""                             
## [771] ""                              ""                             
## [773] "RNA"                           "vaccines\r\n"                 
## [775] ""                              ""                             
## [777] ""                              ""                             
## [779] ""                              ""                             
## [781] "Adenovirus"                    "vector"                       
## [783] "vaccines\r\n"                  ""                             
## [785] ""                              ""                             
## [787] ""                              ""                             
## [789] ""                              "Inactivated"                  
## [791] "virus"                         "vaccines\r\n"
covid_text_word <- unlist(covid_text_word)
length(covid_text_word)
## [1] 95363
strsplit("      ", split=" ")
## [[1]]
## [1] "" "" "" "" "" ""
covid_text_word[1:200]
##   [1] "COVID-19"                "vaccine\r\nA"           
##   [3] "COVID<U+2011>19"         "vaccine"                
##   [5] "is"                      "a"                      
##   [7] "vaccine"                 "intended"               
##   [9] "to"                      "provide"                
##  [11] "acquired"                "immunity\r\nagainst"    
##  [13] "severe"                  "acute"                  
##  [15] "respiratory"             "syndrome"               
##  [17] "coronavirus"             "2"                      
##  [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"           
##  [21] "causing"                 "coronavirus"            
##  [23] "disease"                 "2019"                   
##  [25] "(COVID<U+2011>19)."      "Prior"                  
##  [27] "to"                      "the\r\nCOVID<U+2011>19" 
##  [29] "pandemic,"               "there"                  
##  [31] "was"                     "an"                     
##  [33] "established"             "body"                   
##  [35] "of"                      "knowledge"              
##  [37] "about\r\nthe"            "structure"              
##  [39] "and"                     "function"               
##  [41] "of"                      "coronaviruses"          
##  [43] "causing"                 "diseases"               
##  [45] "like"                    "severe"                 
##  [47] "acute\r\nrespiratory"    "syndrome"               
##  [49] "(SARS)"                  "and"                    
##  [51] "Middle"                  "East"                   
##  [53] "respiratory"             "syndrome\r\n(MERS),"    
##  [55] "which"                   "enabled"                
##  [57] "accelerated"             "development"            
##  [59] "of"                      "various"                
##  [61] "vaccine\r\ntechnologies" "during"                 
##  [63] "early"                   "2020.[1]"               
##  [65] "On"                      "10"                     
##  [67] "January"                 "2020,"                  
##  [69] "the"                     "SARS-CoV-2\r\ngenetic"  
##  [71] "sequence"                "data"                   
##  [73] "was"                     "shared"                 
##  [75] "through"                 "GISAID,"                
##  [77] "and"                     "by"                     
##  [79] "19"                      "March,"                 
##  [81] "the\r\nglobal"           "pharmaceutical"         
##  [83] "industry"                "announced"              
##  [85] "a"                       "major"                  
##  [87] "commitment"              "to"                     
##  [89] "address\r\nCOVID-19.[2]" ""                       
##  [91] ""                        ""                       
##  [93] ""                        ""                       
##  [95] ""                        ""                       
##  [97] ""                        ""                       
##  [99] ""                        ""                       
## [101] ""                        ""                       
## [103] ""                        ""                       
## [105] ""                        ""                       
## [107] ""                        ""                       
## [109] ""                        ""                       
## [111] ""                        ""                       
## [113] ""                        ""                       
## [115] ""                        ""                       
## [117] ""                        ""                       
## [119] ""                        ""                       
## [121] ""                        ""                       
## [123] ""                        ""                       
## [125] ""                        ""                       
## [127] ""                        ""                       
## [129] ""                        ""                       
## [131] ""                        ""                       
## [133] ""                        ""                       
## [135] ""                        ""                       
## [137] ""                        ""                       
## [139] ""                        ""                       
## [141] ""                        ""                       
## [143] ""                        ""                       
## [145] ""                        ""                       
## [147] ""                        ""                       
## [149] ""                        ""                       
## [151] ""                        ""                       
## [153] ""                        ""                       
## [155] ""                        "COVID-19"               
## [157] "vaccination"             "doses"                  
## [159] "administered\r\n"        ""                       
## [161] ""                        ""                       
## [163] ""                        ""                       
## [165] ""                        ""                       
## [167] ""                        ""                       
## [169] ""                        ""                       
## [171] ""                        ""                       
## [173] ""                        ""                       
## [175] ""                        ""                       
## [177] ""                        ""                       
## [179] ""                        ""                       
## [181] ""                        ""                       
## [183] ""                        ""                       
## [185] ""                        ""                       
## [187] ""                        ""                       
## [189] ""                        ""                       
## [191] ""                        ""                       
## [193] ""                        ""                       
## [195] ""                        ""                       
## [197] ""                        ""                       
## [199] ""                        ""

Cleaning texts

  1. White space
  2. Punctuation marks
  3. Numbers
  4. Stop words
  5. Non-English text (ASCII) https://en.wikipedia.org/wiki/UTF-8
  6. Reference section

Simple but Incomplete Cleaning

#install.packages("stringr")
library(stringr)

covid_text_word[1:100]
##   [1] "COVID-19"                "vaccine\r\nA"           
##   [3] "COVID<U+2011>19"         "vaccine"                
##   [5] "is"                      "a"                      
##   [7] "vaccine"                 "intended"               
##   [9] "to"                      "provide"                
##  [11] "acquired"                "immunity\r\nagainst"    
##  [13] "severe"                  "acute"                  
##  [15] "respiratory"             "syndrome"               
##  [17] "coronavirus"             "2"                      
##  [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"           
##  [21] "causing"                 "coronavirus"            
##  [23] "disease"                 "2019"                   
##  [25] "(COVID<U+2011>19)."      "Prior"                  
##  [27] "to"                      "the\r\nCOVID<U+2011>19" 
##  [29] "pandemic,"               "there"                  
##  [31] "was"                     "an"                     
##  [33] "established"             "body"                   
##  [35] "of"                      "knowledge"              
##  [37] "about\r\nthe"            "structure"              
##  [39] "and"                     "function"               
##  [41] "of"                      "coronaviruses"          
##  [43] "causing"                 "diseases"               
##  [45] "like"                    "severe"                 
##  [47] "acute\r\nrespiratory"    "syndrome"               
##  [49] "(SARS)"                  "and"                    
##  [51] "Middle"                  "East"                   
##  [53] "respiratory"             "syndrome\r\n(MERS),"    
##  [55] "which"                   "enabled"                
##  [57] "accelerated"             "development"            
##  [59] "of"                      "various"                
##  [61] "vaccine\r\ntechnologies" "during"                 
##  [63] "early"                   "2020.[1]"               
##  [65] "On"                      "10"                     
##  [67] "January"                 "2020,"                  
##  [69] "the"                     "SARS-CoV-2\r\ngenetic"  
##  [71] "sequence"                "data"                   
##  [73] "was"                     "shared"                 
##  [75] "through"                 "GISAID,"                
##  [77] "and"                     "by"                     
##  [79] "19"                      "March,"                 
##  [81] "the\r\nglobal"           "pharmaceutical"         
##  [83] "industry"                "announced"              
##  [85] "a"                       "major"                  
##  [87] "commitment"              "to"                     
##  [89] "address\r\nCOVID-19.[2]" ""                       
##  [91] ""                        ""                       
##  [93] ""                        ""                       
##  [95] ""                        ""                       
##  [97] ""                        ""                       
##  [99] ""                        ""
nchar(covid_text_word[1:100])
##   [1]  8 10  8  7  2  1  7  8  2  7  8 17  6  5 11  8 11  1 13 10  7 11  7  4 11
##  [26]  5  2 13  9  5  3  2 11  4  2  9 10  9  3  8  2 13  7  8  4  6 18  8  6  3
##  [51]  6  4 11 17  5  7 11 11  2  7 21  6  5  8  2  2  7  5  3 19  8  4  3  6  7
##  [76]  7  3  2  2  6 11 14  8  9  1  5 10  2 21  0  0  0  0  0  0  0  0  0  0  0
covid_text_word_main <- covid_text_word[nchar(covid_text_word)>0]
# which: a function to give the TRUE indices of a logical object
# str_detect: a function to detect the presence (TRUE) or absence (FALSE) of a pattern in a string
str_detect(covid_text_word_main, pattern="References")[1:100]
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE
which(str_detect(covid_text_word_main, pattern="References"))
## [1]  406 7877
str_which(covid_text_word_main, pattern="References")
## [1]  406 7877
covid_text_word_main <- covid_text_word_main[1:7876] #selecting only words before the word "references"

# sort: a function to sort a vector into ascending or descending order
# table: a function to build a table of the counts at each observation (word)
covid_text_word_freq <- sort(table(covid_text_word_main), decreasing = TRUE)
covid_text_word_freq[1:50]
## covid_text_word_main
##             the              of             and           Phase              to 
##             212             195             177             130             113 
##              in         vaccine        <U+2013>             for               a 
##              97              95              92              87              75 
##            2020           2021,          United        vaccines               I 
##              57              48              48              43              42 
## COVID<U+2011>19 Preclinical\r\n              an              by            that 
##              40              37              32              32              32 
##            with        COVID-19     I<U+2013>II              or              as 
##              30              29              28              28              26 
## Randomized,\r\n             are              be              is       2021,\r\n 
##              26              25              25              25              24 
##              at           doses             RNA             III           South 
##              24              24              24              23              23 
##           2020,              on    placebo-\r\n         Subunit             The 
##              22              22              22              22              22 
##        clinical        efficacy            have          States           2022, 
##              21              21              20              20              19 
##         against           trial          vector             Nov         and\r\n 
##              19              19              19              18              17
covid_text_word_freq <- sort(table(tolower(covid_text_word_main)), decreasing = TRUE)
covid_text_word_freq[1:50]
## 
##             the              of             and           phase              to 
##             234             195             177             135             113 
##         vaccine              in        <U+2013>             for               a 
##             109             105              92              89              81 
##            2020        vaccines           2021,          united               i 
##              57              50              48              48              42 
## covid<U+2011>19 randomized,\r\n preclinical\r\n              an              as 
##              40              38              37              35              35 
##              by            that            with        covid-19    placebo-\r\n 
##              33              32              32              29              29 
##     i<U+2013>ii              or              on         subunit             are 
##              28              28              27              26              25 
##              be              is       2021,\r\n              at           doses 
##              25              25              24              24              24 
##             rna       emergency             iii           south           2020, 
##              24              23              23              23              22 
##        efficacy        clinical             may           trial            have 
##              22              21              21              21              20 
##          states           2022,         against          vector     development 
##              20              19              19              19              18
  1. White space: Simply removed
  2. Punctuation marks: Cannot specify what punctuation marks appear and should be removed
  3. Numbers: Cannot specify what numbers appear and should be removed
  4. Stop words: What are stop words? https://en.wikipedia.org/wiki/Stop_words
  5. Non-English text: How to detect non-English characters
  6. URLs: Removed, anyway, but incomplete

Generating Wordcloud

length(covid_text_word_freq)
## [1] 3211
covid_text_word_freq[1:20]
## 
##             the              of             and           phase              to 
##             234             195             177             135             113 
##         vaccine              in        <U+2013>             for               a 
##             109             105              92              89              81 
##            2020        vaccines           2021,          united               i 
##              57              50              48              48              42 
## covid<U+2011>19 randomized,\r\n preclinical\r\n              an              as 
##              40              38              37              35              35
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(words = names(covid_text_word_freq), # Sequence of unique words
          freq = covid_text_word_freq, # Frequency of words
          min.freq = 10, # Minimum frequency of words plotted
          random.order = FALSE, # Highly frequent words placed in the middle
          rot.per = 0.1, # Rate of words rotated in plot
          colors = brewer.pal(8, "Dark2"), # Retrieve 8 colors from the list of "Dark2"
          scale = c(3, 0.5)) # Range of words in size