Ch 5: Basic Text Processing

Learning Goals

Understand some of the basic text processing steps such as tokenization, stop word removal, stemming, and lemmatization

Basic Text (Pre-)Processing

Automated text analysis always requires some form of text processing. Consider the following example of a tweet:

Today’s the day, ladies and gents. Mr. K will land in U.S. :)

If one wants to use information from this piece of text for any form of text mining, it is important to determine what are the tokens in the text:

today, ’s, the, day, ladies, and, gents, Mr., K, will, land, in, U.S., :)

This implies a process that understands that periods in abbreviations (e.g., Mr.) and acronyms (e.g., U.S.) need to be preserved as such, but there is also punctuation that needs to be separated from the nearby tokens (comma after day or period aftter gents).

Further, a text preprocessor often normalizes the text (e.g, it may expand ’s into is or the informal gents into gentlemen), it may try to identify the root or stem of the words (e.g., lady for ladies, or be for ’s), and it may even attempt to identify and possibly label special symbols such as this emoticons: :).

Text (pre-)processing can consist of basic steps such as:

Removing the HTML(HyperText Markup Language) tags from documents collected from the web
Separating the punctuation from the words
Removing function words as stop words (too frequent words) https://en.wikipedia.org/wiki/Function_word
Applying stemming or lemmatization (the root form of the words)

This kind of text (pre-)processing steps results in a set of tokens that can be used to collect statistics or to use as input for other advanced applications such as sentiment analysis or text classification.

Note that the kind of text (pre-)processing steps is often application dependent: i.e, for analyzing the language of deception, stop words are useful and should be preserved. But to analyze the main theme of texts, stop words can be removed, and we also benefit from stemming all the input words. For identifying all the organizations that appear in a corpus, more advanced annotations are useful such as a named entity regonition tool.

Tokenization

Tokenization is the process of identifying the words in the input sequence of characters, mainly by separating the punctuation marks but also by identifying contractions, abbreviations, and so forth to maintain their intended meaning.

This tokenization process also includes text normalization steps, such as lowercasing and removing HTML tags.

The process of tokenization assumes that white spaces and punctuation are used a explicit word boundaries. But this is not the case with Korean.

“Mr. Smith doesn’t like applies.” —Tokenization—> “Mr. Smith does not like apples”

Special Attention

End-of-sentence periods vs. markers of abbreviations (e.g, Mr., Dr., U.S.)
Contractions and abbreviations are dependent on language: we need to compile a list of such words to make sure that the tokenization of the period is handled correctly. The same applies to apostrophe and hyphenation.
For an apostrophe, we often want to identify the contractions and separate them such that they form meaningful individual words. For instance, the possessive books’ should form two words: book and s’. The contractions aren’t and he’s should be separated into are and not and he and ’s
For hyphenations, we often leave them in place, to indicate a collocation as in, e.g., state-of-the-art, although sometimes it may be useful to separate it, to allow for access to individual words; e.g., separate Hewlett-Packard into Hewlett - Packard

Stop word removal

Stop words, aka function words, consist of high-frequency words including pronouns (e.g., I, we, us), determiners (e.g., the, a) prepositions (e.g., in, on), and others. For some tasks, stop words provide meaningful information: e.g., they give significant insights into people’s personalities and behaviors (Pennebaker & King, 1999). But there are also tasks when it is useful to remove them and focus the attention on content words such as nouns and verbs. In this case, we usually use a precompiled list of stop words.

But such a precompiled list of stop words can be unavaiable for a language. We can then gather word statistics on a very large corpus of texts written in the language and consequently get the top N most frequent words as candidate stop words because stop words are generally high-frequency words.

Stemming and Lemmatization

Many words in natural language are related, yet they have different surface forms due to grammatical reasons such as construction and contruct or study and studies. These relations can be captured by identifying the common stem of multiple words and this is called stemming. Stemming applies a set of rules to an input word to remove suffixes and prefixes and obtain its stem, which will now be shared with other related words. For instance, computer, computational, and computation will be all reduced to the same stem: compute. Simply saying, stemming is a processing step that uses a set of rules to remove such inflections.

But stemming often produces stems that are not valid words since suffixes or prefixes were removed. Stemming the words study and studying transforms them into studi.

The alternative to stemming is lemmatization, which reduces the inflectional forms of a word to its root form. For example, lemmatization transforms studies to study and am, are, or is to be. That is, lemmatization is the process of identifying the base form (or root form) of a word as found in a dictionary. So, unlike stemming, the output obtained from lemmatization is a valid word form; thus, its output is readable by humans, while this comes at a cost of a more computationally intensive process.

R pratice for Text (Pre-)Processing

Package "stringr’

Let me remind you of the functions in the package stringr covered last time.

Function	Description	Similar Base Functions
`str_length()`	number of characters	`nchar()`
`str_split()`	split up a string into pieces	`strsplit()`
`str_c()`	string concatenation	`paste()`
`str_squish()`	removes any redundant whitespace
`str_detect()`	finds a particular pattern of characters
`str_view_all()`	show the matching result on the actual screen

All functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Useful stringr functions for pattern matching

Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.

Function	Description
`str_which()`	Returns all positions of a matching pattern in a string vector
`str_subset()`	Returns all elements that contain a matching pattern in a string vector
`str_trunc()`	Truncates a string
`str_locate()`	Locates the first position of a matching pattern from a string
`str_locate_all()`	Locates all positions of a matching pattern from a string
`str_extact()`	Extracts the first matching pattern from a string
`str_extact_all()`	Extracts all matching patterns from a string
`str_replace()`	Replaces the first matching pattern in a string
`str_replace_all()`	Replaces all matching patterns in a string
`str_remove()`	Remove the first matched pattern in a string
`str_remove_all()`	remove all matched patterns in a string

Re-work on retrieving text from Wikipedia

library(stringr)
library(pdftools)

covid_text <- pdf_text("Coronavirus_disease_2019.pdf")
class(covid_text)

## [1] "character"

length(covid_text)

## [1] 43

covid_string <- str_c(covid_text, collapse = " ") # Collapse a character vector, covid_text, into a single string
length(covid_string)

## [1] 1

Now we have a single string in which text from a Wikipedia page about COVID-19 is concatenated.

Let’s preprocess the string

First, we want to remove everything in the References section

str_locate_all(covid_string, "References") # Locate the section position of the pattern "References" in the string

## [[1]]
##      start   end
## [1,]  5336  5345
## [2,] 44140 44149

str_trunc(covid_string, 100, "right") # Truncate a character string; Leaves 100 characters from the first and Removes the characters afterwards to the right end

## [1] "Coronavirus disease 2019\nCoronavirus disease 2019 (COVID-19) is an infectious\ndisease caused by s..."

covid_trunc <- str_trunc(covid_string, 44139, "right")  
str_locate_all(covid_trunc, "References")

## [[1]]
##      start  end
## [1,]  5336 5345

Now we know where the regex of literal characters "References" appear in the string covid_string and truncate it by removing everything after the position of our regex pattern.

Next, it seems we need to deal with whitespaces (/n or /r/n or multiple blanks). Remember how to revmove all redundant characters, including line breaks: [:space:]

str_trunc(covid_trunc, 100, "right")

## [1] "Coronavirus disease 2019\nCoronavirus disease 2019 (COVID-19) is an infectious\ndisease caused by s..."

covid_tidy <- str_squish(covid_trunc) # Replace any redundant whitespace
str_trunc(covid_tidy, 100, "right")

## [1] "Coronavirus disease 2019 Coronavirus disease 2019 (COVID-19) is an infectious disease caused by s..."

It should look tidier than before. What do we need to do with the string object now? It seems we should deal with normalization (standardized into either lower-case or upper-case letters).

So, we may want to use the function tolower to translate characters of a string into lower-case ones. Before doing so, we may need to remove all non-English characters using the POSIX character class [:ascii:]. Let’s check what non-English characters are in the string.

str_extract_all(covid_tidy, "[^[:ascii:]]+") # Extract all non-English characters (matching the preceding character set at least one or more times); # Guess what {1,} in regex does

## [[1]]
##  [1] "–"  "əˈ" "ʊ"  "əˌ" "ɪ"  "ə"  "ɪˈ" "ː"  "ˈ"  "ʊ"  "ɪ"  "–"  "–"  "–"  "×" 
## [16] "–"  "–"  "à"  "–"  "–"  "–"  "–"  "–"  "–"  "–"  "–"  "–"  "–"  "–"  "–" 
## [31] "–"  "–"  "≥"  "–"  "–"  "–"  "–"  "–"  "–"

covid_eng <- str_replace_all(covid_tidy, "[^[:ascii:]]+", " ") # Replace any non-English character with a blank " ".
covid_eng_lower <- tolower(covid_eng) # Translate all characters into lower-case letters
# if you have an error message, you may try a stringr function, str_to_lower, instead. 
str_trunc(covid_eng, 1000)

## [1] "Coronavirus disease 2019 Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome Coronavirus disease 2019 (COVID- coronavirus 2 (SARS-CoV-2).[7] The disease was first 19) identified in December 2019 in Wuhan, the capital of Other names 2019-nCoV acute China's Hubei province, and has since spread globally, resulting in the ongoing 2019 20 coronavirus respiratory disease pandemic.[8][9] Common symptoms include fever, cough, Novel coronavirus and shortness of breath.[10] Other symptoms may include pneumonia[1] muscle pain, sputum production, diarrhea, sore throat, Wuhan pneumonia[2][3] loss of smell, and abdominal pain.[4][11][12] While the Wuhan coronavirus majority of cases result in mild symptoms, some progress to viral pneumonia and multi-organ failure.[8][13] As of \"Coronavirus\" or other 28 March 2020, the overall rate of deaths per number of names for SARS-CoV-2 diagnosed cases is 4.7 percent; ranging from 0.2 percent to 15 percent..."

str_trunc(covid_eng_lower, 1000, "right")

## [1] "coronavirus disease 2019 coronavirus disease 2019 (covid-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 (covid- coronavirus 2 (sars-cov-2).[7] the disease was first 19) identified in december 2019 in wuhan, the capital of other names 2019-ncov acute china's hubei province, and has since spread globally, resulting in the ongoing 2019 20 coronavirus respiratory disease pandemic.[8][9] common symptoms include fever, cough, novel coronavirus and shortness of breath.[10] other symptoms may include pneumonia[1] muscle pain, sputum production, diarrhea, sore throat, wuhan pneumonia[2][3] loss of smell, and abdominal pain.[4][11][12] while the wuhan coronavirus majority of cases result in mild symptoms, some progress to viral pneumonia and multi-organ failure.[8][13] as of \"coronavirus\" or other 28 march 2020, the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is 4.7 percent; ranging from 0.2 percent to 15 percent..."

covid_tidy <- str_to_lower(covid_tidy)

Now, let’s think about how to deal with punctuation and numbers.

# Check what punctuation marks are to be removed; Punctuation 
unlist(str_extract_all(covid_eng_lower, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*")) # Remember why we apply the unlist function to the result from str_extract_all

##   [1] "(covid-19)"                   "(covid-"                     
##   [3] "(sars-cov-2).[7]"             "19)"                         
##   [5] "wuhan,"                       "2019-ncov"                   
##   [7] "china's"                      "province,"                   
##   [9] "globally,"                    "pandemic.[8][9]"             
##  [11] "fever,"                       "cough,"                      
##  [13] "breath.[10]"                  "pneumonia[1]"                
##  [15] "pain,"                        "production,"                 
##  [17] "diarrhea,"                    "throat,"                     
##  [19] "pneumonia[2][3]"              "smell,"                      
##  [21] "pain.[4][11][12]"             "symptoms,"                   
##  [23] "multi-organ"                  "failure.[8][13]"             
##  [25] "\"coronavirus\""              "2020,"                       
##  [27] "sars-cov-2"                   "4.7"                         
##  [29] "percent;"                     "0.2"                         
##  [31] "problems.[14]"                "comparison,"                 
##  [33] "3%"                           "5%.[15]"                     
##  [35] "sneeze.[16][17]"              "airborne.[16][18]"           
##  [37] "covid-19"                     "face.[16][17]"               
##  [39] "covid-19"                     "symptomatic,"                
##  [41] "/k"                           "z,"                          
##  [43] "appear.[17]"                  "d/"                          
##  [45] "hours.[19]"                   "days,"                       
##  [47] "days.[10][20]"                "fever,"                      
##  [49] "cough,"                       "breath[4]"                   
##  [51] "(rrt-pcr)"                    "swab.[21]"                   
##  [53] "pneumonia,"                   "sepsis,"                     
##  [55] "syndrome,"                    "symptoms,"                   
##  [57] "pneumonia.[22][23]"           "(may"                        
##  [59] "washing,"                     "(maintaining"                
##  [61] "days)"                        "others,"                     
##  [63] "symptoms),"                   "elbow,"                      
##  [65] "face.[24][25]"                "(sars-cov-2)"                
##  [67] "caregivers.[26]"              "travel,"                     
##  [69] "rrt-pcr"                      "testing,"                    
##  [71] "vary,"                        "use,"                        
##  [73] "use,"                         "use.[27][28][29]"            
##  [75] "currently,"                   "washing,"                    
##  [77] "quarantine,"                  "covid-19."                   
##  [79] "symptoms,"                    "care,"                       
##  [81] "isolation,"                   "measures.[30]"               
##  [83] "859,032[5][6]"                "2020,"                       
##  [85] "(who)"                        "42,322[5][6]"                
##  [87] "(4.7%"                        "cases)"                      
##  [89] "(pheic)"                      "[31][32]"                    
##  [91] "2020."                        "[9]"                         
##  [93] "regions.[33]"                 "anti-cytokine"               
##  [95] "symptom[34]"                  "%"                           
##  [97] "flu-like"                     "symptoms,"                   
##  [99] "fever,"                       "cough,"                      
## [101] "fatigue,"                     "breath.[4][38][39]"          
## [103] "87.9"                         "breathing,"                  
## [105] "pressure,"                    "67.7"                        
## [107] "confusion,"                   "waking,"                     
## [109] "lips;"                        "38.1"                        
## [111] "present.[40]"                 "commonly,"                   
## [113] "symptoms,"                    "sneezing,"                   
## [115] "33.4"                         "nose,"                       
## [117] "seen."                        "15[35]"                      
## [119] "30[12][36]"                   "nausea,"                     
## [121] "vomiting,"                    "18.6"                        
## [123] "percentages.[37][41][42]"     "palpitations.[43]"           
## [125] "14.8"                         "13.9"                        
## [127] "(anosmia)"                    "13.6"                        
## [129] "disease,[12][36]"             "reported.[35]"               
## [131] "some,"                        "pneumonia,"                  
## [133] "11.4"                         "multi-organ"                 
## [135] "failure,"                     "death.[8][13]"               
## [137] "5.0"                          "symptoms,"                   
## [139] "4.8"                          "days.[44]"                   
## [141] "3.7"                          "31[37]"                      
## [143] "infections,"                  "0.9"                         
## [145] "symptoms."                    "0.8"                         
## [147] "period."                      "covid-19"                    
## [149] "days.[45][46]"                "97.5%"                       
## [151] "11.5"                         "infection.[47]"              
## [153] "symptoms,"                    "unknown.[48]"                
## [155] "disease.[49][50]"             "studied,"                    
## [157] "korea's"                      "20%"                         
## [159] "stay.[50][51]"                "(sars-cov-2).[52]"           
## [161] "sneezes.[53]"                 "copper,"                     
## [163] "cardboard,"                   "steel,"                      
## [165] "plastic."                     "however,"                    
## [167] "100%"                         "(limit"                      
## [169] "3.33"                         "100.5"                       
## [171] "aerosols,"                    "100.5"                       
## [173] "plastic,"                     "steel,"                      
## [175] "cardboard,"                   "101.5"                       
## [177] "copper)."                     "copper,"                     
## [179] "cardboard,"                   "steel,"                      
## [181] "plastic."                     "(three"                      
## [183] "hours).[54]"                  "faeces,"                     
## [185] "researched.[55][56]"          "areas."                      
## [187] "2.35"                         "1.05,"                       
## [189] "manageable.[57]"              "newborn.[58]"                
## [191] "53%[55]"                      "infection.[59]"              
## [193] "days,"                        "samples,"                    
## [195] "fecal-oral"                   "tract.[55]"                  
## [197] "periods.[59]"                 "sars-cov-2."                 
## [199] "particle."                    "s,"                          
## [201] "protein;"                     "m,"                          
## [203] "protein;"                     "crown,"                      
## [205] "e,"                           "protein;"                    
## [207] "n,"                           "protein;"                    
## [209] "name."                        "coronavirus."                
## [211] "structure."                   "covid-19"                    
## [213] "ace2,"                        "lungs."                      
## [215] "\"spike\""                    "(peplomer)"                  
## [217] "cell.[60]"                    "protective,[61][62]"         
## [219] "tested.[63]"                  "progresses,"                 
## [221] "follow.[62]"                  "gastric,"                    
## [223] "epithelium[55]"               "intestine.[64]"              
## [225] "disease.[66]"                 "real-time"                   
## [227] "(rrt-pcr).[67]"               "swab,"                       
## [229] "used.[21][68]"                "days.[69][70]"               
## [231] "used,"                        "value.[71]"                  
## [233] "(pcr)"                        "rrt-pcr"                     
## [235] "covid-"                       "virus.[8][72][73]"           
## [237] "2020,[74]"                    "19[65]"                      
## [239] "ongoing.[75]"                 "point-of-care"               
## [241] "month.[76]"                   "risk."                       
## [243] "people:"                      "fever,"                      
## [245] "pneumonia,"                   "count,"                      
## [247] "count.[22]"                   "x-rays"                      
## [249] "stages,"                      "occur.[77]"                  
## [251] "ground-"                      "peripheral,"                 
## [253] "distribution.[77]"            "dominance,"                  
## [255] "evolves.[78]"                 "2020,"                       
## [257] "\"ct"                         "first-line"                  
## [259] "covid-"                       "19\".[79]"                   
## [261] "covid-19.[80][81]"            "are:"                        
## [263] "macroscopy:"                  "pleurisy,"                   
## [265] "pericarditis,"                "observed:"                   
## [267] "pneumonia:"                   "exudation,"                  
## [269] "pneumonia:"                   "oedema,"                     
## [271] "hyperplasia,"                 "pneumocytes,"                
## [273] "pneumonia:"                   "(dad)"                       
## [275] "exudates."                    "(ards)"                      
## [277] "disease."                     "pneumonia:"                  
## [279] "cavities,"                    "bal[82]"                     
## [281] "liver:"                       "home,"                       
## [283] "places,"                      "seconds,"                    
## [285] "eyes,"                        "nose,"                       
## [287] "hands.[88][89][90]"           "available.[88]"              
## [289] "sneeze.[88]"                  "time,"                       
## [291] "curve;"                       "workplaces,"                 
## [293] "patients.[83][84][85]"        "travel,"                     
## [295] "gatherings."                  "[91]"                        
## [297] "(about"                       "1.80"                        
## [299] "meters).[92]"                 "sars-cov-2"                  
## [301] "earliest,[93]"                "covid-19"                    
## [303] "peak,"                        "\"flattening"                
## [305] "curve\","                     "infections.[84]"             
## [307] "curve[86][87]"                "overwhelmed,"                
## [309] "cases,"                       "available.[84]"              
## [311] "who,"                         "infection.[94]"              
## [313] "masks,"                       "china,[95]"                  
## [315] "kong,[96]"                    "thailand,[97]"               
## [317] "republic,[98]"                "austria.[99]"                
## [319] "masks,"                       "40%."                        
## [321] "problem,"                     "sixfold,"                    
## [323] "tripled,"                     "doubled.[100]"               
## [325] "non-medical"                  "noses,"                      
## [327] "non-medical"                  "person.[101]"                
## [329] "covid-19"                     "care,"                       
## [331] "provider,"                    "provider's"                  
## [333] "person,"                      "tissue,"                     
## [335] "water,"                       "items.[102][103]"            
## [337] "seconds,"                     "dirty,"                      
## [339] "one's"                        "nose,"                       
## [341] "coughing,"                    "sneezing."                   
## [343] "alcohol-based"                "60%"                         
## [345] "alcohol,"                     "available.[88]"              
## [347] "available,"                   "production."                 
## [349] "formulations,"                "isopropanol."                
## [351] "alcohol;"                     "\"not"                       
## [353] "antisepsis\"."                "humectant.[104]"             
## [355] "multiplicative,"              "spread."                     
## [357] "line,"                        "tracks."                     
## [359] "care,"                        "fluid,"                      
## [361] "support,"                     "organs.[106][107][108]"      
## [363] "mask.[26]"                    "(ecmo)"                      
## [365] "failure,"                     "consideration.[109][110]"    
## [367] "covid-19.[111][112]"          "u.s."                        
## [369] "resource,"                    "ibcc.[113][114]"             
## [371] "(acetaminophen)"              "first-line"                  
## [373] "use.[115][116][117]"          "non-steroidal"               
## [375] "anti-inflammatory"            "(nsaids)"                    
## [377] "symptoms,[118]"               "covid-19"                    
## [379] "symptoms.[119]"               "blockers,"                   
## [381] "2020,"                        "equipment[105]"              
## [383] "medications.[120][121][122]"  "syndrome.[123][124]"         
## [385] "transmission,"                "aerosols,"                   
## [387] "ventilation.[125]"            "(ppe)"                       
## [389] "pandemic."                    "includes:"                   
## [391] "facemask[126][127]"           "gloves[128][129]"            
## [393] "protection[130]"              "available,"                  
## [395] "(instead"                     "facemasks)"                  
## [397] "preferred.[131]"              "(eua)."                      
## [399] "off-label"                    "uses.[132]"                  
## [401] "shields,"                     "masks.[133]"                 
## [403] "covid-19"                     "(artificial"                 
## [405] "breathing),"                  "do.[134][135]"               
## [407] "vectors.[134]"                "(those"                      
## [409] "years[134]"                   "years).[136]"                
## [411] "capita,"                      "system's"                    
## [413] "covid-19"                     "hospitalization.[137]"       
## [415] "(to"                          "lower).[137]"                
## [417] "5%"                           "units,"                      
## [419] "2.3%"                         "ventilation,"                
## [421] "1.4%"                         "died.[109]"                  
## [423] "20-30%"                       "support.[138]"               
## [425] "equipment.[139]"              "covid-19"                    
## [427] "difficult.[140]"              "peep[141]"                   
## [429] "ventilator-associated"        "pneumothorax.[142]"          
## [431] "ventilators."                 "ards[140]"                   
## [433] "high-flow"                    "<93%."                       
## [435] "4ml/kg"                       "(high"                       
## [437] "(35"                          "minute)"                     
## [439] "required)"                    "end-expiratory"              
## [441] "1/2"                          "authorities.[143]"           
## [443] "2020,[144]"                   "trials.[145][146]"           
## [445] "develop,[147]"                "uses,"                       
## [447] "testing.[143]"                "disease.[106]"               
## [449] "treatments.[148]"             "2020,"                       
## [451] "outbreak.[149]"               "number."                     
## [453] "'close"                       "contact'"                    
## [455] "infection."                   "users."                      
## [457] "detected,"                    "self-quarantine,"            
## [459] "officials.[150]"              "data,"                       
## [461] "technology,"                  "korea,"                      
## [463] "taiwan,"                      "singapore.[151][152]"        
## [465] "2020,"                        "coronavirus."                
## [467] "citizens.[153]"               "2020,"                       
## [469] "agency,"                      "institute,"                  
## [471] "virus.[154]"                  "breakers.[155]"              
## [473] "\"40%"                        "anyway\".[156]"              
## [475] "42.000"                       "participants.[157][158]"     
## [477] "estonia,"                     "kaljulaid,"                  
## [479] "coronavirus.[159]"            "quarantine,"                 
## [481] "restrictions,"                "treatment,"                  
## [483] "itself."                      "concerns,"                   
## [485] "2020.[160][161]"              "covid-19"                    
## [487] "varies."                      "symptoms,"                   
## [489] "cold."                        "weeks,"                      
## [491] "recover."                     "died,"                       
## [493] "weeks.[34]"                   "disease,"                    
## [495] "adults;"                      "years,"                      
## [497] "0.5%,"                        "8%.[164][165]"               
## [499] "covid-19"                     "viruses,"                    
## [501] "mers,"                        "covid-19"                    
## [503] "lacking.[166][167]"           "people,"                     
## [505] "covid-19"                     "pneumonia."                  
## [507] "affected,"                    "covid-19"                    
## [509] "(ards)"                       "failure,"                    
## [511] "shock,"                       "multi-organ"                 
## [513] "failure.[168][169]"           "covid-19"                    
## [515] "sepsis,"                      "clotting,"                   
## [517] "heart,"                       "kidneys,"                    
## [519] "liver."                       "abnormalities,"              
## [521] "time,"                        "6%"                          
## [523] "covid-19,"                    "4%"                          
## [525] "group.[170]"                  "cases.[171]"                 
## [527] "(nlr)"                        "illness.[172]"               
## [529] "covid-19"                     "pre-existing"                
## [531] "(underlying)"                 "conditions,"                 
## [533] "hypertension,"                "mellitus,"                   
## [535] "disease.[173]"                "(iss)"                       
## [537] "88%"                          "comorbidity.[174]"           
## [539] "10.4%"                        "review,"                     
## [541] "97.9%"                        "2.7"                         
## [543] "diseases.[175]"               "report,"                     
## [545] "days,"                        "hospitalised."               
## [547] "however,"                     "death.[175]"                 
## [549] "cases,"                       "days,"                       
## [551] "days.[176]"                   "(nhc)"                       
## [553] "china,"                       "covid-19"                    
## [555] "china[162]"                   "2.8%"                        
## [557] "1.7%.[177]"                   "post-"                       
## [559] "lungs."                       "pneumocytes."                
## [561] "(ards).[34]"                  "11.8%"                       
## [563] "china,"                       "arrest.[43]"                 
## [565] "china."                       "2020.[163]"                  
## [567] "mortality.[178]"              "differences,[179]"           
## [569] "difficulties."                "under-counting"              
## [571] "overestimated.[180]"          "however,"                    
## [573] "underestimated.[181][182]"    "antibodies,"                 
## [575] "problems."                    "hard."                       
## [577] "2020.[163]"                   "covid-19"                    
## [579] "noticing,"                    "there."                      
## [581] "pressure,"                    "attacks.[183]"               
## [583] "long-term"                    "disease.[184]"               
## [585] "likely,"                      "coronaviruses,[185]"         
## [587] "covid-19"                     "reported.[186][187]"         
## [589] "reinfection,"                 "relapse,"                    
## [591] "error."                       "long-term"                   
## [593] "disease."                     "20%"                         
## [595] "30%"                          "disease,"                    
## [597] "damage.[188]"                 "(%)"                         
## [599] "february[163]"                "0.0"                         
## [601] "0.2"                          "0.2"                         
## [603] "0.2"                          "0.4"                         
## [605] "1.3"                          "3.6"                         
## [607] "8.0"                          "14.8"                        
## [609] "march[174]"                   "0.0"                         
## [611] "0.0"                          "0.0"                         
## [613] "0.3"                          "0.7"                         
## [615] "1.7"                          "5.7"                         
## [617] "16.9"                         "24.4"                        
## [619] "march[189]"                   "0.0"                         
## [621] "0.0"                          "0.0"                         
## [623] "0.0"                          "0.0"                         
## [625] "0.3"                          "3.7"                         
## [627] "9.3"                          "19.1"                        
## [629] "march[190]"                   "0.0"                         
## [631] "0.0"                          "0.0"                         
## [633] "0.1"                          "0.1"                         
## [635] "0.6"                          "1.7"                         
## [637] "7.0"                          "18.3"                        
## [639] "march[191]"                   "0.0"                         
## [641] "0.3"                          "0.2"                         
## [643] "0.2"                          "0.4"                         
## [645] "0.6"                          "2.1"                         
## [647] "5.7"                          "15.3"                        
## [649] "(%)"                          "march[192]"                  
## [651] "0.0"                          "0.1"                         
## [653] "0.2"                          "0.5"                         
## [655] "0.8"                          "1.4"                         
## [657] "2.6"                          "2.7"                         
## [659] "4.9"                          "4.3"                         
## [661] "10.5"                         "10.4"                        
## [663] "27.3"                         "note:"                       
## [665] "cases."                       "data."                       
## [667] "origin,[193][194]"            "infection.[195]"             
## [669] "human-to-"                    "transmission.[163][196]"     
## [671] "wuhan,"                       "china.[197]"                 
## [673] "mortality.[198]"              "time,"                       
## [675] "testing,"                     "quality,"                    
## [677] "options,"                     "outbreak,"                   
## [679] "age,"                         "sex,"                        
## [681] "health.[199]"                 "2019,"                       
## [683] "icd-10"                       "u07.1"                       
## [685] "lab-confirmed"                "sars-cov-2"                  
## [687] "u07.2"                        "covid-19"                    
## [689] "lab-confirmed"                "sars-cov-2"                  
## [691] "infection.[200]"              "death-to-case"               
## [693] "interval."                    "statistics,"                 
## [695] "death-to-case"                "4.7%"                        
## [697] "(29,957"                      "/"                           
## [699] "634,835)"                     "march.[201]"                 
## [701] "region.[202]"                 "(cfr),"                      
## [703] "disease,"                     "(ifr),"                      
## [705] "(diagnosed"                   "undiagnosed)"                
## [707] "disease."                     "resolution."                 
## [709] "populations.[203]"            "covid-19"                    
## [711] "covid-19"                     "people,"                     
## [713] "2020[204]"                    "people,"                     
## [715] "2020[205]"                    "covid-19"                    
## [717] "disease."                     "corona,"                     
## [719] "disease,"                     "identified:"                 
## [721] "2019.[206]"                   "(i.e."                       
## [723] "china),"                      "species,"                    
## [725] "people,"                      "stigmatisation.[207][208]"   
## [727] "covid-19,"                    "sars-cov-2.[209]"            
## [729] "2019-ncov.[210]"              "\"the"                       
## [731] "covid-19"                     "virus\""                     
## [733] "\"the"                        "covid-"                      
## [735] "19\""                         "communications.[209]"        
## [737] "corona,"                      "latin.[211][212][213]"       
## [739] "2020,"                        "\"chinese"                   
## [741] "virus\""                      "\"wuhan"                     
## [743] "virus\".[214][215][216][217]" "terms,"                      
## [745] "\"wuflu\""                    "\"kung"                      
## [747] "flu\","                       "(outside"                    
## [749] "professionals)"               "covid-19."                   
## [751] "wuhan,"                       "detected,"                   
## [753] "general,"                     "arts,"                       
## [755] "fu."                          "(popularised"                
## [757] "alt-right"                    "sources)"                    
## [759] "(when"                        "flu),"                       
## [761] "culture."                     "china's"                     
## [763] "world,"                       "covid-19,"                   
## [765] "wuhan.[218][219]"             "\"corona\""                  
## [767] "either.[220][221]"            "sars-cov-2,"                 
## [769] "suggested.[62]"               "hygiene,"                    
## [771] "immunity.[222]"               "vaccine,"                    
## [773] "agencies."                    "sars-cov"                    
## [775] "sars-cov-2"                   "sars-cov"                    
## [777] "cells.[223]"                  "investigated."               
## [779] "first,"                       "vaccine."                    
## [781] "virus,"                       "dead,"                       
## [783] "covid-19."                    "strategy,"                   
## [785] "vaccines,"                    "virus."                      
## [787] "sars-cov-2,"                  "s-spike"                     
## [789] "receptor."                    "(dna"                        
## [791] "vaccines,"                    "vaccination)."               
## [793] "efficacy.[224]"               "2020,"                       
## [795] "seattle."                     "disease.[225]"               
## [797] "covid-19"                     "trials.[143]"                
## [799] "2020,"                        "multi-country"               
## [801] "\"solidarity\""               "covid-19"                    
## [803] "pandemic."                    "remdesivir,"                 
## [805] "hydroxychloroquine,"          "lopinavir/ritonavir"         
## [807] "lopinavir/ritonavir"          "trial.[226][227]"            
## [809] "2020.[228]"                   "sars-cov-2"                  
## [811] "vitro.[229]"                  "u.s.,"                       
## [813] "china,"                       "italy.[143][230][231]"       
## [815] "chloroquine,"                 "malaria,"                    
## [817] "2020,"                        "results.[232]"               
## [819] "however,"                     "research.[233]"              
## [821] "\"improves"                   "person's"                    
## [823] "stay\""                       "mild,"                       
## [825] "pneumonia.[234]"              "march,"                      
## [827] "covid-19.[235]"               "chloroquine.[236][237]"      
## [829] "however,"                     "virology,"                   
## [831] "gram,"                        "lethal."                     
## [833] "2020,"                        "covid-19.[238][239]"         
## [835] "interferon,"                  "ribavirin,"                  
## [837] "covid-"                       "19.[237]"                    
## [839] "2020,"                        "lopinavir/ritonavir"         
## [841] "illness.[240]"                "sars-cov-2.[229]"            
## [843] "(tmprss2)"                    "sars-cov-2"                  
## [845] "receptor.[241][242]"          "off-label"                   
## [847] "treatment.[241]"              "2020,"                       
## [849] "covid-19"                     "disease.[243][244]"          
## [851] "anti-cytokine"                "storm,"                      
## [853] "life-threatening"             "condition,"                  
## [855] "covid-19."                    "anti-cytokine"               
## [857] "properties.[245]"             "china's"                     
## [859] "completed.[246][247]"         "disease.[235][248][249]"     
## [861] "storms,"                      "developments,"               
## [863] "people.[250][251][252]"       "interleukin-6"               
## [865] "cause,"                       "therapy,"                    
## [867] "2017.[253]"                   "\"a"                         
## [869] "activity\""                   "il-6.[254]"                  
## [871] "covid-19"                     "immunisation.[255]"          
## [873] "sars.[255]"                   "sars-cov-2."                 
## [875] "however,"                     "antibody-dependent"          
## [877] "and/or"                       "phagocytosis,"               
## [879] "possible.[255]"               "therapy,"                    
## [881] "example,"                     "antibodies,"                 
## [883] "development.[255]"            "'convalescent"               
## [885] "serum',"                      "virus,"                      
## [887] "deployment.[256]"             "diseases,"                   
## [889] "wenliang,"                    "wuhan,"                      
## [891] "covid-"                       "virus."                      
## [893] "x,"                           "te..."

It seems we have a couple patterns of string with punctuation that we want to remove from text: 1) Citation mark: "\\[\\d+\\]" 2) Number point number: "\\d+\\.\\d+" 3) Apostrophe: "[[:word:]]+[']s" These two patterns of string are to be replaced with a blank

str_extract_all(covid_eng_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]") # Check first the patterns are matched by our regex

## [[1]]
##   [1] "[7]"    "'s "    "[8]"    "[9]"    "[10]"   "[1]"    "[2]"    "[3]"   
##   [9] "[4]"    "[11]"   "[12]"   "[8]"    "[13]"   "4.7"    "0.2"    "[14]"  
##  [17] "[15]"   "[16]"   "[17]"   "[16]"   "[18]"   "[16]"   "[17]"   "[17]"  
##  [25] "[19]"   "[10]"   "[20]"   "[4]"    "[21]"   "[22]"   "[23]"   "[24]"  
##  [33] "[25]"   "[26]"   "[27]"   "[28]"   "[29]"   "[30]"   "[5]"    "[6]"   
##  [41] "[5]"    "[6]"    "4.7"    "[31]"   "[32]"   "[9]"    "[33]"   "[34]"  
##  [49] "[4]"    "[38]"   "[39]"   "87.9"   "67.7"   "38.1"   "[40]"   "33.4"  
##  [57] "[35]"   "[12]"   "[36]"   "18.6"   "[37]"   "[41]"   "[42]"   "[43]"  
##  [65] "14.8"   "13.9"   "13.6"   "[12]"   "[36]"   "[35]"   "11.4"   "[8]"   
##  [73] "[13]"   "5.0"    "4.8"    "[44]"   "3.7"    "[37]"   "0.9"    "0.8"   
##  [81] "[45]"   "[46]"   "97.5"   "11.5"   "[47]"   "[48]"   "[49]"   "[50]"  
##  [89] "'s "    "[50]"   "[51]"   "[52]"   "[53]"   "3.33"   "100.5"  "100.5" 
##  [97] "101.5"  "[54]"   "[55]"   "[56]"   "2.35"   "1.05"   "[57]"   "[58]"  
## [105] "[55]"   "[59]"   "[55]"   "[59]"   "[60]"   "[61]"   "[62]"   "[63]"  
## [113] "[62]"   "[55]"   "[64]"   "[66]"   "[67]"   "[21]"   "[68]"   "[69]"  
## [121] "[70]"   "[71]"   "[8]"    "[72]"   "[73]"   "[74]"   "[65]"   "[75]"  
## [129] "[76]"   "[22]"   "[77]"   "[77]"   "[78]"   "[79]"   "[80]"   "[81]"  
## [137] "[82]"   "[88]"   "[89]"   "[90]"   "[88]"   "[88]"   "[83]"   "[84]"  
## [145] "[85]"   "[91]"   "1.80"   "[92]"   "[93]"   "[84]"   "[86]"   "[87]"  
## [153] "[84]"   "[94]"   "[95]"   "[96]"   "[97]"   "[98]"   "[99]"   "[100]" 
## [161] "[101]"  "'s "    "[102]"  "[103]"  "'s "    "[88]"   "[104]"  "[106]" 
## [169] "[107]"  "[108]"  "[26]"   "[109]"  "[110]"  "[111]"  "[112]"  "[113]" 
## [177] "[114]"  "[115]"  "[116]"  "[117]"  "[118]"  "[119]"  "[105]"  "[120]" 
## [185] "[121]"  "[122]"  "[123]"  "[124]"  "[125]"  "[126]"  "[127]"  "[128]" 
## [193] "[129]"  "[130]"  "[131]"  "[132]"  "[133]"  "[134]"  "[135]"  "[134]" 
## [201] "[134]"  "[136]"  "'s "    "[137]"  "[137]"  "2.3"    "1.4"    "[109]" 
## [209] "[138]"  "[139]"  "[140]"  "[141]"  "[142]"  "[140]"  "[143]"  "[144]" 
## [217] "[145]"  "[146]"  "[147]"  "[143]"  "[106]"  "[148]"  "[149]"  "[150]" 
## [225] "[151]"  "[152]"  "[153]"  "[154]"  "[155]"  "[156]"  "42.000" "[157]" 
## [233] "[158]"  "[159]"  "[160]"  "[161]"  "[34]"   "0.5"    "[164]"  "[165]" 
## [241] "[166]"  "[167]"  "[168]"  "[169]"  "[170]"  "[171]"  "[172]"  "[173]" 
## [249] "[174]"  "10.4"   "97.9"   "2.7"    "[175]"  "[175]"  "[176]"  "[162]" 
## [257] "2.8"    "1.7"    "[177]"  "[34]"   "11.8"   "[43]"   "[163]"  "[178]" 
## [265] "[179]"  "[180]"  "[181]"  "[182]"  "[163]"  "[183]"  "[184]"  "[185]" 
## [273] "[186]"  "[187]"  "[188]"  "[163]"  "0.0"    "0.2"    "0.2"    "0.2"   
## [281] "0.4"    "1.3"    "3.6"    "8.0"    "14.8"   "[174]"  "0.0"    "0.0"   
## [289] "0.0"    "0.3"    "0.7"    "1.7"    "5.7"    "16.9"   "24.4"   "[189]" 
## [297] "0.0"    "0.0"    "0.0"    "0.0"    "0.0"    "0.3"    "3.7"    "9.3"   
## [305] "19.1"   "[190]"  "0.0"    "0.0"    "0.0"    "0.1"    "0.1"    "0.6"   
## [313] "1.7"    "7.0"    "18.3"   "[191]"  "0.0"    "0.3"    "0.2"    "0.2"   
## [321] "0.4"    "0.6"    "2.1"    "5.7"    "15.3"   "[192]"  "0.0"    "0.1"   
## [329] "0.2"    "0.5"    "0.8"    "1.4"    "2.6"    "2.7"    "4.9"    "4.3"   
## [337] "10.5"   "10.4"   "27.3"   "[193]"  "[194]"  "[195]"  "[163]"  "[196]" 
## [345] "[197]"  "[198]"  "[199]"  "07.1"   "07.2"   "[200]"  "4.7"    "[201]" 
## [353] "[202]"  "[203]"  "[204]"  "[205]"  "[206]"  "[207]"  "[208]"  "[209]" 
## [361] "[210]"  "[209]"  "[211]"  "[212]"  "[213]"  "[214]"  "[215]"  "[216]" 
## [369] "[217]"  "'s "    "[218]"  "[219]"  "[220]"  "[221]"  "[62]"   "[222]" 
## [377] "[223]"  "[224]"  "[225]"  "[143]"  "[226]"  "[227]"  "[228]"  "[229]" 
## [385] "[143]"  "[230]"  "[231]"  "[232]"  "[233]"  "'s "    "[234]"  "[235]" 
## [393] "[236]"  "[237]"  "[238]"  "[239]"  "[237]"  "[240]"  "[229]"  "[241]" 
## [401] "[242]"  "[241]"  "[243]"  "[244]"  "[245]"  "'s "    "[246]"  "[247]" 
## [409] "[235]"  "[248]"  "[249]"  "[250]"  "[251]"  "[252]"  "[253]"  "[254]" 
## [417] "[255]"  "[255]"  "[255]"  "[255]"  "[256]"

covid_nocite <- str_replace_all(covid_eng_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]", " ")
unlist(str_extract_all(covid_nocite, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))

##   [1] "(covid-19)"            "(covid-"               "(sars-cov-2)."        
##   [4] "19)"                   "wuhan,"                "2019-ncov"            
##   [7] "province,"             "globally,"             "pandemic."            
##  [10] "fever,"                "cough,"                "breath."              
##  [13] "pain,"                 "production,"           "diarrhea,"            
##  [16] "throat,"               "smell,"                "pain."                
##  [19] "symptoms,"             "multi-organ"           "failure."             
##  [22] "\"coronavirus\""       "2020,"                 "sars-cov-2"           
##  [25] "percent;"              "problems."             "comparison,"          
##  [28] "3%"                    "5%."                   "sneeze."              
##  [31] "airborne."             "covid-19"              "face."                
##  [34] "covid-19"              "symptomatic,"          "/k"                   
##  [37] "z,"                    "appear."               "d/"                   
##  [40] "hours."                "days,"                 "days."                
##  [43] "fever,"                "cough,"                "(rrt-pcr)"            
##  [46] "swab."                 "pneumonia,"            "sepsis,"              
##  [49] "syndrome,"             "symptoms,"             "pneumonia."           
##  [52] "(may"                  "washing,"              "(maintaining"         
##  [55] "days)"                 "others,"               "symptoms),"           
##  [58] "elbow,"                "face."                 "(sars-cov-2)"         
##  [61] "caregivers."           "travel,"               "rrt-pcr"              
##  [64] "testing,"              "vary,"                 "use,"                 
##  [67] "use,"                  "use."                  "currently,"           
##  [70] "washing,"              "quarantine,"           "covid-19."            
##  [73] "symptoms,"             "care,"                 "isolation,"           
##  [76] "measures."             "859,032"               "2020,"                
##  [79] "(who)"                 "42,322"                "("                    
##  [82] "%"                     "cases)"                "(pheic)"              
##  [85] "2020."                 "regions."              "anti-cytokine"        
##  [88] "%"                     "flu-like"              "symptoms,"            
##  [91] "fever,"                "cough,"                "fatigue,"             
##  [94] "breath."               "breathing,"            "pressure,"            
##  [97] "confusion,"            "waking,"               "lips;"                
## [100] "present."              "commonly,"             "symptoms,"            
## [103] "sneezing,"             "nose,"                 "seen."                
## [106] "nausea,"               "vomiting,"             "percentages."         
## [109] "palpitations."         "(anosmia)"             "disease,"             
## [112] "reported."             "some,"                 "pneumonia,"           
## [115] "multi-organ"           "failure,"              "death."               
## [118] "symptoms,"             "days."                 "infections,"          
## [121] "symptoms."             "period."               "covid-19"             
## [124] "days."                 "%"                     "infection."           
## [127] "symptoms,"             "unknown."              "disease."             
## [130] "studied,"              "20%"                   "stay."                
## [133] "(sars-cov-2)."         "sneezes."              "copper,"              
## [136] "cardboard,"            "steel,"                "plastic."             
## [139] "however,"              "100%"                  "(limit"               
## [142] "aerosols,"             "plastic,"              "steel,"               
## [145] "cardboard,"            "copper)."              "copper,"              
## [148] "cardboard,"            "steel,"                "plastic."             
## [151] "(three"                "hours)."               "faeces,"              
## [154] "researched."           "areas."                ","                    
## [157] "manageable."           "newborn."              "53%"                  
## [160] "infection."            "days,"                 "samples,"             
## [163] "fecal-oral"            "tract."                "periods."             
## [166] "sars-cov-2."           "particle."             "s,"                   
## [169] "protein;"              "m,"                    "protein;"             
## [172] "crown,"                "e,"                    "protein;"             
## [175] "n,"                    "protein;"              "name."                
## [178] "coronavirus."          "structure."            "covid-19"             
## [181] "ace2,"                 "lungs."                "\"spike\""            
## [184] "(peplomer)"            "cell."                 "protective,"          
## [187] "tested."               "progresses,"           "follow."              
## [190] "gastric,"              "intestine."            "disease."             
## [193] "real-time"             "(rrt-pcr)."            "swab,"                
## [196] "used."                 "days."                 "used,"                
## [199] "value."                "(pcr)"                 "rrt-pcr"              
## [202] "covid-"                "virus."                "2020,"                
## [205] "ongoing."              "point-of-care"         "month."               
## [208] "risk."                 "people:"               "fever,"               
## [211] "pneumonia,"            "count,"                "count."               
## [214] "x-rays"                "stages,"               "occur."               
## [217] "ground-"               "peripheral,"           "distribution."        
## [220] "dominance,"            "evolves."              "2020,"                
## [223] "\"ct"                  "first-line"            "covid-"               
## [226] "19\"."                 "covid-19."             "are:"                 
## [229] "macroscopy:"           "pleurisy,"             "pericarditis,"        
## [232] "observed:"             "pneumonia:"            "exudation,"           
## [235] "pneumonia:"            "oedema,"               "hyperplasia,"         
## [238] "pneumocytes,"          "pneumonia:"            "(dad)"                
## [241] "exudates."             "(ards)"                "disease."             
## [244] "pneumonia:"            "cavities,"             "liver:"               
## [247] "home,"                 "places,"               "seconds,"             
## [250] "eyes,"                 "nose,"                 "hands."               
## [253] "available."            "sneeze."               "time,"                
## [256] "curve;"                "workplaces,"           "patients."            
## [259] "travel,"               "gatherings."           "(about"               
## [262] "meters)."              "sars-cov-2"            "earliest,"            
## [265] "covid-19"              "peak,"                 "\"flattening"         
## [268] "curve\","              "infections."           "overwhelmed,"         
## [271] "cases,"                "available."            "who,"                 
## [274] "infection."            "masks,"                "china,"               
## [277] "kong,"                 "thailand,"             "republic,"            
## [280] "austria."              "masks,"                "40%."                 
## [283] "problem,"              "sixfold,"              "tripled,"             
## [286] "doubled."              "non-medical"           "noses,"               
## [289] "non-medical"           "person."               "covid-19"             
## [292] "care,"                 "provider,"             "person,"              
## [295] "tissue,"               "water,"                "items."               
## [298] "seconds,"              "dirty,"                "nose,"                
## [301] "coughing,"             "sneezing."             "alcohol-based"        
## [304] "60%"                   "alcohol,"              "available."           
## [307] "available,"            "production."           "formulations,"        
## [310] "isopropanol."          "alcohol;"              "\"not"                
## [313] "antisepsis\"."         "humectant."            "multiplicative,"      
## [316] "spread."               "line,"                 "tracks."              
## [319] "care,"                 "fluid,"                "support,"             
## [322] "organs."               "mask."                 "(ecmo)"               
## [325] "failure,"              "consideration."        "covid-19."            
## [328] "u.s."                  "resource,"             "ibcc."                
## [331] "(acetaminophen)"       "first-line"            "use."                 
## [334] "non-steroidal"         "anti-inflammatory"     "(nsaids)"             
## [337] "symptoms,"             "covid-19"              "symptoms."            
## [340] "blockers,"             "2020,"                 "medications."         
## [343] "syndrome."             "transmission,"         "aerosols,"            
## [346] "ventilation."          "(ppe)"                 "pandemic."            
## [349] "includes:"             "available,"            "(instead"             
## [352] "facemasks)"            "preferred."            "(eua)."               
## [355] "off-label"             "uses."                 "shields,"             
## [358] "masks."                "covid-19"              "(artificial"          
## [361] "breathing),"           "do."                   "vectors."             
## [364] "(those"                "years)."               "capita,"              
## [367] "covid-19"              "hospitalization."      "(to"                  
## [370] "lower)."               "5%"                    "units,"               
## [373] "%"                     "ventilation,"          "%"                    
## [376] "died."                 "20-30%"                "support."             
## [379] "equipment."            "covid-19"              "difficult."           
## [382] "ventilator-associated" "pneumothorax."         "ventilators."         
## [385] "high-flow"             "<93%."                 "4ml/kg"               
## [388] "(high"                 "(35"                   "minute)"              
## [391] "required)"             "end-expiratory"        "1/2"                  
## [394] "authorities."          "2020,"                 "trials."              
## [397] "develop,"              "uses,"                 "testing."             
## [400] "disease."              "treatments."           "2020,"                
## [403] "outbreak."             "number."               "'close"               
## [406] "contact'"              "infection."            "users."               
## [409] "detected,"             "self-quarantine,"      "officials."           
## [412] "data,"                 "technology,"           "korea,"               
## [415] "taiwan,"               "singapore."            "2020,"                
## [418] "coronavirus."          "citizens."             "2020,"                
## [421] "agency,"               "institute,"            "virus."               
## [424] "breakers."             "\"40%"                 "anyway\"."            
## [427] "participants."         "estonia,"              "kaljulaid,"           
## [430] "coronavirus."          "quarantine,"           "restrictions,"        
## [433] "treatment,"            "itself."               "concerns,"            
## [436] "2020."                 "covid-19"              "varies."              
## [439] "symptoms,"             "cold."                 "weeks,"               
## [442] "recover."              "died,"                 "weeks."               
## [445] "disease,"              "adults;"               "years,"               
## [448] "%,"                    "8%."                   "covid-19"             
## [451] "viruses,"              "mers,"                 "covid-19"             
## [454] "lacking."              "people,"               "covid-19"             
## [457] "pneumonia."            "affected,"             "covid-19"             
## [460] "(ards)"                "failure,"              "shock,"               
## [463] "multi-organ"           "failure."              "covid-19"             
## [466] "sepsis,"               "clotting,"             "heart,"               
## [469] "kidneys,"              "liver."                "abnormalities,"       
## [472] "time,"                 "6%"                    "covid-19,"            
## [475] "4%"                    "group."                "cases."               
## [478] "(nlr)"                 "illness."              "covid-19"             
## [481] "pre-existing"          "(underlying)"          "conditions,"          
## [484] "hypertension,"         "mellitus,"             "disease."             
## [487] "(iss)"                 "88%"                   "comorbidity."         
## [490] "%"                     "review,"               "%"                    
## [493] "diseases."             "report,"               "days,"                
## [496] "hospitalised."         "however,"              "death."               
## [499] "cases,"                "days,"                 "days."                
## [502] "(nhc)"                 "china,"                "covid-19"             
## [505] "%"                     "%."                    "post-"                
## [508] "lungs."                "pneumocytes."          "(ards)."              
## [511] "%"                     "china,"                "arrest."              
## [514] "china."                "2020."                 "mortality."           
## [517] "differences,"          "difficulties."         "under-counting"       
## [520] "overestimated."        "however,"              "underestimated."      
## [523] "antibodies,"           "problems."             "hard."                
## [526] "2020."                 "covid-19"              "noticing,"            
## [529] "there."                "pressure,"             "attacks."             
## [532] "long-term"             "disease."              "likely,"              
## [535] "coronaviruses,"        "covid-19"              "reported."            
## [538] "reinfection,"          "relapse,"              "error."               
## [541] "long-term"             "disease."              "20%"                  
## [544] "30%"                   "disease,"              "damage."              
## [547] "(%)"                   "(%)"                   "note:"                
## [550] "cases."                "data."                 "origin,"              
## [553] "infection."            "human-to-"             "transmission."        
## [556] "wuhan,"                "china."                "mortality."           
## [559] "time,"                 "testing,"              "quality,"             
## [562] "options,"              "outbreak,"             "age,"                 
## [565] "sex,"                  "health."               "2019,"                
## [568] "icd-10"                "lab-confirmed"         "sars-cov-2"           
## [571] "covid-19"              "lab-confirmed"         "sars-cov-2"           
## [574] "infection."            "death-to-case"         "interval."            
## [577] "statistics,"           "death-to-case"         "%"                    
## [580] "(29,957"               "/"                     "634,835)"             
## [583] "march."                "region."               "(cfr),"               
## [586] "disease,"              "(ifr),"                "(diagnosed"           
## [589] "undiagnosed)"          "disease."              "resolution."          
## [592] "populations."          "covid-19"              "covid-19"             
## [595] "people,"               "people,"               "covid-19"             
## [598] "disease."              "corona,"               "disease,"             
## [601] "identified:"           "2019."                 "(i.e."                
## [604] "china),"               "species,"              "people,"              
## [607] "stigmatisation."       "covid-19,"             "sars-cov-2."          
## [610] "2019-ncov."            "\"the"                 "covid-19"             
## [613] "virus\""               "\"the"                 "covid-"               
## [616] "19\""                  "communications."       "corona,"              
## [619] "latin."                "2020,"                 "\"chinese"            
## [622] "virus\""               "\"wuhan"               "virus\"."             
## [625] "terms,"                "\"wuflu\""             "\"kung"               
## [628] "flu\","                "(outside"              "professionals)"       
## [631] "covid-19."             "wuhan,"                "detected,"            
## [634] "general,"              "arts,"                 "fu."                  
## [637] "(popularised"          "alt-right"             "sources)"             
## [640] "(when"                 "flu),"                 "culture."             
## [643] "world,"                "covid-19,"             "wuhan."               
## [646] "\"corona\""            "either."               "sars-cov-2,"          
## [649] "suggested."            "hygiene,"              "immunity."            
## [652] "vaccine,"              "agencies."             "sars-cov"             
## [655] "sars-cov-2"            "sars-cov"              "cells."               
## [658] "investigated."         "first,"                "vaccine."             
## [661] "virus,"                "dead,"                 "covid-19."            
## [664] "strategy,"             "vaccines,"             "virus."               
## [667] "sars-cov-2,"           "s-spike"               "receptor."            
## [670] "(dna"                  "vaccines,"             "vaccination)."        
## [673] "efficacy."             "2020,"                 "seattle."             
## [676] "disease."              "covid-19"              "trials."              
## [679] "2020,"                 "multi-country"         "\"solidarity\""       
## [682] "covid-19"              "pandemic."             "remdesivir,"          
## [685] "hydroxychloroquine,"   "lopinavir/ritonavir"   "lopinavir/ritonavir"  
## [688] "trial."                "2020."                 "sars-cov-2"           
## [691] "vitro."                "u.s.,"                 "china,"               
## [694] "italy."                "chloroquine,"          "malaria,"             
## [697] "2020,"                 "results."              "however,"             
## [700] "research."             "\"improves"            "stay\""               
## [703] "mild,"                 "pneumonia."            "march,"               
## [706] "covid-19."             "chloroquine."          "however,"             
## [709] "virology,"             "gram,"                 "lethal."              
## [712] "2020,"                 "covid-19."             "interferon,"          
## [715] "ribavirin,"            "covid-"                "19."                  
## [718] "2020,"                 "lopinavir/ritonavir"   "illness."             
## [721] "sars-cov-2."           "(tmprss2)"             "sars-cov-2"           
## [724] "receptor."             "off-label"             "treatment."           
## [727] "2020,"                 "covid-19"              "disease."             
## [730] "anti-cytokine"         "storm,"                "life-threatening"     
## [733] "condition,"            "covid-19."             "anti-cytokine"        
## [736] "properties."           "completed."            "disease."             
## [739] "storms,"               "developments,"         "people."              
## [742] "interleukin-6"         "cause,"                "therapy,"             
## [745] "2017."                 "\"a"                   "activity\""           
## [748] "il-6."                 "covid-19"              "immunisation."        
## [751] "sars."                 "sars-cov-2."           "however,"             
## [754] "antibody-dependent"    "and/or"                "phagocytosis,"        
## [757] "possible."             "therapy,"              "example,"             
## [760] "antibodies,"           "development."          "'convalescent"        
## [763] "serum',"               "virus,"                "deployment."          
## [766] "diseases,"             "wenliang,"             "wuhan,"               
## [769] "covid-"                "virus."                "x,"                   
## [772] "te..."

However, what about a punctuation mark to form a word? For example… a hyphen: “covid-19” or “sars-cov-2” What about percentage like 20%? We may not want to remove the hyphen and percent from the string, so we can remove any punctuation mark exepct the hyphen for convenience sake.

# How can we form a regex that matches to every punctuation characters except the hyphen 
str_extract_all(covid_nocite, "[^[:alnum:][:space:]-%]") # A negation of any letter/number/whitespace/hyphen/percent character

## [[1]]
##   [1] "("  ")"  "("  "("  ")"  "."  ")"  ","  ","  ","  "."  ","  ","  "."  "," 
##  [16] ","  ","  ","  ","  "."  ","  "."  "\"" "\"" ","  ";"  "."  ","  "."  "." 
##  [31] "."  "."  ","  "/"  ","  "."  "/"  "."  ","  "."  ","  ","  "("  ")"  "." 
##  [46] ","  ","  ","  ","  "."  "("  ","  "("  ")"  ","  ")"  ","  ","  "."  "(" 
##  [61] ")"  "."  ","  ","  ","  ","  ","  "."  ","  ","  ","  "."  ","  ","  "," 
##  [76] "."  ","  ","  "("  ")"  ","  "("  ")"  "("  ")"  "."  "."  ","  ","  "," 
##  [91] ","  "."  ","  ","  ","  ","  ";"  "."  ","  ","  ","  ","  "."  ","  "," 
## [106] "."  "."  "("  ")"  ","  "."  ","  ","  ","  "."  ","  "."  ","  "."  "." 
## [121] "."  "."  ","  "."  "."  ","  "."  "("  ")"  "."  "."  ","  ","  ","  "." 
## [136] ","  "("  ","  ","  ","  ","  ")"  "."  ","  ","  ","  "."  "("  ")"  "." 
## [151] ","  "."  "."  ","  "."  "."  "."  ","  ","  "."  "."  "."  "."  ","  ";" 
## [166] ","  ";"  ","  ","  ";"  ","  ";"  "."  "."  "."  ","  "."  "\"" "\"" "(" 
## [181] ")"  "."  ","  "."  ","  "."  ","  "."  "."  "("  ")"  "."  ","  "."  "." 
## [196] ","  "."  "("  ")"  "."  ","  "."  "."  "."  ":"  ","  ","  ","  "."  "," 
## [211] "."  ","  "."  ","  "."  ","  "\"" "\"" "."  "."  ":"  ":"  ","  ","  ":" 
## [226] ":"  ","  ":"  ","  ","  ","  ":"  "("  ")"  "."  "("  ")"  "."  ":"  "," 
## [241] ":"  ","  ","  ","  ","  ","  "."  "."  "."  ","  ";"  ","  "."  ","  "." 
## [256] "("  ")"  "."  ","  ","  "\"" "\"" ","  "."  ","  ","  "."  ","  "."  "," 
## [271] ","  ","  ","  ","  "."  ","  "."  ","  ","  ","  "."  ","  "."  ","  "," 
## [286] ","  ","  ","  "."  ","  ","  ","  ","  "."  ","  "."  ","  "."  ","  "." 
## [301] ";"  "\"" "\"" "."  "."  ","  "."  ","  "."  ","  ","  ","  "."  "."  "(" 
## [316] ")"  ","  "."  "."  "."  "."  ","  "."  "("  ")"  "."  "("  ")"  ","  "." 
## [331] ","  ","  "."  "."  ","  ","  "."  "("  ")"  "."  ":"  ","  "("  ")"  "." 
## [346] "("  ")"  "."  "."  ","  "."  "("  ")"  ","  "."  "."  "("  ")"  "."  "," 
## [361] "."  "("  ")"  "."  ","  ","  "."  "."  "."  "."  "."  "."  "<"  "."  "/" 
## [376] "("  "("  ")"  ")"  "/"  "."  ","  "."  ","  ","  "."  "."  "."  ","  "." 
## [391] "."  "'"  "'"  "."  "."  ","  ","  "."  ","  ","  ","  ","  "."  ","  "." 
## [406] "."  ","  ","  ","  "."  "."  "\"" "\"" "."  "."  ","  ","  "."  ","  "," 
## [421] ","  "."  ","  "."  "."  ","  "."  ","  "."  ","  "."  ","  ";"  ","  "," 
## [436] "."  ","  ","  "."  ","  "."  ","  "("  ")"  ","  ","  "."  ","  ","  "," 
## [451] ","  "."  ","  ","  ","  "."  "."  "("  ")"  "."  "("  ")"  ","  ","  "," 
## [466] "."  "("  ")"  "."  ","  "."  ","  ","  "."  ","  "."  ","  ","  "."  "(" 
## [481] ")"  ","  "."  "."  "."  "("  ")"  "."  ","  "."  "."  "."  "."  ","  "." 
## [496] "."  ","  "."  ","  "."  "."  "."  ","  "."  ","  "."  "."  ","  ","  "." 
## [511] ","  ","  "."  "."  ","  "."  "("  ")"  "+"  "("  ")"  ":"  "."  "."  "," 
## [526] "."  "."  ","  "."  "."  ","  ","  ","  ","  ","  ","  ","  "."  ","  "." 
## [541] "."  ","  "("  ","  "/"  ","  ")"  "."  "."  "("  ")"  ","  ","  "("  ")" 
## [556] ","  "("  ")"  "."  "."  "."  ","  ","  "."  ","  ","  ":"  "."  "("  "." 
## [571] "."  ")"  ","  ","  ","  "."  ","  "."  "."  "\"" "\"" "\"" "\"" "."  "," 
## [586] "."  ","  "\"" "\"" "\"" "\"" "."  ","  "\"" "\"" "\"" "\"" ","  "("  ")" 
## [601] "."  ","  ","  ","  ","  "."  "("  ")"  "("  ")"  ","  "."  ","  ","  "." 
## [616] "\"" "\"" "."  ","  "."  ","  "."  ","  "."  "."  "."  ","  "."  ","  "," 
## [631] "."  ","  ","  "."  ","  "."  "("  ","  ")"  "."  "."  ","  "."  "."  "." 
## [646] ","  "\"" "\"" "."  ","  ","  "/"  "/"  "."  "."  "."  "."  "."  ","  "," 
## [661] "."  ","  ","  ","  "."  ","  "."  "\"" "\"" ","  "."  ","  "."  "."  "," 
## [676] ","  ","  "."  ","  "."  ","  ","  "."  ","  "/"  "."  "."  "("  ")"  "." 
## [691] "."  ","  "."  ","  ","  "."  "."  "."  "."  ","  ","  "."  ","  ","  "." 
## [706] "\"" "\"" "."  "."  "."  "."  ","  "/"  ","  "."  ","  ","  ","  "."  "'" 
## [721] "'"  ","  ","  "."  ","  ","  ","  "."  ","  "."  "."  "."

covid_nopunct <- str_replace_all(covid_eng_lower, "[^[:alnum:][:space:]-%]", " ") # Replace the pattern with a single whitespace character
unlist(str_extract_all(covid_nopunct, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))

##   [1] "covid-19"              "covid-"                "sars-cov-2"           
##   [4] "2019-ncov"             "multi-organ"           "sars-cov-2"           
##   [7] "3%"                    "5%"                    "covid-19"             
##  [10] "covid-19"              "rrt-pcr"               "sars-cov-2"           
##  [13] "rrt-pcr"               "covid-19"              "7%"                   
##  [16] "anti-cytokine"         "%"                     "flu-like"             
##  [19] "multi-organ"           "covid-19"              "5%"                   
##  [22] "20%"                   "sars-cov-2"            "100%"                 
##  [25] "53%"                   "fecal-oral"            "sars-cov-2"           
##  [28] "covid-19"              "real-time"             "rrt-pcr"              
##  [31] "rrt-pcr"               "covid-"                "point-of-care"        
##  [34] "x-rays"                "ground-"               "first-line"           
##  [37] "covid-"                "covid-19"              "sars-cov-2"           
##  [40] "covid-19"              "40%"                   "non-medical"          
##  [43] "non-medical"           "covid-19"              "alcohol-based"        
##  [46] "60%"                   "covid-19"              "first-line"           
##  [49] "non-steroidal"         "anti-inflammatory"     "covid-19"             
##  [52] "off-label"             "covid-19"              "covid-19"             
##  [55] "5%"                    "3%"                    "4%"                   
##  [58] "20-30%"                "covid-19"              "ventilator-associated"
##  [61] "high-flow"             "93%"                   "end-expiratory"       
##  [64] "self-quarantine"       "40%"                   "covid-19"             
##  [67] "5%"                    "8%"                    "covid-19"             
##  [70] "covid-19"              "covid-19"              "covid-19"             
##  [73] "multi-organ"           "covid-19"              "6%"                   
##  [76] "covid-19"              "4%"                    "covid-19"             
##  [79] "pre-existing"          "88%"                   "4%"                   
##  [82] "9%"                    "covid-19"              "8%"                   
##  [85] "7%"                    "post-"                 "8%"                   
##  [88] "under-counting"        "covid-19"              "long-term"            
##  [91] "covid-19"              "long-term"             "20%"                  
##  [94] "30%"                   "%"                     "%"                    
##  [97] "human-to-"             "icd-10"                "lab-confirmed"        
## [100] "sars-cov-2"            "covid-19"              "lab-confirmed"        
## [103] "sars-cov-2"            "death-to-case"         "death-to-case"        
## [106] "7%"                    "covid-19"              "covid-19"             
## [109] "covid-19"              "covid-19"              "sars-cov-2"           
## [112] "2019-ncov"             "covid-19"              "covid-"               
## [115] "covid-19"              "alt-right"             "covid-19"             
## [118] "sars-cov-2"            "sars-cov"              "sars-cov-2"           
## [121] "sars-cov"              "covid-19"              "sars-cov-2"           
## [124] "s-spike"               "covid-19"              "multi-country"        
## [127] "covid-19"              "sars-cov-2"            "covid-19"             
## [130] "covid-19"              "covid-"                "sars-cov-2"           
## [133] "sars-cov-2"            "off-label"             "covid-19"             
## [136] "anti-cytokine"         "life-threatening"      "covid-19"             
## [139] "anti-cytokine"         "interleukin-6"         "il-6"                 
## [142] "covid-19"              "sars-cov-2"            "antibody-dependent"   
## [145] "covid-"

Now we removed all punctuation marks except the hyphen and percent in a string

str_trunc(covid_nopunct, 2000, "right") # Check the first 2000 characters in covid_nopunct

## [1] "coronavirus disease 2019 coronavirus disease 2019  covid-19  is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019  covid- coronavirus 2  sars-cov-2   7  the disease was first 19  identified in december 2019 in wuhan  the capital of other names 2019-ncov acute china s hubei province  and has since spread globally  resulting in the ongoing 2019 20 coronavirus respiratory disease pandemic  8  9  common symptoms include fever  cough  novel coronavirus and shortness of breath  10  other symptoms may include pneumonia 1  muscle pain  sputum production  diarrhea  sore throat  wuhan pneumonia 2  3  loss of smell  and abdominal pain  4  11  12  while the wuhan coronavirus majority of cases result in mild symptoms  some progress to viral pneumonia and multi-organ failure  8  13  as of  coronavirus  or other 28 march 2020  the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is 4 7 percent  ranging from 0 2 percent to 15 percent according to age group and other health problems  14  in comparison  the mortality rate of the 1918 flu pandemic was approximately 3% to 5%  15  the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze  16  17  respiratory droplets may be produced during breathing but the virus is not generally airborne  16  18  people may also contract covid-19 by touching a contaminated surface and then their face  16  17  it is most contagious when people are symptoms of covid-19 symptomatic  although spread may be possible before pronunciation  k ro n va r s d zi z  symptoms appear  17  the virus can survive on surfaces  ko v d  up to 72 hours  19  time from exposure to onset of symptoms is generally between two and fourteen days  specialty infectious diseases with an average of five days  10  20  the standard method symptoms fever  cough  shortness of of diagnosis is by reverse transcription polymerase chain breath 4  reaction  rrt-pcr  fr..."

# But it seems we may want to remove any number that does not come with a word nor mean a year

str_extract_all(covid_nopunct, "[[:space:]]+[[:digit:]]{1,3}[[:space:]]+") # Begins with a blank and less than three digit numbers and ends with a blank

## [[1]]
##   [1] " 2  "     "   7  "   " 19  "    " 20 "     "  8  "    "  10  "  
##   [7] " 1  "     " 2  "     "  4  "    "  12  "   "  8  "    " 28 "    
##  [13] " 4 "      " 0 "      " 15 "     "  14  "   "  15  "   "  16  "  
##  [19] "  16  "   "  16  "   "  17  "   " 72 "     "  19  "   "  10  "  
##  [25] " 4  "     "  21  "   "  22  "   " 5 "      " 2  "     " 14 "    
##  [31] " 2 "      "  24  "   "  26  "   "  27  "   "  29  "   "  30  "  
##  [37] " 859 "    " 5  "     " 30 "     " 20 "     " 42 "     " 5  "    
##  [43] "   4 "    "   31  "  " 11 "     "   9  "   "  33  "   " 34  "   
##  [49] "  4  "    "  39  "   " 87 "     " 67 "     " 38 "     "  40  "  
##  [55] " 33 "     " 15 "     " 30 "     "  36  "   " 18 "     "  37  "  
##  [61] "  42  "   "  43  "   " 14 "     " 13 "     " 13 "     "  12  "  
##  [67] "  35  "   " 11 "     "  8  "    " 5 "      " 4 "      "  44  "  
##  [73] " 3 "      " 31 "     " 0 "      " 0 "      " 14 "     "  45  "  
##  [79] "  97 "    " 11 "     "  47  "   "  48  "   "  49  "   "  50  "  
##  [85] " 2  "     "   52  "  "  53  "   "  24 "    "  72 "    " 72 "    
##  [91] " 3 "      " 100 "    "  100 "   " 101 "    " 18 "     "  55 "   
##  [97] "  90 "    " 100 "    "   54  "  "  55  "   " 2 "      " 1 "     
## [103] "  57  "   "  58  "   " 55  "    "  59  "   "  55  "   "  59  "  
## [109] "  60  "   "  61  "   "  63  "   "  62  "   " 55  "    "  64  "  
## [115] "  66  "   "   67  "  "  21  "   "  69  "   "  71  "   "  8  "   
## [121] "  73  "   " 19 "     "  74  "   " 65  "    "  75  "   " 21 "    
## [127] "  76  "   "  22  "   "  77  "   "  77  "   "  78  "   " 19   "  
## [133] "  80  "   " 82  "    " 20 "     "  88  "   "  90  "   "  88  "  
## [139] "  88  "   "  83  "   "  85  "   "   91  "  " 1 "      "   92  " 
## [145] "  93  "   "  84  "   " 86  "    "  84  "   "  94  "   "  95  "  
## [151] "  96  "   "  97  "   "  98  "   "  99  "   "  100  "  "  101  " 
## [157] "  102  "  " 20 "     "  88  "   "  104  "  "  106  "  "  108  " 
## [163] "  26  "   "  109  "  "  111  "  "  113  "  "  115  "  "  117  " 
## [169] "  118  "  "  119  "  " 19 "     " 105  "   "  120  "  "  122  " 
## [175] "  123  "  "  125  "  " 126  "   " 128  "   " 130  "   "  131  " 
## [181] "  132  "  "  133  "  "  134  "  "  134  "  " 60 "     " 134  "  
## [187] " 80 "     "   136  " "  137  "  "   137  " "  2 "     " 1 "     
## [193] "  109  "  "  138  "  "  139  "  "  140  "  " 141  "   "  142  " 
## [199] " 140  "   " 30 "     "  35 "    " 5 "      " 1 "      "  143  " 
## [205] "  144  "  "  145  "  "  147  "  "  143  "  "  106  "  "  148  " 
## [211] "  149  "  "  150  "  "  151  "  "  153  "  "  154  "  "  155  " 
## [217] "   156  " " 48 "     " 42 "     "  157  "  "  159  "  " 27 "    
## [223] "  160  "  "  34  "   " 50 "     " 0 "      " 70 "     "  164  " 
## [229] "  166  "  "  168  "  "  170  "  "  171  "  "  172  "  "  173  " 
## [235] "  174  "  " 10 "     " 97 "     " 2 "      "  175  "  "  175  " 
## [241] " 14 "     " 41 "     "  176  "  " 162  "   " 2 "      " 1 "     
## [247] "  177  "  "   34  "  " 11 "     "  43  "   " 11 "     "  163  " 
## [253] "  178  "  "  179  "  "  180  "  "  181  "  " 11 "     "  163  " 
## [259] "  183  "  "  184  "  "  185  "  "  186  "  "  188  "  " 0 "     
## [265] " 10 "     " 20 "     " 30 "     " 40 "     " 50 "     " 60 "    
## [271] " 70 "     " 80  "    " 11 "     " 163  "   " 0 "      " 2 "     
## [277] " 2 "      " 2 "      " 4 "      " 3 "      " 6 "      " 0 "     
## [283] " 8 "      " 26 "     " 174  "   " 0 "      " 0 "      " 0 "     
## [289] " 3 "      " 7 "      " 7 "      " 7 "      " 9 "      " 4 "     
## [295] " 27 "     " 189  "   " 0 "      " 0 "      " 0 "      " 0 "     
## [301] " 0 "      " 3 "      " 7 "      " 3 "      " 1 "      " 30 "    
## [307] " 190  "   " 0 "      " 0 "      " 0 "      " 1 "      " 1 "     
## [313] " 6 "      " 7 "      " 0 "      " 3 "      " 26 "     " 191  "  
## [319] " 0 "      " 3 "      " 2 "      " 2 "      " 4 "      " 6 "     
## [325] " 1 "      " 7 "      " 3 "      " 0 "      " 20 "     " 45 "    
## [331] " 55 "     " 65 "     " 75 "     "  85 "    " 16 "     " 192  "  
## [337] " 0 "      " 1 "      " 2 "      " 5 "      " 8 "      " 4 "     
## [343] " 6 "      " 7 "      " 9 "      " 3 "      " 5 "      " 4 "     
## [349] " 3 "      "  193  "  "  195  "  "  163  "  " 17 "     "  197  " 
## [355] "  198  "  "  199  "  " 1 "      " 2 "      "  200  "  " 4 "     
## [361] "  29 "    "   634 "  " 29 "     "  201  "  "  202  "  "  203  " 
## [367] "  20 "    " 204  "   "  24 "    " 205  "   " 19 "     "  31 "   
## [373] "  206  "  "  207  "  " 2 "      "  209  "  "  210  "  " 19  "   
## [379] "  209  "  "  211  "  "  213  "  "   214  " "  216  "  "  218  " 
## [385] "  220  "  "  62  "   "  222  "  "  223  "  "  224  "  " 16 "    
## [391] "  225  "  "  143  "  " 10 "     "  226  "  "  228  "  "  229  " 
## [397] " 3 "      "  143  "  "  231  "  "  232  "  "  233  "  "  234  " 
## [403] " 17 "     "  235  "  "  236  "  " 28 "     "  238  "  " 19  "   
## [409] "  240  "  "  229  "  " 2  "     "  241  "  "  241  "  "  243  " 
## [415] "  245  "  "  246  "  " 2 "      "  235  "  "  249  "  "  250  " 
## [421] "  252  "  "  253  "  "  254  "  "  255  "  "  255  "  "  255  " 
## [427] "  255  "  "  256  "  " 19 "

covid_nonum <- str_replace_all(covid_nopunct, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(covid_nonum, 2000) # some numbers remain because of whitespace

## [1] "coronavirus disease 2019 coronavirus disease 2019  covid-19  is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019  covid- coronavirus  sars-cov-2    the disease was first  identified in december 2019 in wuhan  the capital of other names 2019-ncov acute china s hubei province  and has since spread globally  resulting in the ongoing 2019 coronavirus respiratory disease pandemic    common symptoms include fever  cough  novel coronavirus and shortness of breath   other symptoms may include pneumonia  muscle pain  sputum production  diarrhea  sore throat  wuhan pneumonia   loss of smell  and abdominal pain     while the wuhan coronavirus majority of cases result in mild symptoms  some progress to viral pneumonia and multi-organ failure    as of  coronavirus  or other march 2020  the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is 7 percent  ranging from 2 percent to percent according to age group and other health problems   in comparison  the mortality rate of the 1918 flu pandemic was approximately 3% to 5%   the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze    respiratory droplets may be produced during breathing but the virus is not generally airborne    people may also contract covid-19 by touching a contaminated surface and then their face    it is most contagious when people are symptoms of covid-19 symptomatic  although spread may be possible before pronunciation  k ro n va r s d zi z  symptoms appear   the virus can survive on surfaces  ko v d  up to hours   time from exposure to onset of symptoms is generally between two and fourteen days  specialty infectious diseases with an average of five days    the standard method symptoms fever  cough  shortness of of diagnosis is by reverse transcription polymerase chain breath  reaction  rrt-pcr  from a nasopharyngeal swab   the complications pneumonia  viral sepsis  acute infection c..."

# Rerun pattern matching again
covid_nonum <- str_replace_all(covid_nonum, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(covid_nonum, 2000)

## [1] "coronavirus disease 2019 coronavirus disease 2019  covid-19  is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019  covid- coronavirus  sars-cov-2    the disease was first  identified in december 2019 in wuhan  the capital of other names 2019-ncov acute china s hubei province  and has since spread globally  resulting in the ongoing 2019 coronavirus respiratory disease pandemic    common symptoms include fever  cough  novel coronavirus and shortness of breath   other symptoms may include pneumonia  muscle pain  sputum production  diarrhea  sore throat  wuhan pneumonia   loss of smell  and abdominal pain     while the wuhan coronavirus majority of cases result in mild symptoms  some progress to viral pneumonia and multi-organ failure    as of  coronavirus  or other march 2020  the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is percent  ranging from percent to percent according to age group and other health problems   in comparison  the mortality rate of the 1918 flu pandemic was approximately 3% to 5%   the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze    respiratory droplets may be produced during breathing but the virus is not generally airborne    people may also contract covid-19 by touching a contaminated surface and then their face    it is most contagious when people are symptoms of covid-19 symptomatic  although spread may be possible before pronunciation  k ro n va r s d zi z  symptoms appear   the virus can survive on surfaces  ko v d  up to hours   time from exposure to onset of symptoms is generally between two and fourteen days  specialty infectious diseases with an average of five days    the standard method symptoms fever  cough  shortness of of diagnosis is by reverse transcription polymerase chain breath  reaction  rrt-pcr  from a nasopharyngeal swab   the complications pneumonia  viral sepsis  acute infection can a..."

Now our string is preprocessed insofar as non-English characters, punctuation marks, numbers are removed. But we can still see some multiple whitespaces generated in text preprocessing.

covid_nospace <- str_squish(covid_nonum) # We can repeat the whitespace deletion process
str_trunc(covid_nospace, 2000)

## [1] "coronavirus disease 2019 coronavirus disease 2019 covid-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 covid- coronavirus sars-cov-2 the disease was first identified in december 2019 in wuhan the capital of other names 2019-ncov acute china s hubei province and has since spread globally resulting in the ongoing 2019 coronavirus respiratory disease pandemic common symptoms include fever cough novel coronavirus and shortness of breath other symptoms may include pneumonia muscle pain sputum production diarrhea sore throat wuhan pneumonia loss of smell and abdominal pain while the wuhan coronavirus majority of cases result in mild symptoms some progress to viral pneumonia and multi-organ failure as of coronavirus or other march 2020 the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is percent ranging from percent to percent according to age group and other health problems in comparison the mortality rate of the 1918 flu pandemic was approximately 3% to 5% the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze respiratory droplets may be produced during breathing but the virus is not generally airborne people may also contract covid-19 by touching a contaminated surface and then their face it is most contagious when people are symptoms of covid-19 symptomatic although spread may be possible before pronunciation k ro n va r s d zi z symptoms appear the virus can survive on surfaces ko v d up to hours time from exposure to onset of symptoms is generally between two and fourteen days specialty infectious diseases with an average of five days the standard method symptoms fever cough shortness of of diagnosis is by reverse transcription polymerase chain breath reaction rrt-pcr from a nasopharyngeal swab the complications pneumonia viral sepsis acute infection can also be diagnosed from a combination of respiratory distress syndrome sympt..."

Finally, we are ready to tokenize the string object, covid_nospace, into words separated by " ".

covid_tidy_word <- unlist(str_split(covid_nospace, " "))
covid_tidy_word[1:50]

##  [1] "coronavirus" "disease"     "2019"        "coronavirus" "disease"    
##  [6] "2019"        "covid-19"    "is"          "an"          "infectious" 
## [11] "disease"     "caused"      "by"          "severe"      "acute"      
## [16] "respiratory" "syndrome"    "coronavirus" "disease"     "2019"       
## [21] "covid-"      "coronavirus" "sars-cov-2"  "the"         "disease"    
## [26] "was"         "first"       "identified"  "in"          "december"   
## [31] "2019"        "in"          "wuhan"       "the"         "capital"    
## [36] "of"          "other"       "names"       "2019-ncov"   "acute"      
## [41] "china"       "s"           "hubei"       "province"    "and"        
## [46] "has"         "since"       "spread"      "globally"    "resulting"

covid_tidy_word_freq <- sort(table(covid_tidy_word), decreasing = TRUE) # Create a table of word counts 
covid_tidy_word_freq[1:50]

## covid_tidy_word
##         the          of         and          to          in           a 
##         304         230         150         125         121         104 
##          is         for        with         are          or          as 
##          74          69          57          53          49          44 
##     disease        that          by    covid-19        from       virus 
##          43          43          42          41          41          39 
##      people         who         may          be          on       cases 
##          34          34          33          31          31          30 
##        have    symptoms        2020         not respiratory       march 
##          30          30          28          27          26          24 
##        also coronavirus      health          at       china   infection 
##          23          23          23          21          21          21 
##         was        been      severe         use       those        time 
##          21          20          20          20          19          18 
##          it   treatment         but       other        rate  sars-cov-2 
##          17          17          16          16          16          16 
##        some       these 
##          16          16

One more thing to do in tokenization is to remove stopwords. The stopword lexicon is available in the tidytext package

library(tidytext)
tidytext::stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # … with 1,139 more rows

stop_words %>% dplyr::count(lexicon)

## # A tibble: 3 x 2
##   lexicon      n
##   <chr>    <int>
## 1 onix       404
## 2 SMART      571
## 3 snowball   174

smart <- stop_words[stop_words$lexicon=="SMART",] # The dataset "stop_words" provides the smart lexicon of stop words in a dataframe format

smart$word

##   [1] "a"             "a's"           "able"          "about"        
##   [5] "above"         "according"     "accordingly"   "across"       
##   [9] "actually"      "after"         "afterwards"    "again"        
##  [13] "against"       "ain't"         "all"           "allow"        
##  [17] "allows"        "almost"        "alone"         "along"        
##  [21] "already"       "also"          "although"      "always"       
##  [25] "am"            "among"         "amongst"       "an"           
##  [29] "and"           "another"       "any"           "anybody"      
##  [33] "anyhow"        "anyone"        "anything"      "anyway"       
##  [37] "anyways"       "anywhere"      "apart"         "appear"       
##  [41] "appreciate"    "appropriate"   "are"           "aren't"       
##  [45] "around"        "as"            "aside"         "ask"          
##  [49] "asking"        "associated"    "at"            "available"    
##  [53] "away"          "awfully"       "b"             "be"           
##  [57] "became"        "because"       "become"        "becomes"      
##  [61] "becoming"      "been"          "before"        "beforehand"   
##  [65] "behind"        "being"         "believe"       "below"        
##  [69] "beside"        "besides"       "best"          "better"       
##  [73] "between"       "beyond"        "both"          "brief"        
##  [77] "but"           "by"            "c"             "c'mon"        
##  [81] "c's"           "came"          "can"           "can't"        
##  [85] "cannot"        "cant"          "cause"         "causes"       
##  [89] "certain"       "certainly"     "changes"       "clearly"      
##  [93] "co"            "com"           "come"          "comes"        
##  [97] "concerning"    "consequently"  "consider"      "considering"  
## [101] "contain"       "containing"    "contains"      "corresponding"
## [105] "could"         "couldn't"      "course"        "currently"    
## [109] "d"             "definitely"    "described"     "despite"      
## [113] "did"           "didn't"        "different"     "do"           
## [117] "does"          "doesn't"       "doing"         "don't"        
## [121] "done"          "down"          "downwards"     "during"       
## [125] "e"             "each"          "edu"           "eg"           
## [129] "eight"         "either"        "else"          "elsewhere"    
## [133] "enough"        "entirely"      "especially"    "et"           
## [137] "etc"           "even"          "ever"          "every"        
## [141] "everybody"     "everyone"      "everything"    "everywhere"   
## [145] "ex"            "exactly"       "example"       "except"       
## [149] "f"             "far"           "few"           "fifth"        
## [153] "first"         "five"          "followed"      "following"    
## [157] "follows"       "for"           "former"        "formerly"     
## [161] "forth"         "four"          "from"          "further"      
## [165] "furthermore"   "g"             "get"           "gets"         
## [169] "getting"       "given"         "gives"         "go"           
## [173] "goes"          "going"         "gone"          "got"          
## [177] "gotten"        "greetings"     "h"             "had"          
## [181] "hadn't"        "happens"       "hardly"        "has"          
## [185] "hasn't"        "have"          "haven't"       "having"       
## [189] "he"            "he's"          "hello"         "help"         
## [193] "hence"         "her"           "here"          "here's"       
## [197] "hereafter"     "hereby"        "herein"        "hereupon"     
## [201] "hers"          "herself"       "hi"            "him"          
## [205] "himself"       "his"           "hither"        "hopefully"    
## [209] "how"           "howbeit"       "however"       "i"            
## [213] "i'd"           "i'll"          "i'm"           "i've"         
## [217] "ie"            "if"            "ignored"       "immediate"    
## [221] "in"            "inasmuch"      "inc"           "indeed"       
## [225] "indicate"      "indicated"     "indicates"     "inner"        
## [229] "insofar"       "instead"       "into"          "inward"       
## [233] "is"            "isn't"         "it"            "it'd"         
## [237] "it'll"         "it's"          "its"           "itself"       
## [241] "j"             "just"          "k"             "keep"         
## [245] "keeps"         "kept"          "know"          "knows"        
## [249] "known"         "l"             "last"          "lately"       
## [253] "later"         "latter"        "latterly"      "least"        
## [257] "less"          "lest"          "let"           "let's"        
## [261] "like"          "liked"         "likely"        "little"       
## [265] "look"          "looking"       "looks"         "ltd"          
## [269] "m"             "mainly"        "many"          "may"          
## [273] "maybe"         "me"            "mean"          "meanwhile"    
## [277] "merely"        "might"         "more"          "moreover"     
## [281] "most"          "mostly"        "much"          "must"         
## [285] "my"            "myself"        "n"             "name"         
## [289] "namely"        "nd"            "near"          "nearly"       
## [293] "necessary"     "need"          "needs"         "neither"      
## [297] "never"         "nevertheless"  "new"           "next"         
## [301] "nine"          "no"            "nobody"        "non"          
## [305] "none"          "noone"         "nor"           "normally"     
## [309] "not"           "nothing"       "novel"         "now"          
## [313] "nowhere"       "o"             "obviously"     "of"           
## [317] "off"           "often"         "oh"            "ok"           
## [321] "okay"          "old"           "on"            "once"         
## [325] "one"           "ones"          "only"          "onto"         
## [329] "or"            "other"         "others"        "otherwise"    
## [333] "ought"         "our"           "ours"          "ourselves"    
## [337] "out"           "outside"       "over"          "overall"      
## [341] "own"           "p"             "particular"    "particularly" 
## [345] "per"           "perhaps"       "placed"        "please"       
## [349] "plus"          "possible"      "presumably"    "probably"     
## [353] "provides"      "q"             "que"           "quite"        
## [357] "qv"            "r"             "rather"        "rd"           
## [361] "re"            "really"        "reasonably"    "regarding"    
## [365] "regardless"    "regards"       "relatively"    "respectively" 
## [369] "right"         "s"             "said"          "same"         
## [373] "saw"           "say"           "saying"        "says"         
## [377] "second"        "secondly"      "see"           "seeing"       
## [381] "seem"          "seemed"        "seeming"       "seems"        
## [385] "seen"          "self"          "selves"        "sensible"     
## [389] "sent"          "serious"       "seriously"     "seven"        
## [393] "several"       "shall"         "she"           "should"       
## [397] "shouldn't"     "since"         "six"           "so"           
## [401] "some"          "somebody"      "somehow"       "someone"      
## [405] "something"     "sometime"      "sometimes"     "somewhat"     
## [409] "somewhere"     "soon"          "sorry"         "specified"    
## [413] "specify"       "specifying"    "still"         "sub"          
## [417] "such"          "sup"           "sure"          "t"            
## [421] "t's"           "take"          "taken"         "tell"         
## [425] "tends"         "th"            "than"          "thank"        
## [429] "thanks"        "thanx"         "that"          "that's"       
## [433] "thats"         "the"           "their"         "theirs"       
## [437] "them"          "themselves"    "then"          "thence"       
## [441] "there"         "there's"       "thereafter"    "thereby"      
## [445] "therefore"     "therein"       "theres"        "thereupon"    
## [449] "these"         "they"          "they'd"        "they'll"      
## [453] "they're"       "they've"       "think"         "third"        
## [457] "this"          "thorough"      "thoroughly"    "those"        
## [461] "though"        "three"         "through"       "throughout"   
## [465] "thru"          "thus"          "to"            "together"     
## [469] "too"           "took"          "toward"        "towards"      
## [473] "tried"         "tries"         "truly"         "try"          
## [477] "trying"        "twice"         "two"           "u"            
## [481] "un"            "under"         "unfortunately" "unless"       
## [485] "unlikely"      "until"         "unto"          "up"           
## [489] "upon"          "us"            "use"           "used"         
## [493] "useful"        "uses"          "using"         "usually"      
## [497] "uucp"          "v"             "value"         "various"      
## [501] "very"          "via"           "viz"           "vs"           
## [505] "w"             "want"          "wants"         "was"          
## [509] "wasn't"        "way"           "we"            "we'd"         
## [513] "we'll"         "we're"         "we've"         "welcome"      
## [517] "well"          "went"          "were"          "weren't"      
## [521] "what"          "what's"        "whatever"      "when"         
## [525] "whence"        "whenever"      "where"         "where's"      
## [529] "whereafter"    "whereas"       "whereby"       "wherein"      
## [533] "whereupon"     "wherever"      "whether"       "which"        
## [537] "while"         "whither"       "who"           "who's"        
## [541] "whoever"       "whole"         "whom"          "whose"        
## [545] "why"           "will"          "willing"       "wish"         
## [549] "with"          "within"        "without"       "won't"        
## [553] "wonder"        "would"         "would"         "wouldn't"     
## [557] "x"             "y"             "yes"           "yet"          
## [561] "you"           "you'd"         "you'll"        "you're"       
## [565] "you've"        "your"          "yours"         "yourself"     
## [569] "yourselves"    "z"             "zero"

covid_tidy_nostop <- covid_tidy_word[!covid_tidy_word %in% smart$word] # %in% is a matching operater that leaves the elements in covid_text_word when they belong to smart$word  

covid_tidy_word[1:50]

##  [1] "coronavirus" "disease"     "2019"        "coronavirus" "disease"    
##  [6] "2019"        "covid-19"    "is"          "an"          "infectious" 
## [11] "disease"     "caused"      "by"          "severe"      "acute"      
## [16] "respiratory" "syndrome"    "coronavirus" "disease"     "2019"       
## [21] "covid-"      "coronavirus" "sars-cov-2"  "the"         "disease"    
## [26] "was"         "first"       "identified"  "in"          "december"   
## [31] "2019"        "in"          "wuhan"       "the"         "capital"    
## [36] "of"          "other"       "names"       "2019-ncov"   "acute"      
## [41] "china"       "s"           "hubei"       "province"    "and"        
## [46] "has"         "since"       "spread"      "globally"    "resulting"

smart$word[1:50]

##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"      
## [11] "afterwards"  "again"       "against"     "ain't"       "all"        
## [16] "allow"       "allows"      "almost"      "alone"       "along"      
## [21] "already"     "also"        "although"    "always"      "am"         
## [26] "among"       "amongst"     "an"          "and"         "another"    
## [31] "any"         "anybody"     "anyhow"      "anyone"      "anything"   
## [36] "anyway"      "anyways"     "anywhere"    "apart"       "appear"     
## [41] "appreciate"  "appropriate" "are"         "aren't"      "around"     
## [46] "as"          "aside"       "ask"         "asking"      "associated"

covid_tidy_nostop[1:50]

##  [1] "coronavirus" "disease"     "2019"        "coronavirus" "disease"    
##  [6] "2019"        "covid-19"    "infectious"  "disease"     "caused"     
## [11] "severe"      "acute"       "respiratory" "syndrome"    "coronavirus"
## [16] "disease"     "2019"        "covid-"      "coronavirus" "sars-cov-2" 
## [21] "disease"     "identified"  "december"    "2019"        "wuhan"      
## [26] "capital"     "names"       "2019-ncov"   "acute"       "china"      
## [31] "hubei"       "province"    "spread"      "globally"    "resulting"  
## [36] "ongoing"     "2019"        "coronavirus" "respiratory" "disease"    
## [41] "pandemic"    "common"      "symptoms"    "include"     "fever"      
## [46] "cough"       "coronavirus" "shortness"   "breath"      "symptoms"

covid_tidy_nostop_freq <- sort(table(covid_tidy_nostop), decreasing = TRUE)
names(covid_tidy_nostop_freq)[1:50]

##  [1] "disease"      "covid-19"     "virus"        "people"       "cases"       
##  [6] "symptoms"     "2020"         "respiratory"  "march"        "coronavirus" 
## [11] "health"       "china"        "infection"    "severe"       "time"        
## [16] "treatment"    "rate"         "sars-cov-2"   "pneumonia"    "days"        
## [21] "acute"        "deaths"       "hours"        "number"       "spread"      
## [26] "syndrome"     "vaccine"      "2019"         "data"         "infected"    
## [31] "recommended"  "risk"         "wuhan"        "ace2"         "death"       
## [36] "develop"      "found"        "include"      "medical"      "case"        
## [41] "diagnosed"    "masks"        "transmission" "ventilation"  "blood"       
## [46] "cdc"          "chinese"      "distress"     "face"         "february"

covid_tidy_nostop_freq[1:50]

## covid_tidy_nostop
##      disease     covid-19        virus       people        cases     symptoms 
##           43           41           39           34           30           30 
##         2020  respiratory        march  coronavirus       health        china 
##           28           26           24           23           23           21 
##    infection       severe         time    treatment         rate   sars-cov-2 
##           21           20           18           17           16           16 
##    pneumonia         days        acute       deaths        hours       number 
##           15           14           12           12           12           12 
##       spread     syndrome      vaccine         2019         data     infected 
##           12           12           12           11           11           11 
##  recommended         risk        wuhan         ace2        death      develop 
##           11           11           11           10           10           10 
##        found      include      medical         case    diagnosed        masks 
##           10           10           10            9            9            9 
## transmission  ventilation        blood          cdc      chinese     distress 
##            9            9            8            8            8            8 
##         face     february 
##            8            8

Let’s create a wordcloud from the table of word frequency

library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(words = names(covid_tidy_nostop_freq), # Sequence of unique words
          freq = covid_tidy_nostop_freq, # Frequency of words
          min.freq = 5, # Minimum frequency of words plotted
          random.order = FALSE, # Highly frequent words placed in the middle
          rot.per = 0.1, # Rate of words rotated in plot
          scale = c(3, 0.3), # Range of words in size
          colors = brewer.pal(8, "Dark2")) # Retrieve 8 colors from the list of "Dark2"

Now we have a much better wordcloud that gives more information about COVID-19

Quiz

Extract all words that contain the hyphen (-) in the vector object of words, covid_tidy_word, and sort the counts of the extracted words in descending order. Save the result to the object named “HyphenWordTable” and export it as a text file using the following R input. And submit the file.

HyphenWordTable <- ???
write.table(HyphenWordTable, file="HyphenWordTable.txt", sep=",", quote = FALSE, row.names = FALSE)

1st Hint: We can modify the regex pattern (“[^[:space:]][[:punct:]]{1,}[^[:space:]]”) by which the words with a punctuation mark was matched for extracting the words with a hyphen.
2nd Hint: What’s the stringr function to extract certain string patterns that are matched by the regex specified?
3rd Hint: What’s the function to build a table of the counts at each word?

Week6: Basic text processing

이신행

4/20/2020