Understand some of the basic text processing steps such as tokenization, stop word removal, stemming, and lemmatization
Automated text analysis always requires some form of text processing. Consider the following example of a tweet:
Today’s the day, ladies and gents. Mr. K will land in U.S. :)
If one wants to use information from this piece of text for any form of text mining, it is important to determine what are the tokens in the text:
today, ’s, the, day, ladies, and, gents, Mr., K, will, land, in, U.S., :)
This implies a process that understands that periods in abbreviations (e.g., Mr.) and acronyms (e.g., U.S.) need to be preserved as such, but there is also punctuation that needs to be separated from the nearby tokens (comma after day or period aftter gents).
Further, a text preprocessor often normalizes the text (e.g, it may expand ’s into is or the informal gents into gentlemen), it may try to identify the root or stem of the words (e.g., lady for ladies, or be for ’s), and it may even attempt to identify and possibly label special symbols such as this emoticons: :).
Text (pre-)processing can consist of basic steps such as:
Removing the HTML(HyperText Markup Language) tags from documents collected from the web
Separating the punctuation from the words
Removing function words as stop words (too frequent words) https://en.wikipedia.org/wiki/Function_word
Applying stemming or lemmatization (the root form of the words)
This kind of text (pre-)processing steps results in a set of tokens that can be used to collect statistics or to use as input for other advanced applications such as sentiment analysis or text classification.
Note that the kind of text (pre-)processing steps is often application dependent: i.e, for analyzing the language of deception, stop words are useful and should be preserved. But to analyze the main theme of texts, stop words can be removed, and we also benefit from stemming all the input words. For identifying all the organizations that appear in a corpus, more advanced annotations are useful such as a named entity regonition tool.
Tokenization is the process of identifying the words in the input sequence of characters, mainly by separating the punctuation marks but also by identifying contractions, abbreviations, and so forth to maintain their intended meaning.
This tokenization process also includes text normalization steps, such as lowercasing and removing HTML tags.
The process of tokenization assumes that white spaces and punctuation are used a explicit word boundaries. But this is not the case with Korean.
“Mr. Smith doesn’t like applies.” —Tokenization—> “Mr. Smith does not like apples”
Special Attention
End-of-sentence periods vs. markers of abbreviations (e.g, Mr., Dr., U.S.)
Contractions and abbreviations are dependent on language: we need to compile a list of such words to make sure that the tokenization of the period is handled correctly. The same applies to apostrophe and hyphenation.
For an apostrophe, we often want to identify the contractions and separate them such that they form meaningful individual words. For instance, the possessive books’ should form two words: book and s’. The contractions aren’t and he’s should be separated into are and not and he and ’s
For hyphenations, we often leave them in place, to indicate a collocation as in, e.g., state-of-the-art, although sometimes it may be useful to separate it, to allow for access to individual words; e.g., separate Hewlett-Packard into Hewlett - Packard
Stop words, aka function words, consist of high-frequency words including pronouns (e.g., I, we, us), determiners (e.g., the, a) prepositions (e.g., in, on), and others. For some tasks, stop words provide meaningful information: e.g., they give significant insights into people’s personalities and behaviors (Pennebaker & King, 1999). But there are also tasks when it is useful to remove them and focus the attention on content words such as nouns and verbs. In this case, we usually use a precompiled list of stop words.
But such a precompiled list of stop words can be unavaiable for a language. We can then gather word statistics on a very large corpus of texts written in the language and consequently get the top N most frequent words as candidate stop words because stop words are generally high-frequency words.
Many words in natural language are related, yet they have different surface forms due to grammatical reasons such as construction and contruct or study and studies. These relations can be captured by identifying the common stem of multiple words and this is called stemming. Stemming applies a set of rules to an input word to remove suffixes and prefixes and obtain its stem, which will now be shared with other related words. For instance, computer, computational, and computation will be all reduced to the same stem: compute. Simply saying, stemming is a processing step that uses a set of rules to remove such inflections.
But stemming often produces stems that are not valid words since suffixes or prefixes were removed. Stemming the words study and studying transforms them into studi.
The alternative to stemming is lemmatization, which reduces the inflectional forms of a word to its root form. For example, lemmatization transforms studies to study and am, are, or is to be. That is, lemmatization is the process of identifying the base form (or root form) of a word as found in a dictionary. So, unlike stemming, the output obtained from lemmatization is a valid word form; thus, its output is readable by humans, while this comes at a cost of a more computationally intensive process.
Let me remind you of the functions in the package stringr covered last time.
Function | Description | Similar Base Functions |
---|---|---|
str_length() |
number of characters | nchar() |
str_split() |
split up a string into pieces | strsplit() |
str_c() |
string concatenation | paste() |
str_squish() |
removes any redundant whitespace | |
str_detect() |
finds a particular pattern of characters | |
str_view_all() |
show the matching result on the actual screen |
All functions in stringr
starts with "str_"
followed by a term in relation to the task they perform.
Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.
Function | Description |
---|---|
str_which() |
Returns all positions of a matching pattern in a string vector |
str_subset() |
Returns all elements that contain a matching pattern in a string vector |
str_trunc() |
Truncates a string |
str_locate() |
Locates the first position of a matching pattern from a string |
str_locate_all() |
Locates all positions of a matching pattern from a string |
str_extact() |
Extracts the first matching pattern from a string |
str_extact_all() |
Extracts all matching patterns from a string |
str_replace() |
Replaces the first matching pattern in a string |
str_replace_all() |
Replaces all matching patterns in a string |
str_remove() |
Remove the first matched pattern in a string |
str_remove_all() |
remove all matched patterns in a string |
library(stringr)
library(pdftools)
covid_text <- pdf_text("Coronavirus_disease_2019.pdf")
class(covid_text)
## [1] "character"
length(covid_text)
## [1] 43
covid_string <- str_c(covid_text, collapse = " ") # Collapse a character vector, covid_text, into a single string
length(covid_string)
## [1] 1
Now we have a single string in which text from a Wikipedia page about COVID-19 is concatenated.
First, we want to remove everything in the References section
str_locate_all(covid_string, "References") # Locate the section position of the pattern "References" in the string
## [[1]]
## start end
## [1,] 5336 5345
## [2,] 44140 44149
str_trunc(covid_string, 100, "right") # Truncate a character string; Leaves 100 characters from the first and Removes the characters afterwards to the right end
## [1] "Coronavirus disease 2019\nCoronavirus disease 2019 (COVID-19) is an infectious\ndisease caused by s..."
covid_trunc <- str_trunc(covid_string, 44139, "right")
str_locate_all(covid_trunc, "References")
## [[1]]
## start end
## [1,] 5336 5345
Now we know where the regex of literal characters "References"
appear in the string covid_string
and truncate it by removing everything after the position of our regex pattern.
Next, it seems we need to deal with whitespaces (/n or /r/n or multiple blanks). Remember how to revmove all redundant characters, including line breaks: [:space:]
str_trunc(covid_trunc, 100, "right")
## [1] "Coronavirus disease 2019\nCoronavirus disease 2019 (COVID-19) is an infectious\ndisease caused by s..."
covid_tidy <- str_squish(covid_trunc) # Replace any redundant whitespace
str_trunc(covid_tidy, 100, "right")
## [1] "Coronavirus disease 2019 Coronavirus disease 2019 (COVID-19) is an infectious disease caused by s..."
It should look tidier than before. What do we need to do with the string object now? It seems we should deal with normalization (standardized into either lower-case or upper-case letters).
So, we may want to use the function tolower
to translate characters of a string into lower-case ones. Before doing so, we may need to remove all non-English characters using the POSIX character class [:ascii:]
. Let’s check what non-English characters are in the string.
str_extract_all(covid_tidy, "[^[:ascii:]]+") # Extract all non-English characters (matching the preceding character set at least one or more times); # Guess what {1,} in regex does
## [[1]]
## [1] "–" "əˈ" "ʊ" "əˌ" "ɪ" "ə" "ɪˈ" "ː" "ˈ" "ʊ" "ɪ" "–" "–" "–" "×"
## [16] "–" "–" "à" "–" "–" "–" "–" "–" "–" "–" "–" "–" "–" "–" "–"
## [31] "–" "–" "≥" "–" "–" "–" "–" "–" "–"
covid_eng <- str_replace_all(covid_tidy, "[^[:ascii:]]+", " ") # Replace any non-English character with a blank " ".
covid_eng_lower <- tolower(covid_eng) # Translate all characters into lower-case letters
# if you have an error message, you may try a stringr function, str_to_lower, instead.
str_trunc(covid_eng, 1000)
## [1] "Coronavirus disease 2019 Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome Coronavirus disease 2019 (COVID- coronavirus 2 (SARS-CoV-2).[7] The disease was first 19) identified in December 2019 in Wuhan, the capital of Other names 2019-nCoV acute China's Hubei province, and has since spread globally, resulting in the ongoing 2019 20 coronavirus respiratory disease pandemic.[8][9] Common symptoms include fever, cough, Novel coronavirus and shortness of breath.[10] Other symptoms may include pneumonia[1] muscle pain, sputum production, diarrhea, sore throat, Wuhan pneumonia[2][3] loss of smell, and abdominal pain.[4][11][12] While the Wuhan coronavirus majority of cases result in mild symptoms, some progress to viral pneumonia and multi-organ failure.[8][13] As of \"Coronavirus\" or other 28 March 2020, the overall rate of deaths per number of names for SARS-CoV-2 diagnosed cases is 4.7 percent; ranging from 0.2 percent to 15 percent..."
str_trunc(covid_eng_lower, 1000, "right")
## [1] "coronavirus disease 2019 coronavirus disease 2019 (covid-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 (covid- coronavirus 2 (sars-cov-2).[7] the disease was first 19) identified in december 2019 in wuhan, the capital of other names 2019-ncov acute china's hubei province, and has since spread globally, resulting in the ongoing 2019 20 coronavirus respiratory disease pandemic.[8][9] common symptoms include fever, cough, novel coronavirus and shortness of breath.[10] other symptoms may include pneumonia[1] muscle pain, sputum production, diarrhea, sore throat, wuhan pneumonia[2][3] loss of smell, and abdominal pain.[4][11][12] while the wuhan coronavirus majority of cases result in mild symptoms, some progress to viral pneumonia and multi-organ failure.[8][13] as of \"coronavirus\" or other 28 march 2020, the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is 4.7 percent; ranging from 0.2 percent to 15 percent..."
covid_tidy <- str_to_lower(covid_tidy)
Now, let’s think about how to deal with punctuation and numbers.
# Check what punctuation marks are to be removed; Punctuation
unlist(str_extract_all(covid_eng_lower, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*")) # Remember why we apply the unlist function to the result from str_extract_all
## [1] "(covid-19)" "(covid-"
## [3] "(sars-cov-2).[7]" "19)"
## [5] "wuhan," "2019-ncov"
## [7] "china's" "province,"
## [9] "globally," "pandemic.[8][9]"
## [11] "fever," "cough,"
## [13] "breath.[10]" "pneumonia[1]"
## [15] "pain," "production,"
## [17] "diarrhea," "throat,"
## [19] "pneumonia[2][3]" "smell,"
## [21] "pain.[4][11][12]" "symptoms,"
## [23] "multi-organ" "failure.[8][13]"
## [25] "\"coronavirus\"" "2020,"
## [27] "sars-cov-2" "4.7"
## [29] "percent;" "0.2"
## [31] "problems.[14]" "comparison,"
## [33] "3%" "5%.[15]"
## [35] "sneeze.[16][17]" "airborne.[16][18]"
## [37] "covid-19" "face.[16][17]"
## [39] "covid-19" "symptomatic,"
## [41] "/k" "z,"
## [43] "appear.[17]" "d/"
## [45] "hours.[19]" "days,"
## [47] "days.[10][20]" "fever,"
## [49] "cough," "breath[4]"
## [51] "(rrt-pcr)" "swab.[21]"
## [53] "pneumonia," "sepsis,"
## [55] "syndrome," "symptoms,"
## [57] "pneumonia.[22][23]" "(may"
## [59] "washing," "(maintaining"
## [61] "days)" "others,"
## [63] "symptoms)," "elbow,"
## [65] "face.[24][25]" "(sars-cov-2)"
## [67] "caregivers.[26]" "travel,"
## [69] "rrt-pcr" "testing,"
## [71] "vary," "use,"
## [73] "use," "use.[27][28][29]"
## [75] "currently," "washing,"
## [77] "quarantine," "covid-19."
## [79] "symptoms," "care,"
## [81] "isolation," "measures.[30]"
## [83] "859,032[5][6]" "2020,"
## [85] "(who)" "42,322[5][6]"
## [87] "(4.7%" "cases)"
## [89] "(pheic)" "[31][32]"
## [91] "2020." "[9]"
## [93] "regions.[33]" "anti-cytokine"
## [95] "symptom[34]" "%"
## [97] "flu-like" "symptoms,"
## [99] "fever," "cough,"
## [101] "fatigue," "breath.[4][38][39]"
## [103] "87.9" "breathing,"
## [105] "pressure," "67.7"
## [107] "confusion," "waking,"
## [109] "lips;" "38.1"
## [111] "present.[40]" "commonly,"
## [113] "symptoms," "sneezing,"
## [115] "33.4" "nose,"
## [117] "seen." "15[35]"
## [119] "30[12][36]" "nausea,"
## [121] "vomiting," "18.6"
## [123] "percentages.[37][41][42]" "palpitations.[43]"
## [125] "14.8" "13.9"
## [127] "(anosmia)" "13.6"
## [129] "disease,[12][36]" "reported.[35]"
## [131] "some," "pneumonia,"
## [133] "11.4" "multi-organ"
## [135] "failure," "death.[8][13]"
## [137] "5.0" "symptoms,"
## [139] "4.8" "days.[44]"
## [141] "3.7" "31[37]"
## [143] "infections," "0.9"
## [145] "symptoms." "0.8"
## [147] "period." "covid-19"
## [149] "days.[45][46]" "97.5%"
## [151] "11.5" "infection.[47]"
## [153] "symptoms," "unknown.[48]"
## [155] "disease.[49][50]" "studied,"
## [157] "korea's" "20%"
## [159] "stay.[50][51]" "(sars-cov-2).[52]"
## [161] "sneezes.[53]" "copper,"
## [163] "cardboard," "steel,"
## [165] "plastic." "however,"
## [167] "100%" "(limit"
## [169] "3.33" "100.5"
## [171] "aerosols," "100.5"
## [173] "plastic," "steel,"
## [175] "cardboard," "101.5"
## [177] "copper)." "copper,"
## [179] "cardboard," "steel,"
## [181] "plastic." "(three"
## [183] "hours).[54]" "faeces,"
## [185] "researched.[55][56]" "areas."
## [187] "2.35" "1.05,"
## [189] "manageable.[57]" "newborn.[58]"
## [191] "53%[55]" "infection.[59]"
## [193] "days," "samples,"
## [195] "fecal-oral" "tract.[55]"
## [197] "periods.[59]" "sars-cov-2."
## [199] "particle." "s,"
## [201] "protein;" "m,"
## [203] "protein;" "crown,"
## [205] "e," "protein;"
## [207] "n," "protein;"
## [209] "name." "coronavirus."
## [211] "structure." "covid-19"
## [213] "ace2," "lungs."
## [215] "\"spike\"" "(peplomer)"
## [217] "cell.[60]" "protective,[61][62]"
## [219] "tested.[63]" "progresses,"
## [221] "follow.[62]" "gastric,"
## [223] "epithelium[55]" "intestine.[64]"
## [225] "disease.[66]" "real-time"
## [227] "(rrt-pcr).[67]" "swab,"
## [229] "used.[21][68]" "days.[69][70]"
## [231] "used," "value.[71]"
## [233] "(pcr)" "rrt-pcr"
## [235] "covid-" "virus.[8][72][73]"
## [237] "2020,[74]" "19[65]"
## [239] "ongoing.[75]" "point-of-care"
## [241] "month.[76]" "risk."
## [243] "people:" "fever,"
## [245] "pneumonia," "count,"
## [247] "count.[22]" "x-rays"
## [249] "stages," "occur.[77]"
## [251] "ground-" "peripheral,"
## [253] "distribution.[77]" "dominance,"
## [255] "evolves.[78]" "2020,"
## [257] "\"ct" "first-line"
## [259] "covid-" "19\".[79]"
## [261] "covid-19.[80][81]" "are:"
## [263] "macroscopy:" "pleurisy,"
## [265] "pericarditis," "observed:"
## [267] "pneumonia:" "exudation,"
## [269] "pneumonia:" "oedema,"
## [271] "hyperplasia," "pneumocytes,"
## [273] "pneumonia:" "(dad)"
## [275] "exudates." "(ards)"
## [277] "disease." "pneumonia:"
## [279] "cavities," "bal[82]"
## [281] "liver:" "home,"
## [283] "places," "seconds,"
## [285] "eyes," "nose,"
## [287] "hands.[88][89][90]" "available.[88]"
## [289] "sneeze.[88]" "time,"
## [291] "curve;" "workplaces,"
## [293] "patients.[83][84][85]" "travel,"
## [295] "gatherings." "[91]"
## [297] "(about" "1.80"
## [299] "meters).[92]" "sars-cov-2"
## [301] "earliest,[93]" "covid-19"
## [303] "peak," "\"flattening"
## [305] "curve\"," "infections.[84]"
## [307] "curve[86][87]" "overwhelmed,"
## [309] "cases," "available.[84]"
## [311] "who," "infection.[94]"
## [313] "masks," "china,[95]"
## [315] "kong,[96]" "thailand,[97]"
## [317] "republic,[98]" "austria.[99]"
## [319] "masks," "40%."
## [321] "problem," "sixfold,"
## [323] "tripled," "doubled.[100]"
## [325] "non-medical" "noses,"
## [327] "non-medical" "person.[101]"
## [329] "covid-19" "care,"
## [331] "provider," "provider's"
## [333] "person," "tissue,"
## [335] "water," "items.[102][103]"
## [337] "seconds," "dirty,"
## [339] "one's" "nose,"
## [341] "coughing," "sneezing."
## [343] "alcohol-based" "60%"
## [345] "alcohol," "available.[88]"
## [347] "available," "production."
## [349] "formulations," "isopropanol."
## [351] "alcohol;" "\"not"
## [353] "antisepsis\"." "humectant.[104]"
## [355] "multiplicative," "spread."
## [357] "line," "tracks."
## [359] "care," "fluid,"
## [361] "support," "organs.[106][107][108]"
## [363] "mask.[26]" "(ecmo)"
## [365] "failure," "consideration.[109][110]"
## [367] "covid-19.[111][112]" "u.s."
## [369] "resource," "ibcc.[113][114]"
## [371] "(acetaminophen)" "first-line"
## [373] "use.[115][116][117]" "non-steroidal"
## [375] "anti-inflammatory" "(nsaids)"
## [377] "symptoms,[118]" "covid-19"
## [379] "symptoms.[119]" "blockers,"
## [381] "2020," "equipment[105]"
## [383] "medications.[120][121][122]" "syndrome.[123][124]"
## [385] "transmission," "aerosols,"
## [387] "ventilation.[125]" "(ppe)"
## [389] "pandemic." "includes:"
## [391] "facemask[126][127]" "gloves[128][129]"
## [393] "protection[130]" "available,"
## [395] "(instead" "facemasks)"
## [397] "preferred.[131]" "(eua)."
## [399] "off-label" "uses.[132]"
## [401] "shields," "masks.[133]"
## [403] "covid-19" "(artificial"
## [405] "breathing)," "do.[134][135]"
## [407] "vectors.[134]" "(those"
## [409] "years[134]" "years).[136]"
## [411] "capita," "system's"
## [413] "covid-19" "hospitalization.[137]"
## [415] "(to" "lower).[137]"
## [417] "5%" "units,"
## [419] "2.3%" "ventilation,"
## [421] "1.4%" "died.[109]"
## [423] "20-30%" "support.[138]"
## [425] "equipment.[139]" "covid-19"
## [427] "difficult.[140]" "peep[141]"
## [429] "ventilator-associated" "pneumothorax.[142]"
## [431] "ventilators." "ards[140]"
## [433] "high-flow" "<93%."
## [435] "4ml/kg" "(high"
## [437] "(35" "minute)"
## [439] "required)" "end-expiratory"
## [441] "1/2" "authorities.[143]"
## [443] "2020,[144]" "trials.[145][146]"
## [445] "develop,[147]" "uses,"
## [447] "testing.[143]" "disease.[106]"
## [449] "treatments.[148]" "2020,"
## [451] "outbreak.[149]" "number."
## [453] "'close" "contact'"
## [455] "infection." "users."
## [457] "detected," "self-quarantine,"
## [459] "officials.[150]" "data,"
## [461] "technology," "korea,"
## [463] "taiwan," "singapore.[151][152]"
## [465] "2020," "coronavirus."
## [467] "citizens.[153]" "2020,"
## [469] "agency," "institute,"
## [471] "virus.[154]" "breakers.[155]"
## [473] "\"40%" "anyway\".[156]"
## [475] "42.000" "participants.[157][158]"
## [477] "estonia," "kaljulaid,"
## [479] "coronavirus.[159]" "quarantine,"
## [481] "restrictions," "treatment,"
## [483] "itself." "concerns,"
## [485] "2020.[160][161]" "covid-19"
## [487] "varies." "symptoms,"
## [489] "cold." "weeks,"
## [491] "recover." "died,"
## [493] "weeks.[34]" "disease,"
## [495] "adults;" "years,"
## [497] "0.5%," "8%.[164][165]"
## [499] "covid-19" "viruses,"
## [501] "mers," "covid-19"
## [503] "lacking.[166][167]" "people,"
## [505] "covid-19" "pneumonia."
## [507] "affected," "covid-19"
## [509] "(ards)" "failure,"
## [511] "shock," "multi-organ"
## [513] "failure.[168][169]" "covid-19"
## [515] "sepsis," "clotting,"
## [517] "heart," "kidneys,"
## [519] "liver." "abnormalities,"
## [521] "time," "6%"
## [523] "covid-19," "4%"
## [525] "group.[170]" "cases.[171]"
## [527] "(nlr)" "illness.[172]"
## [529] "covid-19" "pre-existing"
## [531] "(underlying)" "conditions,"
## [533] "hypertension," "mellitus,"
## [535] "disease.[173]" "(iss)"
## [537] "88%" "comorbidity.[174]"
## [539] "10.4%" "review,"
## [541] "97.9%" "2.7"
## [543] "diseases.[175]" "report,"
## [545] "days," "hospitalised."
## [547] "however," "death.[175]"
## [549] "cases," "days,"
## [551] "days.[176]" "(nhc)"
## [553] "china," "covid-19"
## [555] "china[162]" "2.8%"
## [557] "1.7%.[177]" "post-"
## [559] "lungs." "pneumocytes."
## [561] "(ards).[34]" "11.8%"
## [563] "china," "arrest.[43]"
## [565] "china." "2020.[163]"
## [567] "mortality.[178]" "differences,[179]"
## [569] "difficulties." "under-counting"
## [571] "overestimated.[180]" "however,"
## [573] "underestimated.[181][182]" "antibodies,"
## [575] "problems." "hard."
## [577] "2020.[163]" "covid-19"
## [579] "noticing," "there."
## [581] "pressure," "attacks.[183]"
## [583] "long-term" "disease.[184]"
## [585] "likely," "coronaviruses,[185]"
## [587] "covid-19" "reported.[186][187]"
## [589] "reinfection," "relapse,"
## [591] "error." "long-term"
## [593] "disease." "20%"
## [595] "30%" "disease,"
## [597] "damage.[188]" "(%)"
## [599] "february[163]" "0.0"
## [601] "0.2" "0.2"
## [603] "0.2" "0.4"
## [605] "1.3" "3.6"
## [607] "8.0" "14.8"
## [609] "march[174]" "0.0"
## [611] "0.0" "0.0"
## [613] "0.3" "0.7"
## [615] "1.7" "5.7"
## [617] "16.9" "24.4"
## [619] "march[189]" "0.0"
## [621] "0.0" "0.0"
## [623] "0.0" "0.0"
## [625] "0.3" "3.7"
## [627] "9.3" "19.1"
## [629] "march[190]" "0.0"
## [631] "0.0" "0.0"
## [633] "0.1" "0.1"
## [635] "0.6" "1.7"
## [637] "7.0" "18.3"
## [639] "march[191]" "0.0"
## [641] "0.3" "0.2"
## [643] "0.2" "0.4"
## [645] "0.6" "2.1"
## [647] "5.7" "15.3"
## [649] "(%)" "march[192]"
## [651] "0.0" "0.1"
## [653] "0.2" "0.5"
## [655] "0.8" "1.4"
## [657] "2.6" "2.7"
## [659] "4.9" "4.3"
## [661] "10.5" "10.4"
## [663] "27.3" "note:"
## [665] "cases." "data."
## [667] "origin,[193][194]" "infection.[195]"
## [669] "human-to-" "transmission.[163][196]"
## [671] "wuhan," "china.[197]"
## [673] "mortality.[198]" "time,"
## [675] "testing," "quality,"
## [677] "options," "outbreak,"
## [679] "age," "sex,"
## [681] "health.[199]" "2019,"
## [683] "icd-10" "u07.1"
## [685] "lab-confirmed" "sars-cov-2"
## [687] "u07.2" "covid-19"
## [689] "lab-confirmed" "sars-cov-2"
## [691] "infection.[200]" "death-to-case"
## [693] "interval." "statistics,"
## [695] "death-to-case" "4.7%"
## [697] "(29,957" "/"
## [699] "634,835)" "march.[201]"
## [701] "region.[202]" "(cfr),"
## [703] "disease," "(ifr),"
## [705] "(diagnosed" "undiagnosed)"
## [707] "disease." "resolution."
## [709] "populations.[203]" "covid-19"
## [711] "covid-19" "people,"
## [713] "2020[204]" "people,"
## [715] "2020[205]" "covid-19"
## [717] "disease." "corona,"
## [719] "disease," "identified:"
## [721] "2019.[206]" "(i.e."
## [723] "china)," "species,"
## [725] "people," "stigmatisation.[207][208]"
## [727] "covid-19," "sars-cov-2.[209]"
## [729] "2019-ncov.[210]" "\"the"
## [731] "covid-19" "virus\""
## [733] "\"the" "covid-"
## [735] "19\"" "communications.[209]"
## [737] "corona," "latin.[211][212][213]"
## [739] "2020," "\"chinese"
## [741] "virus\"" "\"wuhan"
## [743] "virus\".[214][215][216][217]" "terms,"
## [745] "\"wuflu\"" "\"kung"
## [747] "flu\"," "(outside"
## [749] "professionals)" "covid-19."
## [751] "wuhan," "detected,"
## [753] "general," "arts,"
## [755] "fu." "(popularised"
## [757] "alt-right" "sources)"
## [759] "(when" "flu),"
## [761] "culture." "china's"
## [763] "world," "covid-19,"
## [765] "wuhan.[218][219]" "\"corona\""
## [767] "either.[220][221]" "sars-cov-2,"
## [769] "suggested.[62]" "hygiene,"
## [771] "immunity.[222]" "vaccine,"
## [773] "agencies." "sars-cov"
## [775] "sars-cov-2" "sars-cov"
## [777] "cells.[223]" "investigated."
## [779] "first," "vaccine."
## [781] "virus," "dead,"
## [783] "covid-19." "strategy,"
## [785] "vaccines," "virus."
## [787] "sars-cov-2," "s-spike"
## [789] "receptor." "(dna"
## [791] "vaccines," "vaccination)."
## [793] "efficacy.[224]" "2020,"
## [795] "seattle." "disease.[225]"
## [797] "covid-19" "trials.[143]"
## [799] "2020," "multi-country"
## [801] "\"solidarity\"" "covid-19"
## [803] "pandemic." "remdesivir,"
## [805] "hydroxychloroquine," "lopinavir/ritonavir"
## [807] "lopinavir/ritonavir" "trial.[226][227]"
## [809] "2020.[228]" "sars-cov-2"
## [811] "vitro.[229]" "u.s.,"
## [813] "china," "italy.[143][230][231]"
## [815] "chloroquine," "malaria,"
## [817] "2020," "results.[232]"
## [819] "however," "research.[233]"
## [821] "\"improves" "person's"
## [823] "stay\"" "mild,"
## [825] "pneumonia.[234]" "march,"
## [827] "covid-19.[235]" "chloroquine.[236][237]"
## [829] "however," "virology,"
## [831] "gram," "lethal."
## [833] "2020," "covid-19.[238][239]"
## [835] "interferon," "ribavirin,"
## [837] "covid-" "19.[237]"
## [839] "2020," "lopinavir/ritonavir"
## [841] "illness.[240]" "sars-cov-2.[229]"
## [843] "(tmprss2)" "sars-cov-2"
## [845] "receptor.[241][242]" "off-label"
## [847] "treatment.[241]" "2020,"
## [849] "covid-19" "disease.[243][244]"
## [851] "anti-cytokine" "storm,"
## [853] "life-threatening" "condition,"
## [855] "covid-19." "anti-cytokine"
## [857] "properties.[245]" "china's"
## [859] "completed.[246][247]" "disease.[235][248][249]"
## [861] "storms," "developments,"
## [863] "people.[250][251][252]" "interleukin-6"
## [865] "cause," "therapy,"
## [867] "2017.[253]" "\"a"
## [869] "activity\"" "il-6.[254]"
## [871] "covid-19" "immunisation.[255]"
## [873] "sars.[255]" "sars-cov-2."
## [875] "however," "antibody-dependent"
## [877] "and/or" "phagocytosis,"
## [879] "possible.[255]" "therapy,"
## [881] "example," "antibodies,"
## [883] "development.[255]" "'convalescent"
## [885] "serum'," "virus,"
## [887] "deployment.[256]" "diseases,"
## [889] "wenliang," "wuhan,"
## [891] "covid-" "virus."
## [893] "x," "te..."
It seems we have a couple patterns of string with punctuation that we want to remove from text: 1) Citation mark: "\\[\\d+\\]"
2) Number point number: "\\d+\\.\\d+"
3) Apostrophe: "[[:word:]]+[']s"
These two patterns of string are to be replaced with a blank
str_extract_all(covid_eng_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]") # Check first the patterns are matched by our regex
## [[1]]
## [1] "[7]" "'s " "[8]" "[9]" "[10]" "[1]" "[2]" "[3]"
## [9] "[4]" "[11]" "[12]" "[8]" "[13]" "4.7" "0.2" "[14]"
## [17] "[15]" "[16]" "[17]" "[16]" "[18]" "[16]" "[17]" "[17]"
## [25] "[19]" "[10]" "[20]" "[4]" "[21]" "[22]" "[23]" "[24]"
## [33] "[25]" "[26]" "[27]" "[28]" "[29]" "[30]" "[5]" "[6]"
## [41] "[5]" "[6]" "4.7" "[31]" "[32]" "[9]" "[33]" "[34]"
## [49] "[4]" "[38]" "[39]" "87.9" "67.7" "38.1" "[40]" "33.4"
## [57] "[35]" "[12]" "[36]" "18.6" "[37]" "[41]" "[42]" "[43]"
## [65] "14.8" "13.9" "13.6" "[12]" "[36]" "[35]" "11.4" "[8]"
## [73] "[13]" "5.0" "4.8" "[44]" "3.7" "[37]" "0.9" "0.8"
## [81] "[45]" "[46]" "97.5" "11.5" "[47]" "[48]" "[49]" "[50]"
## [89] "'s " "[50]" "[51]" "[52]" "[53]" "3.33" "100.5" "100.5"
## [97] "101.5" "[54]" "[55]" "[56]" "2.35" "1.05" "[57]" "[58]"
## [105] "[55]" "[59]" "[55]" "[59]" "[60]" "[61]" "[62]" "[63]"
## [113] "[62]" "[55]" "[64]" "[66]" "[67]" "[21]" "[68]" "[69]"
## [121] "[70]" "[71]" "[8]" "[72]" "[73]" "[74]" "[65]" "[75]"
## [129] "[76]" "[22]" "[77]" "[77]" "[78]" "[79]" "[80]" "[81]"
## [137] "[82]" "[88]" "[89]" "[90]" "[88]" "[88]" "[83]" "[84]"
## [145] "[85]" "[91]" "1.80" "[92]" "[93]" "[84]" "[86]" "[87]"
## [153] "[84]" "[94]" "[95]" "[96]" "[97]" "[98]" "[99]" "[100]"
## [161] "[101]" "'s " "[102]" "[103]" "'s " "[88]" "[104]" "[106]"
## [169] "[107]" "[108]" "[26]" "[109]" "[110]" "[111]" "[112]" "[113]"
## [177] "[114]" "[115]" "[116]" "[117]" "[118]" "[119]" "[105]" "[120]"
## [185] "[121]" "[122]" "[123]" "[124]" "[125]" "[126]" "[127]" "[128]"
## [193] "[129]" "[130]" "[131]" "[132]" "[133]" "[134]" "[135]" "[134]"
## [201] "[134]" "[136]" "'s " "[137]" "[137]" "2.3" "1.4" "[109]"
## [209] "[138]" "[139]" "[140]" "[141]" "[142]" "[140]" "[143]" "[144]"
## [217] "[145]" "[146]" "[147]" "[143]" "[106]" "[148]" "[149]" "[150]"
## [225] "[151]" "[152]" "[153]" "[154]" "[155]" "[156]" "42.000" "[157]"
## [233] "[158]" "[159]" "[160]" "[161]" "[34]" "0.5" "[164]" "[165]"
## [241] "[166]" "[167]" "[168]" "[169]" "[170]" "[171]" "[172]" "[173]"
## [249] "[174]" "10.4" "97.9" "2.7" "[175]" "[175]" "[176]" "[162]"
## [257] "2.8" "1.7" "[177]" "[34]" "11.8" "[43]" "[163]" "[178]"
## [265] "[179]" "[180]" "[181]" "[182]" "[163]" "[183]" "[184]" "[185]"
## [273] "[186]" "[187]" "[188]" "[163]" "0.0" "0.2" "0.2" "0.2"
## [281] "0.4" "1.3" "3.6" "8.0" "14.8" "[174]" "0.0" "0.0"
## [289] "0.0" "0.3" "0.7" "1.7" "5.7" "16.9" "24.4" "[189]"
## [297] "0.0" "0.0" "0.0" "0.0" "0.0" "0.3" "3.7" "9.3"
## [305] "19.1" "[190]" "0.0" "0.0" "0.0" "0.1" "0.1" "0.6"
## [313] "1.7" "7.0" "18.3" "[191]" "0.0" "0.3" "0.2" "0.2"
## [321] "0.4" "0.6" "2.1" "5.7" "15.3" "[192]" "0.0" "0.1"
## [329] "0.2" "0.5" "0.8" "1.4" "2.6" "2.7" "4.9" "4.3"
## [337] "10.5" "10.4" "27.3" "[193]" "[194]" "[195]" "[163]" "[196]"
## [345] "[197]" "[198]" "[199]" "07.1" "07.2" "[200]" "4.7" "[201]"
## [353] "[202]" "[203]" "[204]" "[205]" "[206]" "[207]" "[208]" "[209]"
## [361] "[210]" "[209]" "[211]" "[212]" "[213]" "[214]" "[215]" "[216]"
## [369] "[217]" "'s " "[218]" "[219]" "[220]" "[221]" "[62]" "[222]"
## [377] "[223]" "[224]" "[225]" "[143]" "[226]" "[227]" "[228]" "[229]"
## [385] "[143]" "[230]" "[231]" "[232]" "[233]" "'s " "[234]" "[235]"
## [393] "[236]" "[237]" "[238]" "[239]" "[237]" "[240]" "[229]" "[241]"
## [401] "[242]" "[241]" "[243]" "[244]" "[245]" "'s " "[246]" "[247]"
## [409] "[235]" "[248]" "[249]" "[250]" "[251]" "[252]" "[253]" "[254]"
## [417] "[255]" "[255]" "[255]" "[255]" "[256]"
covid_nocite <- str_replace_all(covid_eng_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]", " ")
unlist(str_extract_all(covid_nocite, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))
## [1] "(covid-19)" "(covid-" "(sars-cov-2)."
## [4] "19)" "wuhan," "2019-ncov"
## [7] "province," "globally," "pandemic."
## [10] "fever," "cough," "breath."
## [13] "pain," "production," "diarrhea,"
## [16] "throat," "smell," "pain."
## [19] "symptoms," "multi-organ" "failure."
## [22] "\"coronavirus\"" "2020," "sars-cov-2"
## [25] "percent;" "problems." "comparison,"
## [28] "3%" "5%." "sneeze."
## [31] "airborne." "covid-19" "face."
## [34] "covid-19" "symptomatic," "/k"
## [37] "z," "appear." "d/"
## [40] "hours." "days," "days."
## [43] "fever," "cough," "(rrt-pcr)"
## [46] "swab." "pneumonia," "sepsis,"
## [49] "syndrome," "symptoms," "pneumonia."
## [52] "(may" "washing," "(maintaining"
## [55] "days)" "others," "symptoms),"
## [58] "elbow," "face." "(sars-cov-2)"
## [61] "caregivers." "travel," "rrt-pcr"
## [64] "testing," "vary," "use,"
## [67] "use," "use." "currently,"
## [70] "washing," "quarantine," "covid-19."
## [73] "symptoms," "care," "isolation,"
## [76] "measures." "859,032" "2020,"
## [79] "(who)" "42,322" "("
## [82] "%" "cases)" "(pheic)"
## [85] "2020." "regions." "anti-cytokine"
## [88] "%" "flu-like" "symptoms,"
## [91] "fever," "cough," "fatigue,"
## [94] "breath." "breathing," "pressure,"
## [97] "confusion," "waking," "lips;"
## [100] "present." "commonly," "symptoms,"
## [103] "sneezing," "nose," "seen."
## [106] "nausea," "vomiting," "percentages."
## [109] "palpitations." "(anosmia)" "disease,"
## [112] "reported." "some," "pneumonia,"
## [115] "multi-organ" "failure," "death."
## [118] "symptoms," "days." "infections,"
## [121] "symptoms." "period." "covid-19"
## [124] "days." "%" "infection."
## [127] "symptoms," "unknown." "disease."
## [130] "studied," "20%" "stay."
## [133] "(sars-cov-2)." "sneezes." "copper,"
## [136] "cardboard," "steel," "plastic."
## [139] "however," "100%" "(limit"
## [142] "aerosols," "plastic," "steel,"
## [145] "cardboard," "copper)." "copper,"
## [148] "cardboard," "steel," "plastic."
## [151] "(three" "hours)." "faeces,"
## [154] "researched." "areas." ","
## [157] "manageable." "newborn." "53%"
## [160] "infection." "days," "samples,"
## [163] "fecal-oral" "tract." "periods."
## [166] "sars-cov-2." "particle." "s,"
## [169] "protein;" "m," "protein;"
## [172] "crown," "e," "protein;"
## [175] "n," "protein;" "name."
## [178] "coronavirus." "structure." "covid-19"
## [181] "ace2," "lungs." "\"spike\""
## [184] "(peplomer)" "cell." "protective,"
## [187] "tested." "progresses," "follow."
## [190] "gastric," "intestine." "disease."
## [193] "real-time" "(rrt-pcr)." "swab,"
## [196] "used." "days." "used,"
## [199] "value." "(pcr)" "rrt-pcr"
## [202] "covid-" "virus." "2020,"
## [205] "ongoing." "point-of-care" "month."
## [208] "risk." "people:" "fever,"
## [211] "pneumonia," "count," "count."
## [214] "x-rays" "stages," "occur."
## [217] "ground-" "peripheral," "distribution."
## [220] "dominance," "evolves." "2020,"
## [223] "\"ct" "first-line" "covid-"
## [226] "19\"." "covid-19." "are:"
## [229] "macroscopy:" "pleurisy," "pericarditis,"
## [232] "observed:" "pneumonia:" "exudation,"
## [235] "pneumonia:" "oedema," "hyperplasia,"
## [238] "pneumocytes," "pneumonia:" "(dad)"
## [241] "exudates." "(ards)" "disease."
## [244] "pneumonia:" "cavities," "liver:"
## [247] "home," "places," "seconds,"
## [250] "eyes," "nose," "hands."
## [253] "available." "sneeze." "time,"
## [256] "curve;" "workplaces," "patients."
## [259] "travel," "gatherings." "(about"
## [262] "meters)." "sars-cov-2" "earliest,"
## [265] "covid-19" "peak," "\"flattening"
## [268] "curve\"," "infections." "overwhelmed,"
## [271] "cases," "available." "who,"
## [274] "infection." "masks," "china,"
## [277] "kong," "thailand," "republic,"
## [280] "austria." "masks," "40%."
## [283] "problem," "sixfold," "tripled,"
## [286] "doubled." "non-medical" "noses,"
## [289] "non-medical" "person." "covid-19"
## [292] "care," "provider," "person,"
## [295] "tissue," "water," "items."
## [298] "seconds," "dirty," "nose,"
## [301] "coughing," "sneezing." "alcohol-based"
## [304] "60%" "alcohol," "available."
## [307] "available," "production." "formulations,"
## [310] "isopropanol." "alcohol;" "\"not"
## [313] "antisepsis\"." "humectant." "multiplicative,"
## [316] "spread." "line," "tracks."
## [319] "care," "fluid," "support,"
## [322] "organs." "mask." "(ecmo)"
## [325] "failure," "consideration." "covid-19."
## [328] "u.s." "resource," "ibcc."
## [331] "(acetaminophen)" "first-line" "use."
## [334] "non-steroidal" "anti-inflammatory" "(nsaids)"
## [337] "symptoms," "covid-19" "symptoms."
## [340] "blockers," "2020," "medications."
## [343] "syndrome." "transmission," "aerosols,"
## [346] "ventilation." "(ppe)" "pandemic."
## [349] "includes:" "available," "(instead"
## [352] "facemasks)" "preferred." "(eua)."
## [355] "off-label" "uses." "shields,"
## [358] "masks." "covid-19" "(artificial"
## [361] "breathing)," "do." "vectors."
## [364] "(those" "years)." "capita,"
## [367] "covid-19" "hospitalization." "(to"
## [370] "lower)." "5%" "units,"
## [373] "%" "ventilation," "%"
## [376] "died." "20-30%" "support."
## [379] "equipment." "covid-19" "difficult."
## [382] "ventilator-associated" "pneumothorax." "ventilators."
## [385] "high-flow" "<93%." "4ml/kg"
## [388] "(high" "(35" "minute)"
## [391] "required)" "end-expiratory" "1/2"
## [394] "authorities." "2020," "trials."
## [397] "develop," "uses," "testing."
## [400] "disease." "treatments." "2020,"
## [403] "outbreak." "number." "'close"
## [406] "contact'" "infection." "users."
## [409] "detected," "self-quarantine," "officials."
## [412] "data," "technology," "korea,"
## [415] "taiwan," "singapore." "2020,"
## [418] "coronavirus." "citizens." "2020,"
## [421] "agency," "institute," "virus."
## [424] "breakers." "\"40%" "anyway\"."
## [427] "participants." "estonia," "kaljulaid,"
## [430] "coronavirus." "quarantine," "restrictions,"
## [433] "treatment," "itself." "concerns,"
## [436] "2020." "covid-19" "varies."
## [439] "symptoms," "cold." "weeks,"
## [442] "recover." "died," "weeks."
## [445] "disease," "adults;" "years,"
## [448] "%," "8%." "covid-19"
## [451] "viruses," "mers," "covid-19"
## [454] "lacking." "people," "covid-19"
## [457] "pneumonia." "affected," "covid-19"
## [460] "(ards)" "failure," "shock,"
## [463] "multi-organ" "failure." "covid-19"
## [466] "sepsis," "clotting," "heart,"
## [469] "kidneys," "liver." "abnormalities,"
## [472] "time," "6%" "covid-19,"
## [475] "4%" "group." "cases."
## [478] "(nlr)" "illness." "covid-19"
## [481] "pre-existing" "(underlying)" "conditions,"
## [484] "hypertension," "mellitus," "disease."
## [487] "(iss)" "88%" "comorbidity."
## [490] "%" "review," "%"
## [493] "diseases." "report," "days,"
## [496] "hospitalised." "however," "death."
## [499] "cases," "days," "days."
## [502] "(nhc)" "china," "covid-19"
## [505] "%" "%." "post-"
## [508] "lungs." "pneumocytes." "(ards)."
## [511] "%" "china," "arrest."
## [514] "china." "2020." "mortality."
## [517] "differences," "difficulties." "under-counting"
## [520] "overestimated." "however," "underestimated."
## [523] "antibodies," "problems." "hard."
## [526] "2020." "covid-19" "noticing,"
## [529] "there." "pressure," "attacks."
## [532] "long-term" "disease." "likely,"
## [535] "coronaviruses," "covid-19" "reported."
## [538] "reinfection," "relapse," "error."
## [541] "long-term" "disease." "20%"
## [544] "30%" "disease," "damage."
## [547] "(%)" "(%)" "note:"
## [550] "cases." "data." "origin,"
## [553] "infection." "human-to-" "transmission."
## [556] "wuhan," "china." "mortality."
## [559] "time," "testing," "quality,"
## [562] "options," "outbreak," "age,"
## [565] "sex," "health." "2019,"
## [568] "icd-10" "lab-confirmed" "sars-cov-2"
## [571] "covid-19" "lab-confirmed" "sars-cov-2"
## [574] "infection." "death-to-case" "interval."
## [577] "statistics," "death-to-case" "%"
## [580] "(29,957" "/" "634,835)"
## [583] "march." "region." "(cfr),"
## [586] "disease," "(ifr)," "(diagnosed"
## [589] "undiagnosed)" "disease." "resolution."
## [592] "populations." "covid-19" "covid-19"
## [595] "people," "people," "covid-19"
## [598] "disease." "corona," "disease,"
## [601] "identified:" "2019." "(i.e."
## [604] "china)," "species," "people,"
## [607] "stigmatisation." "covid-19," "sars-cov-2."
## [610] "2019-ncov." "\"the" "covid-19"
## [613] "virus\"" "\"the" "covid-"
## [616] "19\"" "communications." "corona,"
## [619] "latin." "2020," "\"chinese"
## [622] "virus\"" "\"wuhan" "virus\"."
## [625] "terms," "\"wuflu\"" "\"kung"
## [628] "flu\"," "(outside" "professionals)"
## [631] "covid-19." "wuhan," "detected,"
## [634] "general," "arts," "fu."
## [637] "(popularised" "alt-right" "sources)"
## [640] "(when" "flu)," "culture."
## [643] "world," "covid-19," "wuhan."
## [646] "\"corona\"" "either." "sars-cov-2,"
## [649] "suggested." "hygiene," "immunity."
## [652] "vaccine," "agencies." "sars-cov"
## [655] "sars-cov-2" "sars-cov" "cells."
## [658] "investigated." "first," "vaccine."
## [661] "virus," "dead," "covid-19."
## [664] "strategy," "vaccines," "virus."
## [667] "sars-cov-2," "s-spike" "receptor."
## [670] "(dna" "vaccines," "vaccination)."
## [673] "efficacy." "2020," "seattle."
## [676] "disease." "covid-19" "trials."
## [679] "2020," "multi-country" "\"solidarity\""
## [682] "covid-19" "pandemic." "remdesivir,"
## [685] "hydroxychloroquine," "lopinavir/ritonavir" "lopinavir/ritonavir"
## [688] "trial." "2020." "sars-cov-2"
## [691] "vitro." "u.s.," "china,"
## [694] "italy." "chloroquine," "malaria,"
## [697] "2020," "results." "however,"
## [700] "research." "\"improves" "stay\""
## [703] "mild," "pneumonia." "march,"
## [706] "covid-19." "chloroquine." "however,"
## [709] "virology," "gram," "lethal."
## [712] "2020," "covid-19." "interferon,"
## [715] "ribavirin," "covid-" "19."
## [718] "2020," "lopinavir/ritonavir" "illness."
## [721] "sars-cov-2." "(tmprss2)" "sars-cov-2"
## [724] "receptor." "off-label" "treatment."
## [727] "2020," "covid-19" "disease."
## [730] "anti-cytokine" "storm," "life-threatening"
## [733] "condition," "covid-19." "anti-cytokine"
## [736] "properties." "completed." "disease."
## [739] "storms," "developments," "people."
## [742] "interleukin-6" "cause," "therapy,"
## [745] "2017." "\"a" "activity\""
## [748] "il-6." "covid-19" "immunisation."
## [751] "sars." "sars-cov-2." "however,"
## [754] "antibody-dependent" "and/or" "phagocytosis,"
## [757] "possible." "therapy," "example,"
## [760] "antibodies," "development." "'convalescent"
## [763] "serum'," "virus," "deployment."
## [766] "diseases," "wenliang," "wuhan,"
## [769] "covid-" "virus." "x,"
## [772] "te..."
However, what about a punctuation mark to form a word? For example… a hyphen: “covid-19” or “sars-cov-2” What about percentage like 20%? We may not want to remove the hyphen and percent from the string, so we can remove any punctuation mark exepct the hyphen for convenience sake.
# How can we form a regex that matches to every punctuation characters except the hyphen
str_extract_all(covid_nocite, "[^[:alnum:][:space:]-%]") # A negation of any letter/number/whitespace/hyphen/percent character
## [[1]]
## [1] "(" ")" "(" "(" ")" "." ")" "," "," "," "." "," "," "." ","
## [16] "," "," "," "," "." "," "." "\"" "\"" "," ";" "." "," "." "."
## [31] "." "." "," "/" "," "." "/" "." "," "." "," "," "(" ")" "."
## [46] "," "," "," "," "." "(" "," "(" ")" "," ")" "," "," "." "("
## [61] ")" "." "," "," "," "," "," "." "," "," "," "." "," "," ","
## [76] "." "," "," "(" ")" "," "(" ")" "(" ")" "." "." "," "," ","
## [91] "," "." "," "," "," "," ";" "." "," "," "," "," "." "," ","
## [106] "." "." "(" ")" "," "." "," "," "," "." "," "." "," "." "."
## [121] "." "." "," "." "." "," "." "(" ")" "." "." "," "," "," "."
## [136] "," "(" "," "," "," "," ")" "." "," "," "," "." "(" ")" "."
## [151] "," "." "." "," "." "." "." "," "," "." "." "." "." "," ";"
## [166] "," ";" "," "," ";" "," ";" "." "." "." "," "." "\"" "\"" "("
## [181] ")" "." "," "." "," "." "," "." "." "(" ")" "." "," "." "."
## [196] "," "." "(" ")" "." "," "." "." "." ":" "," "," "," "." ","
## [211] "." "," "." "," "." "," "\"" "\"" "." "." ":" ":" "," "," ":"
## [226] ":" "," ":" "," "," "," ":" "(" ")" "." "(" ")" "." ":" ","
## [241] ":" "," "," "," "," "," "." "." "." "," ";" "," "." "," "."
## [256] "(" ")" "." "," "," "\"" "\"" "," "." "," "," "." "," "." ","
## [271] "," "," "," "," "." "," "." "," "," "," "." "," "." "," ","
## [286] "," "," "," "." "," "," "," "," "." "," "." "," "." "," "."
## [301] ";" "\"" "\"" "." "." "," "." "," "." "," "," "," "." "." "("
## [316] ")" "," "." "." "." "." "," "." "(" ")" "." "(" ")" "," "."
## [331] "," "," "." "." "," "," "." "(" ")" "." ":" "," "(" ")" "."
## [346] "(" ")" "." "." "," "." "(" ")" "," "." "." "(" ")" "." ","
## [361] "." "(" ")" "." "," "," "." "." "." "." "." "." "<" "." "/"
## [376] "(" "(" ")" ")" "/" "." "," "." "," "," "." "." "." "," "."
## [391] "." "'" "'" "." "." "," "," "." "," "," "," "," "." "," "."
## [406] "." "," "," "," "." "." "\"" "\"" "." "." "," "," "." "," ","
## [421] "," "." "," "." "." "," "." "," "." "," "." "," ";" "," ","
## [436] "." "," "," "." "," "." "," "(" ")" "," "," "." "," "," ","
## [451] "," "." "," "," "," "." "." "(" ")" "." "(" ")" "," "," ","
## [466] "." "(" ")" "." "," "." "," "," "." "," "." "," "," "." "("
## [481] ")" "," "." "." "." "(" ")" "." "," "." "." "." "." "," "."
## [496] "." "," "." "," "." "." "." "," "." "," "." "." "," "," "."
## [511] "," "," "." "." "," "." "(" ")" "+" "(" ")" ":" "." "." ","
## [526] "." "." "," "." "." "," "," "," "," "," "," "," "." "," "."
## [541] "." "," "(" "," "/" "," ")" "." "." "(" ")" "," "," "(" ")"
## [556] "," "(" ")" "." "." "." "," "," "." "," "," ":" "." "(" "."
## [571] "." ")" "," "," "," "." "," "." "." "\"" "\"" "\"" "\"" "." ","
## [586] "." "," "\"" "\"" "\"" "\"" "." "," "\"" "\"" "\"" "\"" "," "(" ")"
## [601] "." "," "," "," "," "." "(" ")" "(" ")" "," "." "," "," "."
## [616] "\"" "\"" "." "," "." "," "." "," "." "." "." "," "." "," ","
## [631] "." "," "," "." "," "." "(" "," ")" "." "." "," "." "." "."
## [646] "," "\"" "\"" "." "," "," "/" "/" "." "." "." "." "." "," ","
## [661] "." "," "," "," "." "," "." "\"" "\"" "," "." "," "." "." ","
## [676] "," "," "." "," "." "," "," "." "," "/" "." "." "(" ")" "."
## [691] "." "," "." "," "," "." "." "." "." "," "," "." "," "," "."
## [706] "\"" "\"" "." "." "." "." "," "/" "," "." "," "," "," "." "'"
## [721] "'" "," "," "." "," "," "," "." "," "." "." "."
covid_nopunct <- str_replace_all(covid_eng_lower, "[^[:alnum:][:space:]-%]", " ") # Replace the pattern with a single whitespace character
unlist(str_extract_all(covid_nopunct, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))
## [1] "covid-19" "covid-" "sars-cov-2"
## [4] "2019-ncov" "multi-organ" "sars-cov-2"
## [7] "3%" "5%" "covid-19"
## [10] "covid-19" "rrt-pcr" "sars-cov-2"
## [13] "rrt-pcr" "covid-19" "7%"
## [16] "anti-cytokine" "%" "flu-like"
## [19] "multi-organ" "covid-19" "5%"
## [22] "20%" "sars-cov-2" "100%"
## [25] "53%" "fecal-oral" "sars-cov-2"
## [28] "covid-19" "real-time" "rrt-pcr"
## [31] "rrt-pcr" "covid-" "point-of-care"
## [34] "x-rays" "ground-" "first-line"
## [37] "covid-" "covid-19" "sars-cov-2"
## [40] "covid-19" "40%" "non-medical"
## [43] "non-medical" "covid-19" "alcohol-based"
## [46] "60%" "covid-19" "first-line"
## [49] "non-steroidal" "anti-inflammatory" "covid-19"
## [52] "off-label" "covid-19" "covid-19"
## [55] "5%" "3%" "4%"
## [58] "20-30%" "covid-19" "ventilator-associated"
## [61] "high-flow" "93%" "end-expiratory"
## [64] "self-quarantine" "40%" "covid-19"
## [67] "5%" "8%" "covid-19"
## [70] "covid-19" "covid-19" "covid-19"
## [73] "multi-organ" "covid-19" "6%"
## [76] "covid-19" "4%" "covid-19"
## [79] "pre-existing" "88%" "4%"
## [82] "9%" "covid-19" "8%"
## [85] "7%" "post-" "8%"
## [88] "under-counting" "covid-19" "long-term"
## [91] "covid-19" "long-term" "20%"
## [94] "30%" "%" "%"
## [97] "human-to-" "icd-10" "lab-confirmed"
## [100] "sars-cov-2" "covid-19" "lab-confirmed"
## [103] "sars-cov-2" "death-to-case" "death-to-case"
## [106] "7%" "covid-19" "covid-19"
## [109] "covid-19" "covid-19" "sars-cov-2"
## [112] "2019-ncov" "covid-19" "covid-"
## [115] "covid-19" "alt-right" "covid-19"
## [118] "sars-cov-2" "sars-cov" "sars-cov-2"
## [121] "sars-cov" "covid-19" "sars-cov-2"
## [124] "s-spike" "covid-19" "multi-country"
## [127] "covid-19" "sars-cov-2" "covid-19"
## [130] "covid-19" "covid-" "sars-cov-2"
## [133] "sars-cov-2" "off-label" "covid-19"
## [136] "anti-cytokine" "life-threatening" "covid-19"
## [139] "anti-cytokine" "interleukin-6" "il-6"
## [142] "covid-19" "sars-cov-2" "antibody-dependent"
## [145] "covid-"
Now we removed all punctuation marks except the hyphen and percent in a string
str_trunc(covid_nopunct, 2000, "right") # Check the first 2000 characters in covid_nopunct
## [1] "coronavirus disease 2019 coronavirus disease 2019 covid-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 covid- coronavirus 2 sars-cov-2 7 the disease was first 19 identified in december 2019 in wuhan the capital of other names 2019-ncov acute china s hubei province and has since spread globally resulting in the ongoing 2019 20 coronavirus respiratory disease pandemic 8 9 common symptoms include fever cough novel coronavirus and shortness of breath 10 other symptoms may include pneumonia 1 muscle pain sputum production diarrhea sore throat wuhan pneumonia 2 3 loss of smell and abdominal pain 4 11 12 while the wuhan coronavirus majority of cases result in mild symptoms some progress to viral pneumonia and multi-organ failure 8 13 as of coronavirus or other 28 march 2020 the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is 4 7 percent ranging from 0 2 percent to 15 percent according to age group and other health problems 14 in comparison the mortality rate of the 1918 flu pandemic was approximately 3% to 5% 15 the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze 16 17 respiratory droplets may be produced during breathing but the virus is not generally airborne 16 18 people may also contract covid-19 by touching a contaminated surface and then their face 16 17 it is most contagious when people are symptoms of covid-19 symptomatic although spread may be possible before pronunciation k ro n va r s d zi z symptoms appear 17 the virus can survive on surfaces ko v d up to 72 hours 19 time from exposure to onset of symptoms is generally between two and fourteen days specialty infectious diseases with an average of five days 10 20 the standard method symptoms fever cough shortness of of diagnosis is by reverse transcription polymerase chain breath 4 reaction rrt-pcr fr..."
# But it seems we may want to remove any number that does not come with a word nor mean a year
str_extract_all(covid_nopunct, "[[:space:]]+[[:digit:]]{1,3}[[:space:]]+") # Begins with a blank and less than three digit numbers and ends with a blank
## [[1]]
## [1] " 2 " " 7 " " 19 " " 20 " " 8 " " 10 "
## [7] " 1 " " 2 " " 4 " " 12 " " 8 " " 28 "
## [13] " 4 " " 0 " " 15 " " 14 " " 15 " " 16 "
## [19] " 16 " " 16 " " 17 " " 72 " " 19 " " 10 "
## [25] " 4 " " 21 " " 22 " " 5 " " 2 " " 14 "
## [31] " 2 " " 24 " " 26 " " 27 " " 29 " " 30 "
## [37] " 859 " " 5 " " 30 " " 20 " " 42 " " 5 "
## [43] " 4 " " 31 " " 11 " " 9 " " 33 " " 34 "
## [49] " 4 " " 39 " " 87 " " 67 " " 38 " " 40 "
## [55] " 33 " " 15 " " 30 " " 36 " " 18 " " 37 "
## [61] " 42 " " 43 " " 14 " " 13 " " 13 " " 12 "
## [67] " 35 " " 11 " " 8 " " 5 " " 4 " " 44 "
## [73] " 3 " " 31 " " 0 " " 0 " " 14 " " 45 "
## [79] " 97 " " 11 " " 47 " " 48 " " 49 " " 50 "
## [85] " 2 " " 52 " " 53 " " 24 " " 72 " " 72 "
## [91] " 3 " " 100 " " 100 " " 101 " " 18 " " 55 "
## [97] " 90 " " 100 " " 54 " " 55 " " 2 " " 1 "
## [103] " 57 " " 58 " " 55 " " 59 " " 55 " " 59 "
## [109] " 60 " " 61 " " 63 " " 62 " " 55 " " 64 "
## [115] " 66 " " 67 " " 21 " " 69 " " 71 " " 8 "
## [121] " 73 " " 19 " " 74 " " 65 " " 75 " " 21 "
## [127] " 76 " " 22 " " 77 " " 77 " " 78 " " 19 "
## [133] " 80 " " 82 " " 20 " " 88 " " 90 " " 88 "
## [139] " 88 " " 83 " " 85 " " 91 " " 1 " " 92 "
## [145] " 93 " " 84 " " 86 " " 84 " " 94 " " 95 "
## [151] " 96 " " 97 " " 98 " " 99 " " 100 " " 101 "
## [157] " 102 " " 20 " " 88 " " 104 " " 106 " " 108 "
## [163] " 26 " " 109 " " 111 " " 113 " " 115 " " 117 "
## [169] " 118 " " 119 " " 19 " " 105 " " 120 " " 122 "
## [175] " 123 " " 125 " " 126 " " 128 " " 130 " " 131 "
## [181] " 132 " " 133 " " 134 " " 134 " " 60 " " 134 "
## [187] " 80 " " 136 " " 137 " " 137 " " 2 " " 1 "
## [193] " 109 " " 138 " " 139 " " 140 " " 141 " " 142 "
## [199] " 140 " " 30 " " 35 " " 5 " " 1 " " 143 "
## [205] " 144 " " 145 " " 147 " " 143 " " 106 " " 148 "
## [211] " 149 " " 150 " " 151 " " 153 " " 154 " " 155 "
## [217] " 156 " " 48 " " 42 " " 157 " " 159 " " 27 "
## [223] " 160 " " 34 " " 50 " " 0 " " 70 " " 164 "
## [229] " 166 " " 168 " " 170 " " 171 " " 172 " " 173 "
## [235] " 174 " " 10 " " 97 " " 2 " " 175 " " 175 "
## [241] " 14 " " 41 " " 176 " " 162 " " 2 " " 1 "
## [247] " 177 " " 34 " " 11 " " 43 " " 11 " " 163 "
## [253] " 178 " " 179 " " 180 " " 181 " " 11 " " 163 "
## [259] " 183 " " 184 " " 185 " " 186 " " 188 " " 0 "
## [265] " 10 " " 20 " " 30 " " 40 " " 50 " " 60 "
## [271] " 70 " " 80 " " 11 " " 163 " " 0 " " 2 "
## [277] " 2 " " 2 " " 4 " " 3 " " 6 " " 0 "
## [283] " 8 " " 26 " " 174 " " 0 " " 0 " " 0 "
## [289] " 3 " " 7 " " 7 " " 7 " " 9 " " 4 "
## [295] " 27 " " 189 " " 0 " " 0 " " 0 " " 0 "
## [301] " 0 " " 3 " " 7 " " 3 " " 1 " " 30 "
## [307] " 190 " " 0 " " 0 " " 0 " " 1 " " 1 "
## [313] " 6 " " 7 " " 0 " " 3 " " 26 " " 191 "
## [319] " 0 " " 3 " " 2 " " 2 " " 4 " " 6 "
## [325] " 1 " " 7 " " 3 " " 0 " " 20 " " 45 "
## [331] " 55 " " 65 " " 75 " " 85 " " 16 " " 192 "
## [337] " 0 " " 1 " " 2 " " 5 " " 8 " " 4 "
## [343] " 6 " " 7 " " 9 " " 3 " " 5 " " 4 "
## [349] " 3 " " 193 " " 195 " " 163 " " 17 " " 197 "
## [355] " 198 " " 199 " " 1 " " 2 " " 200 " " 4 "
## [361] " 29 " " 634 " " 29 " " 201 " " 202 " " 203 "
## [367] " 20 " " 204 " " 24 " " 205 " " 19 " " 31 "
## [373] " 206 " " 207 " " 2 " " 209 " " 210 " " 19 "
## [379] " 209 " " 211 " " 213 " " 214 " " 216 " " 218 "
## [385] " 220 " " 62 " " 222 " " 223 " " 224 " " 16 "
## [391] " 225 " " 143 " " 10 " " 226 " " 228 " " 229 "
## [397] " 3 " " 143 " " 231 " " 232 " " 233 " " 234 "
## [403] " 17 " " 235 " " 236 " " 28 " " 238 " " 19 "
## [409] " 240 " " 229 " " 2 " " 241 " " 241 " " 243 "
## [415] " 245 " " 246 " " 2 " " 235 " " 249 " " 250 "
## [421] " 252 " " 253 " " 254 " " 255 " " 255 " " 255 "
## [427] " 255 " " 256 " " 19 "
covid_nonum <- str_replace_all(covid_nopunct, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(covid_nonum, 2000) # some numbers remain because of whitespace
## [1] "coronavirus disease 2019 coronavirus disease 2019 covid-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 covid- coronavirus sars-cov-2 the disease was first identified in december 2019 in wuhan the capital of other names 2019-ncov acute china s hubei province and has since spread globally resulting in the ongoing 2019 coronavirus respiratory disease pandemic common symptoms include fever cough novel coronavirus and shortness of breath other symptoms may include pneumonia muscle pain sputum production diarrhea sore throat wuhan pneumonia loss of smell and abdominal pain while the wuhan coronavirus majority of cases result in mild symptoms some progress to viral pneumonia and multi-organ failure as of coronavirus or other march 2020 the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is 7 percent ranging from 2 percent to percent according to age group and other health problems in comparison the mortality rate of the 1918 flu pandemic was approximately 3% to 5% the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze respiratory droplets may be produced during breathing but the virus is not generally airborne people may also contract covid-19 by touching a contaminated surface and then their face it is most contagious when people are symptoms of covid-19 symptomatic although spread may be possible before pronunciation k ro n va r s d zi z symptoms appear the virus can survive on surfaces ko v d up to hours time from exposure to onset of symptoms is generally between two and fourteen days specialty infectious diseases with an average of five days the standard method symptoms fever cough shortness of of diagnosis is by reverse transcription polymerase chain breath reaction rrt-pcr from a nasopharyngeal swab the complications pneumonia viral sepsis acute infection c..."
# Rerun pattern matching again
covid_nonum <- str_replace_all(covid_nonum, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(covid_nonum, 2000)
## [1] "coronavirus disease 2019 coronavirus disease 2019 covid-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 covid- coronavirus sars-cov-2 the disease was first identified in december 2019 in wuhan the capital of other names 2019-ncov acute china s hubei province and has since spread globally resulting in the ongoing 2019 coronavirus respiratory disease pandemic common symptoms include fever cough novel coronavirus and shortness of breath other symptoms may include pneumonia muscle pain sputum production diarrhea sore throat wuhan pneumonia loss of smell and abdominal pain while the wuhan coronavirus majority of cases result in mild symptoms some progress to viral pneumonia and multi-organ failure as of coronavirus or other march 2020 the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is percent ranging from percent to percent according to age group and other health problems in comparison the mortality rate of the 1918 flu pandemic was approximately 3% to 5% the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze respiratory droplets may be produced during breathing but the virus is not generally airborne people may also contract covid-19 by touching a contaminated surface and then their face it is most contagious when people are symptoms of covid-19 symptomatic although spread may be possible before pronunciation k ro n va r s d zi z symptoms appear the virus can survive on surfaces ko v d up to hours time from exposure to onset of symptoms is generally between two and fourteen days specialty infectious diseases with an average of five days the standard method symptoms fever cough shortness of of diagnosis is by reverse transcription polymerase chain breath reaction rrt-pcr from a nasopharyngeal swab the complications pneumonia viral sepsis acute infection can a..."
Now our string is preprocessed insofar as non-English characters, punctuation marks, numbers are removed. But we can still see some multiple whitespaces generated in text preprocessing.
covid_nospace <- str_squish(covid_nonum) # We can repeat the whitespace deletion process
str_trunc(covid_nospace, 2000)
## [1] "coronavirus disease 2019 coronavirus disease 2019 covid-19 is an infectious disease caused by severe acute respiratory syndrome coronavirus disease 2019 covid- coronavirus sars-cov-2 the disease was first identified in december 2019 in wuhan the capital of other names 2019-ncov acute china s hubei province and has since spread globally resulting in the ongoing 2019 coronavirus respiratory disease pandemic common symptoms include fever cough novel coronavirus and shortness of breath other symptoms may include pneumonia muscle pain sputum production diarrhea sore throat wuhan pneumonia loss of smell and abdominal pain while the wuhan coronavirus majority of cases result in mild symptoms some progress to viral pneumonia and multi-organ failure as of coronavirus or other march 2020 the overall rate of deaths per number of names for sars-cov-2 diagnosed cases is percent ranging from percent to percent according to age group and other health problems in comparison the mortality rate of the 1918 flu pandemic was approximately 3% to 5% the virus is spread mainly through close contact and via respiratory droplets produced when people cough or sneeze respiratory droplets may be produced during breathing but the virus is not generally airborne people may also contract covid-19 by touching a contaminated surface and then their face it is most contagious when people are symptoms of covid-19 symptomatic although spread may be possible before pronunciation k ro n va r s d zi z symptoms appear the virus can survive on surfaces ko v d up to hours time from exposure to onset of symptoms is generally between two and fourteen days specialty infectious diseases with an average of five days the standard method symptoms fever cough shortness of of diagnosis is by reverse transcription polymerase chain breath reaction rrt-pcr from a nasopharyngeal swab the complications pneumonia viral sepsis acute infection can also be diagnosed from a combination of respiratory distress syndrome sympt..."
Finally, we are ready to tokenize the string object, covid_nospace, into words separated by " "
.
covid_tidy_word <- unlist(str_split(covid_nospace, " "))
covid_tidy_word[1:50]
## [1] "coronavirus" "disease" "2019" "coronavirus" "disease"
## [6] "2019" "covid-19" "is" "an" "infectious"
## [11] "disease" "caused" "by" "severe" "acute"
## [16] "respiratory" "syndrome" "coronavirus" "disease" "2019"
## [21] "covid-" "coronavirus" "sars-cov-2" "the" "disease"
## [26] "was" "first" "identified" "in" "december"
## [31] "2019" "in" "wuhan" "the" "capital"
## [36] "of" "other" "names" "2019-ncov" "acute"
## [41] "china" "s" "hubei" "province" "and"
## [46] "has" "since" "spread" "globally" "resulting"
covid_tidy_word_freq <- sort(table(covid_tidy_word), decreasing = TRUE) # Create a table of word counts
covid_tidy_word_freq[1:50]
## covid_tidy_word
## the of and to in a
## 304 230 150 125 121 104
## is for with are or as
## 74 69 57 53 49 44
## disease that by covid-19 from virus
## 43 43 42 41 41 39
## people who may be on cases
## 34 34 33 31 31 30
## have symptoms 2020 not respiratory march
## 30 30 28 27 26 24
## also coronavirus health at china infection
## 23 23 23 21 21 21
## was been severe use those time
## 21 20 20 20 19 18
## it treatment but other rate sars-cov-2
## 17 17 16 16 16 16
## some these
## 16 16
One more thing to do in tokenization is to remove stopwords. The stopword lexicon is available in the tidytext package
library(tidytext)
tidytext::stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
stop_words %>% dplyr::count(lexicon)
## # A tibble: 3 x 2
## lexicon n
## <chr> <int>
## 1 onix 404
## 2 SMART 571
## 3 snowball 174
smart <- stop_words[stop_words$lexicon=="SMART",] # The dataset "stop_words" provides the smart lexicon of stop words in a dataframe format
smart$word
## [1] "a" "a's" "able" "about"
## [5] "above" "according" "accordingly" "across"
## [9] "actually" "after" "afterwards" "again"
## [13] "against" "ain't" "all" "allow"
## [17] "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always"
## [25] "am" "among" "amongst" "an"
## [29] "and" "another" "any" "anybody"
## [33] "anyhow" "anyone" "anything" "anyway"
## [37] "anyways" "anywhere" "apart" "appear"
## [41] "appreciate" "appropriate" "are" "aren't"
## [45] "around" "as" "aside" "ask"
## [49] "asking" "associated" "at" "available"
## [53] "away" "awfully" "b" "be"
## [57] "became" "because" "become" "becomes"
## [61] "becoming" "been" "before" "beforehand"
## [65] "behind" "being" "believe" "below"
## [69] "beside" "besides" "best" "better"
## [73] "between" "beyond" "both" "brief"
## [77] "but" "by" "c" "c'mon"
## [81] "c's" "came" "can" "can't"
## [85] "cannot" "cant" "cause" "causes"
## [89] "certain" "certainly" "changes" "clearly"
## [93] "co" "com" "come" "comes"
## [97] "concerning" "consequently" "consider" "considering"
## [101] "contain" "containing" "contains" "corresponding"
## [105] "could" "couldn't" "course" "currently"
## [109] "d" "definitely" "described" "despite"
## [113] "did" "didn't" "different" "do"
## [117] "does" "doesn't" "doing" "don't"
## [121] "done" "down" "downwards" "during"
## [125] "e" "each" "edu" "eg"
## [129] "eight" "either" "else" "elsewhere"
## [133] "enough" "entirely" "especially" "et"
## [137] "etc" "even" "ever" "every"
## [141] "everybody" "everyone" "everything" "everywhere"
## [145] "ex" "exactly" "example" "except"
## [149] "f" "far" "few" "fifth"
## [153] "first" "five" "followed" "following"
## [157] "follows" "for" "former" "formerly"
## [161] "forth" "four" "from" "further"
## [165] "furthermore" "g" "get" "gets"
## [169] "getting" "given" "gives" "go"
## [173] "goes" "going" "gone" "got"
## [177] "gotten" "greetings" "h" "had"
## [181] "hadn't" "happens" "hardly" "has"
## [185] "hasn't" "have" "haven't" "having"
## [189] "he" "he's" "hello" "help"
## [193] "hence" "her" "here" "here's"
## [197] "hereafter" "hereby" "herein" "hereupon"
## [201] "hers" "herself" "hi" "him"
## [205] "himself" "his" "hither" "hopefully"
## [209] "how" "howbeit" "however" "i"
## [213] "i'd" "i'll" "i'm" "i've"
## [217] "ie" "if" "ignored" "immediate"
## [221] "in" "inasmuch" "inc" "indeed"
## [225] "indicate" "indicated" "indicates" "inner"
## [229] "insofar" "instead" "into" "inward"
## [233] "is" "isn't" "it" "it'd"
## [237] "it'll" "it's" "its" "itself"
## [241] "j" "just" "k" "keep"
## [245] "keeps" "kept" "know" "knows"
## [249] "known" "l" "last" "lately"
## [253] "later" "latter" "latterly" "least"
## [257] "less" "lest" "let" "let's"
## [261] "like" "liked" "likely" "little"
## [265] "look" "looking" "looks" "ltd"
## [269] "m" "mainly" "many" "may"
## [273] "maybe" "me" "mean" "meanwhile"
## [277] "merely" "might" "more" "moreover"
## [281] "most" "mostly" "much" "must"
## [285] "my" "myself" "n" "name"
## [289] "namely" "nd" "near" "nearly"
## [293] "necessary" "need" "needs" "neither"
## [297] "never" "nevertheless" "new" "next"
## [301] "nine" "no" "nobody" "non"
## [305] "none" "noone" "nor" "normally"
## [309] "not" "nothing" "novel" "now"
## [313] "nowhere" "o" "obviously" "of"
## [317] "off" "often" "oh" "ok"
## [321] "okay" "old" "on" "once"
## [325] "one" "ones" "only" "onto"
## [329] "or" "other" "others" "otherwise"
## [333] "ought" "our" "ours" "ourselves"
## [337] "out" "outside" "over" "overall"
## [341] "own" "p" "particular" "particularly"
## [345] "per" "perhaps" "placed" "please"
## [349] "plus" "possible" "presumably" "probably"
## [353] "provides" "q" "que" "quite"
## [357] "qv" "r" "rather" "rd"
## [361] "re" "really" "reasonably" "regarding"
## [365] "regardless" "regards" "relatively" "respectively"
## [369] "right" "s" "said" "same"
## [373] "saw" "say" "saying" "says"
## [377] "second" "secondly" "see" "seeing"
## [381] "seem" "seemed" "seeming" "seems"
## [385] "seen" "self" "selves" "sensible"
## [389] "sent" "serious" "seriously" "seven"
## [393] "several" "shall" "she" "should"
## [397] "shouldn't" "since" "six" "so"
## [401] "some" "somebody" "somehow" "someone"
## [405] "something" "sometime" "sometimes" "somewhat"
## [409] "somewhere" "soon" "sorry" "specified"
## [413] "specify" "specifying" "still" "sub"
## [417] "such" "sup" "sure" "t"
## [421] "t's" "take" "taken" "tell"
## [425] "tends" "th" "than" "thank"
## [429] "thanks" "thanx" "that" "that's"
## [433] "thats" "the" "their" "theirs"
## [437] "them" "themselves" "then" "thence"
## [441] "there" "there's" "thereafter" "thereby"
## [445] "therefore" "therein" "theres" "thereupon"
## [449] "these" "they" "they'd" "they'll"
## [453] "they're" "they've" "think" "third"
## [457] "this" "thorough" "thoroughly" "those"
## [461] "though" "three" "through" "throughout"
## [465] "thru" "thus" "to" "together"
## [469] "too" "took" "toward" "towards"
## [473] "tried" "tries" "truly" "try"
## [477] "trying" "twice" "two" "u"
## [481] "un" "under" "unfortunately" "unless"
## [485] "unlikely" "until" "unto" "up"
## [489] "upon" "us" "use" "used"
## [493] "useful" "uses" "using" "usually"
## [497] "uucp" "v" "value" "various"
## [501] "very" "via" "viz" "vs"
## [505] "w" "want" "wants" "was"
## [509] "wasn't" "way" "we" "we'd"
## [513] "we'll" "we're" "we've" "welcome"
## [517] "well" "went" "were" "weren't"
## [521] "what" "what's" "whatever" "when"
## [525] "whence" "whenever" "where" "where's"
## [529] "whereafter" "whereas" "whereby" "wherein"
## [533] "whereupon" "wherever" "whether" "which"
## [537] "while" "whither" "who" "who's"
## [541] "whoever" "whole" "whom" "whose"
## [545] "why" "will" "willing" "wish"
## [549] "with" "within" "without" "won't"
## [553] "wonder" "would" "would" "wouldn't"
## [557] "x" "y" "yes" "yet"
## [561] "you" "you'd" "you'll" "you're"
## [565] "you've" "your" "yours" "yourself"
## [569] "yourselves" "z" "zero"
covid_tidy_nostop <- covid_tidy_word[!covid_tidy_word %in% smart$word] # %in% is a matching operater that leaves the elements in covid_text_word when they belong to smart$word
covid_tidy_word[1:50]
## [1] "coronavirus" "disease" "2019" "coronavirus" "disease"
## [6] "2019" "covid-19" "is" "an" "infectious"
## [11] "disease" "caused" "by" "severe" "acute"
## [16] "respiratory" "syndrome" "coronavirus" "disease" "2019"
## [21] "covid-" "coronavirus" "sars-cov-2" "the" "disease"
## [26] "was" "first" "identified" "in" "december"
## [31] "2019" "in" "wuhan" "the" "capital"
## [36] "of" "other" "names" "2019-ncov" "acute"
## [41] "china" "s" "hubei" "province" "and"
## [46] "has" "since" "spread" "globally" "resulting"
smart$word[1:50]
## [1] "a" "a's" "able" "about" "above"
## [6] "according" "accordingly" "across" "actually" "after"
## [11] "afterwards" "again" "against" "ain't" "all"
## [16] "allow" "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always" "am"
## [26] "among" "amongst" "an" "and" "another"
## [31] "any" "anybody" "anyhow" "anyone" "anything"
## [36] "anyway" "anyways" "anywhere" "apart" "appear"
## [41] "appreciate" "appropriate" "are" "aren't" "around"
## [46] "as" "aside" "ask" "asking" "associated"
covid_tidy_nostop[1:50]
## [1] "coronavirus" "disease" "2019" "coronavirus" "disease"
## [6] "2019" "covid-19" "infectious" "disease" "caused"
## [11] "severe" "acute" "respiratory" "syndrome" "coronavirus"
## [16] "disease" "2019" "covid-" "coronavirus" "sars-cov-2"
## [21] "disease" "identified" "december" "2019" "wuhan"
## [26] "capital" "names" "2019-ncov" "acute" "china"
## [31] "hubei" "province" "spread" "globally" "resulting"
## [36] "ongoing" "2019" "coronavirus" "respiratory" "disease"
## [41] "pandemic" "common" "symptoms" "include" "fever"
## [46] "cough" "coronavirus" "shortness" "breath" "symptoms"
covid_tidy_nostop_freq <- sort(table(covid_tidy_nostop), decreasing = TRUE)
names(covid_tidy_nostop_freq)[1:50]
## [1] "disease" "covid-19" "virus" "people" "cases"
## [6] "symptoms" "2020" "respiratory" "march" "coronavirus"
## [11] "health" "china" "infection" "severe" "time"
## [16] "treatment" "rate" "sars-cov-2" "pneumonia" "days"
## [21] "acute" "deaths" "hours" "number" "spread"
## [26] "syndrome" "vaccine" "2019" "data" "infected"
## [31] "recommended" "risk" "wuhan" "ace2" "death"
## [36] "develop" "found" "include" "medical" "case"
## [41] "diagnosed" "masks" "transmission" "ventilation" "blood"
## [46] "cdc" "chinese" "distress" "face" "february"
covid_tidy_nostop_freq[1:50]
## covid_tidy_nostop
## disease covid-19 virus people cases symptoms
## 43 41 39 34 30 30
## 2020 respiratory march coronavirus health china
## 28 26 24 23 23 21
## infection severe time treatment rate sars-cov-2
## 21 20 18 17 16 16
## pneumonia days acute deaths hours number
## 15 14 12 12 12 12
## spread syndrome vaccine 2019 data infected
## 12 12 12 11 11 11
## recommended risk wuhan ace2 death develop
## 11 11 11 10 10 10
## found include medical case diagnosed masks
## 10 10 10 9 9 9
## transmission ventilation blood cdc chinese distress
## 9 9 8 8 8 8
## face february
## 8 8
Let’s create a wordcloud from the table of word frequency
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(words = names(covid_tidy_nostop_freq), # Sequence of unique words
freq = covid_tidy_nostop_freq, # Frequency of words
min.freq = 5, # Minimum frequency of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.1, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2")) # Retrieve 8 colors from the list of "Dark2"
Now we have a much better wordcloud that gives more information about COVID-19
Extract all words that contain the hyphen (-) in the vector object of words, covid_tidy_word, and sort the counts of the extracted words in descending order. Save the result to the object named “HyphenWordTable” and export it as a text file using the following R input. And submit the file.
HyphenWordTable <- ???
write.table(HyphenWordTable, file="HyphenWordTable.txt", sep=",", quote = FALSE, row.names = FALSE)
1st Hint: We can modify the regex pattern (“[^[:space:]][[:punct:]]{1,}[^[:space:]]”) by which the words with a punctuation mark was matched for extracting the words with a hyphen.
2nd Hint: What’s the stringr function to extract certain string patterns that are matched by the regex specified?
3rd Hint: What’s the function to build a table of the counts at each word?