Let me remind you of the functions in the package stringr covered last time.
Function | Description | Similar Base Functions |
---|---|---|
str_length() |
number of characters | nchar() |
str_split() |
split up a string into pieces | strsplit() |
str_c() |
string concatenation | paste() |
str_squish() |
removes any redundant whitespace | |
str_detect() |
finds a particular pattern of characters | |
str_view_all() |
show the matching result on the actual screen |
All functions in stringr
starts with "str_"
followed by a term in relation to the task they perform.
Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.
Function | Description |
---|---|
str_which() |
Returns all positions of a matching pattern in a string vector |
str_subset() |
Returns all elements that contain a matching pattern in a string vector |
str_trunc() |
Truncates a string |
str_locate() |
Locates the first position of a matching pattern from a string |
str_locate_all() |
Locates all positions of a matching pattern from a string |
str_extact() |
Extracts the first matching pattern from a string |
str_extact_all() |
Extracts all matching patterns from a string |
str_replace() |
Replaces the first matching pattern in a string |
str_replace_all() |
Replaces all matching patterns in a string |
str_remove() |
Remove the first matched pattern in a string |
str_remove_all() |
remove all matched patterns in a string |
library(stringr)
library(pdftools)
## Using poppler version 0.73.0
cv_text <- pdf_text("COVID-19_Vaccine.pdf")
class(cv_text)
## [1] "character"
length(cv_text)
## [1] 93
cv_string <- str_c(cv_text, collapse = " ") # Collapse a character vector, cv_text, into a single string
length(cv_string)
## [1] 1
Now we have a single string in which text from a Wikipedia page about COVID-19 vaccines is concatenated.
First, we want to remove everything in the References section
str_locate_all(cv_string, "References") # Locate the section position of the pattern "References" in the string
## [[1]]
## start end
## [1,] 3702 3711
## [2,] 123159 123168
str_trunc(cv_string, width=100, side="right") # Truncate a character string; Leaves 100 characters from the first and removes the characters afterwards to the right end
## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst ..."
cv_trunc <- str_trunc(cv_string, width=123158, side="right")
str_locate_all(cv_trunc, "References")
## [[1]]
## start end
## [1,] 3702 3711
str_length(cv_trunc)
## [1] 123158
str_length(cv_string)
## [1] 286792
Now we know where the regex of literal characters "References"
appear in the string cv_string
and truncate it by removing everything after the position of our regex pattern.
Next, it seems we need to deal with whitespaces (\n
or \r\n
or multiple blanks). Remember how to remove all redundant whitespace characters, including line breaks: [[:space:]]
str_trunc(cv_trunc, 100, "right")
## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst ..."
str_squish(str_trunc(cv_trunc, 100, "right"))
## [1] "COVID-19 vaccine A COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity against ..."
cv_tidy <- str_squish(cv_trunc) # Remove any redundant whitespace
str_trunc(cv_tidy, 100, "right")
## [1] "COVID-19 vaccine A COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity against se..."
It should look tidier than before. What do we need to do with the string object now? It seems we should deal with normalization (standardized into either lower-case or upper-case letters).
So, we may want to use the function tolower
to translate characters of a string into lower-case ones.
Before doing so, we may need to remove all non-ASCII characters using the POSIX character class [[:ascii:]]
.
Let’s check what non-ASCII characters are in the string.
str_extract_all(cv_tidy, " [[:word:]]+[^[:ascii:]]+[[:word:]]+") # Extract all non-ASCII characters (matching the preceding character set at least one or more times); # Guess what "+" in regex does
## [[1]]
## [1] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [4] " COVID<U+2011>19" " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca"
## [7] " I<U+2013>II" " COVID<U+2011>19" " COVID<U+2011>19"
## [10] " years<U+2014>and" " COVID<U+2011>19" " COVID<U+2011>19"
## [13] " COVID<U+2011>19" " II<U+2013>III" " COVID<U+2011>19"
## [16] " COVID<U+2011>19" " 3<U+2013>6" " COVID<U+2011>19"
## [19] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [22] " COVID<U+2011>19" " COVID<U+2011>19" " Pfizer<U+2013>BioNTech"
## [25] " Pfizer<U+2013>BioNTech" " COVID<U+2011>19" " Pfizer<U+2013>BioNTech"
## [28] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [31] " COVID<U+2011>19" " SARS<U+2011>CoV" " COVID<U+2011>19"
## [34] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [37] " SARS<U+2011>CoV" " Oxford<U+2013>AstraZeneca" " SARS<U+2011>CoV"
## [40] " COVID<U+2011>19" " COVID<U+2011>19" " I<U+2013>II"
## [43] " COVID<U+2011>19" " II<U+2013>III" " Pfizer<U+2013>BioNTech"
## [46] " Oxford<U+2013>AstraZeneca" " Pfizer<U+2013>BioNTech" " 3<U+2013>4"
## [49] " Oxford<U+2013>AstraZeneca" " 2<U+2013>8" " 4<U+2013>12"
## [52] " 2<U+2013>8" " SARS<U+2011>CoV" " 3<U+2013>4"
## [55] " 2<U+2013>8" " SARS<U+2011>CoV" " 2<U+2013>8"
## [58] " 2<U+2013>8" " Mar<U+2013>Dec" " 2<U+2013>8"
## [61] " SARS<U+2011>CoV" " 2<U+2013>8" " 3<U+2013>4"
## [64] " 2<U+2013>8" " SARS<U+2011>CoV" " COVID<U+2011>19"
## [67] " I<U+2013>III" " I<U+2013>II" " I<U+2013>II"
## [70] " I<U+2013>II" " SARS<U+2011>CoV" " I<U+2013>II"
## [73] " SARS<U+2011>CoV" " I<U+2013>II" " I<U+2013>II"
## [76] " SARS<U+2011>CoV" " II<U+2013>III" " II<U+2013>III"
## [79] " II<U+2013>III" " 2020<U+2013>Jan" " 2021<U+2013>Mar"
## [82] " II<U+2013>III" " 18<U+2013>55" " 65<U+2013>85"
## [85] " Mar<U+2013>May" " Aug<U+2013>Dec" " 2020<U+2013>Jan"
## [88] " SARS<U+2011>CoV" " II<U+2013>III" " I<U+2013>II"
## [91] " II<U+2013>III" " I<U+2013>II" " I<U+2013>II"
## [94] " I<U+2013>II" " I<U+2013>II" " Duke<U+2013>NUS"
## [97] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [100] " I<U+2013>II" " SARS<U+2011>CoV" " I<U+2013>II"
## [103] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [106] " COVID<U+2011>19" " I<U+2013>II" " SARS<U+2011>CoV"
## [109] " I<U+2013>II" " SARS<U+2011>CoV" " I<U+2013>II"
## [112] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [115] " SARS<U+2011>CoV" " I<U+2013>II" " I<U+2013>II"
## [118] " I<U+2013>II" " SARS<U+2011>CoV" " Jul<U+2013>Oct"
## [121] " Pfizer<U+2013>BioNTech" " Pfizer<U+2013>BioNTech" " I<U+2013>IIa"
## [124] " II<U+2013>III" " COVID<U+2011>19" " COVID<U+2011>19"
## [127] " Pfizer<U+2013>BioNTech" " SARS<U+2011>CoV" " COVID<U+2011>19"
## [130] " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca" " 72<U+2013>100"
## [133] " SARS<U+2011>CoV" " Oxford<U+2013>AstraZeneca" " 42<U+2013>89"
## [136] " 71<U+2013>91" " COVID<U+2011>19" " COVID<U+2011>19"
## [139] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [142] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [145] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [148] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [151] " COVID<U+2011>19" " COVID<U+2011>19" " COVID<U+2011>19"
## [154] " COVID<U+2011>19" " COVID<U+2011>19"
str_extract_all(cv_tidy, " [[:word:]]+[-‑–—]+[[:word:]]+") # Figure dash, en dash, em dash, hyphen, etc..
## [[1]]
## [1] " 2019" " 2020" " 2020"
## [4] " SARS-CoV" " COVID-19" " COVID-19"
## [7] " 100" " 2021" " RBD-Dimer"
## [10] " 2021" " 308" " EUA"
## [13] " 2021" " 436" " EUA"
## [16] " AstraZeneca-Oxford" " 2021" " Pfizer-BioNTech"
## [19] " EUA" " EUA" " 600"
## [22] " 500" " 2021" " 2020"
## [25] " high-income" " 501" " 2003"
## [28] " non-human" " 2005" " 2006"
## [31] " 2020" " COVID-19" " MERS-CoV"
## [34] " 2020" " viral-vectored" " adenoviral-vectored"
## [37] " BVRS-GamVac" " MVA-vectored" " 2020"
## [40] " COVID-19" " COVID-19" " 2020"
## [43] " COVID-19" " G20" " 2020"
## [46] " cross-discipline" " 2020" " multi-site"
## [49] " low-rate" " 2020" " disease-fighting"
## [52] " 2020" " fast-track" " 2019"
## [55] " 2020" " COVID-19" " 2020"
## [58] " 2020" " 2020" " 2020"
## [61] " 2020" " 2020" " 2020"
## [64] " high-risk" " 2020" " EUA"
## [67] " BNT162b2" " 2020" " 2020"
## [70] " BBIBP-CorV" " 2020" " EUA"
## [73] " mRNA-1273" " 2021" " 2020"
## [76] " non-replicating" " nucleoside-modified" " COVID-19"
## [79] " 2021" " Pfizer-BioNTech" " COVID-19"
## [82] " 2021" " non-replicating" " vector-based"
## [85] " COVID-19" " non-replicating" " 2021"
## [88] " COVID-19" " COVID-19" " one-shot"
## [91] " Ad26" " 2021" " BBIBP-CorV"
## [94] " COVID-19" " 2021" " COVID-19"
## [97] " RBD-Dimer" " V451" " placebo-controlled"
## [100] " WHO-recognized" " RBD-Dimer" " COVID-19"
## [103] " placebo-controlled" " 2020" " 2020"
## [106] " 2020" " COVID-19" " Ad26"
## [109] " 2020" " 2021" " COVID-19"
## [112] " placebo-controlled" " 2020" " 2021"
## [115] " BBIBP-CorV" " double-blind" " placebo-controlled"
## [118] " 2020" " 2021" " Double-blind"
## [121] " placebo-controlled" " 100" " 2020"
## [124] " 2021" " 2020" " 2021"
## [127] " COVID-19" " placebo-controlled" " 2020"
## [130] " 2020" " 2020" " 2022"
## [133] " COVID-19" " double-blinded" " Ad26"
## [136] " placebo-controlled" " 2021" " 2020"
## [139] " 2023" " Ad5-nCoV" " multi-center"
## [142] " 2021" " COVID-19" " 2020"
## [145] " 2020" " 2021" " 2020"
## [148] " BBV152" " observer-blinded" " peer-reviewed"
## [151] " 2020" " 2021" " 2020"
## [154] " 2021" " ZF2001" " double-blind"
## [157] " 2020" " 2022" " Double-blind"
## [160] " placebo-controlled" " COVID-19" " 2020"
## [163] " 2021" " 2020" " 2021"
## [166] " FINLAY-FR" " Non-randomized" " parallel-group"
## [169] " double-blind" " 2021" " 2020"
## [172] " 2021" " COVID-19" " 2020"
## [175] " 2021" " observer-blind" " 2020"
## [178] " 2021" " double-blinded" " double-blinded"
## [181] " single-center" " single-center" " 2020"
## [184] " 2021" " QazCovid-in" " 2020"
## [187] " 2020" " 2021" " ZyCoV-D"
## [190] " double-blind" " 2021" " 2020"
## [193] " 2021" " Virus-like" " plant-based"
## [196] " AS03" " Event-driven" " 10x"
## [199] " 2020" " COVID-19" " 2021"
## [202] " 2020" " 2021" " SCB-2019"
## [205] " 2020" " AS03" " 2021"
## [208] " double-blind" " 2021" " 2022"
## [211] " UB-612" " Open-label" " 2020"
## [214] " 2021" " Observer-blind" " IIb-III"
## [217] " Double-Blind" " 2021" " 2023"
## [220] " GRAd-COV2" " observer-blind" " 2021"
## [223] " 24-week" " GRAd-COV2" " 2020"
## [226] " Single-center" " Double-blind" " Double-Blinded"
## [229] " 2020" " Double-Blinded" " 2020"
## [232] " 2021" " MVC-COV1901" " open-labeled"
## [235] " double-blinded" " single-center" " multi-center"
## [238] " 2020" " 2021" " multi-regional"
## [241] " 2020" " 2021" " double-blind"
## [244] " double-blind" " 2020" " 2021"
## [247] " 2020" " 2021" " double-blind"
## [250] " 2020" " 2021" " 2021"
## [253] " 2021" " 2020" " 2022"
## [256] " 2021" " INO-4800" " Open-label"
## [259] " Ib-IIa" " 2020" " 2020"
## [262] " 2021" " 2022" " Ib-IIa"
## [265] " AG0302-COVID" " double-blind" " single-center"
## [268] " 2020" " 2020" " 2021"
## [271] " I-IIa" " SARS-CoV" " SARS-CoV"
## [274] " AS03" " 2020" " 2022"
## [277] " IIBR-100" " 2020" " 2021"
## [280] " ARCT-021" " COV19" " double-blinded"
## [283] " observer-blind" " 2020" " 2022"
## [286] " VBI-2902a" " Virus-like" " observer-blind"
## [289] " 2021" " 2022" " MRT5500"
## [292] " First-in" " CoV-2" " 2021"
## [295] " 2022" " EuCorVac-19" " observer-blind"
## [298] " 2021" " 2022" " GX-19"
## [301] " I-II" " 210" " double-blind"
## [304] " 2020" " 2021" " VLA2001"
## [307] " multi-center" " double-blinded" " 2020"
## [310] " 2021" " TAK-919" " observer-blind"
## [313] " 2021" " 2022" " TAK-019"
## [316] " observer-blind" " 2021" " 2022"
## [319] " COVID-eVax" " in-human" " 2021"
## [322] " 2020" " 2023" " LV-SMENP"
## [325] " 2020" " 2023" " ChulaCov19"
## [328] " Dose-finding" " 2021" " LNP-nCoVsaRNA"
## [331] " 200" " 2020" " 2021"
## [334] " COVAX-19" " 2020" " 2021"
## [337] " HGC019" " 2021" " COVID-19"
## [340] " 2020" " 2021" " 2021"
## [343] " 2022" " PTX-COVID19" " 2021"
## [346] " COVAC-2" " 2021" " 2022"
## [349] " COVI-VAC" " First-in" " double-blind"
## [352] " dose-escalation" " 2020" " 2021"
## [355] " Open-label" " 2021" " double-blind"
## [358] " 2020" " 2021" " Double-blind"
## [361] " dose-ranging" " 2021" " 2022"
## [364] " BBV154" " double-blinded" " 2021"
## [367] " MV-014-212" " double-blinded" " 2021"
## [370] " 2022" " S-268019" " double-blind"
## [373] " parallel-group" " 2020" " 2022"
## [376] " GBP510" " 2021" " CIGB-66"
## [379] " double-blind" " 2020" " 2021"
## [382] " KBP-201" " First-in" " 2020"
## [385] " 2021" " AdimrSC-2f" " open-label"
## [388] " dose-finding" " 2020" " ERUCOV-VAC"
## [391] " COVID-19" " ERUCOV-VAC" " 2020"
## [394] " 2021" " AKS-452" " Single-center"
## [397] " open-label" " 2021" " GLS-5310"
## [400] " Double-blind" " 2020" " 2022"
## [403] " VAX-001" " observer-blind" " 2021"
## [406] " COH04S1" " 2020" " 2022"
## [409] " 2021" " NBP2001" " 2020"
## [412] " 2021" " CoVac-1" " 2020"
## [415] " 2021" " bacTRL-Spike" " observer-blind"
## [418] " 2020" " 2022" " 2021"
## [421] " CORVax12" " Open-label" " 2020"
## [424] " 2021" " ChAdV68-S" " Open-label"
## [427] " 2021" " 2022" " Double-blind"
## [430] " in-Human" " 2021" " 2022"
## [433] " VXA-CoV2" " Double-blind" " in-Human"
## [436] " 2020" " SARS-CoV" " double-blind"
## [439] " dose-ranging" " 2020" " SARS-CoV"
## [442] " CoV-2" " Long-term" " COVID-19"
## [445] " Gam-COVID" " Gam-COVID" " 2-8"
## [448] " nCoV-19" " middle-income" " nCoV-19"
## [451] " Virus-like" " double-blind" " COVID-19"
## [454] " COVID-19" " COVID-19" " SARS-CoV"
## [457] " COVID-19" " COVID-19" " BBIBP-CorV"
## [460] " 100" " SARS-CoV" " COVID-19"
## [463] " COVID-19" " 2021" " SARS-CoV"
## [466] " 2020" " non-B" " 501"
## [469] " 501" " 2021" " two-thirds"
## [472] " 501" " Ad26" " COVID-19"
## [475] " 2021" " COVID-19" " 501"
## [478] " AZD1222" " COVID-19" " 2021"
## [481] " 2020" " protein-based" " vector-based"
## [484] " 2021" " 436" " 2020"
## [487] " first-service" " taxpayer-funded" " 2020"
## [490] " late-stage" " densely-populated" " 2020"
## [493] " SARS-CoV" " COVID-19" " 2024"
## [496] " side-effects" " COVID-19" " high-income"
## [499] " 2020" " pre-sold" " high-income"
## [502] " 2021" " Director-General" " higher-income"
## [505] " lowest-income" " COVID-19" " 2020"
## [508] " long-standing" " COVID-19" " 100"
## [511] " 200" " 910" " 902"
## [514] " 824" " 744" " 732"
## [517] " 631" " 2021" " 549"
## [520] " COVID-19" " 505" " Anti-vaccination"
## [523] " 436" " COVID-19" " 420"
## [526] " 417" " COVID-19" " 402"
## [529] " 334" " 322" " 2020"
## [532] " 311" " 309" " 2021"
## [535] " 240" " 212" " 209"
## [538] " 2009" " 209"
cv_hyphen <- str_replace_all(cv_tidy, "[-‑–—]", "-") # Extract all words with hyphen/en dash/em dash in ASCII or non-ASCII codes
str_extract_all(cv_hyphen, " [[:word:]]+[^[:ascii:]]+[[:word:]]+")
## [[1]]
## [1] " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca" " I<U+2013>II"
## [4] " years<U+2014>and" " II<U+2013>III" " Pfizer<U+2013>BioNTech"
## [7] " Pfizer<U+2013>BioNTech" " Pfizer<U+2013>BioNTech" " SARS<U+2011>CoV"
## [10] " SARS<U+2011>CoV" " Oxford<U+2013>AstraZeneca" " SARS<U+2011>CoV"
## [13] " I<U+2013>II" " II<U+2013>III" " Pfizer<U+2013>BioNTech"
## [16] " Oxford<U+2013>AstraZeneca" " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca"
## [19] " SARS<U+2011>CoV" " SARS<U+2011>CoV" " Mar<U+2013>Dec"
## [22] " SARS<U+2011>CoV" " SARS<U+2011>CoV" " I<U+2013>III"
## [25] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [28] " SARS<U+2011>CoV" " I<U+2013>II" " SARS<U+2011>CoV"
## [31] " I<U+2013>II" " I<U+2013>II" " SARS<U+2011>CoV"
## [34] " II<U+2013>III" " II<U+2013>III" " II<U+2013>III"
## [37] " II<U+2013>III" " 65<U+2013>85" " Mar<U+2013>May"
## [40] " Aug<U+2013>Dec" " SARS<U+2011>CoV" " II<U+2013>III"
## [43] " I<U+2013>II" " II<U+2013>III" " I<U+2013>II"
## [46] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [49] " Duke<U+2013>N" " I<U+2013>II" " I<U+2013>II"
## [52] " I<U+2013>II" " I<U+2013>II" " SARS<U+2011>CoV"
## [55] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [58] " I<U+2013>II" " I<U+2013>II" " SARS<U+2011>CoV"
## [61] " I<U+2013>II" " SARS<U+2011>CoV" " I<U+2013>II"
## [64] " I<U+2013>II" " I<U+2013>II" " I<U+2013>II"
## [67] " SARS<U+2011>CoV" " I<U+2013>II" " I<U+2013>II"
## [70] " I<U+2013>II" " SARS<U+2011>CoV" " Jul<U+2013>Oct"
## [73] " Pfizer<U+2013>BioNTech" " Pfizer<U+2013>BioNTech" " I<U+2013>IIa"
## [76] " II<U+2013>III" " Pfizer<U+2013>BioNTech" " SARS<U+2011>CoV"
## [79] " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca" " SARS<U+2011>CoV"
## [82] " Oxford<U+2013>AstraZeneca"
str_extract_all(cv_hyphen, "[^[:ascii:]]+")
## [[1]]
## [1] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2014>" "<U+2011>" "<U+2011>"
## [16] "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>"
## [31] "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>"
## [46] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>"
## [61] "<U+2013>" "<U+2013>" "±" "°" "<U+2013>" "<U+2013>" "≤" "°" "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2013>" "<U+2013>" "<U+2013>"
## [76] "°" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "±" "°" "<U+2013>" "<U+2013>"
## [91] "°" "<U+2013>" "<U+2013>" "°" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "°" "<U+2013>"
## [106] "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [121] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [136] "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [151] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>"
## [166] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [181] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>"
## [196] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>"
## [211] "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [226] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [241] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [256] "<U+2013>" "<U+2013>" "<U+2212>" "<U+2212>" "°" "<U+2212>" "°" "°" "°" "°" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>"
## [271] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2013>" "<U+2248>"
## [286] "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2013>"
## [301] "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2212>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>"
## [316] "<U+2248>" "<U+2248>" "°" "°" "<U+2248>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>"
## [331] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>"
## [346] "<U+2011>" "<U+2011>" "<U+2011>"
cv_ascii <- str_replace_all(cv_hyphen, "[^[:ascii:]]+", " ") # Replace any non-English character with a blank " ".
cv_ascii_lower <- tolower(cv_ascii) # Translate all characters into lower-case letters
# if you have an error message, you may try a stringr function, str_to_lower, instead.
str_trunc(cv_ascii, 1000)
## [1] "COVID--9 vaccine A COVID -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus - (SARS CoV -), the virus causing coronavirus disease ---9 (COVID -9). Prior to the COVID -9 pandemic, there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), which enabled accelerated development of various vaccine technologies during early ----.[-] On -- January ----, the SARS-CoV-- genetic sequence data was shared through GISAID, and by -9 March, the global pharmaceutical industry announced a major commitment to address COVID--9.[-] COVID--9 vaccination doses administered per --- people In Phase III trials, several COVID -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic COVID -9 infections. As of March ----, -- vaccines were authorized by at least one national regulator..."
str_trunc(cv_ascii_lower, 1000, "right")
## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus - (sars cov -), the virus causing coronavirus disease ---9 (covid -9). prior to the covid -9 pandemic, there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome (sars) and middle east respiratory syndrome (mers), which enabled accelerated development of various vaccine technologies during early ----.[-] on -- january ----, the sars-cov-- genetic sequence data was shared through gisaid, and by -9 march, the global pharmaceutical industry announced a major commitment to address covid--9.[-] covid--9 vaccination doses administered per --- people in phase iii trials, several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections. as of march ----, -- vaccines were authorized by at least one national regulator..."
cv_tidy <- str_to_lower(cv_tidy)
Now, let’s think about how to deal with punctuation and numbers.
# Check what punctuation marks are to be removed; Punctuation
unlist(str_extract_all(cv_ascii_lower, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))[1:100] # Remember why we apply the unlist function to the result from str_extract_all
## [1] "covid--9" "-9" "-"
## [4] "(sars" "-)," "---9"
## [7] "(covid" "-9)." "-9"
## [10] "pandemic," "(sars)" "(mers),"
## [13] "----.[-]" "--" "----,"
## [16] "sars-cov--" "gisaid," "-9"
## [19] "march," "covid--9.[-]" "covid--9"
## [22] "---" "trials," "-9"
## [25] "95%" "-9" "infections."
## [28] "----," "--" "use:"
## [31] "(the" "vaccine)," "(bbibp-"
## [34] "corv," "coronavac," "covaxin,"
## [37] "covivac)," "(sputnik" "v,"
## [40] "vaccine," "convidecia," "&"
## [43] "vaccine)," "(epivaccorona" "rbd-dimer).[-]"
## [46] "total," "----," "--8"
## [49] "development," "7-" "research,"
## [52] "--" "trials," "--"
## [55] "trials," "-6" "development.[-]"
## [58] "use," "complications," "elderly,"
## [61] "e-a" "(or" "equivalent)"
## [64] "granted," "transmission," "workers.[-]"
## [67] "--" "----," "--6.-7"
## [70] "-9" "e-a" "granted,"
## [73] "use," "agencies.[5]" "astrazeneca-oxford"
## [76] "-" "----," "pfizer-biontech"
## [79] "-.-" "doses," "v,"
## [82] "sinopharm," "e-a" "granted,"
## [85] "sinovac," "&" "-"
## [88] "each." "e-a" "6--"
## [91] "5--" "----.[6][7]" "----,"
## [94] "--" "countries,[8]" "high-income"
## [97] "--%" "world's" "population.[9]"
## [100] "5--.v-"
It seems we have several patterns of string with punctuation that we want to remove from text: 1) Citation mark: "\\[\\d+\\]"
2) Number point number: "\\d+\\.\\d+"
3) Apostrophe: "[[:word:]]+[']s"
These three patterns of string are to be replaced with a blank
str_extract_all(cv_ascii_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]") # Check first the patterns are matched by our regex
## [[1]]
## [1] "[5]" "[6]" "[7]" "[8]" "'s " "[9]" "'s " "[55]" "[56]" "[57]"
## [11] "[55]" "[58]" "[59]" "[57]" "[65]" "[66]" "[67]" "'s " "[68]" "[69]"
## [21] "[75]" "[76]" "[77]" "[78]" "[79]" "[85]" "[86]" "[87]" "[88]" "[89]"
## [31] "[95]" "[96]" "[97]" "[98]" "[99]" "'s " "'s " "'s " "[87]" "[88]"
## [41] "[89]" "[87]" "[95]" "[96]" "'s " "[98]" "[99]" "65.7" "'s " "97.5"
## [51] "'s " "9.5" "8.6" "8.9" "59.8" "8.8" "'s " "8.8" "8.6" "5.6"
## [61] "8.9" "6.5" "9.7" "8.9" "8.6" "9.9" "7.5" "6.8" "'s "
cv_nocite <- str_replace_all(cv_ascii_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]", " ")
unlist(str_extract_all(cv_nocite, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))[1:100]
## [1] "covid--9" "-9" "-"
## [4] "(sars" "-)," "---9"
## [7] "(covid" "-9)." "-9"
## [10] "pandemic," "(sars)" "(mers),"
## [13] "----.[-]" "--" "----,"
## [16] "sars-cov--" "gisaid," "-9"
## [19] "march," "covid--9.[-]" "covid--9"
## [22] "---" "trials," "-9"
## [25] "95%" "-9" "infections."
## [28] "----," "--" "use:"
## [31] "(the" "vaccine)," "(bbibp-"
## [34] "corv," "coronavac," "covaxin,"
## [37] "covivac)," "(sputnik" "v,"
## [40] "vaccine," "convidecia," "&"
## [43] "vaccine)," "(epivaccorona" "rbd-dimer).[-]"
## [46] "total," "----," "--8"
## [49] "development," "7-" "research,"
## [52] "--" "trials," "--"
## [55] "trials," "-6" "development.[-]"
## [58] "use," "complications," "elderly,"
## [61] "e-a" "(or" "equivalent)"
## [64] "granted," "transmission," "workers.[-]"
## [67] "--" "----," "--6.-7"
## [70] "-9" "e-a" "granted,"
## [73] "use," "agencies." "astrazeneca-oxford"
## [76] "-" "----," "pfizer-biontech"
## [79] "-.-" "doses," "v,"
## [82] "sinopharm," "e-a" "granted,"
## [85] "sinovac," "&" "-"
## [88] "each." "e-a" "6--"
## [91] "5--" "----." "----,"
## [94] "--" "countries," "high-income"
## [97] "--%" "population." "5--.v-"
## [100] "-9,"
However, what about a punctuation mark to form a word? For example… a hyphen: “covid-19”, “sars-cov-2”, or “pfizer-biontech” What about percentage like 95%? We may not want to remove the hyphen and percent from the string, so we can remove any punctuation mark except the hyphen and % for convenience sake.
# How can we form a regex that matches to every punctuation characters except the hyphen and %
str_extract_all(cv_nocite, "[^[:alnum:][:space:]-%]") # A negation of any letter/number/whitespace/hyphen/percent character
## [[1]]
## [1] "(" ")" "," "(" ")" "." "," "(" ")" "(" ")" "," "." "["
## [15] "]" "," "," "," "." "[" "]" "," "." "," ":" "(" ")" ","
## [29] "(" "," "," "," ")" "," "(" "," "," "," "&" ")" "," "("
## [43] ")" "." "[" "]" "," "," "," "," "," "," "." "[" "]" ","
## [57] "," "," "(" ")" "," "," "." "[" "]" "," "." "," "," "."
## [71] "," "." "," "," "," "," "," "&" "." "." "," "," "." "."
## [85] "," "." "[" "]" "," "," "(" ")" "," "," "." "[" "]" "("
## [99] ")" "(" ")" "." "[" "]" "[" "]" "." "," "." "[" "]" "["
## [113] "]" "[" "]" "," "." "[" "]" "[" "]" "." "[" "]" "," "."
## [127] "[" "]" "[" "]" "," "(" ")" "[" "]" "," ":" "(" "," ")"
## [141] "(" ")" "." "[" "]" "," "." "[" "]" "(" ")" "," "," ","
## [155] "." "[" "]" "[" "]" "," ":" "[" "]" "[" "]" "(" ")" ","
## [169] "," "," "," "(" "," "." ")" "," "," "," "," "," "." "."
## [183] "(" "," ")" "." "," "," "." "[" "]" "," "," "," "." "["
## [197] "]" "[" "]" "," "," "[" "]" "," "," "." "[" "]" "[" "]"
## [211] "," "." "[" "]" "[" "]" "[" "]" "," "." "[" "]" "[" "]"
## [225] "," "," "," "." "[" "]" "[" "]" "\"" "\"" "," ";" "[" "]"
## [239] "[" "]" "." "[" "]" "," "." "[" "]" "[" "]" "[" "]" ","
## [253] "(" ")" "." "[" "]" "[" "]" "." "," ":" "(" "\"" "\"" ")"
## [267] "," "," "," "." "[" "]" "\"" "&" "(" ")" "\"" "\"" "," ","
## [281] "," "\"" "\"" "." "\"" "," "." "[" "]" "." "," "\"" "\"" ","
## [295] "[" "]" "." "[" "]" "\"" "\"" "." "[" "]" "," "," "." "["
## [309] "]" "," "," "." "." "[" "]" "," "(" ")" "," "," "." "["
## [323] "]" "[" "]" "," "." "[" "]" "[" "]" "," "&" "," "," "("
## [337] ")" "," "," "," "." "[" "]" "[" "]" "," "[" "]" "," "."
## [351] "[" "]" "," "." "," "," "," "," "." "[" "]" "(" ")" ","
## [365] "." "(" ")" "," "," "." "[" "]" "," "(" ")" "," "," "."
## [379] "[" "]" "," "[" "]" "(" "," ")" "." "[" "]" "[" "]" ","
## [393] "." "[" "]" "," "," "." "(" ")" "." "," "(" ")" "," "."
## [407] "[" "]" "[" "]" "[" "]" "," "[" "]" "." "," "." "[" "]"
## [421] "," "(" ")" "(" ")" "." "," "," "." "," "." "[" "]" "["
## [435] "]" "." "[" "]" "(" ":" "(" ")" ")" "," "," "," "," "("
## [449] ")" "," "(" ")" "," "," "." "[" "]" "[" "]" "[" "]" "["
## [463] "]" "," "\"" "\"" "." "[" "]" "[" "]" "[" "]" "," "," ","
## [477] "," "," "." "[" "]" "[" "]" "," "," "(" ")" "." "," ","
## [491] "." "." "[" "]" "[" "]" "[" "]" "[" "]" "." "," "." "["
## [505] "]" "[" "]" "," "." "[" "]" "." "," "." "," "." "[" "]"
## [519] "," "," "." "[" "]" "," "," "[" "]" "," "," "&" "." "&"
## [533] ";" "." "[" "]" "[" "]" "&" "." "," "." "[" "]" "," ","
## [547] "[" "]" "[" "]" "[" "]" "[" "]" "," "." "[" "]" "." "["
## [561] "]" "[" "]" "." "," "." "[" "]" "," "." "[" "]" "[" "]"
## [575] "." "[" "]" "," "." "[" "]" "[" "]" "," "[" "]" "[" "]"
## [589] "[" "]" "[" "]" "[" "]" "[" "]" "," "[" "]" "[" "]" ","
## [603] "." "[" "]" "." "[" "]" "," "." "[" "]" "," "," "(" ")"
## [617] "," "." "[" "]" "[" "]" "," "," "," "," "." "[" "]" ","
## [631] "," "(" "\"" "\"" "\"" "\"" ")" "," "." "[" "]" "[" "]" ","
## [645] "," "," "," "," "." "[" "]" "[" "]" "[" "]" "\"" "\"" "."
## [659] "[" "]" "[" "]" "," "," "," "." "[" "]" "[" "]" "." "."
## [673] "&" "," "," "/" "(" ")" "(" ")" "[" "]" "(" "," ")" "("
## [687] ")" "," "(" "(" ")" "," "(" ")" "(" ")" "." "," ")" "["
## [701] "]" "[" "]" "[" "]" "." "[" "]" "[" "]" "," "[" "]" "["
## [715] "]" "," "[" "]" "(" "," ")" "(" ")" "(" ")" "(" ")" "("
## [729] "[" "]" "," ")" "[" "]" "," "," "." "[" "]" "," "." "."
## [743] "[" "]" "," "," "," "[" "]" "," "[" "]" "[" "]" "," "["
## [757] "]" "[" "]" "[" "]" "[" "]" "(" "," ")" "(" ")" "," ";"
## [771] "(" ")" "(" ")" "[" "]" "[" "]" "[" "]" "(" ")" "," ","
## [785] "[" "]" "[" "]" "," "," "," "." "[" "]" "." "[" "]" ","
## [799] "(" "," ")" "," "[" "]" "," "[" "]" "[" "]" "[" "]" "("
## [813] "," ")" "(" ")" ":" "," "(" ")" "(" ")" "," "," "[" "]"
## [827] "[" "]" "," "." "." "[" "]" "," "," "," "," "[" "]" ","
## [841] "[" "]" "," "[" "]" "[" "]" "[" "]" "[" "]" "[" "]" "["
## [855] "]" "(" "," ")" "(" ")" "," "(" ")" "(" ")" "[" "]" ","
## [869] "[" "]" "." "." "." "[" "]" "." "." "[" "]" "." "," ","
## [883] "." "[" "]" "," "(" "," ")" ";" "[" "]" "," "(" "," ")"
## [897] ";" "(" "," ")" ";" "[" "]" "(" "," ")" "[" "]" "[" "]"
## [911] "(" "," ")" "(" ")" "[" "]" "[" "]" "(" "(" ")" ";" "("
## [925] ")" "," "," "[" "]" "[" "]" "," "," ")" "[" "]" "," ","
## [939] "." "[" "]" "." "[" "]" "," "&" "[" "]" "(" "," ")" "("
## [953] ")" "," "[" "]" "," "(" ")" "(" "," ")" "[" "]" "(" "&"
## [967] ")" "," "." "&" "," "." "," "." "[" "]" "[" "]" "," ","
## [981] "," "," "," "," "," "," "," "," "(" ")" "[" "]" "(" ","
## [995] ")" "(" ")" "," "[" "]" "," "(" ")" "(" "," ")" "[" "]"
## [1009] "," "," "." "," "." "." "[" "]" "," ";" "," ";" "," ","
## [1023] "[" "]" "," "," ";" "[" "]" ";" "[" "]" ";" "[" "]" "["
## [1037] "]" "[" "]" "(" ")" "[" "]" "(" "," ")" "(" ")" "," ","
## [1051] "(" ")" "(" ")" "[" "]" "," "[" "]" "[" "]" "." "[" "]"
## [1065] "." "[" "]" "," "." "[" "]" "[" "]" "(" "," ")" "(" ")"
## [1079] "(" ")" "[" "]" "(" ")" "," "[" "]" "," "," "," "[" "]"
## [1093] "(" ")" "[" "]" "(" "," ")" "(" ")" "(" ")" "," "(" ")"
## [1107] "." "[" "]" "[" "]" "," "." "[" "]" "," "," "," "," ","
## [1121] "," "[" "]" "[" "]" "[" "]" "[" "]" "(" "," ")" "(" ")"
## [1135] "," "(" ")" "[" "]" "[" "]" "," "." "[" "]" "[" "]" "["
## [1149] "]" "," "[" "]" "(" ")" "(" ")" "(" ")" "," "[" "]" "["
## [1163] "]" "[" "]" "/" "(" ")" "(" ")" "[" "]" "(" "," ")" "["
## [1177] "]" "[" "]" "[" "]" "," "(" "," "[" "]" "," "[" "]" "["
## [1191] "]" ")" "." "[" "]" "[" "]" "," "(" "," ")" ";" "," ","
## [1205] "," "(" "," ")" "[" "]" "(" ")" "(" ")" "(" "," ")" "["
## [1219] "]" "[" "]" "[" "]" "[" "]" "(" ")" ":" "," "," "." ","
## [1233] ":" "." "," ":" "." "," ":" "," "," "(" ")" ":" "." ":"
## [1247] "." ":" "." ":" "." "," "(" "(" ")" "(" ")" "(" ")" ")"
## [1261] "[" "]" "(" "," ")" "[" "]" "[" "]" "[" "]" "," "[" "]"
## [1275] "/" ":" "(" ")" ":" "," "," "," "," "," "," "," "." ","
## [1289] "," "(" ")" ":" "," "," "," "," "," "," "," "," "." ","
## [1303] "(" ")" "," "(" ")" "," "(" ")" "," "(" ")" "[" "]" "["
## [1317] "]" "(" ")" "(" "," ")" "," "," "," "," "," "," "," ","
## [1331] "," "[" "]" "(" "," ")" "(" "[" "]" ")" "," "," "," "["
## [1345] "]" "," "[" "]" "[" "]" "(" ")" "(" "," ")" "[" "]" "("
## [1359] "," ")" "," ";" "," "," "," "[" "]" "?" "," "[" "]" "["
## [1373] "]" "[" "]" "," "[" "]" "," "[" "]" "(" ")" "," "(" ","
## [1387] "(" "," ")" ")" "," "," "(" "," ")" "[" "]" "," "." ","
## [1401] "[" "]" "[" "]" "[" "]" "," "(" "(" ")" "(" "," ")" ","
## [1415] "[" "]" "[" "]" ")" "," "," "," "," "," "," "," "," ","
## [1429] "," "," "," "," "," "," "(" ")" "[" "]" "," "," "(" ","
## [1443] ")" "," "[" "]" "[" "]" "," "(" "," ")" ":" "," "," "."
## [1457] "(" "," ")" ":" "," "," "," "," "." "," "[" "]" "[" "]"
## [1471] "(" ")" "[" "]" "," "(" "(" "," ")" "(" "," ")" "[" "]"
## [1485] "[" "]" ":" "," "," ")" "," "." "," "," "." "." "," "("
## [1499] "(" ")" "[" "]" ")" "(" "," ")" "[" "]" "[" "]" "," ","
## [1513] "," "(" ")" ":" "," "," "," "." "," "," "." "(" "," ")"
## [1527] ":" "," "," "," "." "," "(" ")" "(" ")" "[" "]" "(" ","
## [1541] ")" "[" "]" "," "," "," "," "," "," "," "," "(" ")" "["
## [1555] "]" "." "(" ")" "(" "," ")" "[" "]" "," "," "," "," "."
## [1569] "." "," "," "[" "]" "(" "(" ")" "(" ")" "[" "]" "[" "]"
## [1583] "," "," ")" "[" "]" "[" "]" "," "," "," "," "[" "]" "("
## [1597] ")" "(" ")" "," "," "[" "]" "," "[" "]" "[" "]" "[" "]"
## [1611] "(" "(" ")" "," "," "," "(" ")" "(" ")" ":" "," ")" ","
## [1625] "." "," "(" ")" ":" "." "[" "]" "." "[" "]" "," "," ","
## [1639] "(" "[" "]" ")" "[" "]" "[" "]" "(" ")" "(" ")" "." ","
## [1653] "[" "]" "(" ")" "/" "," "," "," "," "[" "]" "," "," ","
## [1667] "," "(" "," ")" "(" ")" ":" "(" ")" "." "[" "]" "(" ")"
## [1681] ":" "." "[" "]" "," "(" ")" "[" "]" "(" ")" "(" "," ")"
## [1695] "[" "]" "," "(" ")" "[" "]" "[" "]" "," "(" ")" "," "/"
## [1709] "(" ")" ":" "," "," "(" ")" ":" "[" "]" "(" ")" ":" ","
## [1723] "," "[" "]" "," "," "(" ")" "[" "]" "(" ")" "," "," ","
## [1737] "," "[" "]" "(" ")" "[" "]" "." "," "[" "]" "(" ")" ","
## [1751] "," "," "," "(" ")" "[" "]" "[" "]" "[" "]" "," "[" "]"
## [1765] "(" ")" "(" ")" ":" "," "," "," "," "[" "]" "[" "]" "("
## [1779] ")" "," "," "," "[" "]" "(" ")" "[" "]" "," "," "," "["
## [1793] "]" "(" ")" "[" "]" "," "," "." "," "(" ")" "[" "]" ":"
## [1807] "," "." ":" "," "," "." "," "/" "[" "]" "(" "(" ")" ","
## [1821] "[" "]" ")" "[" "]" "(" "(" ")" ")" "," "[" "]" "(" ")"
## [1835] "[" "]" "." "," "[" "]" "(" ")" "," "(" ")" "(" ")" ","
## [1849] "[" "]" "(" "(" ")" "[" "]" ")" "," "[" "]" "," "(" ")"
## [1863] "[" "]" "," "?" "," "[" "]" "[" "]" "[" "]" "," "." ","
## [1877] "(" ")" "(" ")" "[" "]" "," "[" "]" "," "," "[" "]" "["
## [1891] "]" "(" ")" "." "[" "]" "[" "]" "," "," "[" "]" "[" "]"
## [1905] "(" ")" "[" "]" "," "[" "]" "(" ")" "(" "[" "]" ")" ","
## [1919] "(" ")" "." "[" "]" "," "," "," "," "," "(" ")" "(" ")"
## [1933] "[" "]" "," "," "," "[" "]" "(" ")" "," "," "," "," "."
## [1947] "[" "]" "," "(" ")" "[" "]" "," "," "," "." "," "[" "]"
## [1961] "(" ")" "(" ")" "[" "]" "[" "]" "," "," "." "," "[" "]"
## [1975] "(" ")" "." "[" "]" "," "," "." "," "(" ")" "[" "]" ","
## [1989] "," "," "." "," "." "." "(" ")" "[" "]" "," "," "," "."
## [2003] "," "(" ")" "[" "]" "," "," "," "." "," "(" ")" "[" "]"
## [2017] "," "," "," "," "," "(" ")" "[" "]" "," "," "," "." ","
## [2031] "(" ")" "[" "]" "," "," "." "," "(" ")" "[" "]" "," ","
## [2045] "," "." "," "." "(" ")" "[" "]" "," "," "." "," "(" ")"
## [2059] "." "[" "]" "," "," "," "." "," "(" ")" "[" "]" "." ","
## [2073] "(" ")" "(" ")" "[" "]" "," "," "." "," "(" ")" "." "."
## [2087] "[" "]" "," "," "," "." "," "(" ")" "[" "]" "," "," ","
## [2101] "." "," "(" ")" "[" "]" "," "," "." "," "(" ")" "[" "]"
## [2115] "," "," "." "," "(" ")" "&" "[" "]" "." "," "(" ")" ","
## [2129] "[" "]" "," "," "." "," "(" ")" "." "[" "]" "," "," ","
## [2143] "." "," "(" ")" "[" "]" "," "," "," "." "," "/" "(" "?"
## [2157] "[" "]" "[" "]" "(" ")" "," "," "," ")" "," "," "," "."
## [2171] "." "," "[" "]" "/" "[" "]" "&" "," "[" "]" "/" "," "."
## [2185] "(" ")" "," "[" "]" "," "(" ")" "," "/" "." "[" "]" "."
## [2199] "." "." "[" "]" "[" "]" "." "." "(" ")" "," "(" ")" "."
## [2213] "[" "]" "[" "]" "." "." "." "[" "]" "." "[" "]" "." "["
## [2227] "]" "." ":" "." "." "[" "]" "." "." "." "[" "]" "." "."
## [2241] "[" "]" "." "[" "]" "(" ")" "." "." "," "," "." "[" "]"
## [2255] "," "," "," "." "," "[" "]" "(" ")" "." "(" ")" "." "("
## [2269] ":" ";" ":" "(" ")" ")" "[" "]" "." "[" "]" "[" "]" ","
## [2283] "," "," "." "[" "]" "," "," "." "." "," "." "," "." "["
## [2297] "]" ":" "[" "]" "[" "]" "(" ")" "[" "]" "[" "]" "(" ")"
## [2311] "[" "]" "(" ")" "(" ")" "[" "]" "(" ")" "[" "]" "(" ","
## [2325] ")" "[" "]" "(" ")" "[" "]" "[" "]" "[" "]" "(" ")" "["
## [2339] "]" "[" "]" "[" "]" "[" "]" "[" "]" "[" "]" "(" ")" "["
## [2353] "]" "(" ")" "(" ")" "[" "]" "(" ")" "[" "]" "&" "[" "]"
## [2367] "(" ")" "[" "]" "(" ")" "(" ")" "[" "]" "(" ")" "[" "]"
## [2381] "[" "]" "[" "]" "." ":" "," "," "," "," "," "," "," ","
## [2395] "," "," "," "," "," "," "," "," "," "." ":" "." "." ":"
## [2409] "," "," "," "," "," "(" ")" "," "," "," "," "," "." "."
## [2423] "." "." "." "," "(" ")" "." "." "." "." "." "." "[" "]"
## [2437] "." "[" "]" "," "." "[" "]" "," "," "." "." "." "," "."
## [2451] "[" "]" "." "[" "]" "." "." "." "," "." "." "." "." "["
## [2465] "]" "~" "," "~" "." "." "." "," "~" "\"" "\"" "." "." "."
## [2479] "[" "]" "." "." "(" "." "." ")" "." "[" "]" "," "." ","
## [2493] "." "[" "]" "," "&" "," "." "." "," "." "[" "]" "," "."
## [2507] "." "[" "]" "," "\"" "\"" "." "[" "]" "," "." "[" "]" "["
## [2521] "]" "," "." "[" "]" "," "." "[" "]" "," "." "[" "]" "["
## [2535] "]" "." "[" "]" "," "\"" "\"" "," "," "." "[" "]" "," "."
## [2549] "[" "]" "[" "]" "," "." "." "[" "]" "," "," "," "," "("
## [2563] ")" "." "[" "]" "." "[" "]" "[" "]" "[" "]" "[" "]" "."
## [2577] "[" "]" "[" "]" "[" "]" "," "," "." "[" "]" "," "." "["
## [2591] "]" ":" "\"" "," "," "," "," "," "," "." "\"" "[" "]" ","
## [2605] "," "." "[" "]" "[" "]" "[" "]" "[" "]" "," "," "." "["
## [2619] "]" "[" "]" "," "," "," "." "[" "]" "[" "]" "[" "]" ","
## [2633] "," "\"" "," "," "," "," "," "," "\"" "," "\"" "," "," "("
## [2647] "[" "]" ")" "[" "]" "\"" "." "[" "]" "[" "]" "." "[" "]"
## [2661] "." "[" "]" "[" "]" "," "," "." "," "," "," "." "[" "]"
## [2675] "," "," "." "[" "]" "," "," "." "," "," "," "." "." "["
## [2689] "]" "," "," "." "," "," "." "," "," "[" "]" "," "," ","
## [2703] "," "." "," "," "." "," "," "." "," "," "." "," "," "."
## [2717] "," "," "." "." "[" "]" "," "," "," "," "," ":" "\"" ","
## [2731] "," "." "," "," "." "." "." "," "," "." ";" ";" "." "\""
## [2745] "[" "]" "," "," "," "," "." "[" "]" "[" "]" "," "," "["
## [2759] "]" "," "," "," "," ";" "," "," "," "[" "]" "," "," "."
## [2773] "[" "]" "[" "]" "[" "]" "," "," "," "," "." "(" ")" ","
## [2787] "," "." "." "," "," "." "," "," "." "." "," "," "." "."
## [2801] "," "." "," "," "." "," "," "." "." "[" "]" "," "," "["
## [2815] "]" "," ":" "\"" "," "," "," "." "." "\"" "[" "]" "," "."
## [2829] "'" "\"" "\"" "," "," "." "," "." "," "." "[" "]" "[" "]"
## [2843] "," "," "," "," "," "." "," "," "." "," "." "," "." "."
## [2857] "[" "]" "," "." "," "," "," "." "," "," "." "," "," ","
## [2871] "." "." "," "." "," "." "," "." "," "[" "]" "," "[" "]"
## [2885] "," "." "." "[" "]" "," "," "." "." "," "." "," "[" "]"
## [2899] "," "," "," "," "." "[" "]" "," "," "." "," "." "[" "]"
## [2913] "[" "]" "," "." "," "," "[" "]" "," "." "[" "]" "[" "]"
## [2927] "[" "]" "," "." "[" "]" "," "," "." "[" "]" "," "," "."
## [2941] "." "." "."
cv_nopunct <- str_replace_all(cv_nocite, "[^[:alnum:][:space:]-%]", " ") # Replace the pattern with a single whitespace character
unlist(str_extract_all(cv_nopunct, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))
## [1] "covid--9" "-9" "-"
## [4] "-" "---9" "-9"
## [7] "-9" "----" "-"
## [10] "--" "----" "sars-cov--"
## [13] "-9" "covid--9" "-"
## [16] "covid--9" "---" "-9"
## [19] "95%" "-9" "----"
## [22] "--" "bbibp-" "rbd-dimer"
## [25] "-" "----" "--8"
## [28] "7-" "--" "--"
## [31] "-6" "-" "e-a"
## [34] "-" "--" "----"
## [37] "--6" "-7" "-9"
## [40] "e-a" "astrazeneca-oxford" "-"
## [43] "----" "pfizer-biontech" "-"
## [46] "-" "e-a" "-"
## [49] "e-a" "6--" "5--"
## [52] "----" "----" "--"
## [55] "high-income" "--%" "5--"
## [58] "v-" "-9" "--"
## [61] "----" "--" "--"
## [64] "--" "non-human" "---5"
## [67] "---6" "--" "-5"
## [70] "-6" "----" "-7"
## [73] "-8" "-9" "covid--9"
## [76] "mers-cov" "-7" "--"
## [79] "----" "--" "viral-vectored"
## [82] "adenoviral-vectored" "chadox--mers" "bvrs-gamvac"
## [85] "mva-vectored" "mva-mers-s" "--"
## [88] "----" "--" "-9"
## [91] "--%" "--%" "-6%"
## [94] "--" "--" "--"
## [97] "-5" "-s" "covid--9"
## [100] "covid--9" "covid-" "-9"
## [103] "bbibp-corv" "-9" "-6"
## [106] "-6" "-7" "----"
## [109] "-8" "-nited" "-6"
## [112] "-9" "-9" "--"
## [115] "--" "--" "--"
## [118] "--" "--" "-5"
## [121] "-6" "-7" "-8"
## [124] "-9" "-9" "-9"
## [127] "--" "--" "covid--9"
## [130] "g--" "----" "--"
## [133] "--" "cross-discipline" "--"
## [136] "----" "multi-site" "-"
## [139] "-5" "-6" "-7"
## [142] "low-rate" "-7" "-8"
## [145] "-9" "----" "-9"
## [148] "-nited" "disease-fighting" "-9"
## [151] "-7" "5-" "----"
## [154] "fast-track" "5-" "5-"
## [157] "-7" "-6" "---9"
## [160] "5-" "--" "----"
## [163] "covid--9" "5-" "----"
## [166] "----" "-9" "6-"
## [169] "-9" "--" "-9"
## [172] "----" "--%" "--%"
## [175] "-6%" "5-" "----"
## [178] "-" "-" "-8"
## [181] "6-" "-9" "----"
## [184] "6-" "----" "5-"
## [187] "6-" "--" "----"
## [190] "high-risk" "6-" "--"
## [193] "----" "-" "e-a"
## [196] "bnt-6-b-" "--" "----"
## [199] "-" "----" "-nited"
## [202] "-9" "7-" "7-"
## [205] "7-" "--" "-nion"
## [208] "7-" "-9" "-nited"
## [211] "bbibp-corv" "7-" "--"
## [214] "----" "-nited" "-se"
## [217] "e-a" "-9" "e-a"
## [220] "mrna---7-" "----" "-9"
## [223] "-" "8-" "-9"
## [226] "8-" "----" "nucleoside-modified"
## [229] "-" "-" "non-replicating"
## [232] "-" "-" "--"
## [235] "8-" "--" "--"
## [238] "-9" "next-generation" "-9"
## [241] "8-" "--" "--"
## [244] "-9" "8-" "--"
## [247] "nucleoside-modified" "8-" "8-"
## [250] "8-" "8-" "covid--9"
## [253] "-nited" "-nion" "----"
## [256] "pfizer-biontech" "-9" "covid--9"
## [259] "9-" "9-" "----"
## [262] "e-" "9-" "non-replicating"
## [265] "-" "9-" "vector-based"
## [268] "covid--9" "non-replicating" "9-"
## [271] "----" "covid--9" "9-"
## [274] "covid--9" "one-shot" "---"
## [277] "---" "ad-6" "-nd"
## [280] "---" "----" "---"
## [283] "---" "--5" "bbibp-corv"
## [286] "--6" "--7" "covid--9"
## [289] "--8" "--9" "---"
## [292] "----" "---" "covid--9"
## [295] "---" "rbd-dimer" "-"
## [298] "v-5-" "---" "---"
## [301] "--5" "--6" "--7"
## [304] "--8" "--9" "---"
## [307] "---" "---" "-"
## [310] "---" "-9" "---"
## [313] "non-" "-9" "--5"
## [316] "--" "-5" "placebo-controlled"
## [319] "-5" "--" "-5"
## [322] "-9" "-6" "--6"
## [325] "--7" "--8" "--9"
## [328] "--8" "---" "who-recognized"
## [331] "rbd-dimer" "-nited" "-"
## [334] "-7-" "--" "--"
## [337] "--8" "-6" "covid--9"
## [340] "-" "-" "-lt"
## [343] "--" "placebo-controlled" "---"
## [346] "-8" "----" "--6"
## [349] "--" "----" "95%"
## [352] "--7" "--8" "----"
## [355] "--9" "---" "-nited"
## [358] "covid--9" "-" "--8"
## [361] "--" "---" "-5"
## [364] "-" "double-" "--"
## [367] "---" "placebo-" "ad-6"
## [370] "---" "---" "9-"
## [373] "6%" "--5" "----"
## [376] "----" "--6" "--7"
## [379] "--8" "--9" "-5-"
## [382] "-ae" "-5-" "-nited"
## [385] "-" "-" "-56"
## [388] "--" "---" "--"
## [391] "covid--9" "-" "--"
## [394] "--" "9-" "chadox-"
## [397] "-niversity" "9-" "-55"
## [400] "placebo-controlled" "-57" "76%"
## [403] "8-%" "--" "-58"
## [406] "----" "----" "---"
## [409] "-59" "-nited" "-6-"
## [412] "bbibp-corv" "--6" "-"
## [415] "-" "-6-" "-8"
## [418] "---" "-5" "-"
## [421] "-" "-" "-"
## [424] "double-blind" "--6" "-6-"
## [427] "placebo-controlled" "79%" "-6-"
## [430] "----" "----" "-nited"
## [433] "-6-" "-65" "-66"
## [436] "-67" "---" "---"
## [439] "--5" "-" "-"
## [442] "-69" "--" "6--"
## [445] "--" "-" "-"
## [448] "double-blind" "-" "-68"
## [451] "---" "placebo-controlled" "8-"
## [454] "5%" "-7-" "-%"
## [457] "-7-" "5-" "-%"
## [460] "78%" "---%" "-7-"
## [463] "----" "----" "-5"
## [466] "---" "-7-" "----"
## [469] "----" "-" "6--"
## [472] "-" "---" "-7-"
## [475] "--" "---" "-75"
## [478] "covid--9" "-nited" "-"
## [481] "---" "-78" "--"
## [484] "---" "--" "9-"
## [487] "9-" "-" "-77"
## [490] "placebo-controlled" "-76" "-5"
## [493] "----" "-79" "--"
## [496] "----" "9-%" "-8-"
## [499] "----" "----" "-nited"
## [502] "-nited" "-" "-"
## [505] "-8-" "--" "---"
## [508] "covid--9" "-8-" "double-blinded"
## [511] "ad-6" "-8-" "placebo-controlled"
## [514] "-9" "----" "66%"
## [517] "85%" "6-%" "7-%"
## [520] "-nited" "-8-" "-8-"
## [523] "----" "----" "-nited"
## [526] "-kraine" "ad5-ncov" "-"
## [529] "-" "-86" "--"
## [532] "---" "-" "-86"
## [535] "multi-center" "-" "double-"
## [538] "-85" "placebo-" "----"
## [541] "%" "covid--9" "9-"
## [544] "98%" "-86" "----"
## [547] "----" "----" "----"
## [550] "-87" "-88" "-89"
## [553] "-9-" "-9-" "-9-"
## [556] "bbv-5-" "-" "-"
## [559] "-9-" "-5" "8--"
## [562] "-" "-" "-"
## [565] "-9-" "observer-blinded" "-9-"
## [568] "placebo-" "-95" "8-%"
## [571] "-96" "peer-reviewed" "-97"
## [574] "----" "----" "---"
## [577] "-" "-" "-98"
## [580] "--" "---" "-"
## [583] "---" "-" "-"
## [586] "double-" "-" "placebo-"
## [589] "---" "----" "----"
## [592] "-99" "zf----" "rbd-dimer"
## [595] "-" "-" "-9"
## [598] "---" "-" "--"
## [601] "-" "---" "---"
## [604] "double-blind" "placebo-" "---"
## [607] "----" "----" "-zbekistan"
## [610] "---" "---" "---"
## [613] "-" "-" "--6"
## [616] "-" "---" "-"
## [619] "-" "-" "double-blind"
## [622] "-" "--5" "--6"
## [625] "placebo-controlled" "-9" "-"
## [628] "--7" "--8" "covid--9"
## [631] "-nited" "--9" "---"
## [634] "---" "virus-" "---"
## [637] "-" "---" "-5"
## [640] "---" "---" "---"
## [643] "--7" "-" "observer-"
## [646] "e-" "--8" "-kraine"
## [649] "--9" "---" "placebo-"
## [652] "--6" "---" "----"
## [655] "----" "-k" "-5"
## [658] "---" "----" "----"
## [661] "-s" "--" "---"
## [664] "--5" "finlay-fr--" "--"
## [667] "95-" "--" "---"
## [670] "---" "---" "---"
## [673] "---" "--" "non-randomized"
## [676] "parallel-group" "placebo-" "-ncontrolled"
## [679] "double-blind" "----" "9--"
## [682] "----" "----" "covid--9"
## [685] "9--" "-" "--5"
## [688] "-6" "5--" "--6"
## [691] "--7" "--8" "e-"
## [694] "9-" "-b" "-"
## [697] "-8-" "dose-" "----"
## [700] "----" "66-" "observer-blind"
## [703] "dose-" "----" "----"
## [706] "-nnamed" "--9" "---"
## [709] "-" "9--" "--"
## [712] "---" "double-blinded" "double-blinded"
## [715] "single-center" "single-center" "placebo-"
## [718] "placebo-" "----" "----"
## [721] "qazcovid-in" "---" "-"
## [724] "-" "---" "---"
## [727] "---" "----" "placebo-"
## [730] "---" "----" "----"
## [733] "---" "zycov-d" "--5"
## [736] "-" "-6" "---"
## [739] "---" "-" "---"
## [742] "placebo-" "double-blind" "--5"
## [745] "placebo-" "----" "--7"
## [748] "--5" "--6" "----"
## [751] "----" "--8" "virus-like"
## [754] "-8-" "-nited" "plant-based"
## [757] "--" "9-8" "as--"
## [760] "event-driven" "--" "--"
## [763] "placebo-" "---" "--x"
## [766] "----" "covid--9" "----"
## [769] "----" "----" "---"
## [772] "scb----9" "---" "---"
## [775] "-5-" "-nited" "--"
## [778] "---" "----" "---"
## [781] "--5" "as--" "----"
## [784] "double-blind" "----" "----"
## [787] "-b-6--" "6-" "--8"
## [790] "-nited" "-nited" "--"
## [793] "-7-" "open-label" "--6"
## [796] "--7" "----" "----"
## [799] "-" "85-" "placebo-"
## [802] "observer-blind" "iib-iii" "---"
## [805] "double-blind" "dose-" "----"
## [808] "----" "grad-cov-" "--9"
## [811] "-5-" "9-" "-5-"
## [814] "--" "---" "-5-"
## [817] "-5-" "-8" "observer-blind"
## [820] "placebo-" "grad-" "----"
## [823] "cov-" "---week" "9-%"
## [826] "grad-cov-" "anti-" "----"
## [829] "-nnamed" "-5" "-56"
## [832] "-" "96-" "-5-"
## [835] "-55" "single-center" "96-"
## [838] "single-" "placebo-" "double-blind"
## [841] "double-blinded" "----" "placebo-"
## [844] "-" "---" "single-"
## [847] "double-blinded" "placebo-" "----"
## [850] "----" "mvc-cov-9--" "-5"
## [853] "-58" "-" "7--"
## [856] "-57" "open-labeled" "double-blinded"
## [859] "single-center" "multi-center" "----"
## [862] "----" "multi-regional" "----"
## [865] "----" "-nnamed" "-"
## [868] "-8-" "-6-" "-"
## [871] "---" "-59" "double-blind"
## [874] "double-blind" "parallel-" "parallel-"
## [877] "----" "----" "----"
## [880] "----" "-6-" "-"
## [883] "56-" "6-" "-6-"
## [886] "-6-" "-6-" "-6-"
## [889] "double-blind" "----" "----"
## [892] "placebo-" "----" "-nnamed"
## [895] "-65" "---" "-68"
## [898] "----" "----" "----"
## [901] "-67" "----" "-66"
## [904] "ino--8--" "--6" "--7"
## [907] "-8-" "---" "---"
## [910] "-nited" "open-label" "placebo-"
## [913] "ib-iia" "-6-" "dose-"
## [916] "multi-" "-69" "-68"
## [919] "----" "----" "----"
## [922] "-nited" "----" "-nited"
## [925] "ib-iia" "ag-----covid" "-9"
## [928] "--8" "-7-" "--"
## [931] "-7-" "5--" "non-"
## [934] "double-blind" "single-center" "-7-"
## [937] "----" "----" "----"
## [940] "-nnamed" "-nited" "-"
## [943] "-6-" "i-iia" "---"
## [946] "sars-cov--" "-8" "-7-"
## [949] "7--" "sars-cov--" "as--"
## [952] "-8" "-7-" "----"
## [955] "----" "-nited" "iibr----"
## [958] "---" "-" "---"
## [961] "-75" "----" "----"
## [964] "arct----" "lunar-" "-nited"
## [967] "cov-9" "-76" "-77"
## [970] "9-" "n-s" "double-blinded"
## [973] "--6" "-78" "6--"
## [976] "observer-blind" "placebo-" "-79"
## [979] "----" "----" "-nited"
## [982] "vbi--9--a" "-8-" "-nited"
## [985] "virus-like" "78-" "observer-blind"
## [988] "dose-" "placebo-" "----"
## [991] "----" "mrt55--" "-8-"
## [994] "--5" "-8-" "first-in-"
## [997] "sars-" "cov--" "-8"
## [1000] "----" "----" "-nited"
## [1003] "eucorvac--9" "-8-" "-8-"
## [1006] "dose-" "observer-blind" "placebo-"
## [1009] "----" "----" "gx--9"
## [1012] "gx--9n" "--9" "-8-"
## [1015] "---" "-85" "-8-"
## [1018] "i-ii" "-7--" "---"
## [1021] "multi-" "double-blind" "placebo-"
## [1024] "----" "----" "vla----"
## [1027] "--8" "--9" "-"
## [1030] "-5-" "multi-center" "double-blinded"
## [1033] "----" "----" "-nited"
## [1036] "tak-9-9" "-86" "---"
## [1039] "-87" "observer-blind" "placebo-"
## [1042] "----" "----" "tak---9"
## [1045] "-88" "---" "-89"
## [1048] "observer-blind" "placebo-" "----"
## [1051] "----" "covid-evax" "-6-"
## [1054] "-9-" "first-" "in-human"
## [1057] "open-" "----" "-9"
## [1060] "---" "---" "----"
## [1063] "----" "-9-" "lv-smenp-dc"
## [1066] "---" "---" "----"
## [1069] "----" "-9-" "chulacov-9"
## [1072] "-niversity" "-9-" "dose-finding"
## [1075] "----" "lnp-ncovsarna" "-9-"
## [1078] "-nited" "--5" "-5"
## [1081] "---" "----" "----"
## [1084] "-nited" "covax--9" "-9-"
## [1087] "--" "-95" "----"
## [1090] "----" "hgc--9" "-96"
## [1093] "---" "-nited" "-98"
## [1096] "----" "-97" "covid--9"
## [1099] "-99" "---" "-"
## [1102] "-nited" "-6-" "---"
## [1105] "---" "----" "----"
## [1108] "---" "---" "---"
## [1111] "---" "--5" "----"
## [1114] "----" "--6" "ptx-covid-9-b"
## [1117] "--7" "6-" "--7"
## [1120] "----" "covac--" "--8"
## [1123] "--8" "-niversity" "--8"
## [1126] "----" "----" "covi-vac"
## [1129] "-nited" "-8" "--9"
## [1132] "first-in-human" "double-blind" "placebo-"
## [1135] "dose-escalation" "----" "----"
## [1138] "-nited" "cov-" "-nited"
## [1141] "-8" "---" "open-label"
## [1144] "non-" "----" "-nited"
## [1147] "---" "-" "double-blind"
## [1150] "placebo-" "---" "----"
## [1153] "----" "-5-" "-niversity"
## [1156] "---" "double-blind" "dose-ranging"
## [1159] "placebo-" "----" "----"
## [1162] "bbv-5-" "---" "-75"
## [1165] "---" "--5" "double-blinded"
## [1168] "----" "mv--------" "--6"
## [1171] "-nited" "---" "--7"
## [1174] "double-blinded" "----" "----"
## [1177] "-nited" "s--68--9" "---"
## [1180] "--8" "double-blind" "placebo-"
## [1183] "parallel-group" "----" "----"
## [1186] "gbp5--" "-6-" "--9"
## [1189] "placebo-" "observer-" "dose-"
## [1192] "----" "cigb-66" "---"
## [1195] "---" "double-blind" "placebo-"
## [1198] "----" "----" "kbp----"
## [1201] "-nited" "-8-" "---"
## [1204] "first-in-human" "observer-" "placebo-"
## [1207] "----" "----" "-nited"
## [1210] "adimrsc--f" "7-" "---"
## [1213] "open-label" "dose-finding" "----"
## [1216] "er-cov-vac" "-" "--"
## [1219] "---" "covid--9" "er-cov-vac"
## [1222] "----" "----" "aks--5-"
## [1225] "-niversity" "---" "---"
## [1228] "non-" "single-center" "open-label"
## [1231] "----" "gls-5---" "--5"
## [1234] "--5" "dose-" "double-blind"
## [1237] "----" "----" "vax----"
## [1240] "7-" "--6" "placebo-"
## [1243] "observer-blind" "----" "coh--s-"
## [1246] "-nited" "--9" "--7"
## [1249] "----" "----" "-"
## [1252] "--5" "--8" "----"
## [1255] "nbp----" "5-" "--9"
## [1258] "placebo-" "observer-" "dose-"
## [1261] "----" "----" "covac--"
## [1264] "-6" "-niversity" "---"
## [1267] "placebo-" "observer-" "dose-"
## [1270] "----" "----" "bactrl-spike"
## [1273] "--" "---" "observer-blind"
## [1276] "placebo-" "----" "----"
## [1279] "---" "---" "----"
## [1282] "corvax--" "-nited" "-6"
## [1285] "---" "open-label" "----"
## [1288] "----" "-nited" "chadv68-s"
## [1291] "-nited" "---" "---"
## [1294] "open-label" "----" "----"
## [1297] "-nited" "-nited" "-8-"
## [1300] "--5" "double-blind" "placebo-"
## [1303] "first-" "in-human" "----"
## [1306] "----" "-nited" "vxa-cov---"
## [1309] "-nited" "-5" "--6"
## [1312] "double-blind" "placebo-" "first-"
## [1315] "in-human" "----" "-nited"
## [1318] "sars-cov--" "v-5-" "---"
## [1321] "---" "---" "-q"
## [1324] "double-blind" "placebo-" "dose-ranging"
## [1327] "----" "v59-" "--7"
## [1330] "v59-" "mv-" "-nited"
## [1333] "sars-cov--" "--8" "--9"
## [1336] "---" "-niversity" "sars-"
## [1339] "cov--" "covid-" "-9"
## [1342] "---" "---" "---"
## [1345] "long-term" "covid--9" "-5"
## [1348] "-5" "--" "-"
## [1351] "-6" "-6" "---"
## [1354] "--5" "gam-covid-vac" "gam-covid-vac-lyo"
## [1357] "--8" "---" "chadox-"
## [1360] "ncov--9" "-5-" "low-"
## [1363] "middle-income" "-5-" "chadox-"
## [1366] "ncov--9" "-5-" "virus-like"
## [1369] "--9" "-s" "---"
## [1372] "---" "-%" "5-%"
## [1375] "---" "-9" "67%"
## [1378] "-9" "-" "---"
## [1381] "double-blind" "-s" "5-%"
## [1384] "---" "-9" "--5"
## [1387] "--6" "75%" "covid--9"
## [1390] "7-%" "8-%" "--7"
## [1393] "covid--9" "covid--9" "sars-cov--"
## [1396] "95%" "covid--9" "--8"
## [1399] "covid--9" "9-%" "97%"
## [1402] "---%" "-nited" "--9"
## [1405] "95%" "9-" "98%"
## [1408] "-5-" "9-%" "95%"
## [1411] "---%" "9-" "---%"
## [1414] "-5-" "8-%" "6-"
## [1417] "9-%" "---%" "%"
## [1420] "7-" "---%" "-58"
## [1423] "89%" "95%" "---%"
## [1426] "-nited" "-5-" "-5-"
## [1429] "6-%" "--" "8-%"
## [1432] "---%" "bbibp-corv" "79%"
## [1435] "---%" "-5-" "-55"
## [1438] "78%" "---%" "-56"
## [1441] "-57" "-58" "66%"
## [1444] "75%" "85%" "5-"
## [1447] "97%" "7-%" "8-%"
## [1450] "86%" "---%" "-nited"
## [1453] "-8-" "68%" "-9"
## [1456] "8-%" "88%" "---%"
## [1459] "6-%" "--" "79%"
## [1462] "8-%" "-6" "95%"
## [1465] "8-%" "-59" "-6-"
## [1468] "66%" "9-%" "-86"
## [1471] "-8" "---" "--"
## [1474] "55%" "--" "7-%"
## [1477] "sars-cov--" "covid--9" "-6-"
## [1480] "covid--9" "-6-" "----"
## [1483] "-s" "sars-cov--" "-6-"
## [1486] "----" "-" "-"
## [1489] "-" "-k" "-6-"
## [1492] "-k" "-6-" "--"
## [1495] "89%" "-" "-"
## [1498] "7-" "9-%" "non-b"
## [1501] "-" "-" "-65"
## [1504] "96%" "86%" "-"
## [1507] "-" "6-%" "-"
## [1510] "-5-" "-66" "5--"
## [1513] "v-" "5--" "v-"
## [1516] "-" "-5-" "-67"
## [1519] "-7" "----" "two-thirds"
## [1522] "5--" "v-" "-68"
## [1525] "ad-6" "cov-" "covid--9"
## [1528] "7-%" "-nited" "57%"
## [1531] "-69" "----" "-niversity"
## [1534] "-niversity" "covid--9" "5--"
## [1537] "v-" "-7-" "-"
## [1540] "---" "azd----" "covid--9"
## [1543] "-7-" "----" "-"
## [1546] "-7-" "-7-" "----"
## [1549] "8-" "-9" "-7-"
## [1552] "-9" "-9" "-7-"
## [1555] "-75" "-9" "-9"
## [1558] "protein-based" "vector-based" "-75"
## [1561] "8-%" "-75" "-7-"
## [1564] "-75" "--" "----"
## [1567] "--6" "-7" "-9"
## [1570] "-8-" "-9" "----"
## [1573] "-8-" "first-service" "-8-"
## [1576] "-8-" "-8-" "-85"
## [1579] "-8-" "-8-" "-85"
## [1582] "-k" "--" "-k"
## [1585] "-9" "taxpayer-funded" "-k"
## [1588] "-86" "-85" "----"
## [1591] "late-stage" "low-" "-87"
## [1594] "-9" "-8-" "-88"
## [1597] "-89" "-9-" "-9"
## [1600] "-9" "densely-populated" "-9-"
## [1603] "-9-" "-9" "-87"
## [1606] "-89" "-9-" "-"
## [1609] "----" "-s" "-9"
## [1612] "-9" "sars-cov--" "covid--9"
## [1615] "--" "-76" "-9-"
## [1618] "-nited" "-" "%"
## [1621] "----" "-9-" "-6-"
## [1624] "7-9" "-58" "-"
## [1627] "-%" "-nion" "-9"
## [1630] "-nited" "8-" "--5"
## [1633] "--" "-%" "7-"
## [1636] "---" "--" "-9-"
## [1639] "e-" "--" "--9"
## [1642] "8-8" "-%" "side-effects"
## [1645] "-7" "57-" "---"
## [1648] "-" "7%" "-95"
## [1651] "-nited" "-7" "6--"
## [1654] "97-" "--" "7%"
## [1657] "--" "-68" "97-"
## [1660] "-" "8%" "--7"
## [1663] "%" "-nited" "-98"
## [1666] "--" "--7" "%"
## [1669] "covid--9" "--6" "-6-"
## [1672] "-%" "high-income" "--%"
## [1675] "6--" "--9" "-"
## [1678] "8%" "-5" "----"
## [1681] "5-%" "pre-sold" "6--"
## [1684] "-9-" "-9" "-%"
## [1687] "high-income" "-8-" "-"
## [1690] "-%" "-96" "-5-"
## [1693] "--8" "%" "-8"
## [1696] "----" "director-general" "-75"
## [1699] "98-" "%" "-9"
## [1702] "-" "5--" "-"
## [1705] "8%" "-" "76-"
## [1708] "7-7" "-" "9%"
## [1711] "-9" "higher-income" "-5"
## [1714] "lowest-income" "-" "-6-"
## [1717] "-68" "--" "6%"
## [1720] "-5" "-5" "-5"
## [1723] "-97" "-" "--6"
## [1726] "55-" "%" "covid--9"
## [1729] "-" "---" "--9"
## [1732] "%" "-nited" "----"
## [1735] "-98" "-99" "-"
## [1738] "---" "%" "long-standing"
## [1741] "-" "-89" "---"
## [1744] "--" "-" "5--"
## [1747] "-8-" "%" "-"
## [1750] "7--" "--8" "%"
## [1753] "-" "6--" "---"
## [1756] "--" "---" "---"
## [1759] "---" "-" "6--"
## [1762] "-" "%" "covid--9"
## [1765] "-" "-9-" "6--"
## [1768] "-%" "-7" "---"
## [1771] "-" "--8" "--6"
## [1774] "-9" "-%" "---"
## [1777] "-" "--7" "57-"
## [1780] "-" "-%" "-"
## [1783] "98-" "97-" "-%"
## [1786] "--" "--5" "-%"
## [1789] "-5" "-7-" "-%"
## [1792] "96-" "--" "7%"
## [1795] "9--" "--5" "-%"
## [1798] "---" "9--" "5-7"
## [1801] "%" "8--" "5--"
## [1804] "--" "7--" "68-"
## [1807] "%" "7--" "---"
## [1810] "--" "-%" "---"
## [1813] "6-7" "-" "-%"
## [1816] "me-first" "---" "-%"
## [1819] "6--" "86-" "--"
## [1822] "9%" "--5" "59-"
## [1825] "6--" "--" "--"
## [1828] "----" "-nited" "55-"
## [1831] "-5-" "-" "-%"
## [1834] "-nion" "5-9" "-5-"
## [1837] "-%" "8-" "covid--9"
## [1840] "5-5" "9-5" "-%"
## [1843] "-97" "---" "-%"
## [1846] "--6" "-8-" "66-"
## [1849] "-" "8%" "-87"
## [1852] "-66" "%" "anti-vaccination"
## [1855] "--6" "-87" "-"
## [1858] "-%" "covid--9" "---"
## [1861] "---" "-" "-%"
## [1864] "--7" "-6-" "--"
## [1867] "5%" "covid--9" "---"
## [1870] "65-" "-" "-%"
## [1873] "-8-" "---" "-"
## [1876] "7%" "-5-" "---"
## [1879] "-" "7%" "--%"
## [1882] "-5-" "---" "--"
## [1885] "--7" "---" "5-8"
## [1888] "-" "6%" "-9"
## [1891] "-7" "---" "---"
## [1894] "%" "----" "67%"
## [1897] "8-%" "-" "-ruguay"
## [1900] "---" "-%" "-9"
## [1903] "--9" "---" "--"
## [1906] "--8" "----" "-9%"
## [1909] "-s" "-97" "68-"
## [1912] "-" "-%" "5-%"
## [1915] "-8-" "6--" "%"
## [1918] "-s" "--9" "---"
## [1921] "-75" "--8" "--"
## [1924] "-%" "-56" "78-"
## [1927] "--" "---" "---"
## [1930] "---" "---" "-97"
## [1933] "-" "-%" "---"
## [1936] "7--" "--" "--9"
## [1939] "-78" "-" "-%"
## [1942] "---9" "-9" "--9"
## [1945] "-78" "--" "-9"
## [1948] "-9-" "-88" "-"
## [1951] "7%"
str_extract_all(cv_nopunct, " [-%]+ ")
## [[1]]
## [1] " - " " - " " ---- " " - " " -- " " ---- " " - " " --- "
## [9] " ---- " " -- " " - " " ---- " " -- " " -- " " - " " - "
## [17] " -- " " ---- " " - " " ---- " " - " " - " " ---- " " ---- "
## [25] " -- " " --% " " -- " " ---- " " -- " " -- " " -- " " -- "
## [33] " ---- " " -- " " ---- " " -- " " -- " " ---- " " -- " " --% "
## [41] " --% " " -- " " -- " " -- " " ---- " " -- " " -- " " -- "
## [49] " -- " " -- " " -- " " -- " " -- " " ---- " " -- " " -- "
## [57] " -- " " ---- " " - " " ---- " " ---- " " -- " " ---- " " ---- "
## [65] " ---- " " -- " " ---- " " --% " " --% " " ---- " " - " " - "
## [73] " ---- " " ---- " " -- " " ---- " " -- " " ---- " " - " " -- "
## [81] " ---- " " - " " ---- " " -- " " -- " " ---- " " ---- " " - "
## [89] " ---- " " - " " - " " - " " - " " -- " " -- " " -- "
## [97] " -- " " -- " " -- " " ---- " " ---- " " - " " ---- " " --- "
## [105] " --- " " --- " " ---- " " --- " " --- " " --- " " ---- " " --- "
## [113] " --- " " - " " --- " " --- " " --- " " --- " " --- " " - "
## [121] " --- " " --- " " -- " " -- " " --- " " - " " -- " " -- "
## [129] " - " " -- " " --- " " ---- " " -- " " ---- " " ---- " " --- "
## [137] " - " " -- " " - " " -- " " --- " " --- " " --- " " ---- "
## [145] " ---- " " - " " - " " -- " " -- " " - " " -- " " -- "
## [153] " ---- " " ---- " " --- " " - " " - " " --- " " - " " - "
## [161] " - " " ---- " " ---- " " --- " " --- " " - " " - " " -- "
## [169] " -- " " - " " - " " --- " " -% " " -% " " ---% " " ---- "
## [177] " ---- " " --- " " ---- " " ---- " " - " " - " " -- " " - "
## [185] " --- " " -- " " -- " " - " " ---- " " -- " " ---- " " ---- "
## [193] " ---- " " - " " -- " " ---- " " ---- " " ---- " " - " " -- "
## [201] " - " " - " " ---- " " % " " ---- " " ---- " " ---- " " ---- "
## [209] " - " " - " " - " " - " " ---- " " ---- " " --- " " - "
## [217] " - " " -- " " - " " --- " " - " " - " " --- " " ---- "
## [225] " ---- " " - " " - " " --- " " - " " -- " " - " " --- "
## [233] " --- " " --- " " ---- " " ---- " " --- " " --- " " --- " " - "
## [241] " - " " - " " - " " - " " - " " - " " --- " " --- "
## [249] " --- " " - " " --- " " --- " " --- " " --- " " - " " --- "
## [257] " --- " " ---- " " ---- " " --- " " ---- " " ---- " " -- " " -- "
## [265] " -- " " --- " " --- " " --- " " --- " " -- " " ---- " " ---- "
## [273] " ---- " " - " " - " " ---- " " ---- " " ---- " " ---- " " --- "
## [281] " - " " -- " " ---- " " ---- " " --- " " - " " - " " --- "
## [289] " ---- " " --- " " ---- " " ---- " " --- " " - " " --- " " --- "
## [297] " - " " ---- " " ---- " " ---- " " -- " " -- " " -- " " --- "
## [305] " ---- " " ---- " " ---- " " ---- " " --- " " --- " " --- " " -- "
## [313] " ---- " " --- " " ---- " " ---- " " ---- " " -- " " ---- " " ---- "
## [321] " - " " --- " " ---- " " ---- " " -- " " ---- " " ---- " " - "
## [329] " ---- " " - " " ---- " " ---- " " - " " ---- " " ---- " " ---- "
## [337] " ---- " " - " " - " " ---- " " ---- " " ---- " " ---- " " - "
## [345] " ---- " " ---- " " ---- " " --- " " ---- " " ---- " " ---- " " ---- "
## [353] " --- " " --- " " ---- " " ---- " " ---- " " ---- " " -- " " ---- "
## [361] " ---- " " ---- " " - " " --- " " ---- " " ---- " " --- " " - "
## [369] " ---- " " ---- " " ---- " " ---- " " ---- " " ---- " " ---- " " ---- "
## [377] " ---- " " ---- " " --- " " --- " " ---- " " ---- " " - " " ---- "
## [385] " ---- " " --- " " ---- " " ---- " " --- " " ---- " " ---- " " ---- "
## [393] " --- " " --- " " ---- " " ---- " " --- " " --- " " ---- " " ---- "
## [401] " ---- " " --- " " ---- " " ---- " " -- " " ---- " " ---- " " --- "
## [409] " ---- " " --- " " - " " --- " " --- " " ---- " " ---- " " --- "
## [417] " --- " " --- " " --- " " ---- " " ---- " " ---- " " ---- " " ---- "
## [425] " ---- " " ---- " " --- " " ---- " " --- " " - " " --- " " ---- "
## [433] " ---- " " --- " " ---- " " ---- " " --- " " --- " " ---- " " --- "
## [441] " ---- " " ---- " " --- " " ---- " " ---- " " ---- " " --- " " --- "
## [449] " ---- " " ---- " " --- " " ---- " " ---- " " --- " " ---- " " - "
## [457] " -- " " --- " " ---- " " ---- " " --- " " --- " " ---- " " ---- "
## [465] " ---- " " ---- " " ---- " " ---- " " - " " ---- " " ---- " " ---- "
## [473] " --- " " ---- " " ---- " " -- " " --- " " ---- " " ---- " " --- "
## [481] " --- " " ---- " " --- " " ---- " " ---- " " --- " " --- " " ---- "
## [489] " ---- " " ---- " " ---- " " ---- " " --- " " --- " " --- " " ---- "
## [497] " --- " " --- " " --- " " --- " " -- " " - " " --- " " --- "
## [505] " --- " " --- " " -% " " --- " " - " " --- " " --- " " ---% "
## [513] " ---% " " ---% " " ---% " " % " " ---% " " ---% " " -- " " ---% "
## [521] " ---% " " ---% " " ---% " " ---% " " -- " " --- " " -- " " -- "
## [529] " ---- " " ---- " " - " " - " " -- " " - " " - " " - "
## [537] " - " " - " " ---- " " ---- " " - " " ---- " " - " " ---- "
## [545] " -- " " ---- " " ---- " " -- " " ---- " " - " " ---- " " -- "
## [553] " - " " % " " ---- " " - " " -- " " --- " " -- " " -% "
## [561] " --- " " -- " " -- " " - " " % " " -- " " % " " -% "
## [569] " --% " " - " " ---- " " -% " " - " " % " " ---- " " % "
## [577] " - " " - " " - " " - " " - " " -- " " - " " % "
## [585] " - " " % " " ---- " " - " " % " " - " " --- " " - "
## [593] " % " " - " " % " " - " " --- " " --- " " --- " " --- "
## [601] " - " " - " " - " " -% " " --- " " - " " -% " " --- "
## [609] " - " " - " " - " " -% " " -- " " -% " " -% " " -- "
## [617] " -% " " --- " " % " " -- " " % " " --- " " -% " " --- "
## [625] " - " " --- " " -% " " -- " " -- " " -- " " ---- " " - "
## [633] " -% " " -% " " --- " " -% " " - " " % " " - " " --- "
## [641] " - " " -- " " --- " " - " " --- " " --- " " --% " " --- "
## [649] " --- " " - " " --- " " % " " - " " --- " " -% " " --- "
## [657] " ---- " " - " " % " " --- " " -- " " -- " " --- " " --- "
## [665] " --- " " --- " " - " " --- " " -- " " - " " -- " " - "
cv_nopunct <- str_replace_all(cv_nopunct, " [-%]+ ", " ")
Now we removed all punctuation marks except the hyphen and percent in a string
str_trunc(cv_nopunct, 2000, "right") # Check the first 2000 characters in cv_nopunct
## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus sars cov the virus causing coronavirus disease ---9 covid -9 prior to the covid -9 pandemic there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome sars and middle east respiratory syndrome mers which enabled accelerated development of various vaccine technologies during early on january the sars-cov-- genetic sequence data was shared through gisaid and by -9 march the global pharmaceutical industry announced a major commitment to address covid--9 covid--9 vaccination doses administered per people in phase iii trials several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections as of march vaccines were authorized by at least one national regulatory authority for public use two rna vaccines the pfizer biontech vaccine and the moderna vaccine four conventional inactivated vaccines bbibp- corv coronavac covaxin and covivac four viral vector vaccines sputnik v the oxford astrazeneca vaccine convidecia and the johnson johnson vaccine and two protein subunit vaccines epivaccorona and rbd-dimer in total as of march --8 vaccine candidates were in various stages of development with 7- in clinical research including in map of countries by approval status phase i trials in phase i ii trials and -6 in phase iii development approved for general use mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications such as the elderly and those at high e-a or equivalent granted mass risk of exposure and transmission such as healthcare workers as of vaccination underway march --6 -7 million doses of covid -9 vaccine have been e-a granted limited vaccination admini..."
# But it seems we may want to remove any number that does not come with a word nor mean a year
str_extract_all(cv_nopunct, "[[:space:]]+[[:digit:]]{1,3}[[:space:]]+") # Begins with a blank and less than three digit numbers and ends with a blank
## [[1]]
## [1] " 6 " " 96 " " 8 " " 5 " " 8 " " 8 "
## [7] " 65 " " 5 " " 6 " " 8 " " 8 " " 6 "
## [13] " 8 " " 8 " " 5 " " 8 " " 8 " " 7 "
## [19] " 55 " " 65 " " 798 " " 96 " " 56 " " 6 "
## [25] " 5 " " 8 " " 89 " " 86 " " 75 " " 55 "
## [31] " 58 " " 9 " " 8 " " 6 " " 9 " " 7 "
## [37] " 7 " " 7 " " 7 " " 6 " " 7 " " 769 "
## [43] " 956 " " 9 " " 8 " " 698 " " 7 " " 768 "
## [49] " 7 " " 867 " " 6 " " 9 " " 5 " " 5 "
## [55] " 5 " " 5 " " 5 " " 895 " " 998 " " 589 "
## [61] " 8 " " 9 " " 5 " " 977 " " 9 " " 965 "
## [67] " 8 " " 695 " " 9 " " 676 " " 675 " " 9 "
## [73] " 9 " " 9 " " 797 "
cv_nonum <- str_replace_all(cv_nopunct, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(cv_nonum, 2000) # some numbers remain because of whitespace
## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus sars cov the virus causing coronavirus disease ---9 covid -9 prior to the covid -9 pandemic there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome sars and middle east respiratory syndrome mers which enabled accelerated development of various vaccine technologies during early on january the sars-cov-- genetic sequence data was shared through gisaid and by -9 march the global pharmaceutical industry announced a major commitment to address covid--9 covid--9 vaccination doses administered per people in phase iii trials several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections as of march vaccines were authorized by at least one national regulatory authority for public use two rna vaccines the pfizer biontech vaccine and the moderna vaccine four conventional inactivated vaccines bbibp- corv coronavac covaxin and covivac four viral vector vaccines sputnik v the oxford astrazeneca vaccine convidecia and the johnson johnson vaccine and two protein subunit vaccines epivaccorona and rbd-dimer in total as of march --8 vaccine candidates were in various stages of development with 7- in clinical research including in map of countries by approval status phase i trials in phase i ii trials and -6 in phase iii development approved for general use mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications such as the elderly and those at high e-a or equivalent granted mass risk of exposure and transmission such as healthcare workers as of vaccination underway march --6 -7 million doses of covid -9 vaccine have been e-a granted limited vaccination admini..."
# Rerun pattern matching again
cv_nonum <- str_replace_all(cv_nonum, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(cv_nonum, 2000)
## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus sars cov the virus causing coronavirus disease ---9 covid -9 prior to the covid -9 pandemic there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome sars and middle east respiratory syndrome mers which enabled accelerated development of various vaccine technologies during early on january the sars-cov-- genetic sequence data was shared through gisaid and by -9 march the global pharmaceutical industry announced a major commitment to address covid--9 covid--9 vaccination doses administered per people in phase iii trials several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections as of march vaccines were authorized by at least one national regulatory authority for public use two rna vaccines the pfizer biontech vaccine and the moderna vaccine four conventional inactivated vaccines bbibp- corv coronavac covaxin and covivac four viral vector vaccines sputnik v the oxford astrazeneca vaccine convidecia and the johnson johnson vaccine and two protein subunit vaccines epivaccorona and rbd-dimer in total as of march --8 vaccine candidates were in various stages of development with 7- in clinical research including in map of countries by approval status phase i trials in phase i ii trials and -6 in phase iii development approved for general use mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications such as the elderly and those at high e-a or equivalent granted mass risk of exposure and transmission such as healthcare workers as of vaccination underway march --6 -7 million doses of covid -9 vaccine have been e-a granted limited vaccination admini..."
Now our string is preprocessed insofar as non-ASCII characters, punctuation marks, numbers are removed. But we can still see some multiple whitespaces generated in text preprocessing.
cv_nospace <- str_squish(cv_nonum) # We can repeat the whitespace deletion process
str_trunc(cv_nospace, 2000)
## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus sars cov the virus causing coronavirus disease ---9 covid -9 prior to the covid -9 pandemic there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome sars and middle east respiratory syndrome mers which enabled accelerated development of various vaccine technologies during early on january the sars-cov-- genetic sequence data was shared through gisaid and by -9 march the global pharmaceutical industry announced a major commitment to address covid--9 covid--9 vaccination doses administered per people in phase iii trials several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections as of march vaccines were authorized by at least one national regulatory authority for public use two rna vaccines the pfizer biontech vaccine and the moderna vaccine four conventional inactivated vaccines bbibp- corv coronavac covaxin and covivac four viral vector vaccines sputnik v the oxford astrazeneca vaccine convidecia and the johnson johnson vaccine and two protein subunit vaccines epivaccorona and rbd-dimer in total as of march --8 vaccine candidates were in various stages of development with 7- in clinical research including in map of countries by approval status phase i trials in phase i ii trials and -6 in phase iii development approved for general use mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications such as the elderly and those at high e-a or equivalent granted mass risk of exposure and transmission such as healthcare workers as of vaccination underway march --6 -7 million doses of covid -9 vaccine have been e-a granted limited vaccination administered worldwide based on official reports from national health..."
Finally, we are ready to tokenize the string object, cv_nospace, into words separated by " "
.
cv_tidy_word <- unlist(str_split(cv_nospace, " "))
class(cv_tidy_word)
## [1] "character"
length(cv_tidy_word)
## [1] 8370
cv_tidy_word[1:50]
## [1] "covid--9" "vaccine" "a" "covid"
## [5] "-9" "vaccine" "is" "a"
## [9] "vaccine" "intended" "to" "provide"
## [13] "acquired" "immunity" "against" "severe"
## [17] "acute" "respiratory" "syndrome" "coronavirus"
## [21] "sars" "cov" "the" "virus"
## [25] "causing" "coronavirus" "disease" "---9"
## [29] "covid" "-9" "prior" "to"
## [33] "the" "covid" "-9" "pandemic"
## [37] "there" "was" "an" "established"
## [41] "body" "of" "knowledge" "about"
## [45] "the" "structure" "and" "function"
## [49] "of" "coronaviruses"
cv_tidy_word_freq <- sort(table(cv_tidy_word), decreasing = TRUE) # Create a table of word counts
cv_tidy_word_freq[1:50]
## cv_tidy_word
## the of and vaccine phase to
## 276 221 219 173 146 126
## in a for vaccines i -9
## 125 100 98 85 79 69
## -nited randomized covid controlled as ii
## 63 62 54 53 50 49
## efficacy preclinical states covid--9 an placebo-
## 46 46 45 44 42 40
## trial by with iii or that
## 40 39 37 35 35 35
## doses -8- on are trials rna
## 34 33 33 31 31 30
## --8 against subunit at be is
## 29 29 29 28 28 28
## mar may sars --6 --5 -5
## 28 28 28 27 26 26
## -6- china
## 26 26
One more thing to do in tokenization is to remove stopwords. The stopword lexicon is available in the tidytext package
#install.packages("tidytext")
library(tidytext)
tidytext::stop_words
## # A tibble: 1,149 x 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # ... with 1,139 more rows
stop_words %>% dplyr::count(lexicon)
## # A tibble: 3 x 2
## lexicon n
## * <chr> <int>
## 1 onix 404
## 2 SMART 571
## 3 snowball 174
smart <- stop_words[stop_words$lexicon=="SMART",] # The dataset "stop_words" provides the smart lexicon of stop words in a dataframe format
class(smart$word)
## [1] "character"
cv_tidy_nostop <- cv_tidy_word[!cv_tidy_word %in% smart$word] # %in% is a matching operator that leaves the elements in cv_text_word when they belong to smart$word
cv_tidy_word[1:50]
## [1] "covid--9" "vaccine" "a" "covid"
## [5] "-9" "vaccine" "is" "a"
## [9] "vaccine" "intended" "to" "provide"
## [13] "acquired" "immunity" "against" "severe"
## [17] "acute" "respiratory" "syndrome" "coronavirus"
## [21] "sars" "cov" "the" "virus"
## [25] "causing" "coronavirus" "disease" "---9"
## [29] "covid" "-9" "prior" "to"
## [33] "the" "covid" "-9" "pandemic"
## [37] "there" "was" "an" "established"
## [41] "body" "of" "knowledge" "about"
## [45] "the" "structure" "and" "function"
## [49] "of" "coronaviruses"
smart$word[1:50]
## [1] "a" "a's" "able" "about" "above"
## [6] "according" "accordingly" "across" "actually" "after"
## [11] "afterwards" "again" "against" "ain't" "all"
## [16] "allow" "allows" "almost" "alone" "along"
## [21] "already" "also" "although" "always" "am"
## [26] "among" "amongst" "an" "and" "another"
## [31] "any" "anybody" "anyhow" "anyone" "anything"
## [36] "anyway" "anyways" "anywhere" "apart" "appear"
## [41] "appreciate" "appropriate" "are" "aren't" "around"
## [46] "as" "aside" "ask" "asking" "associated"
cv_tidy_nostop[1:50]
## [1] "covid--9" "vaccine" "covid" "-9"
## [5] "vaccine" "vaccine" "intended" "provide"
## [9] "acquired" "immunity" "severe" "acute"
## [13] "respiratory" "syndrome" "coronavirus" "sars"
## [17] "cov" "virus" "causing" "coronavirus"
## [21] "disease" "---9" "covid" "-9"
## [25] "prior" "covid" "-9" "pandemic"
## [29] "established" "body" "knowledge" "structure"
## [33] "function" "coronaviruses" "causing" "diseases"
## [37] "severe" "acute" "respiratory" "syndrome"
## [41] "sars" "middle" "east" "respiratory"
## [45] "syndrome" "mers" "enabled" "accelerated"
## [49] "development" "vaccine"
cv_tidy_nostop_freq <- sort(table(cv_tidy_nostop), decreasing = TRUE)
names(cv_tidy_nostop_freq)[1:50]
## [1] "vaccine" "phase" "vaccines" "-9" "-nited"
## [6] "randomized" "covid" "controlled" "ii" "efficacy"
## [11] "preclinical" "states" "covid--9" "placebo-" "trial"
## [16] "iii" "doses" "-8-" "trials" "rna"
## [21] "--8" "subunit" "mar" "sars" "--6"
## [26] "--5" "-5" "-6-" "china" "dose"
## [31] "south" "---" "--9" "-5-" "clinical"
## [36] "development" "double-blind" "emergency" "safety" "cov"
## [41] "vector" "--7" "-9-" "johnson" "research"
## [46] "inactivated" "-7-" "dec" "nov" "virus"
cv_tidy_nostop_freq[1:50]
## cv_tidy_nostop
## vaccine phase vaccines -9 -nited randomized
## 173 146 85 69 63 62
## covid controlled ii efficacy preclinical states
## 54 53 49 46 46 45
## covid--9 placebo- trial iii doses -8-
## 44 40 40 35 34 33
## trials rna --8 subunit mar sars
## 31 30 29 29 28 28
## --6 --5 -5 -6- china dose
## 27 26 26 26 26 26
## south --- --9 -5- clinical development
## 26 25 25 25 25 25
## double-blind emergency safety cov vector --7
## 24 24 24 23 23 22
## -9- johnson research inactivated -7- dec
## 22 22 22 21 20 20
## nov virus
## 20 20
Let’s create a wordcloud from the table of word frequency
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(words = names(cv_tidy_nostop_freq), # Sequence of unique words
freq = cv_tidy_nostop_freq, # Frequency of words
min.freq = 10, # Minimum frequency of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.1, # Rate of words rotated in plot
scale = c(3, 0.3), # Range of words in size
colors = brewer.pal(8, "Dark2")) # Retrieve 8 colors from the list of "Dark2"
Now we have a much better wordcloud that gives more information about COVID-19 vaccine
Extract all the tokens that the hyphen (-) connects more than two words in the vector object of words, cv_tidy_word, and sort the counts of the extracted tokens in descending order. Save the result to the object named “HyphenWordTable” and export it as a text file using the following R input. And submit the file.
HyphenWordTable <- ???
write.table(HyphenWordTable, file="HyphenWordTable.txt", sep=",", quote = FALSE, row.names = FALSE)
1st Hint: We can modify an above-used regex pattern by which the words with a punctuation mark was matched for extracting the words with a hyphen.
2nd Hint: What’s the stringr function to extract certain string patterns that are matched by the regex specified?
3rd Hint: What’s the function to build a table of the counts at each word?