R pratice for Text (Pre-)Processing

Package "stringr’

Let me remind you of the functions in the package stringr covered last time.

Function	Description	Similar Base Functions
`str_length()`	number of characters	`nchar()`
`str_split()`	split up a string into pieces	`strsplit()`
`str_c()`	string concatenation	`paste()`
`str_squish()`	removes any redundant whitespace
`str_detect()`	finds a particular pattern of characters
`str_view_all()`	show the matching result on the actual screen

All functions in stringr starts with "str_" followed by a term in relation to the task they perform.

Useful stringr functions for pattern matching

Most string functions work with regex, a concise language for describing certain patterns of text. The followings are the functions that are useful for text pre-processing.

Function	Description
`str_which()`	Returns all positions of a matching pattern in a string vector
`str_subset()`	Returns all elements that contain a matching pattern in a string vector
`str_trunc()`	Truncates a string
`str_locate()`	Locates the first position of a matching pattern from a string
`str_locate_all()`	Locates all positions of a matching pattern from a string
`str_extact()`	Extracts the first matching pattern from a string
`str_extact_all()`	Extracts all matching patterns from a string
`str_replace()`	Replaces the first matching pattern in a string
`str_replace_all()`	Replaces all matching patterns in a string
`str_remove()`	Remove the first matched pattern in a string
`str_remove_all()`	remove all matched patterns in a string

Re-work on retrieving text from Wikipedia

library(stringr)
library(pdftools)

## Using poppler version 0.73.0

cv_text <- pdf_text("COVID-19_Vaccine.pdf")
class(cv_text)

## [1] "character"

length(cv_text)

## [1] 93

cv_string <- str_c(cv_text, collapse = " ") # Collapse a character vector, cv_text, into a single string
length(cv_string)

## [1] 1

Now we have a single string in which text from a Wikipedia page about COVID-19 vaccines is concatenated.

Let’s preprocess the string

First, we want to remove everything in the References section

str_locate_all(cv_string, "References") # Locate the section position of the pattern "References" in the string

## [[1]]
##       start    end
## [1,]   3702   3711
## [2,] 123159 123168

str_trunc(cv_string, width=100, side="right") # Truncate a character string; Leaves 100 characters from the first and removes the characters afterwards to the right end

## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst ..."

cv_trunc <- str_trunc(cv_string, width=123158, side="right")  
str_locate_all(cv_trunc, "References")

## [[1]]
##      start  end
## [1,]  3702 3711

str_length(cv_trunc)

## [1] 123158

str_length(cv_string)

## [1] 286792

Now we know where the regex of literal characters "References" appear in the string cv_string and truncate it by removing everything after the position of our regex pattern.

Next, it seems we need to deal with whitespaces (\n or \r\n or multiple blanks). Remember how to remove all redundant whitespace characters, including line breaks: [[:space:]]

str_trunc(cv_trunc, 100, "right")

## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst ..."

str_squish(str_trunc(cv_trunc, 100, "right"))

## [1] "COVID-19 vaccine A COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity against ..."

cv_tidy <- str_squish(cv_trunc) # Remove any redundant whitespace
str_trunc(cv_tidy, 100, "right")

## [1] "COVID-19 vaccine A COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity against se..."

It should look tidier than before. What do we need to do with the string object now? It seems we should deal with normalization (standardized into either lower-case or upper-case letters).

So, we may want to use the function tolower to translate characters of a string into lower-case ones.

Before doing so, we may need to remove all non-ASCII characters using the POSIX character class [[:ascii:]].

Let’s check what non-ASCII characters are in the string.

str_extract_all(cv_tidy, " [[:word:]]+[^[:ascii:]]+[[:word:]]+") # Extract all non-ASCII characters (matching the preceding character set at least one or more times); # Guess what "+" in regex does

## [[1]]
##   [1] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
##   [4] " COVID<U+2011>19"    " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca"
##   [7] " I<U+2013>II"        " COVID<U+2011>19"    " COVID<U+2011>19"   
##  [10] " years<U+2014>and"   " COVID<U+2011>19"    " COVID<U+2011>19"   
##  [13] " COVID<U+2011>19"    " II<U+2013>III"      " COVID<U+2011>19"   
##  [16] " COVID<U+2011>19"    " 3<U+2013>6"         " COVID<U+2011>19"   
##  [19] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
##  [22] " COVID<U+2011>19"    " COVID<U+2011>19"    " Pfizer<U+2013>BioNTech"
##  [25] " Pfizer<U+2013>BioNTech" " COVID<U+2011>19"    " Pfizer<U+2013>BioNTech"
##  [28] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
##  [31] " COVID<U+2011>19"    " SARS<U+2011>CoV"    " COVID<U+2011>19"   
##  [34] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
##  [37] " SARS<U+2011>CoV"    " Oxford<U+2013>AstraZeneca" " SARS<U+2011>CoV"   
##  [40] " COVID<U+2011>19"    " COVID<U+2011>19"    " I<U+2013>II"       
##  [43] " COVID<U+2011>19"    " II<U+2013>III"      " Pfizer<U+2013>BioNTech"
##  [46] " Oxford<U+2013>AstraZeneca" " Pfizer<U+2013>BioNTech" " 3<U+2013>4"        
##  [49] " Oxford<U+2013>AstraZeneca" " 2<U+2013>8"         " 4<U+2013>12"       
##  [52] " 2<U+2013>8"         " SARS<U+2011>CoV"    " 3<U+2013>4"        
##  [55] " 2<U+2013>8"         " SARS<U+2011>CoV"    " 2<U+2013>8"        
##  [58] " 2<U+2013>8"         " Mar<U+2013>Dec"     " 2<U+2013>8"        
##  [61] " SARS<U+2011>CoV"    " 2<U+2013>8"         " 3<U+2013>4"        
##  [64] " 2<U+2013>8"         " SARS<U+2011>CoV"    " COVID<U+2011>19"   
##  [67] " I<U+2013>III"       " I<U+2013>II"        " I<U+2013>II"       
##  [70] " I<U+2013>II"        " SARS<U+2011>CoV"    " I<U+2013>II"       
##  [73] " SARS<U+2011>CoV"    " I<U+2013>II"        " I<U+2013>II"       
##  [76] " SARS<U+2011>CoV"    " II<U+2013>III"      " II<U+2013>III"     
##  [79] " II<U+2013>III"      " 2020<U+2013>Jan"    " 2021<U+2013>Mar"   
##  [82] " II<U+2013>III"      " 18<U+2013>55"       " 65<U+2013>85"      
##  [85] " Mar<U+2013>May"     " Aug<U+2013>Dec"     " 2020<U+2013>Jan"   
##  [88] " SARS<U+2011>CoV"    " II<U+2013>III"      " I<U+2013>II"       
##  [91] " II<U+2013>III"      " I<U+2013>II"        " I<U+2013>II"       
##  [94] " I<U+2013>II"        " I<U+2013>II"        " Duke<U+2013>NUS"   
##  [97] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [100] " I<U+2013>II"        " SARS<U+2011>CoV"    " I<U+2013>II"       
## [103] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [106] " COVID<U+2011>19"    " I<U+2013>II"        " SARS<U+2011>CoV"   
## [109] " I<U+2013>II"        " SARS<U+2011>CoV"    " I<U+2013>II"       
## [112] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [115] " SARS<U+2011>CoV"    " I<U+2013>II"        " I<U+2013>II"       
## [118] " I<U+2013>II"        " SARS<U+2011>CoV"    " Jul<U+2013>Oct"    
## [121] " Pfizer<U+2013>BioNTech" " Pfizer<U+2013>BioNTech" " I<U+2013>IIa"      
## [124] " II<U+2013>III"      " COVID<U+2011>19"    " COVID<U+2011>19"   
## [127] " Pfizer<U+2013>BioNTech" " SARS<U+2011>CoV"    " COVID<U+2011>19"   
## [130] " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca" " 72<U+2013>100"     
## [133] " SARS<U+2011>CoV"    " Oxford<U+2013>AstraZeneca" " 42<U+2013>89"      
## [136] " 71<U+2013>91"       " COVID<U+2011>19"    " COVID<U+2011>19"   
## [139] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
## [142] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
## [145] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
## [148] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
## [151] " COVID<U+2011>19"    " COVID<U+2011>19"    " COVID<U+2011>19"   
## [154] " COVID<U+2011>19"    " COVID<U+2011>19"

str_extract_all(cv_tidy, " [[:word:]]+[-‑–—]+[[:word:]]+") # Figure dash, en dash, em dash, hyphen, etc..

## [[1]]
##   [1] " 2019"                " 2020"                " 2020"               
##   [4] " SARS-CoV"            " COVID-19"            " COVID-19"           
##   [7] " 100"                 " 2021"                " RBD-Dimer"          
##  [10] " 2021"                " 308"                 " EUA"                
##  [13] " 2021"                " 436"                 " EUA"                
##  [16] " AstraZeneca-Oxford"  " 2021"                " Pfizer-BioNTech"    
##  [19] " EUA"                 " EUA"                 " 600"                
##  [22] " 500"                 " 2021"                " 2020"               
##  [25] " high-income"         " 501"                 " 2003"               
##  [28] " non-human"           " 2005"                " 2006"               
##  [31] " 2020"                " COVID-19"            " MERS-CoV"           
##  [34] " 2020"                " viral-vectored"      " adenoviral-vectored"
##  [37] " BVRS-GamVac"         " MVA-vectored"        " 2020"               
##  [40] " COVID-19"            " COVID-19"            " 2020"               
##  [43] " COVID-19"            " G20"                 " 2020"               
##  [46] " cross-discipline"    " 2020"                " multi-site"         
##  [49] " low-rate"            " 2020"                " disease-fighting"   
##  [52] " 2020"                " fast-track"          " 2019"               
##  [55] " 2020"                " COVID-19"            " 2020"               
##  [58] " 2020"                " 2020"                " 2020"               
##  [61] " 2020"                " 2020"                " 2020"               
##  [64] " high-risk"           " 2020"                " EUA"                
##  [67] " BNT162b2"            " 2020"                " 2020"               
##  [70] " BBIBP-CorV"          " 2020"                " EUA"                
##  [73] " mRNA-1273"           " 2021"                " 2020"               
##  [76] " non-replicating"     " nucleoside-modified" " COVID-19"           
##  [79] " 2021"                " Pfizer-BioNTech"     " COVID-19"           
##  [82] " 2021"                " non-replicating"     " vector-based"       
##  [85] " COVID-19"            " non-replicating"     " 2021"               
##  [88] " COVID-19"            " COVID-19"            " one-shot"           
##  [91] " Ad26"                " 2021"                " BBIBP-CorV"         
##  [94] " COVID-19"            " 2021"                " COVID-19"           
##  [97] " RBD-Dimer"           " V451"                " placebo-controlled" 
## [100] " WHO-recognized"      " RBD-Dimer"           " COVID-19"           
## [103] " placebo-controlled"  " 2020"                " 2020"               
## [106] " 2020"                " COVID-19"            " Ad26"               
## [109] " 2020"                " 2021"                " COVID-19"           
## [112] " placebo-controlled"  " 2020"                " 2021"               
## [115] " BBIBP-CorV"          " double-blind"        " placebo-controlled" 
## [118] " 2020"                " 2021"                " Double-blind"       
## [121] " placebo-controlled"  " 100"                 " 2020"               
## [124] " 2021"                " 2020"                " 2021"               
## [127] " COVID-19"            " placebo-controlled"  " 2020"               
## [130] " 2020"                " 2020"                " 2022"               
## [133] " COVID-19"            " double-blinded"      " Ad26"               
## [136] " placebo-controlled"  " 2021"                " 2020"               
## [139] " 2023"                " Ad5-nCoV"            " multi-center"       
## [142] " 2021"                " COVID-19"            " 2020"               
## [145] " 2020"                " 2021"                " 2020"               
## [148] " BBV152"              " observer-blinded"    " peer-reviewed"      
## [151] " 2020"                " 2021"                " 2020"               
## [154] " 2021"                " ZF2001"              " double-blind"       
## [157] " 2020"                " 2022"                " Double-blind"       
## [160] " placebo-controlled"  " COVID-19"            " 2020"               
## [163] " 2021"                " 2020"                " 2021"               
## [166] " FINLAY-FR"           " Non-randomized"      " parallel-group"     
## [169] " double-blind"        " 2021"                " 2020"               
## [172] " 2021"                " COVID-19"            " 2020"               
## [175] " 2021"                " observer-blind"      " 2020"               
## [178] " 2021"                " double-blinded"      " double-blinded"     
## [181] " single-center"       " single-center"       " 2020"               
## [184] " 2021"                " QazCovid-in"         " 2020"               
## [187] " 2020"                " 2021"                " ZyCoV-D"            
## [190] " double-blind"        " 2021"                " 2020"               
## [193] " 2021"                " Virus-like"          " plant-based"        
## [196] " AS03"                " Event-driven"        " 10x"                
## [199] " 2020"                " COVID-19"            " 2021"               
## [202] " 2020"                " 2021"                " SCB-2019"           
## [205] " 2020"                " AS03"                " 2021"               
## [208] " double-blind"        " 2021"                " 2022"               
## [211] " UB-612"              " Open-label"          " 2020"               
## [214] " 2021"                " Observer-blind"      " IIb-III"            
## [217] " Double-Blind"        " 2021"                " 2023"               
## [220] " GRAd-COV2"           " observer-blind"      " 2021"               
## [223] " 24-week"             " GRAd-COV2"           " 2020"               
## [226] " Single-center"       " Double-blind"        " Double-Blinded"     
## [229] " 2020"                " Double-Blinded"      " 2020"               
## [232] " 2021"                " MVC-COV1901"         " open-labeled"       
## [235] " double-blinded"      " single-center"       " multi-center"       
## [238] " 2020"                " 2021"                " multi-regional"     
## [241] " 2020"                " 2021"                " double-blind"       
## [244] " double-blind"        " 2020"                " 2021"               
## [247] " 2020"                " 2021"                " double-blind"       
## [250] " 2020"                " 2021"                " 2021"               
## [253] " 2021"                " 2020"                " 2022"               
## [256] " 2021"                " INO-4800"            " Open-label"         
## [259] " Ib-IIa"              " 2020"                " 2020"               
## [262] " 2021"                " 2022"                " Ib-IIa"             
## [265] " AG0302-COVID"        " double-blind"        " single-center"      
## [268] " 2020"                " 2020"                " 2021"               
## [271] " I-IIa"               " SARS-CoV"            " SARS-CoV"           
## [274] " AS03"                " 2020"                " 2022"               
## [277] " IIBR-100"            " 2020"                " 2021"               
## [280] " ARCT-021"            " COV19"               " double-blinded"     
## [283] " observer-blind"      " 2020"                " 2022"               
## [286] " VBI-2902a"           " Virus-like"          " observer-blind"     
## [289] " 2021"                " 2022"                " MRT5500"            
## [292] " First-in"            " CoV-2"               " 2021"               
## [295] " 2022"                " EuCorVac-19"         " observer-blind"     
## [298] " 2021"                " 2022"                " GX-19"              
## [301] " I-II"                " 210"                 " double-blind"       
## [304] " 2020"                " 2021"                " VLA2001"            
## [307] " multi-center"        " double-blinded"      " 2020"               
## [310] " 2021"                " TAK-919"             " observer-blind"     
## [313] " 2021"                " 2022"                " TAK-019"            
## [316] " observer-blind"      " 2021"                " 2022"               
## [319] " COVID-eVax"          " in-human"            " 2021"               
## [322] " 2020"                " 2023"                " LV-SMENP"           
## [325] " 2020"                " 2023"                " ChulaCov19"         
## [328] " Dose-finding"        " 2021"                " LNP-nCoVsaRNA"      
## [331] " 200"                 " 2020"                " 2021"               
## [334] " COVAX-19"            " 2020"                " 2021"               
## [337] " HGC019"              " 2021"                " COVID-19"           
## [340] " 2020"                " 2021"                " 2021"               
## [343] " 2022"                " PTX-COVID19"         " 2021"               
## [346] " COVAC-2"             " 2021"                " 2022"               
## [349] " COVI-VAC"            " First-in"            " double-blind"       
## [352] " dose-escalation"     " 2020"                " 2021"               
## [355] " Open-label"          " 2021"                " double-blind"       
## [358] " 2020"                " 2021"                " Double-blind"       
## [361] " dose-ranging"        " 2021"                " 2022"               
## [364] " BBV154"              " double-blinded"      " 2021"               
## [367] " MV-014-212"          " double-blinded"      " 2021"               
## [370] " 2022"                " S-268019"            " double-blind"       
## [373] " parallel-group"      " 2020"                " 2022"               
## [376] " GBP510"              " 2021"                " CIGB-66"            
## [379] " double-blind"        " 2020"                " 2021"               
## [382] " KBP-201"             " First-in"            " 2020"               
## [385] " 2021"                " AdimrSC-2f"          " open-label"         
## [388] " dose-finding"        " 2020"                " ERUCOV-VAC"         
## [391] " COVID-19"            " ERUCOV-VAC"          " 2020"               
## [394] " 2021"                " AKS-452"             " Single-center"      
## [397] " open-label"          " 2021"                " GLS-5310"           
## [400] " Double-blind"        " 2020"                " 2022"               
## [403] " VAX-001"             " observer-blind"      " 2021"               
## [406] " COH04S1"             " 2020"                " 2022"               
## [409] " 2021"                " NBP2001"             " 2020"               
## [412] " 2021"                " CoVac-1"             " 2020"               
## [415] " 2021"                " bacTRL-Spike"        " observer-blind"     
## [418] " 2020"                " 2022"                " 2021"               
## [421] " CORVax12"            " Open-label"          " 2020"               
## [424] " 2021"                " ChAdV68-S"           " Open-label"         
## [427] " 2021"                " 2022"                " Double-blind"       
## [430] " in-Human"            " 2021"                " 2022"               
## [433] " VXA-CoV2"            " Double-blind"        " in-Human"           
## [436] " 2020"                " SARS-CoV"            " double-blind"       
## [439] " dose-ranging"        " 2020"                " SARS-CoV"           
## [442] " CoV-2"               " Long-term"           " COVID-19"           
## [445] " Gam-COVID"           " Gam-COVID"           " 2-8"                
## [448] " nCoV-19"             " middle-income"       " nCoV-19"            
## [451] " Virus-like"          " double-blind"        " COVID-19"           
## [454] " COVID-19"            " COVID-19"            " SARS-CoV"           
## [457] " COVID-19"            " COVID-19"            " BBIBP-CorV"         
## [460] " 100"                 " SARS-CoV"            " COVID-19"           
## [463] " COVID-19"            " 2021"                " SARS-CoV"           
## [466] " 2020"                " non-B"               " 501"                
## [469] " 501"                 " 2021"                " two-thirds"         
## [472] " 501"                 " Ad26"                " COVID-19"           
## [475] " 2021"                " COVID-19"            " 501"                
## [478] " AZD1222"             " COVID-19"            " 2021"               
## [481] " 2020"                " protein-based"       " vector-based"       
## [484] " 2021"                " 436"                 " 2020"               
## [487] " first-service"       " taxpayer-funded"     " 2020"               
## [490] " late-stage"          " densely-populated"   " 2020"               
## [493] " SARS-CoV"            " COVID-19"            " 2024"               
## [496] " side-effects"        " COVID-19"            " high-income"        
## [499] " 2020"                " pre-sold"            " high-income"        
## [502] " 2021"                " Director-General"    " higher-income"      
## [505] " lowest-income"       " COVID-19"            " 2020"               
## [508] " long-standing"       " COVID-19"            " 100"                
## [511] " 200"                 " 910"                 " 902"                
## [514] " 824"                 " 744"                 " 732"                
## [517] " 631"                 " 2021"                " 549"                
## [520] " COVID-19"            " 505"                 " Anti-vaccination"   
## [523] " 436"                 " COVID-19"            " 420"                
## [526] " 417"                 " COVID-19"            " 402"                
## [529] " 334"                 " 322"                 " 2020"               
## [532] " 311"                 " 309"                 " 2021"               
## [535] " 240"                 " 212"                 " 209"                
## [538] " 2009"                " 209"

cv_hyphen <- str_replace_all(cv_tidy, "[-‑–—]", "-") # Extract all words with hyphen/en dash/em dash in ASCII or non-ASCII codes

str_extract_all(cv_hyphen, " [[:word:]]+[^[:ascii:]]+[[:word:]]+")

## [[1]]
##  [1] " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca" " I<U+2013>II"       
##  [4] " years<U+2014>and"   " II<U+2013>III"      " Pfizer<U+2013>BioNTech"
##  [7] " Pfizer<U+2013>BioNTech" " Pfizer<U+2013>BioNTech" " SARS<U+2011>CoV"   
## [10] " SARS<U+2011>CoV"    " Oxford<U+2013>AstraZeneca" " SARS<U+2011>CoV"   
## [13] " I<U+2013>II"        " II<U+2013>III"      " Pfizer<U+2013>BioNTech"
## [16] " Oxford<U+2013>AstraZeneca" " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca"
## [19] " SARS<U+2011>CoV"    " SARS<U+2011>CoV"    " Mar<U+2013>Dec"    
## [22] " SARS<U+2011>CoV"    " SARS<U+2011>CoV"    " I<U+2013>III"      
## [25] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [28] " SARS<U+2011>CoV"    " I<U+2013>II"        " SARS<U+2011>CoV"   
## [31] " I<U+2013>II"        " I<U+2013>II"        " SARS<U+2011>CoV"   
## [34] " II<U+2013>III"      " II<U+2013>III"      " II<U+2013>III"     
## [37] " II<U+2013>III"      " 65<U+2013>85"       " Mar<U+2013>May"    
## [40] " Aug<U+2013>Dec"     " SARS<U+2011>CoV"    " II<U+2013>III"     
## [43] " I<U+2013>II"        " II<U+2013>III"      " I<U+2013>II"       
## [46] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [49] " Duke<U+2013>N"      " I<U+2013>II"        " I<U+2013>II"       
## [52] " I<U+2013>II"        " I<U+2013>II"        " SARS<U+2011>CoV"   
## [55] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [58] " I<U+2013>II"        " I<U+2013>II"        " SARS<U+2011>CoV"   
## [61] " I<U+2013>II"        " SARS<U+2011>CoV"    " I<U+2013>II"       
## [64] " I<U+2013>II"        " I<U+2013>II"        " I<U+2013>II"       
## [67] " SARS<U+2011>CoV"    " I<U+2013>II"        " I<U+2013>II"       
## [70] " I<U+2013>II"        " SARS<U+2011>CoV"    " Jul<U+2013>Oct"    
## [73] " Pfizer<U+2013>BioNTech" " Pfizer<U+2013>BioNTech" " I<U+2013>IIa"      
## [76] " II<U+2013>III"      " Pfizer<U+2013>BioNTech" " SARS<U+2011>CoV"   
## [79] " Pfizer<U+2013>BioNTech" " Oxford<U+2013>AstraZeneca" " SARS<U+2011>CoV"   
## [82] " Oxford<U+2013>AstraZeneca"

str_extract_all(cv_hyphen, "[^[:ascii:]]+")

## [[1]]
##   [1] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2014>" "<U+2011>" "<U+2011>"
##  [16] "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>"
##  [31] "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>"
##  [46] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>"
##  [61] "<U+2013>" "<U+2013>" "±" "°" "<U+2013>" "<U+2013>" "≤" "°" "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2013>" "<U+2013>" "<U+2013>"
##  [76] "°" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "±" "°" "<U+2013>" "<U+2013>"
##  [91] "°" "<U+2013>" "<U+2013>" "°" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "°" "<U+2013>"
## [106] "<U+2013>" "<U+2013>" "<U+2013>" "°" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [121] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [136] "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [151] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>"
## [166] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [181] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>"
## [196] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>"
## [211] "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [226] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [241] "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>"
## [256] "<U+2013>" "<U+2013>" "<U+2212>" "<U+2212>" "°" "<U+2212>" "°" "°" "°" "°" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>"
## [271] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2013>" "<U+2248>"
## [286] "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2248>" "<U+2013>"
## [301] "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2212>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>" "<U+2013>" "<U+2248>"
## [316] "<U+2248>" "<U+2248>" "°" "°" "<U+2248>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2013>" "<U+2011>" "<U+2011>" "<U+2011>"
## [331] "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2011>" "<U+2013>" "<U+2013>" "<U+2011>"
## [346] "<U+2011>" "<U+2011>" "<U+2011>"

cv_ascii <- str_replace_all(cv_hyphen, "[^[:ascii:]]+", " ") # Replace any non-English character with a blank " ".

cv_ascii_lower <- tolower(cv_ascii) # Translate all characters into lower-case letters
# if you have an error message, you may try a stringr function, str_to_lower, instead. 

str_trunc(cv_ascii, 1000)

## [1] "COVID--9 vaccine A COVID -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus - (SARS CoV -), the virus causing coronavirus disease ---9 (COVID -9). Prior to the COVID -9 pandemic, there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), which enabled accelerated development of various vaccine technologies during early ----.[-] On -- January ----, the SARS-CoV-- genetic sequence data was shared through GISAID, and by -9 March, the global pharmaceutical industry announced a major commitment to address COVID--9.[-] COVID--9 vaccination doses administered per --- people In Phase III trials, several COVID -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic COVID -9 infections. As of March ----, -- vaccines were authorized by at least one national regulator..."

str_trunc(cv_ascii_lower, 1000, "right")

## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus - (sars cov -), the virus causing coronavirus disease ---9 (covid -9). prior to the covid -9 pandemic, there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome (sars) and middle east respiratory syndrome (mers), which enabled accelerated development of various vaccine technologies during early ----.[-] on -- january ----, the sars-cov-- genetic sequence data was shared through gisaid, and by -9 march, the global pharmaceutical industry announced a major commitment to address covid--9.[-] covid--9 vaccination doses administered per --- people in phase iii trials, several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections. as of march ----, -- vaccines were authorized by at least one national regulator..."

cv_tidy <- str_to_lower(cv_tidy)

Now, let’s think about how to deal with punctuation and numbers.

# Check what punctuation marks are to be removed; Punctuation 
unlist(str_extract_all(cv_ascii_lower, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))[1:100] # Remember why we apply the unlist function to the result from str_extract_all

##   [1] "covid--9"           "-9"                 "-"                 
##   [4] "(sars"              "-),"                "---9"              
##   [7] "(covid"             "-9)."               "-9"                
##  [10] "pandemic,"          "(sars)"             "(mers),"           
##  [13] "----.[-]"           "--"                 "----,"             
##  [16] "sars-cov--"         "gisaid,"            "-9"                
##  [19] "march,"             "covid--9.[-]"       "covid--9"          
##  [22] "---"                "trials,"            "-9"                
##  [25] "95%"                "-9"                 "infections."       
##  [28] "----,"              "--"                 "use:"              
##  [31] "(the"               "vaccine),"          "(bbibp-"           
##  [34] "corv,"              "coronavac,"         "covaxin,"          
##  [37] "covivac),"          "(sputnik"           "v,"                
##  [40] "vaccine,"           "convidecia,"        "&"                 
##  [43] "vaccine),"          "(epivaccorona"      "rbd-dimer).[-]"    
##  [46] "total,"             "----,"              "--8"               
##  [49] "development,"       "7-"                 "research,"         
##  [52] "--"                 "trials,"            "--"                
##  [55] "trials,"            "-6"                 "development.[-]"   
##  [58] "use,"               "complications,"     "elderly,"          
##  [61] "e-a"                "(or"                "equivalent)"       
##  [64] "granted,"           "transmission,"      "workers.[-]"       
##  [67] "--"                 "----,"              "--6.-7"            
##  [70] "-9"                 "e-a"                "granted,"          
##  [73] "use,"               "agencies.[5]"       "astrazeneca-oxford"
##  [76] "-"                  "----,"              "pfizer-biontech"   
##  [79] "-.-"                "doses,"             "v,"                
##  [82] "sinopharm,"         "e-a"                "granted,"          
##  [85] "sinovac,"           "&"                  "-"                 
##  [88] "each."              "e-a"                "6--"               
##  [91] "5--"                "----.[6][7]"        "----,"             
##  [94] "--"                 "countries,[8]"      "high-income"       
##  [97] "--%"                "world's"            "population.[9]"    
## [100] "5--.v-"

It seems we have several patterns of string with punctuation that we want to remove from text: 1) Citation mark: "\\[\\d+\\]" 2) Number point number: "\\d+\\.\\d+" 3) Apostrophe: "[[:word:]]+[']s"

These three patterns of string are to be replaced with a blank

str_extract_all(cv_ascii_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]") # Check first the patterns are matched by our regex

## [[1]]
##  [1] "[5]"  "[6]"  "[7]"  "[8]"  "'s "  "[9]"  "'s "  "[55]" "[56]" "[57]"
## [11] "[55]" "[58]" "[59]" "[57]" "[65]" "[66]" "[67]" "'s "  "[68]" "[69]"
## [21] "[75]" "[76]" "[77]" "[78]" "[79]" "[85]" "[86]" "[87]" "[88]" "[89]"
## [31] "[95]" "[96]" "[97]" "[98]" "[99]" "'s "  "'s "  "'s "  "[87]" "[88]"
## [41] "[89]" "[87]" "[95]" "[96]" "'s "  "[98]" "[99]" "65.7" "'s "  "97.5"
## [51] "'s "  "9.5"  "8.6"  "8.9"  "59.8" "8.8"  "'s "  "8.8"  "8.6"  "5.6" 
## [61] "8.9"  "6.5"  "9.7"  "8.9"  "8.6"  "9.9"  "7.5"  "6.8"  "'s "

cv_nocite <- str_replace_all(cv_ascii_lower, "\\[\\d+\\]|\\d+\\.\\d+|[']s[[:space:]]", " ")
unlist(str_extract_all(cv_nocite, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))[1:100]

##   [1] "covid--9"           "-9"                 "-"                 
##   [4] "(sars"              "-),"                "---9"              
##   [7] "(covid"             "-9)."               "-9"                
##  [10] "pandemic,"          "(sars)"             "(mers),"           
##  [13] "----.[-]"           "--"                 "----,"             
##  [16] "sars-cov--"         "gisaid,"            "-9"                
##  [19] "march,"             "covid--9.[-]"       "covid--9"          
##  [22] "---"                "trials,"            "-9"                
##  [25] "95%"                "-9"                 "infections."       
##  [28] "----,"              "--"                 "use:"              
##  [31] "(the"               "vaccine),"          "(bbibp-"           
##  [34] "corv,"              "coronavac,"         "covaxin,"          
##  [37] "covivac),"          "(sputnik"           "v,"                
##  [40] "vaccine,"           "convidecia,"        "&"                 
##  [43] "vaccine),"          "(epivaccorona"      "rbd-dimer).[-]"    
##  [46] "total,"             "----,"              "--8"               
##  [49] "development,"       "7-"                 "research,"         
##  [52] "--"                 "trials,"            "--"                
##  [55] "trials,"            "-6"                 "development.[-]"   
##  [58] "use,"               "complications,"     "elderly,"          
##  [61] "e-a"                "(or"                "equivalent)"       
##  [64] "granted,"           "transmission,"      "workers.[-]"       
##  [67] "--"                 "----,"              "--6.-7"            
##  [70] "-9"                 "e-a"                "granted,"          
##  [73] "use,"               "agencies."          "astrazeneca-oxford"
##  [76] "-"                  "----,"              "pfizer-biontech"   
##  [79] "-.-"                "doses,"             "v,"                
##  [82] "sinopharm,"         "e-a"                "granted,"          
##  [85] "sinovac,"           "&"                  "-"                 
##  [88] "each."              "e-a"                "6--"               
##  [91] "5--"                "----."              "----,"             
##  [94] "--"                 "countries,"         "high-income"       
##  [97] "--%"                "population."        "5--.v-"            
## [100] "-9,"

However, what about a punctuation mark to form a word? For example… a hyphen: “covid-19”, “sars-cov-2”, or “pfizer-biontech” What about percentage like 95%? We may not want to remove the hyphen and percent from the string, so we can remove any punctuation mark except the hyphen and % for convenience sake.

# How can we form a regex that matches to every punctuation characters except the hyphen and %
str_extract_all(cv_nocite, "[^[:alnum:][:space:]-%]") # A negation of any letter/number/whitespace/hyphen/percent character

## [[1]]
##    [1] "("  ")"  ","  "("  ")"  "."  ","  "("  ")"  "("  ")"  ","  "."  "[" 
##   [15] "]"  ","  ","  ","  "."  "["  "]"  ","  "."  ","  ":"  "("  ")"  "," 
##   [29] "("  ","  ","  ","  ")"  ","  "("  ","  ","  ","  "&"  ")"  ","  "(" 
##   [43] ")"  "."  "["  "]"  ","  ","  ","  ","  ","  ","  "."  "["  "]"  "," 
##   [57] ","  ","  "("  ")"  ","  ","  "."  "["  "]"  ","  "."  ","  ","  "." 
##   [71] ","  "."  ","  ","  ","  ","  ","  "&"  "."  "."  ","  ","  "."  "." 
##   [85] ","  "."  "["  "]"  ","  ","  "("  ")"  ","  ","  "."  "["  "]"  "(" 
##   [99] ")"  "("  ")"  "."  "["  "]"  "["  "]"  "."  ","  "."  "["  "]"  "[" 
##  [113] "]"  "["  "]"  ","  "."  "["  "]"  "["  "]"  "."  "["  "]"  ","  "." 
##  [127] "["  "]"  "["  "]"  ","  "("  ")"  "["  "]"  ","  ":"  "("  ","  ")" 
##  [141] "("  ")"  "."  "["  "]"  ","  "."  "["  "]"  "("  ")"  ","  ","  "," 
##  [155] "."  "["  "]"  "["  "]"  ","  ":"  "["  "]"  "["  "]"  "("  ")"  "," 
##  [169] ","  ","  ","  "("  ","  "."  ")"  ","  ","  ","  ","  ","  "."  "." 
##  [183] "("  ","  ")"  "."  ","  ","  "."  "["  "]"  ","  ","  ","  "."  "[" 
##  [197] "]"  "["  "]"  ","  ","  "["  "]"  ","  ","  "."  "["  "]"  "["  "]" 
##  [211] ","  "."  "["  "]"  "["  "]"  "["  "]"  ","  "."  "["  "]"  "["  "]" 
##  [225] ","  ","  ","  "."  "["  "]"  "["  "]"  "\"" "\"" ","  ";"  "["  "]" 
##  [239] "["  "]"  "."  "["  "]"  ","  "."  "["  "]"  "["  "]"  "["  "]"  "," 
##  [253] "("  ")"  "."  "["  "]"  "["  "]"  "."  ","  ":"  "("  "\"" "\"" ")" 
##  [267] ","  ","  ","  "."  "["  "]"  "\"" "&"  "("  ")"  "\"" "\"" ","  "," 
##  [281] ","  "\"" "\"" "."  "\"" ","  "."  "["  "]"  "."  ","  "\"" "\"" "," 
##  [295] "["  "]"  "."  "["  "]"  "\"" "\"" "."  "["  "]"  ","  ","  "."  "[" 
##  [309] "]"  ","  ","  "."  "."  "["  "]"  ","  "("  ")"  ","  ","  "."  "[" 
##  [323] "]"  "["  "]"  ","  "."  "["  "]"  "["  "]"  ","  "&"  ","  ","  "(" 
##  [337] ")"  ","  ","  ","  "."  "["  "]"  "["  "]"  ","  "["  "]"  ","  "." 
##  [351] "["  "]"  ","  "."  ","  ","  ","  ","  "."  "["  "]"  "("  ")"  "," 
##  [365] "."  "("  ")"  ","  ","  "."  "["  "]"  ","  "("  ")"  ","  ","  "." 
##  [379] "["  "]"  ","  "["  "]"  "("  ","  ")"  "."  "["  "]"  "["  "]"  "," 
##  [393] "."  "["  "]"  ","  ","  "."  "("  ")"  "."  ","  "("  ")"  ","  "." 
##  [407] "["  "]"  "["  "]"  "["  "]"  ","  "["  "]"  "."  ","  "."  "["  "]" 
##  [421] ","  "("  ")"  "("  ")"  "."  ","  ","  "."  ","  "."  "["  "]"  "[" 
##  [435] "]"  "."  "["  "]"  "("  ":"  "("  ")"  ")"  ","  ","  ","  ","  "(" 
##  [449] ")"  ","  "("  ")"  ","  ","  "."  "["  "]"  "["  "]"  "["  "]"  "[" 
##  [463] "]"  ","  "\"" "\"" "."  "["  "]"  "["  "]"  "["  "]"  ","  ","  "," 
##  [477] ","  ","  "."  "["  "]"  "["  "]"  ","  ","  "("  ")"  "."  ","  "," 
##  [491] "."  "."  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "."  ","  "."  "[" 
##  [505] "]"  "["  "]"  ","  "."  "["  "]"  "."  ","  "."  ","  "."  "["  "]" 
##  [519] ","  ","  "."  "["  "]"  ","  ","  "["  "]"  ","  ","  "&"  "."  "&" 
##  [533] ";"  "."  "["  "]"  "["  "]"  "&"  "."  ","  "."  "["  "]"  ","  "," 
##  [547] "["  "]"  "["  "]"  "["  "]"  "["  "]"  ","  "."  "["  "]"  "."  "[" 
##  [561] "]"  "["  "]"  "."  ","  "."  "["  "]"  ","  "."  "["  "]"  "["  "]" 
##  [575] "."  "["  "]"  ","  "."  "["  "]"  "["  "]"  ","  "["  "]"  "["  "]" 
##  [589] "["  "]"  "["  "]"  "["  "]"  "["  "]"  ","  "["  "]"  "["  "]"  "," 
##  [603] "."  "["  "]"  "."  "["  "]"  ","  "."  "["  "]"  ","  ","  "("  ")" 
##  [617] ","  "."  "["  "]"  "["  "]"  ","  ","  ","  ","  "."  "["  "]"  "," 
##  [631] ","  "("  "\"" "\"" "\"" "\"" ")"  ","  "."  "["  "]"  "["  "]"  "," 
##  [645] ","  ","  ","  ","  "."  "["  "]"  "["  "]"  "["  "]"  "\"" "\"" "." 
##  [659] "["  "]"  "["  "]"  ","  ","  ","  "."  "["  "]"  "["  "]"  "."  "." 
##  [673] "&"  ","  ","  "/"  "("  ")"  "("  ")"  "["  "]"  "("  ","  ")"  "(" 
##  [687] ")"  ","  "("  "("  ")"  ","  "("  ")"  "("  ")"  "."  ","  ")"  "[" 
##  [701] "]"  "["  "]"  "["  "]"  "."  "["  "]"  "["  "]"  ","  "["  "]"  "[" 
##  [715] "]"  ","  "["  "]"  "("  ","  ")"  "("  ")"  "("  ")"  "("  ")"  "(" 
##  [729] "["  "]"  ","  ")"  "["  "]"  ","  ","  "."  "["  "]"  ","  "."  "." 
##  [743] "["  "]"  ","  ","  ","  "["  "]"  ","  "["  "]"  "["  "]"  ","  "[" 
##  [757] "]"  "["  "]"  "["  "]"  "["  "]"  "("  ","  ")"  "("  ")"  ","  ";" 
##  [771] "("  ")"  "("  ")"  "["  "]"  "["  "]"  "["  "]"  "("  ")"  ","  "," 
##  [785] "["  "]"  "["  "]"  ","  ","  ","  "."  "["  "]"  "."  "["  "]"  "," 
##  [799] "("  ","  ")"  ","  "["  "]"  ","  "["  "]"  "["  "]"  "["  "]"  "(" 
##  [813] ","  ")"  "("  ")"  ":"  ","  "("  ")"  "("  ")"  ","  ","  "["  "]" 
##  [827] "["  "]"  ","  "."  "."  "["  "]"  ","  ","  ","  ","  "["  "]"  "," 
##  [841] "["  "]"  ","  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "[" 
##  [855] "]"  "("  ","  ")"  "("  ")"  ","  "("  ")"  "("  ")"  "["  "]"  "," 
##  [869] "["  "]"  "."  "."  "."  "["  "]"  "."  "."  "["  "]"  "."  ","  "," 
##  [883] "."  "["  "]"  ","  "("  ","  ")"  ";"  "["  "]"  ","  "("  ","  ")" 
##  [897] ";"  "("  ","  ")"  ";"  "["  "]"  "("  ","  ")"  "["  "]"  "["  "]" 
##  [911] "("  ","  ")"  "("  ")"  "["  "]"  "["  "]"  "("  "("  ")"  ";"  "(" 
##  [925] ")"  ","  ","  "["  "]"  "["  "]"  ","  ","  ")"  "["  "]"  ","  "," 
##  [939] "."  "["  "]"  "."  "["  "]"  ","  "&"  "["  "]"  "("  ","  ")"  "(" 
##  [953] ")"  ","  "["  "]"  ","  "("  ")"  "("  ","  ")"  "["  "]"  "("  "&" 
##  [967] ")"  ","  "."  "&"  ","  "."  ","  "."  "["  "]"  "["  "]"  ","  "," 
##  [981] ","  ","  ","  ","  ","  ","  ","  ","  "("  ")"  "["  "]"  "("  "," 
##  [995] ")"  "("  ")"  ","  "["  "]"  ","  "("  ")"  "("  ","  ")"  "["  "]" 
## [1009] ","  ","  "."  ","  "."  "."  "["  "]"  ","  ";"  ","  ";"  ","  "," 
## [1023] "["  "]"  ","  ","  ";"  "["  "]"  ";"  "["  "]"  ";"  "["  "]"  "[" 
## [1037] "]"  "["  "]"  "("  ")"  "["  "]"  "("  ","  ")"  "("  ")"  ","  "," 
## [1051] "("  ")"  "("  ")"  "["  "]"  ","  "["  "]"  "["  "]"  "."  "["  "]" 
## [1065] "."  "["  "]"  ","  "."  "["  "]"  "["  "]"  "("  ","  ")"  "("  ")" 
## [1079] "("  ")"  "["  "]"  "("  ")"  ","  "["  "]"  ","  ","  ","  "["  "]" 
## [1093] "("  ")"  "["  "]"  "("  ","  ")"  "("  ")"  "("  ")"  ","  "("  ")" 
## [1107] "."  "["  "]"  "["  "]"  ","  "."  "["  "]"  ","  ","  ","  ","  "," 
## [1121] ","  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "("  ","  ")"  "("  ")" 
## [1135] ","  "("  ")"  "["  "]"  "["  "]"  ","  "."  "["  "]"  "["  "]"  "[" 
## [1149] "]"  ","  "["  "]"  "("  ")"  "("  ")"  "("  ")"  ","  "["  "]"  "[" 
## [1163] "]"  "["  "]"  "/"  "("  ")"  "("  ")"  "["  "]"  "("  ","  ")"  "[" 
## [1177] "]"  "["  "]"  "["  "]"  ","  "("  ","  "["  "]"  ","  "["  "]"  "[" 
## [1191] "]"  ")"  "."  "["  "]"  "["  "]"  ","  "("  ","  ")"  ";"  ","  "," 
## [1205] ","  "("  ","  ")"  "["  "]"  "("  ")"  "("  ")"  "("  ","  ")"  "[" 
## [1219] "]"  "["  "]"  "["  "]"  "["  "]"  "("  ")"  ":"  ","  ","  "."  "," 
## [1233] ":"  "."  ","  ":"  "."  ","  ":"  ","  ","  "("  ")"  ":"  "."  ":" 
## [1247] "."  ":"  "."  ":"  "."  ","  "("  "("  ")"  "("  ")"  "("  ")"  ")" 
## [1261] "["  "]"  "("  ","  ")"  "["  "]"  "["  "]"  "["  "]"  ","  "["  "]" 
## [1275] "/"  ":"  "("  ")"  ":"  ","  ","  ","  ","  ","  ","  ","  "."  "," 
## [1289] ","  "("  ")"  ":"  ","  ","  ","  ","  ","  ","  ","  ","  "."  "," 
## [1303] "("  ")"  ","  "("  ")"  ","  "("  ")"  ","  "("  ")"  "["  "]"  "[" 
## [1317] "]"  "("  ")"  "("  ","  ")"  ","  ","  ","  ","  ","  ","  ","  "," 
## [1331] ","  "["  "]"  "("  ","  ")"  "("  "["  "]"  ")"  ","  ","  ","  "[" 
## [1345] "]"  ","  "["  "]"  "["  "]"  "("  ")"  "("  ","  ")"  "["  "]"  "(" 
## [1359] ","  ")"  ","  ";"  ","  ","  ","  "["  "]"  "?"  ","  "["  "]"  "[" 
## [1373] "]"  "["  "]"  ","  "["  "]"  ","  "["  "]"  "("  ")"  ","  "("  "," 
## [1387] "("  ","  ")"  ")"  ","  ","  "("  ","  ")"  "["  "]"  ","  "."  "," 
## [1401] "["  "]"  "["  "]"  "["  "]"  ","  "("  "("  ")"  "("  ","  ")"  "," 
## [1415] "["  "]"  "["  "]"  ")"  ","  ","  ","  ","  ","  ","  ","  ","  "," 
## [1429] ","  ","  ","  ","  ","  ","  "("  ")"  "["  "]"  ","  ","  "("  "," 
## [1443] ")"  ","  "["  "]"  "["  "]"  ","  "("  ","  ")"  ":"  ","  ","  "." 
## [1457] "("  ","  ")"  ":"  ","  ","  ","  ","  "."  ","  "["  "]"  "["  "]" 
## [1471] "("  ")"  "["  "]"  ","  "("  "("  ","  ")"  "("  ","  ")"  "["  "]" 
## [1485] "["  "]"  ":"  ","  ","  ")"  ","  "."  ","  ","  "."  "."  ","  "(" 
## [1499] "("  ")"  "["  "]"  ")"  "("  ","  ")"  "["  "]"  "["  "]"  ","  "," 
## [1513] ","  "("  ")"  ":"  ","  ","  ","  "."  ","  ","  "."  "("  ","  ")" 
## [1527] ":"  ","  ","  ","  "."  ","  "("  ")"  "("  ")"  "["  "]"  "("  "," 
## [1541] ")"  "["  "]"  ","  ","  ","  ","  ","  ","  ","  ","  "("  ")"  "[" 
## [1555] "]"  "."  "("  ")"  "("  ","  ")"  "["  "]"  ","  ","  ","  ","  "." 
## [1569] "."  ","  ","  "["  "]"  "("  "("  ")"  "("  ")"  "["  "]"  "["  "]" 
## [1583] ","  ","  ")"  "["  "]"  "["  "]"  ","  ","  ","  ","  "["  "]"  "(" 
## [1597] ")"  "("  ")"  ","  ","  "["  "]"  ","  "["  "]"  "["  "]"  "["  "]" 
## [1611] "("  "("  ")"  ","  ","  ","  "("  ")"  "("  ")"  ":"  ","  ")"  "," 
## [1625] "."  ","  "("  ")"  ":"  "."  "["  "]"  "."  "["  "]"  ","  ","  "," 
## [1639] "("  "["  "]"  ")"  "["  "]"  "["  "]"  "("  ")"  "("  ")"  "."  "," 
## [1653] "["  "]"  "("  ")"  "/"  ","  ","  ","  ","  "["  "]"  ","  ","  "," 
## [1667] ","  "("  ","  ")"  "("  ")"  ":"  "("  ")"  "."  "["  "]"  "("  ")" 
## [1681] ":"  "."  "["  "]"  ","  "("  ")"  "["  "]"  "("  ")"  "("  ","  ")" 
## [1695] "["  "]"  ","  "("  ")"  "["  "]"  "["  "]"  ","  "("  ")"  ","  "/" 
## [1709] "("  ")"  ":"  ","  ","  "("  ")"  ":"  "["  "]"  "("  ")"  ":"  "," 
## [1723] ","  "["  "]"  ","  ","  "("  ")"  "["  "]"  "("  ")"  ","  ","  "," 
## [1737] ","  "["  "]"  "("  ")"  "["  "]"  "."  ","  "["  "]"  "("  ")"  "," 
## [1751] ","  ","  ","  "("  ")"  "["  "]"  "["  "]"  "["  "]"  ","  "["  "]" 
## [1765] "("  ")"  "("  ")"  ":"  ","  ","  ","  ","  "["  "]"  "["  "]"  "(" 
## [1779] ")"  ","  ","  ","  "["  "]"  "("  ")"  "["  "]"  ","  ","  ","  "[" 
## [1793] "]"  "("  ")"  "["  "]"  ","  ","  "."  ","  "("  ")"  "["  "]"  ":" 
## [1807] ","  "."  ":"  ","  ","  "."  ","  "/"  "["  "]"  "("  "("  ")"  "," 
## [1821] "["  "]"  ")"  "["  "]"  "("  "("  ")"  ")"  ","  "["  "]"  "("  ")" 
## [1835] "["  "]"  "."  ","  "["  "]"  "("  ")"  ","  "("  ")"  "("  ")"  "," 
## [1849] "["  "]"  "("  "("  ")"  "["  "]"  ")"  ","  "["  "]"  ","  "("  ")" 
## [1863] "["  "]"  ","  "?"  ","  "["  "]"  "["  "]"  "["  "]"  ","  "."  "," 
## [1877] "("  ")"  "("  ")"  "["  "]"  ","  "["  "]"  ","  ","  "["  "]"  "[" 
## [1891] "]"  "("  ")"  "."  "["  "]"  "["  "]"  ","  ","  "["  "]"  "["  "]" 
## [1905] "("  ")"  "["  "]"  ","  "["  "]"  "("  ")"  "("  "["  "]"  ")"  "," 
## [1919] "("  ")"  "."  "["  "]"  ","  ","  ","  ","  ","  "("  ")"  "("  ")" 
## [1933] "["  "]"  ","  ","  ","  "["  "]"  "("  ")"  ","  ","  ","  ","  "." 
## [1947] "["  "]"  ","  "("  ")"  "["  "]"  ","  ","  ","  "."  ","  "["  "]" 
## [1961] "("  ")"  "("  ")"  "["  "]"  "["  "]"  ","  ","  "."  ","  "["  "]" 
## [1975] "("  ")"  "."  "["  "]"  ","  ","  "."  ","  "("  ")"  "["  "]"  "," 
## [1989] ","  ","  "."  ","  "."  "."  "("  ")"  "["  "]"  ","  ","  ","  "." 
## [2003] ","  "("  ")"  "["  "]"  ","  ","  ","  "."  ","  "("  ")"  "["  "]" 
## [2017] ","  ","  ","  ","  ","  "("  ")"  "["  "]"  ","  ","  ","  "."  "," 
## [2031] "("  ")"  "["  "]"  ","  ","  "."  ","  "("  ")"  "["  "]"  ","  "," 
## [2045] ","  "."  ","  "."  "("  ")"  "["  "]"  ","  ","  "."  ","  "("  ")" 
## [2059] "."  "["  "]"  ","  ","  ","  "."  ","  "("  ")"  "["  "]"  "."  "," 
## [2073] "("  ")"  "("  ")"  "["  "]"  ","  ","  "."  ","  "("  ")"  "."  "." 
## [2087] "["  "]"  ","  ","  ","  "."  ","  "("  ")"  "["  "]"  ","  ","  "," 
## [2101] "."  ","  "("  ")"  "["  "]"  ","  ","  "."  ","  "("  ")"  "["  "]" 
## [2115] ","  ","  "."  ","  "("  ")"  "&"  "["  "]"  "."  ","  "("  ")"  "," 
## [2129] "["  "]"  ","  ","  "."  ","  "("  ")"  "."  "["  "]"  ","  ","  "," 
## [2143] "."  ","  "("  ")"  "["  "]"  ","  ","  ","  "."  ","  "/"  "("  "?" 
## [2157] "["  "]"  "["  "]"  "("  ")"  ","  ","  ","  ")"  ","  ","  ","  "." 
## [2171] "."  ","  "["  "]"  "/"  "["  "]"  "&"  ","  "["  "]"  "/"  ","  "." 
## [2185] "("  ")"  ","  "["  "]"  ","  "("  ")"  ","  "/"  "."  "["  "]"  "." 
## [2199] "."  "."  "["  "]"  "["  "]"  "."  "."  "("  ")"  ","  "("  ")"  "." 
## [2213] "["  "]"  "["  "]"  "."  "."  "."  "["  "]"  "."  "["  "]"  "."  "[" 
## [2227] "]"  "."  ":"  "."  "."  "["  "]"  "."  "."  "."  "["  "]"  "."  "." 
## [2241] "["  "]"  "."  "["  "]"  "("  ")"  "."  "."  ","  ","  "."  "["  "]" 
## [2255] ","  ","  ","  "."  ","  "["  "]"  "("  ")"  "."  "("  ")"  "."  "(" 
## [2269] ":"  ";"  ":"  "("  ")"  ")"  "["  "]"  "."  "["  "]"  "["  "]"  "," 
## [2283] ","  ","  "."  "["  "]"  ","  ","  "."  "."  ","  "."  ","  "."  "[" 
## [2297] "]"  ":"  "["  "]"  "["  "]"  "("  ")"  "["  "]"  "["  "]"  "("  ")" 
## [2311] "["  "]"  "("  ")"  "("  ")"  "["  "]"  "("  ")"  "["  "]"  "("  "," 
## [2325] ")"  "["  "]"  "("  ")"  "["  "]"  "["  "]"  "["  "]"  "("  ")"  "[" 
## [2339] "]"  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "("  ")"  "[" 
## [2353] "]"  "("  ")"  "("  ")"  "["  "]"  "("  ")"  "["  "]"  "&"  "["  "]" 
## [2367] "("  ")"  "["  "]"  "("  ")"  "("  ")"  "["  "]"  "("  ")"  "["  "]" 
## [2381] "["  "]"  "["  "]"  "."  ":"  ","  ","  ","  ","  ","  ","  ","  "," 
## [2395] ","  ","  ","  ","  ","  ","  ","  ","  ","  "."  ":"  "."  "."  ":" 
## [2409] ","  ","  ","  ","  ","  "("  ")"  ","  ","  ","  ","  ","  "."  "." 
## [2423] "."  "."  "."  ","  "("  ")"  "."  "."  "."  "."  "."  "."  "["  "]" 
## [2437] "."  "["  "]"  ","  "."  "["  "]"  ","  ","  "."  "."  "."  ","  "." 
## [2451] "["  "]"  "."  "["  "]"  "."  "."  "."  ","  "."  "."  "."  "."  "[" 
## [2465] "]"  "~"  ","  "~"  "."  "."  "."  ","  "~"  "\"" "\"" "."  "."  "." 
## [2479] "["  "]"  "."  "."  "("  "."  "."  ")"  "."  "["  "]"  ","  "."  "," 
## [2493] "."  "["  "]"  ","  "&"  ","  "."  "."  ","  "."  "["  "]"  ","  "." 
## [2507] "."  "["  "]"  ","  "\"" "\"" "."  "["  "]"  ","  "."  "["  "]"  "[" 
## [2521] "]"  ","  "."  "["  "]"  ","  "."  "["  "]"  ","  "."  "["  "]"  "[" 
## [2535] "]"  "."  "["  "]"  ","  "\"" "\"" ","  ","  "."  "["  "]"  ","  "." 
## [2549] "["  "]"  "["  "]"  ","  "."  "."  "["  "]"  ","  ","  ","  ","  "(" 
## [2563] ")"  "."  "["  "]"  "."  "["  "]"  "["  "]"  "["  "]"  "["  "]"  "." 
## [2577] "["  "]"  "["  "]"  "["  "]"  ","  ","  "."  "["  "]"  ","  "."  "[" 
## [2591] "]"  ":"  "\"" ","  ","  ","  ","  ","  ","  "."  "\"" "["  "]"  "," 
## [2605] ","  "."  "["  "]"  "["  "]"  "["  "]"  "["  "]"  ","  ","  "."  "[" 
## [2619] "]"  "["  "]"  ","  ","  ","  "."  "["  "]"  "["  "]"  "["  "]"  "," 
## [2633] ","  "\"" ","  ","  ","  ","  ","  ","  "\"" ","  "\"" ","  ","  "(" 
## [2647] "["  "]"  ")"  "["  "]"  "\"" "."  "["  "]"  "["  "]"  "."  "["  "]" 
## [2661] "."  "["  "]"  "["  "]"  ","  ","  "."  ","  ","  ","  "."  "["  "]" 
## [2675] ","  ","  "."  "["  "]"  ","  ","  "."  ","  ","  ","  "."  "."  "[" 
## [2689] "]"  ","  ","  "."  ","  ","  "."  ","  ","  "["  "]"  ","  ","  "," 
## [2703] ","  "."  ","  ","  "."  ","  ","  "."  ","  ","  "."  ","  ","  "." 
## [2717] ","  ","  "."  "."  "["  "]"  ","  ","  ","  ","  ","  ":"  "\"" "," 
## [2731] ","  "."  ","  ","  "."  "."  "."  ","  ","  "."  ";"  ";"  "."  "\""
## [2745] "["  "]"  ","  ","  ","  ","  "."  "["  "]"  "["  "]"  ","  ","  "[" 
## [2759] "]"  ","  ","  ","  ","  ";"  ","  ","  ","  "["  "]"  ","  ","  "." 
## [2773] "["  "]"  "["  "]"  "["  "]"  ","  ","  ","  ","  "."  "("  ")"  "," 
## [2787] ","  "."  "."  ","  ","  "."  ","  ","  "."  "."  ","  ","  "."  "." 
## [2801] ","  "."  ","  ","  "."  ","  ","  "."  "."  "["  "]"  ","  ","  "[" 
## [2815] "]"  ","  ":"  "\"" ","  ","  ","  "."  "."  "\"" "["  "]"  ","  "." 
## [2829] "'"  "\"" "\"" ","  ","  "."  ","  "."  ","  "."  "["  "]"  "["  "]" 
## [2843] ","  ","  ","  ","  ","  "."  ","  ","  "."  ","  "."  ","  "."  "." 
## [2857] "["  "]"  ","  "."  ","  ","  ","  "."  ","  ","  "."  ","  ","  "," 
## [2871] "."  "."  ","  "."  ","  "."  ","  "."  ","  "["  "]"  ","  "["  "]" 
## [2885] ","  "."  "."  "["  "]"  ","  ","  "."  "."  ","  "."  ","  "["  "]" 
## [2899] ","  ","  ","  ","  "."  "["  "]"  ","  ","  "."  ","  "."  "["  "]" 
## [2913] "["  "]"  ","  "."  ","  ","  "["  "]"  ","  "."  "["  "]"  "["  "]" 
## [2927] "["  "]"  ","  "."  "["  "]"  ","  ","  "."  "["  "]"  ","  ","  "." 
## [2941] "."  "."  "."

cv_nopunct <- str_replace_all(cv_nocite, "[^[:alnum:][:space:]-%]", " ") # Replace the pattern with a single whitespace character
unlist(str_extract_all(cv_nopunct, "[^[:space:]]*[[:punct:]]{1,}[^[:space:]]*"))

##    [1] "covid--9"            "-9"                  "-"                  
##    [4] "-"                   "---9"                "-9"                 
##    [7] "-9"                  "----"                "-"                  
##   [10] "--"                  "----"                "sars-cov--"         
##   [13] "-9"                  "covid--9"            "-"                  
##   [16] "covid--9"            "---"                 "-9"                 
##   [19] "95%"                 "-9"                  "----"               
##   [22] "--"                  "bbibp-"              "rbd-dimer"          
##   [25] "-"                   "----"                "--8"                
##   [28] "7-"                  "--"                  "--"                 
##   [31] "-6"                  "-"                   "e-a"                
##   [34] "-"                   "--"                  "----"               
##   [37] "--6"                 "-7"                  "-9"                 
##   [40] "e-a"                 "astrazeneca-oxford"  "-"                  
##   [43] "----"                "pfizer-biontech"     "-"                  
##   [46] "-"                   "e-a"                 "-"                  
##   [49] "e-a"                 "6--"                 "5--"                
##   [52] "----"                "----"                "--"                 
##   [55] "high-income"         "--%"                 "5--"                
##   [58] "v-"                  "-9"                  "--"                 
##   [61] "----"                "--"                  "--"                 
##   [64] "--"                  "non-human"           "---5"               
##   [67] "---6"                "--"                  "-5"                 
##   [70] "-6"                  "----"                "-7"                 
##   [73] "-8"                  "-9"                  "covid--9"           
##   [76] "mers-cov"            "-7"                  "--"                 
##   [79] "----"                "--"                  "viral-vectored"     
##   [82] "adenoviral-vectored" "chadox--mers"        "bvrs-gamvac"        
##   [85] "mva-vectored"        "mva-mers-s"          "--"                 
##   [88] "----"                "--"                  "-9"                 
##   [91] "--%"                 "--%"                 "-6%"                
##   [94] "--"                  "--"                  "--"                 
##   [97] "-5"                  "-s"                  "covid--9"           
##  [100] "covid--9"            "covid-"              "-9"                 
##  [103] "bbibp-corv"          "-9"                  "-6"                 
##  [106] "-6"                  "-7"                  "----"               
##  [109] "-8"                  "-nited"              "-6"                 
##  [112] "-9"                  "-9"                  "--"                 
##  [115] "--"                  "--"                  "--"                 
##  [118] "--"                  "--"                  "-5"                 
##  [121] "-6"                  "-7"                  "-8"                 
##  [124] "-9"                  "-9"                  "-9"                 
##  [127] "--"                  "--"                  "covid--9"           
##  [130] "g--"                 "----"                "--"                 
##  [133] "--"                  "cross-discipline"    "--"                 
##  [136] "----"                "multi-site"          "-"                  
##  [139] "-5"                  "-6"                  "-7"                 
##  [142] "low-rate"            "-7"                  "-8"                 
##  [145] "-9"                  "----"                "-9"                 
##  [148] "-nited"              "disease-fighting"    "-9"                 
##  [151] "-7"                  "5-"                  "----"               
##  [154] "fast-track"          "5-"                  "5-"                 
##  [157] "-7"                  "-6"                  "---9"               
##  [160] "5-"                  "--"                  "----"               
##  [163] "covid--9"            "5-"                  "----"               
##  [166] "----"                "-9"                  "6-"                 
##  [169] "-9"                  "--"                  "-9"                 
##  [172] "----"                "--%"                 "--%"                
##  [175] "-6%"                 "5-"                  "----"               
##  [178] "-"                   "-"                   "-8"                 
##  [181] "6-"                  "-9"                  "----"               
##  [184] "6-"                  "----"                "5-"                 
##  [187] "6-"                  "--"                  "----"               
##  [190] "high-risk"           "6-"                  "--"                 
##  [193] "----"                "-"                   "e-a"                
##  [196] "bnt-6-b-"            "--"                  "----"               
##  [199] "-"                   "----"                "-nited"             
##  [202] "-9"                  "7-"                  "7-"                 
##  [205] "7-"                  "--"                  "-nion"              
##  [208] "7-"                  "-9"                  "-nited"             
##  [211] "bbibp-corv"          "7-"                  "--"                 
##  [214] "----"                "-nited"              "-se"                
##  [217] "e-a"                 "-9"                  "e-a"                
##  [220] "mrna---7-"           "----"                "-9"                 
##  [223] "-"                   "8-"                  "-9"                 
##  [226] "8-"                  "----"                "nucleoside-modified"
##  [229] "-"                   "-"                   "non-replicating"    
##  [232] "-"                   "-"                   "--"                 
##  [235] "8-"                  "--"                  "--"                 
##  [238] "-9"                  "next-generation"     "-9"                 
##  [241] "8-"                  "--"                  "--"                 
##  [244] "-9"                  "8-"                  "--"                 
##  [247] "nucleoside-modified" "8-"                  "8-"                 
##  [250] "8-"                  "8-"                  "covid--9"           
##  [253] "-nited"              "-nion"               "----"               
##  [256] "pfizer-biontech"     "-9"                  "covid--9"           
##  [259] "9-"                  "9-"                  "----"               
##  [262] "e-"                  "9-"                  "non-replicating"    
##  [265] "-"                   "9-"                  "vector-based"       
##  [268] "covid--9"            "non-replicating"     "9-"                 
##  [271] "----"                "covid--9"            "9-"                 
##  [274] "covid--9"            "one-shot"            "---"                
##  [277] "---"                 "ad-6"                "-nd"                
##  [280] "---"                 "----"                "---"                
##  [283] "---"                 "--5"                 "bbibp-corv"         
##  [286] "--6"                 "--7"                 "covid--9"           
##  [289] "--8"                 "--9"                 "---"                
##  [292] "----"                "---"                 "covid--9"           
##  [295] "---"                 "rbd-dimer"           "-"                  
##  [298] "v-5-"                "---"                 "---"                
##  [301] "--5"                 "--6"                 "--7"                
##  [304] "--8"                 "--9"                 "---"                
##  [307] "---"                 "---"                 "-"                  
##  [310] "---"                 "-9"                  "---"                
##  [313] "non-"                "-9"                  "--5"                
##  [316] "--"                  "-5"                  "placebo-controlled" 
##  [319] "-5"                  "--"                  "-5"                 
##  [322] "-9"                  "-6"                  "--6"                
##  [325] "--7"                 "--8"                 "--9"                
##  [328] "--8"                 "---"                 "who-recognized"     
##  [331] "rbd-dimer"           "-nited"              "-"                  
##  [334] "-7-"                 "--"                  "--"                 
##  [337] "--8"                 "-6"                  "covid--9"           
##  [340] "-"                   "-"                   "-lt"                
##  [343] "--"                  "placebo-controlled"  "---"                
##  [346] "-8"                  "----"                "--6"                
##  [349] "--"                  "----"                "95%"                
##  [352] "--7"                 "--8"                 "----"               
##  [355] "--9"                 "---"                 "-nited"             
##  [358] "covid--9"            "-"                   "--8"                
##  [361] "--"                  "---"                 "-5"                 
##  [364] "-"                   "double-"             "--"                 
##  [367] "---"                 "placebo-"            "ad-6"               
##  [370] "---"                 "---"                 "9-"                 
##  [373] "6%"                  "--5"                 "----"               
##  [376] "----"                "--6"                 "--7"                
##  [379] "--8"                 "--9"                 "-5-"                
##  [382] "-ae"                 "-5-"                 "-nited"             
##  [385] "-"                   "-"                   "-56"                
##  [388] "--"                  "---"                 "--"                 
##  [391] "covid--9"            "-"                   "--"                 
##  [394] "--"                  "9-"                  "chadox-"            
##  [397] "-niversity"          "9-"                  "-55"                
##  [400] "placebo-controlled"  "-57"                 "76%"                
##  [403] "8-%"                 "--"                  "-58"                
##  [406] "----"                "----"                "---"                
##  [409] "-59"                 "-nited"              "-6-"                
##  [412] "bbibp-corv"          "--6"                 "-"                  
##  [415] "-"                   "-6-"                 "-8"                 
##  [418] "---"                 "-5"                  "-"                  
##  [421] "-"                   "-"                   "-"                  
##  [424] "double-blind"        "--6"                 "-6-"                
##  [427] "placebo-controlled"  "79%"                 "-6-"                
##  [430] "----"                "----"                "-nited"             
##  [433] "-6-"                 "-65"                 "-66"                
##  [436] "-67"                 "---"                 "---"                
##  [439] "--5"                 "-"                   "-"                  
##  [442] "-69"                 "--"                  "6--"                
##  [445] "--"                  "-"                   "-"                  
##  [448] "double-blind"        "-"                   "-68"                
##  [451] "---"                 "placebo-controlled"  "8-"                 
##  [454] "5%"                  "-7-"                 "-%"                 
##  [457] "-7-"                 "5-"                  "-%"                 
##  [460] "78%"                 "---%"                "-7-"                
##  [463] "----"                "----"                "-5"                 
##  [466] "---"                 "-7-"                 "----"               
##  [469] "----"                "-"                   "6--"                
##  [472] "-"                   "---"                 "-7-"                
##  [475] "--"                  "---"                 "-75"                
##  [478] "covid--9"            "-nited"              "-"                  
##  [481] "---"                 "-78"                 "--"                 
##  [484] "---"                 "--"                  "9-"                 
##  [487] "9-"                  "-"                   "-77"                
##  [490] "placebo-controlled"  "-76"                 "-5"                 
##  [493] "----"                "-79"                 "--"                 
##  [496] "----"                "9-%"                 "-8-"                
##  [499] "----"                "----"                "-nited"             
##  [502] "-nited"              "-"                   "-"                  
##  [505] "-8-"                 "--"                  "---"                
##  [508] "covid--9"            "-8-"                 "double-blinded"     
##  [511] "ad-6"                "-8-"                 "placebo-controlled" 
##  [514] "-9"                  "----"                "66%"                
##  [517] "85%"                 "6-%"                 "7-%"                
##  [520] "-nited"              "-8-"                 "-8-"                
##  [523] "----"                "----"                "-nited"             
##  [526] "-kraine"             "ad5-ncov"            "-"                  
##  [529] "-"                   "-86"                 "--"                 
##  [532] "---"                 "-"                   "-86"                
##  [535] "multi-center"        "-"                   "double-"            
##  [538] "-85"                 "placebo-"            "----"               
##  [541] "%"                   "covid--9"            "9-"                 
##  [544] "98%"                 "-86"                 "----"               
##  [547] "----"                "----"                "----"               
##  [550] "-87"                 "-88"                 "-89"                
##  [553] "-9-"                 "-9-"                 "-9-"                
##  [556] "bbv-5-"              "-"                   "-"                  
##  [559] "-9-"                 "-5"                  "8--"                
##  [562] "-"                   "-"                   "-"                  
##  [565] "-9-"                 "observer-blinded"    "-9-"                
##  [568] "placebo-"            "-95"                 "8-%"                
##  [571] "-96"                 "peer-reviewed"       "-97"                
##  [574] "----"                "----"                "---"                
##  [577] "-"                   "-"                   "-98"                
##  [580] "--"                  "---"                 "-"                  
##  [583] "---"                 "-"                   "-"                  
##  [586] "double-"             "-"                   "placebo-"           
##  [589] "---"                 "----"                "----"               
##  [592] "-99"                 "zf----"              "rbd-dimer"          
##  [595] "-"                   "-"                   "-9"                 
##  [598] "---"                 "-"                   "--"                 
##  [601] "-"                   "---"                 "---"                
##  [604] "double-blind"        "placebo-"            "---"                
##  [607] "----"                "----"                "-zbekistan"         
##  [610] "---"                 "---"                 "---"                
##  [613] "-"                   "-"                   "--6"                
##  [616] "-"                   "---"                 "-"                  
##  [619] "-"                   "-"                   "double-blind"       
##  [622] "-"                   "--5"                 "--6"                
##  [625] "placebo-controlled"  "-9"                  "-"                  
##  [628] "--7"                 "--8"                 "covid--9"           
##  [631] "-nited"              "--9"                 "---"                
##  [634] "---"                 "virus-"              "---"                
##  [637] "-"                   "---"                 "-5"                 
##  [640] "---"                 "---"                 "---"                
##  [643] "--7"                 "-"                   "observer-"          
##  [646] "e-"                  "--8"                 "-kraine"            
##  [649] "--9"                 "---"                 "placebo-"           
##  [652] "--6"                 "---"                 "----"               
##  [655] "----"                "-k"                  "-5"                 
##  [658] "---"                 "----"                "----"               
##  [661] "-s"                  "--"                  "---"                
##  [664] "--5"                 "finlay-fr--"         "--"                 
##  [667] "95-"                 "--"                  "---"                
##  [670] "---"                 "---"                 "---"                
##  [673] "---"                 "--"                  "non-randomized"     
##  [676] "parallel-group"      "placebo-"            "-ncontrolled"       
##  [679] "double-blind"        "----"                "9--"                
##  [682] "----"                "----"                "covid--9"           
##  [685] "9--"                 "-"                   "--5"                
##  [688] "-6"                  "5--"                 "--6"                
##  [691] "--7"                 "--8"                 "e-"                 
##  [694] "9-"                  "-b"                  "-"                  
##  [697] "-8-"                 "dose-"               "----"               
##  [700] "----"                "66-"                 "observer-blind"     
##  [703] "dose-"               "----"                "----"               
##  [706] "-nnamed"             "--9"                 "---"                
##  [709] "-"                   "9--"                 "--"                 
##  [712] "---"                 "double-blinded"      "double-blinded"     
##  [715] "single-center"       "single-center"       "placebo-"           
##  [718] "placebo-"            "----"                "----"               
##  [721] "qazcovid-in"         "---"                 "-"                  
##  [724] "-"                   "---"                 "---"                
##  [727] "---"                 "----"                "placebo-"           
##  [730] "---"                 "----"                "----"               
##  [733] "---"                 "zycov-d"             "--5"                
##  [736] "-"                   "-6"                  "---"                
##  [739] "---"                 "-"                   "---"                
##  [742] "placebo-"            "double-blind"        "--5"                
##  [745] "placebo-"            "----"                "--7"                
##  [748] "--5"                 "--6"                 "----"               
##  [751] "----"                "--8"                 "virus-like"         
##  [754] "-8-"                 "-nited"              "plant-based"        
##  [757] "--"                  "9-8"                 "as--"               
##  [760] "event-driven"        "--"                  "--"                 
##  [763] "placebo-"            "---"                 "--x"                
##  [766] "----"                "covid--9"            "----"               
##  [769] "----"                "----"                "---"                
##  [772] "scb----9"            "---"                 "---"                
##  [775] "-5-"                 "-nited"              "--"                 
##  [778] "---"                 "----"                "---"                
##  [781] "--5"                 "as--"                "----"               
##  [784] "double-blind"        "----"                "----"               
##  [787] "-b-6--"              "6-"                  "--8"                
##  [790] "-nited"              "-nited"              "--"                 
##  [793] "-7-"                 "open-label"          "--6"                
##  [796] "--7"                 "----"                "----"               
##  [799] "-"                   "85-"                 "placebo-"           
##  [802] "observer-blind"      "iib-iii"             "---"                
##  [805] "double-blind"        "dose-"               "----"               
##  [808] "----"                "grad-cov-"           "--9"                
##  [811] "-5-"                 "9-"                  "-5-"                
##  [814] "--"                  "---"                 "-5-"                
##  [817] "-5-"                 "-8"                  "observer-blind"     
##  [820] "placebo-"            "grad-"               "----"               
##  [823] "cov-"                "---week"             "9-%"                
##  [826] "grad-cov-"           "anti-"               "----"               
##  [829] "-nnamed"             "-5"                  "-56"                
##  [832] "-"                   "96-"                 "-5-"                
##  [835] "-55"                 "single-center"       "96-"                
##  [838] "single-"             "placebo-"            "double-blind"       
##  [841] "double-blinded"      "----"                "placebo-"           
##  [844] "-"                   "---"                 "single-"            
##  [847] "double-blinded"      "placebo-"            "----"               
##  [850] "----"                "mvc-cov-9--"         "-5"                 
##  [853] "-58"                 "-"                   "7--"                
##  [856] "-57"                 "open-labeled"        "double-blinded"     
##  [859] "single-center"       "multi-center"        "----"               
##  [862] "----"                "multi-regional"      "----"               
##  [865] "----"                "-nnamed"             "-"                  
##  [868] "-8-"                 "-6-"                 "-"                  
##  [871] "---"                 "-59"                 "double-blind"       
##  [874] "double-blind"        "parallel-"           "parallel-"          
##  [877] "----"                "----"                "----"               
##  [880] "----"                "-6-"                 "-"                  
##  [883] "56-"                 "6-"                  "-6-"                
##  [886] "-6-"                 "-6-"                 "-6-"                
##  [889] "double-blind"        "----"                "----"               
##  [892] "placebo-"            "----"                "-nnamed"            
##  [895] "-65"                 "---"                 "-68"                
##  [898] "----"                "----"                "----"               
##  [901] "-67"                 "----"                "-66"                
##  [904] "ino--8--"            "--6"                 "--7"                
##  [907] "-8-"                 "---"                 "---"                
##  [910] "-nited"              "open-label"          "placebo-"           
##  [913] "ib-iia"              "-6-"                 "dose-"              
##  [916] "multi-"              "-69"                 "-68"                
##  [919] "----"                "----"                "----"               
##  [922] "-nited"              "----"                "-nited"             
##  [925] "ib-iia"              "ag-----covid"        "-9"                 
##  [928] "--8"                 "-7-"                 "--"                 
##  [931] "-7-"                 "5--"                 "non-"               
##  [934] "double-blind"        "single-center"       "-7-"                
##  [937] "----"                "----"                "----"               
##  [940] "-nnamed"             "-nited"              "-"                  
##  [943] "-6-"                 "i-iia"               "---"                
##  [946] "sars-cov--"          "-8"                  "-7-"                
##  [949] "7--"                 "sars-cov--"          "as--"               
##  [952] "-8"                  "-7-"                 "----"               
##  [955] "----"                "-nited"              "iibr----"           
##  [958] "---"                 "-"                   "---"                
##  [961] "-75"                 "----"                "----"               
##  [964] "arct----"            "lunar-"              "-nited"             
##  [967] "cov-9"               "-76"                 "-77"                
##  [970] "9-"                  "n-s"                 "double-blinded"     
##  [973] "--6"                 "-78"                 "6--"                
##  [976] "observer-blind"      "placebo-"            "-79"                
##  [979] "----"                "----"                "-nited"             
##  [982] "vbi--9--a"           "-8-"                 "-nited"             
##  [985] "virus-like"          "78-"                 "observer-blind"     
##  [988] "dose-"               "placebo-"            "----"               
##  [991] "----"                "mrt55--"             "-8-"                
##  [994] "--5"                 "-8-"                 "first-in-"          
##  [997] "sars-"               "cov--"               "-8"                 
## [1000] "----"                "----"                "-nited"             
## [1003] "eucorvac--9"         "-8-"                 "-8-"                
## [1006] "dose-"               "observer-blind"      "placebo-"           
## [1009] "----"                "----"                "gx--9"              
## [1012] "gx--9n"              "--9"                 "-8-"                
## [1015] "---"                 "-85"                 "-8-"                
## [1018] "i-ii"                "-7--"                "---"                
## [1021] "multi-"              "double-blind"        "placebo-"           
## [1024] "----"                "----"                "vla----"            
## [1027] "--8"                 "--9"                 "-"                  
## [1030] "-5-"                 "multi-center"        "double-blinded"     
## [1033] "----"                "----"                "-nited"             
## [1036] "tak-9-9"             "-86"                 "---"                
## [1039] "-87"                 "observer-blind"      "placebo-"           
## [1042] "----"                "----"                "tak---9"            
## [1045] "-88"                 "---"                 "-89"                
## [1048] "observer-blind"      "placebo-"            "----"               
## [1051] "----"                "covid-evax"          "-6-"                
## [1054] "-9-"                 "first-"              "in-human"           
## [1057] "open-"               "----"                "-9"                 
## [1060] "---"                 "---"                 "----"               
## [1063] "----"                "-9-"                 "lv-smenp-dc"        
## [1066] "---"                 "---"                 "----"               
## [1069] "----"                "-9-"                 "chulacov-9"         
## [1072] "-niversity"          "-9-"                 "dose-finding"       
## [1075] "----"                "lnp-ncovsarna"       "-9-"                
## [1078] "-nited"              "--5"                 "-5"                 
## [1081] "---"                 "----"                "----"               
## [1084] "-nited"              "covax--9"            "-9-"                
## [1087] "--"                  "-95"                 "----"               
## [1090] "----"                "hgc--9"              "-96"                
## [1093] "---"                 "-nited"              "-98"                
## [1096] "----"                "-97"                 "covid--9"           
## [1099] "-99"                 "---"                 "-"                  
## [1102] "-nited"              "-6-"                 "---"                
## [1105] "---"                 "----"                "----"               
## [1108] "---"                 "---"                 "---"                
## [1111] "---"                 "--5"                 "----"               
## [1114] "----"                "--6"                 "ptx-covid-9-b"      
## [1117] "--7"                 "6-"                  "--7"                
## [1120] "----"                "covac--"             "--8"                
## [1123] "--8"                 "-niversity"          "--8"                
## [1126] "----"                "----"                "covi-vac"           
## [1129] "-nited"              "-8"                  "--9"                
## [1132] "first-in-human"      "double-blind"        "placebo-"           
## [1135] "dose-escalation"     "----"                "----"               
## [1138] "-nited"              "cov-"                "-nited"             
## [1141] "-8"                  "---"                 "open-label"         
## [1144] "non-"                "----"                "-nited"             
## [1147] "---"                 "-"                   "double-blind"       
## [1150] "placebo-"            "---"                 "----"               
## [1153] "----"                "-5-"                 "-niversity"         
## [1156] "---"                 "double-blind"        "dose-ranging"       
## [1159] "placebo-"            "----"                "----"               
## [1162] "bbv-5-"              "---"                 "-75"                
## [1165] "---"                 "--5"                 "double-blinded"     
## [1168] "----"                "mv--------"          "--6"                
## [1171] "-nited"              "---"                 "--7"                
## [1174] "double-blinded"      "----"                "----"               
## [1177] "-nited"              "s--68--9"            "---"                
## [1180] "--8"                 "double-blind"        "placebo-"           
## [1183] "parallel-group"      "----"                "----"               
## [1186] "gbp5--"              "-6-"                 "--9"                
## [1189] "placebo-"            "observer-"           "dose-"              
## [1192] "----"                "cigb-66"             "---"                
## [1195] "---"                 "double-blind"        "placebo-"           
## [1198] "----"                "----"                "kbp----"            
## [1201] "-nited"              "-8-"                 "---"                
## [1204] "first-in-human"      "observer-"           "placebo-"           
## [1207] "----"                "----"                "-nited"             
## [1210] "adimrsc--f"          "7-"                  "---"                
## [1213] "open-label"          "dose-finding"        "----"               
## [1216] "er-cov-vac"          "-"                   "--"                 
## [1219] "---"                 "covid--9"            "er-cov-vac"         
## [1222] "----"                "----"                "aks--5-"            
## [1225] "-niversity"          "---"                 "---"                
## [1228] "non-"                "single-center"       "open-label"         
## [1231] "----"                "gls-5---"            "--5"                
## [1234] "--5"                 "dose-"               "double-blind"       
## [1237] "----"                "----"                "vax----"            
## [1240] "7-"                  "--6"                 "placebo-"           
## [1243] "observer-blind"      "----"                "coh--s-"            
## [1246] "-nited"              "--9"                 "--7"                
## [1249] "----"                "----"                "-"                  
## [1252] "--5"                 "--8"                 "----"               
## [1255] "nbp----"             "5-"                  "--9"                
## [1258] "placebo-"            "observer-"           "dose-"              
## [1261] "----"                "----"                "covac--"            
## [1264] "-6"                  "-niversity"          "---"                
## [1267] "placebo-"            "observer-"           "dose-"              
## [1270] "----"                "----"                "bactrl-spike"       
## [1273] "--"                  "---"                 "observer-blind"     
## [1276] "placebo-"            "----"                "----"               
## [1279] "---"                 "---"                 "----"               
## [1282] "corvax--"            "-nited"              "-6"                 
## [1285] "---"                 "open-label"          "----"               
## [1288] "----"                "-nited"              "chadv68-s"          
## [1291] "-nited"              "---"                 "---"                
## [1294] "open-label"          "----"                "----"               
## [1297] "-nited"              "-nited"              "-8-"                
## [1300] "--5"                 "double-blind"        "placebo-"           
## [1303] "first-"              "in-human"            "----"               
## [1306] "----"                "-nited"              "vxa-cov---"         
## [1309] "-nited"              "-5"                  "--6"                
## [1312] "double-blind"        "placebo-"            "first-"             
## [1315] "in-human"            "----"                "-nited"             
## [1318] "sars-cov--"          "v-5-"                "---"                
## [1321] "---"                 "---"                 "-q"                 
## [1324] "double-blind"        "placebo-"            "dose-ranging"       
## [1327] "----"                "v59-"                "--7"                
## [1330] "v59-"                "mv-"                 "-nited"             
## [1333] "sars-cov--"          "--8"                 "--9"                
## [1336] "---"                 "-niversity"          "sars-"              
## [1339] "cov--"               "covid-"              "-9"                 
## [1342] "---"                 "---"                 "---"                
## [1345] "long-term"           "covid--9"            "-5"                 
## [1348] "-5"                  "--"                  "-"                  
## [1351] "-6"                  "-6"                  "---"                
## [1354] "--5"                 "gam-covid-vac"       "gam-covid-vac-lyo"  
## [1357] "--8"                 "---"                 "chadox-"            
## [1360] "ncov--9"             "-5-"                 "low-"               
## [1363] "middle-income"       "-5-"                 "chadox-"            
## [1366] "ncov--9"             "-5-"                 "virus-like"         
## [1369] "--9"                 "-s"                  "---"                
## [1372] "---"                 "-%"                  "5-%"                
## [1375] "---"                 "-9"                  "67%"                
## [1378] "-9"                  "-"                   "---"                
## [1381] "double-blind"        "-s"                  "5-%"                
## [1384] "---"                 "-9"                  "--5"                
## [1387] "--6"                 "75%"                 "covid--9"           
## [1390] "7-%"                 "8-%"                 "--7"                
## [1393] "covid--9"            "covid--9"            "sars-cov--"         
## [1396] "95%"                 "covid--9"            "--8"                
## [1399] "covid--9"            "9-%"                 "97%"                
## [1402] "---%"                "-nited"              "--9"                
## [1405] "95%"                 "9-"                  "98%"                
## [1408] "-5-"                 "9-%"                 "95%"                
## [1411] "---%"                "9-"                  "---%"               
## [1414] "-5-"                 "8-%"                 "6-"                 
## [1417] "9-%"                 "---%"                "%"                  
## [1420] "7-"                  "---%"                "-58"                
## [1423] "89%"                 "95%"                 "---%"               
## [1426] "-nited"              "-5-"                 "-5-"                
## [1429] "6-%"                 "--"                  "8-%"                
## [1432] "---%"                "bbibp-corv"          "79%"                
## [1435] "---%"                "-5-"                 "-55"                
## [1438] "78%"                 "---%"                "-56"                
## [1441] "-57"                 "-58"                 "66%"                
## [1444] "75%"                 "85%"                 "5-"                 
## [1447] "97%"                 "7-%"                 "8-%"                
## [1450] "86%"                 "---%"                "-nited"             
## [1453] "-8-"                 "68%"                 "-9"                 
## [1456] "8-%"                 "88%"                 "---%"               
## [1459] "6-%"                 "--"                  "79%"                
## [1462] "8-%"                 "-6"                  "95%"                
## [1465] "8-%"                 "-59"                 "-6-"                
## [1468] "66%"                 "9-%"                 "-86"                
## [1471] "-8"                  "---"                 "--"                 
## [1474] "55%"                 "--"                  "7-%"                
## [1477] "sars-cov--"          "covid--9"            "-6-"                
## [1480] "covid--9"            "-6-"                 "----"               
## [1483] "-s"                  "sars-cov--"          "-6-"                
## [1486] "----"                "-"                   "-"                  
## [1489] "-"                   "-k"                  "-6-"                
## [1492] "-k"                  "-6-"                 "--"                 
## [1495] "89%"                 "-"                   "-"                  
## [1498] "7-"                  "9-%"                 "non-b"              
## [1501] "-"                   "-"                   "-65"                
## [1504] "96%"                 "86%"                 "-"                  
## [1507] "-"                   "6-%"                 "-"                  
## [1510] "-5-"                 "-66"                 "5--"                
## [1513] "v-"                  "5--"                 "v-"                 
## [1516] "-"                   "-5-"                 "-67"                
## [1519] "-7"                  "----"                "two-thirds"         
## [1522] "5--"                 "v-"                  "-68"                
## [1525] "ad-6"                "cov-"                "covid--9"           
## [1528] "7-%"                 "-nited"              "57%"                
## [1531] "-69"                 "----"                "-niversity"         
## [1534] "-niversity"          "covid--9"            "5--"                
## [1537] "v-"                  "-7-"                 "-"                  
## [1540] "---"                 "azd----"             "covid--9"           
## [1543] "-7-"                 "----"                "-"                  
## [1546] "-7-"                 "-7-"                 "----"               
## [1549] "8-"                  "-9"                  "-7-"                
## [1552] "-9"                  "-9"                  "-7-"                
## [1555] "-75"                 "-9"                  "-9"                 
## [1558] "protein-based"       "vector-based"        "-75"                
## [1561] "8-%"                 "-75"                 "-7-"                
## [1564] "-75"                 "--"                  "----"               
## [1567] "--6"                 "-7"                  "-9"                 
## [1570] "-8-"                 "-9"                  "----"               
## [1573] "-8-"                 "first-service"       "-8-"                
## [1576] "-8-"                 "-8-"                 "-85"                
## [1579] "-8-"                 "-8-"                 "-85"                
## [1582] "-k"                  "--"                  "-k"                 
## [1585] "-9"                  "taxpayer-funded"     "-k"                 
## [1588] "-86"                 "-85"                 "----"               
## [1591] "late-stage"          "low-"                "-87"                
## [1594] "-9"                  "-8-"                 "-88"                
## [1597] "-89"                 "-9-"                 "-9"                 
## [1600] "-9"                  "densely-populated"   "-9-"                
## [1603] "-9-"                 "-9"                  "-87"                
## [1606] "-89"                 "-9-"                 "-"                  
## [1609] "----"                "-s"                  "-9"                 
## [1612] "-9"                  "sars-cov--"          "covid--9"           
## [1615] "--"                  "-76"                 "-9-"                
## [1618] "-nited"              "-"                   "%"                  
## [1621] "----"                "-9-"                 "-6-"                
## [1624] "7-9"                 "-58"                 "-"                  
## [1627] "-%"                  "-nion"               "-9"                 
## [1630] "-nited"              "8-"                  "--5"                
## [1633] "--"                  "-%"                  "7-"                 
## [1636] "---"                 "--"                  "-9-"                
## [1639] "e-"                  "--"                  "--9"                
## [1642] "8-8"                 "-%"                  "side-effects"       
## [1645] "-7"                  "57-"                 "---"                
## [1648] "-"                   "7%"                  "-95"                
## [1651] "-nited"              "-7"                  "6--"                
## [1654] "97-"                 "--"                  "7%"                 
## [1657] "--"                  "-68"                 "97-"                
## [1660] "-"                   "8%"                  "--7"                
## [1663] "%"                   "-nited"              "-98"                
## [1666] "--"                  "--7"                 "%"                  
## [1669] "covid--9"            "--6"                 "-6-"                
## [1672] "-%"                  "high-income"         "--%"                
## [1675] "6--"                 "--9"                 "-"                  
## [1678] "8%"                  "-5"                  "----"               
## [1681] "5-%"                 "pre-sold"            "6--"                
## [1684] "-9-"                 "-9"                  "-%"                 
## [1687] "high-income"         "-8-"                 "-"                  
## [1690] "-%"                  "-96"                 "-5-"                
## [1693] "--8"                 "%"                   "-8"                 
## [1696] "----"                "director-general"    "-75"                
## [1699] "98-"                 "%"                   "-9"                 
## [1702] "-"                   "5--"                 "-"                  
## [1705] "8%"                  "-"                   "76-"                
## [1708] "7-7"                 "-"                   "9%"                 
## [1711] "-9"                  "higher-income"       "-5"                 
## [1714] "lowest-income"       "-"                   "-6-"                
## [1717] "-68"                 "--"                  "6%"                 
## [1720] "-5"                  "-5"                  "-5"                 
## [1723] "-97"                 "-"                   "--6"                
## [1726] "55-"                 "%"                   "covid--9"           
## [1729] "-"                   "---"                 "--9"                
## [1732] "%"                   "-nited"              "----"               
## [1735] "-98"                 "-99"                 "-"                  
## [1738] "---"                 "%"                   "long-standing"      
## [1741] "-"                   "-89"                 "---"                
## [1744] "--"                  "-"                   "5--"                
## [1747] "-8-"                 "%"                   "-"                  
## [1750] "7--"                 "--8"                 "%"                  
## [1753] "-"                   "6--"                 "---"                
## [1756] "--"                  "---"                 "---"                
## [1759] "---"                 "-"                   "6--"                
## [1762] "-"                   "%"                   "covid--9"           
## [1765] "-"                   "-9-"                 "6--"                
## [1768] "-%"                  "-7"                  "---"                
## [1771] "-"                   "--8"                 "--6"                
## [1774] "-9"                  "-%"                  "---"                
## [1777] "-"                   "--7"                 "57-"                
## [1780] "-"                   "-%"                  "-"                  
## [1783] "98-"                 "97-"                 "-%"                 
## [1786] "--"                  "--5"                 "-%"                 
## [1789] "-5"                  "-7-"                 "-%"                 
## [1792] "96-"                 "--"                  "7%"                 
## [1795] "9--"                 "--5"                 "-%"                 
## [1798] "---"                 "9--"                 "5-7"                
## [1801] "%"                   "8--"                 "5--"                
## [1804] "--"                  "7--"                 "68-"                
## [1807] "%"                   "7--"                 "---"                
## [1810] "--"                  "-%"                  "---"                
## [1813] "6-7"                 "-"                   "-%"                 
## [1816] "me-first"            "---"                 "-%"                 
## [1819] "6--"                 "86-"                 "--"                 
## [1822] "9%"                  "--5"                 "59-"                
## [1825] "6--"                 "--"                  "--"                 
## [1828] "----"                "-nited"              "55-"                
## [1831] "-5-"                 "-"                   "-%"                 
## [1834] "-nion"               "5-9"                 "-5-"                
## [1837] "-%"                  "8-"                  "covid--9"           
## [1840] "5-5"                 "9-5"                 "-%"                 
## [1843] "-97"                 "---"                 "-%"                 
## [1846] "--6"                 "-8-"                 "66-"                
## [1849] "-"                   "8%"                  "-87"                
## [1852] "-66"                 "%"                   "anti-vaccination"   
## [1855] "--6"                 "-87"                 "-"                  
## [1858] "-%"                  "covid--9"            "---"                
## [1861] "---"                 "-"                   "-%"                 
## [1864] "--7"                 "-6-"                 "--"                 
## [1867] "5%"                  "covid--9"            "---"                
## [1870] "65-"                 "-"                   "-%"                 
## [1873] "-8-"                 "---"                 "-"                  
## [1876] "7%"                  "-5-"                 "---"                
## [1879] "-"                   "7%"                  "--%"                
## [1882] "-5-"                 "---"                 "--"                 
## [1885] "--7"                 "---"                 "5-8"                
## [1888] "-"                   "6%"                  "-9"                 
## [1891] "-7"                  "---"                 "---"                
## [1894] "%"                   "----"                "67%"                
## [1897] "8-%"                 "-"                   "-ruguay"            
## [1900] "---"                 "-%"                  "-9"                 
## [1903] "--9"                 "---"                 "--"                 
## [1906] "--8"                 "----"                "-9%"                
## [1909] "-s"                  "-97"                 "68-"                
## [1912] "-"                   "-%"                  "5-%"                
## [1915] "-8-"                 "6--"                 "%"                  
## [1918] "-s"                  "--9"                 "---"                
## [1921] "-75"                 "--8"                 "--"                 
## [1924] "-%"                  "-56"                 "78-"                
## [1927] "--"                  "---"                 "---"                
## [1930] "---"                 "---"                 "-97"                
## [1933] "-"                   "-%"                  "---"                
## [1936] "7--"                 "--"                  "--9"                
## [1939] "-78"                 "-"                   "-%"                 
## [1942] "---9"                "-9"                  "--9"                
## [1945] "-78"                 "--"                  "-9"                 
## [1948] "-9-"                 "-88"                 "-"                  
## [1951] "7%"

str_extract_all(cv_nopunct, " [-%]+ ")

## [[1]]
##   [1] " - "    " - "    " ---- " " - "    " -- "   " ---- " " - "    " --- " 
##   [9] " ---- " " -- "   " - "    " ---- " " -- "   " -- "   " - "    " - "   
##  [17] " -- "   " ---- " " - "    " ---- " " - "    " - "    " ---- " " ---- "
##  [25] " -- "   " --% "  " -- "   " ---- " " -- "   " -- "   " -- "   " -- "  
##  [33] " ---- " " -- "   " ---- " " -- "   " -- "   " ---- " " -- "   " --% " 
##  [41] " --% "  " -- "   " -- "   " -- "   " ---- " " -- "   " -- "   " -- "  
##  [49] " -- "   " -- "   " -- "   " -- "   " -- "   " ---- " " -- "   " -- "  
##  [57] " -- "   " ---- " " - "    " ---- " " ---- " " -- "   " ---- " " ---- "
##  [65] " ---- " " -- "   " ---- " " --% "  " --% "  " ---- " " - "    " - "   
##  [73] " ---- " " ---- " " -- "   " ---- " " -- "   " ---- " " - "    " -- "  
##  [81] " ---- " " - "    " ---- " " -- "   " -- "   " ---- " " ---- " " - "   
##  [89] " ---- " " - "    " - "    " - "    " - "    " -- "   " -- "   " -- "  
##  [97] " -- "   " -- "   " -- "   " ---- " " ---- " " - "    " ---- " " --- " 
## [105] " --- "  " --- "  " ---- " " --- "  " --- "  " --- "  " ---- " " --- " 
## [113] " --- "  " - "    " --- "  " --- "  " --- "  " --- "  " --- "  " - "   
## [121] " --- "  " --- "  " -- "   " -- "   " --- "  " - "    " -- "   " -- "  
## [129] " - "    " -- "   " --- "  " ---- " " -- "   " ---- " " ---- " " --- " 
## [137] " - "    " -- "   " - "    " -- "   " --- "  " --- "  " --- "  " ---- "
## [145] " ---- " " - "    " - "    " -- "   " -- "   " - "    " -- "   " -- "  
## [153] " ---- " " ---- " " --- "  " - "    " - "    " --- "  " - "    " - "   
## [161] " - "    " ---- " " ---- " " --- "  " --- "  " - "    " - "    " -- "  
## [169] " -- "   " - "    " - "    " --- "  " -% "   " -% "   " ---% " " ---- "
## [177] " ---- " " --- "  " ---- " " ---- " " - "    " - "    " -- "   " - "   
## [185] " --- "  " -- "   " -- "   " - "    " ---- " " -- "   " ---- " " ---- "
## [193] " ---- " " - "    " -- "   " ---- " " ---- " " ---- " " - "    " -- "  
## [201] " - "    " - "    " ---- " " % "    " ---- " " ---- " " ---- " " ---- "
## [209] " - "    " - "    " - "    " - "    " ---- " " ---- " " --- "  " - "   
## [217] " - "    " -- "   " - "    " --- "  " - "    " - "    " --- "  " ---- "
## [225] " ---- " " - "    " - "    " --- "  " - "    " -- "   " - "    " --- " 
## [233] " --- "  " --- "  " ---- " " ---- " " --- "  " --- "  " --- "  " - "   
## [241] " - "    " - "    " - "    " - "    " - "    " - "    " --- "  " --- " 
## [249] " --- "  " - "    " --- "  " --- "  " --- "  " --- "  " - "    " --- " 
## [257] " --- "  " ---- " " ---- " " --- "  " ---- " " ---- " " -- "   " -- "  
## [265] " -- "   " --- "  " --- "  " --- "  " --- "  " -- "   " ---- " " ---- "
## [273] " ---- " " - "    " - "    " ---- " " ---- " " ---- " " ---- " " --- " 
## [281] " - "    " -- "   " ---- " " ---- " " --- "  " - "    " - "    " --- " 
## [289] " ---- " " --- "  " ---- " " ---- " " --- "  " - "    " --- "  " --- " 
## [297] " - "    " ---- " " ---- " " ---- " " -- "   " -- "   " -- "   " --- " 
## [305] " ---- " " ---- " " ---- " " ---- " " --- "  " --- "  " --- "  " -- "  
## [313] " ---- " " --- "  " ---- " " ---- " " ---- " " -- "   " ---- " " ---- "
## [321] " - "    " --- "  " ---- " " ---- " " -- "   " ---- " " ---- " " - "   
## [329] " ---- " " - "    " ---- " " ---- " " - "    " ---- " " ---- " " ---- "
## [337] " ---- " " - "    " - "    " ---- " " ---- " " ---- " " ---- " " - "   
## [345] " ---- " " ---- " " ---- " " --- "  " ---- " " ---- " " ---- " " ---- "
## [353] " --- "  " --- "  " ---- " " ---- " " ---- " " ---- " " -- "   " ---- "
## [361] " ---- " " ---- " " - "    " --- "  " ---- " " ---- " " --- "  " - "   
## [369] " ---- " " ---- " " ---- " " ---- " " ---- " " ---- " " ---- " " ---- "
## [377] " ---- " " ---- " " --- "  " --- "  " ---- " " ---- " " - "    " ---- "
## [385] " ---- " " --- "  " ---- " " ---- " " --- "  " ---- " " ---- " " ---- "
## [393] " --- "  " --- "  " ---- " " ---- " " --- "  " --- "  " ---- " " ---- "
## [401] " ---- " " --- "  " ---- " " ---- " " -- "   " ---- " " ---- " " --- " 
## [409] " ---- " " --- "  " - "    " --- "  " --- "  " ---- " " ---- " " --- " 
## [417] " --- "  " --- "  " --- "  " ---- " " ---- " " ---- " " ---- " " ---- "
## [425] " ---- " " ---- " " --- "  " ---- " " --- "  " - "    " --- "  " ---- "
## [433] " ---- " " --- "  " ---- " " ---- " " --- "  " --- "  " ---- " " --- " 
## [441] " ---- " " ---- " " --- "  " ---- " " ---- " " ---- " " --- "  " --- " 
## [449] " ---- " " ---- " " --- "  " ---- " " ---- " " --- "  " ---- " " - "   
## [457] " -- "   " --- "  " ---- " " ---- " " --- "  " --- "  " ---- " " ---- "
## [465] " ---- " " ---- " " ---- " " ---- " " - "    " ---- " " ---- " " ---- "
## [473] " --- "  " ---- " " ---- " " -- "   " --- "  " ---- " " ---- " " --- " 
## [481] " --- "  " ---- " " --- "  " ---- " " ---- " " --- "  " --- "  " ---- "
## [489] " ---- " " ---- " " ---- " " ---- " " --- "  " --- "  " --- "  " ---- "
## [497] " --- "  " --- "  " --- "  " --- "  " -- "   " - "    " --- "  " --- " 
## [505] " --- "  " --- "  " -% "   " --- "  " - "    " --- "  " --- "  " ---% "
## [513] " ---% " " ---% " " ---% " " % "    " ---% " " ---% " " -- "   " ---% "
## [521] " ---% " " ---% " " ---% " " ---% " " -- "   " --- "  " -- "   " -- "  
## [529] " ---- " " ---- " " - "    " - "    " -- "   " - "    " - "    " - "   
## [537] " - "    " - "    " ---- " " ---- " " - "    " ---- " " - "    " ---- "
## [545] " -- "   " ---- " " ---- " " -- "   " ---- " " - "    " ---- " " -- "  
## [553] " - "    " % "    " ---- " " - "    " -- "   " --- "  " -- "   " -% "  
## [561] " --- "  " -- "   " -- "   " - "    " % "    " -- "   " % "    " -% "  
## [569] " --% "  " - "    " ---- " " -% "   " - "    " % "    " ---- " " % "   
## [577] " - "    " - "    " - "    " - "    " - "    " -- "   " - "    " % "   
## [585] " - "    " % "    " ---- " " - "    " % "    " - "    " --- "  " - "   
## [593] " % "    " - "    " % "    " - "    " --- "  " --- "  " --- "  " --- " 
## [601] " - "    " - "    " - "    " -% "   " --- "  " - "    " -% "   " --- " 
## [609] " - "    " - "    " - "    " -% "   " -- "   " -% "   " -% "   " -- "  
## [617] " -% "   " --- "  " % "    " -- "   " % "    " --- "  " -% "   " --- " 
## [625] " - "    " --- "  " -% "   " -- "   " -- "   " -- "   " ---- " " - "   
## [633] " -% "   " -% "   " --- "  " -% "   " - "    " % "    " - "    " --- " 
## [641] " - "    " -- "   " --- "  " - "    " --- "  " --- "  " --% "  " --- " 
## [649] " --- "  " - "    " --- "  " % "    " - "    " --- "  " -% "   " --- " 
## [657] " ---- " " - "    " % "    " --- "  " -- "   " -- "   " --- "  " --- " 
## [665] " --- "  " --- "  " - "    " --- "  " -- "   " - "    " -- "   " - "

cv_nopunct <- str_replace_all(cv_nopunct, " [-%]+ ", " ")

Now we removed all punctuation marks except the hyphen and percent in a string

str_trunc(cv_nopunct, 2000, "right") # Check the first 2000 characters in cv_nopunct

## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus  sars cov   the virus causing coronavirus disease ---9  covid -9   prior to the covid -9 pandemic  there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome  sars  and middle east respiratory syndrome  mers   which enabled accelerated development of various vaccine technologies during early   on january  the sars-cov-- genetic sequence data was shared through gisaid  and by -9 march  the global pharmaceutical industry announced a major commitment to address covid--9   covid--9 vaccination doses administered per people in phase iii trials  several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections  as of march  vaccines were authorized by at least one national regulatory authority for public use  two rna vaccines  the pfizer biontech vaccine and the moderna vaccine   four conventional inactivated vaccines  bbibp- corv  coronavac  covaxin  and covivac   four viral vector vaccines  sputnik v  the oxford astrazeneca vaccine  convidecia  and the johnson   johnson vaccine   and two protein subunit vaccines  epivaccorona and rbd-dimer    in total  as of march  --8 vaccine candidates were in various stages of development  with 7- in clinical research  including in map of countries by approval status phase i trials  in phase i ii trials  and -6 in phase iii development   approved for general use  mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications  such as the elderly  and those at high e-a  or equivalent  granted  mass risk of exposure and transmission  such as healthcare workers   as of vaccination underway march  --6 -7 million doses of covid -9 vaccine have been e-a granted  limited vaccination admini..."

# But it seems we may want to remove any number that does not come with a word nor mean a year

str_extract_all(cv_nopunct, "[[:space:]]+[[:digit:]]{1,3}[[:space:]]+") # Begins with a blank and less than three digit numbers and ends with a blank

## [[1]]
##  [1] " 6 "      " 96 "     " 8  "     "  5   "   " 8  "     " 8  "    
##  [7] " 65  "    " 5  "     "  6  "    " 8  "     "  8  "    "  6  "   
## [13] " 8  "     " 8  "     "  5  "    " 8  "     " 8  "     "  7   "  
## [19] " 55 "     " 65 "     "   798  " "  96  "   "  56  "   " 6 "     
## [25] " 5  "     " 8  "     "  89 "    "  86 "    "  75 "    "  55 "   
## [31] "  58 "    "   9 "    "  8  "    " 6 "      " 9  "     " 7  "    
## [37] " 7 "      " 7 "      " 7  "     " 6 "      " 7 "      " 769 "   
## [43] " 956 "    " 9 "      " 8 "      " 698  "   "  7 "     " 768 "   
## [49] " 7 "      " 867  "   " 6 "      " 9 "      " 5 "      " 5 "     
## [55] " 5 "      " 5 "      " 5 "      " 895 "    " 998  "   " 589 "   
## [61] " 8 "      " 9 "      " 5 "      " 977 "    " 9 "      " 965 "   
## [67] " 8 "      " 695 "    " 9 "      " 676 "    " 675 "    " 9 "     
## [73] " 9 "      " 9 "      " 797 "

cv_nonum <- str_replace_all(cv_nopunct, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(cv_nonum, 2000) # some numbers remain because of whitespace

## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus  sars cov   the virus causing coronavirus disease ---9  covid -9   prior to the covid -9 pandemic  there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome  sars  and middle east respiratory syndrome  mers   which enabled accelerated development of various vaccine technologies during early   on january  the sars-cov-- genetic sequence data was shared through gisaid  and by -9 march  the global pharmaceutical industry announced a major commitment to address covid--9   covid--9 vaccination doses administered per people in phase iii trials  several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections  as of march  vaccines were authorized by at least one national regulatory authority for public use  two rna vaccines  the pfizer biontech vaccine and the moderna vaccine   four conventional inactivated vaccines  bbibp- corv  coronavac  covaxin  and covivac   four viral vector vaccines  sputnik v  the oxford astrazeneca vaccine  convidecia  and the johnson   johnson vaccine   and two protein subunit vaccines  epivaccorona and rbd-dimer    in total  as of march  --8 vaccine candidates were in various stages of development  with 7- in clinical research  including in map of countries by approval status phase i trials  in phase i ii trials  and -6 in phase iii development   approved for general use  mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications  such as the elderly  and those at high e-a  or equivalent  granted  mass risk of exposure and transmission  such as healthcare workers   as of vaccination underway march  --6 -7 million doses of covid -9 vaccine have been e-a granted  limited vaccination admini..."

# Rerun pattern matching again
cv_nonum <- str_replace_all(cv_nonum, "[[:space:]][[:digit:]]{1,3}[[:space:]]", " ")
str_trunc(cv_nonum, 2000)

## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus  sars cov   the virus causing coronavirus disease ---9  covid -9   prior to the covid -9 pandemic  there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome  sars  and middle east respiratory syndrome  mers   which enabled accelerated development of various vaccine technologies during early   on january  the sars-cov-- genetic sequence data was shared through gisaid  and by -9 march  the global pharmaceutical industry announced a major commitment to address covid--9   covid--9 vaccination doses administered per people in phase iii trials  several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections  as of march  vaccines were authorized by at least one national regulatory authority for public use  two rna vaccines  the pfizer biontech vaccine and the moderna vaccine   four conventional inactivated vaccines  bbibp- corv  coronavac  covaxin  and covivac   four viral vector vaccines  sputnik v  the oxford astrazeneca vaccine  convidecia  and the johnson   johnson vaccine   and two protein subunit vaccines  epivaccorona and rbd-dimer    in total  as of march  --8 vaccine candidates were in various stages of development  with 7- in clinical research  including in map of countries by approval status phase i trials  in phase i ii trials  and -6 in phase iii development   approved for general use  mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications  such as the elderly  and those at high e-a  or equivalent  granted  mass risk of exposure and transmission  such as healthcare workers   as of vaccination underway march  --6 -7 million doses of covid -9 vaccine have been e-a granted  limited vaccination admini..."

Now our string is preprocessed insofar as non-ASCII characters, punctuation marks, numbers are removed. But we can still see some multiple whitespaces generated in text preprocessing.

cv_nospace <- str_squish(cv_nonum) # We can repeat the whitespace deletion process
str_trunc(cv_nospace, 2000)

## [1] "covid--9 vaccine a covid -9 vaccine is a vaccine intended to provide acquired immunity against severe acute respiratory syndrome coronavirus sars cov the virus causing coronavirus disease ---9 covid -9 prior to the covid -9 pandemic there was an established body of knowledge about the structure and function of coronaviruses causing diseases like severe acute respiratory syndrome sars and middle east respiratory syndrome mers which enabled accelerated development of various vaccine technologies during early on january the sars-cov-- genetic sequence data was shared through gisaid and by -9 march the global pharmaceutical industry announced a major commitment to address covid--9 covid--9 vaccination doses administered per people in phase iii trials several covid -9 vaccines have demonstrated efficacy as high as 95% in preventing symptomatic covid -9 infections as of march vaccines were authorized by at least one national regulatory authority for public use two rna vaccines the pfizer biontech vaccine and the moderna vaccine four conventional inactivated vaccines bbibp- corv coronavac covaxin and covivac four viral vector vaccines sputnik v the oxford astrazeneca vaccine convidecia and the johnson johnson vaccine and two protein subunit vaccines epivaccorona and rbd-dimer in total as of march --8 vaccine candidates were in various stages of development with 7- in clinical research including in map of countries by approval status phase i trials in phase i ii trials and -6 in phase iii development approved for general use mass many countries have implemented phased distribution plans that prioritize vaccination underway those at highest risk of complications such as the elderly and those at high e-a or equivalent granted mass risk of exposure and transmission such as healthcare workers as of vaccination underway march --6 -7 million doses of covid -9 vaccine have been e-a granted limited vaccination administered worldwide based on official reports from national health..."

Finally, we are ready to tokenize the string object, cv_nospace, into words separated by " ".

cv_tidy_word <- unlist(str_split(cv_nospace, " "))
class(cv_tidy_word)

## [1] "character"

length(cv_tidy_word)

## [1] 8370

cv_tidy_word[1:50]

##  [1] "covid--9"      "vaccine"       "a"             "covid"        
##  [5] "-9"            "vaccine"       "is"            "a"            
##  [9] "vaccine"       "intended"      "to"            "provide"      
## [13] "acquired"      "immunity"      "against"       "severe"       
## [17] "acute"         "respiratory"   "syndrome"      "coronavirus"  
## [21] "sars"          "cov"           "the"           "virus"        
## [25] "causing"       "coronavirus"   "disease"       "---9"         
## [29] "covid"         "-9"            "prior"         "to"           
## [33] "the"           "covid"         "-9"            "pandemic"     
## [37] "there"         "was"           "an"            "established"  
## [41] "body"          "of"            "knowledge"     "about"        
## [45] "the"           "structure"     "and"           "function"     
## [49] "of"            "coronaviruses"

cv_tidy_word_freq <- sort(table(cv_tidy_word), decreasing = TRUE) # Create a table of word counts 
cv_tidy_word_freq[1:50]

## cv_tidy_word
##         the          of         and     vaccine       phase          to 
##         276         221         219         173         146         126 
##          in           a         for    vaccines           i          -9 
##         125         100          98          85          79          69 
##      -nited  randomized       covid  controlled          as          ii 
##          63          62          54          53          50          49 
##    efficacy preclinical      states    covid--9          an    placebo- 
##          46          46          45          44          42          40 
##       trial          by        with         iii          or        that 
##          40          39          37          35          35          35 
##       doses         -8-          on         are      trials         rna 
##          34          33          33          31          31          30 
##         --8     against     subunit          at          be          is 
##          29          29          29          28          28          28 
##         mar         may        sars         --6         --5          -5 
##          28          28          28          27          26          26 
##         -6-       china 
##          26          26

One more thing to do in tokenization is to remove stopwords. The stopword lexicon is available in the tidytext package

#install.packages("tidytext")
library(tidytext)
tidytext::stop_words

## # A tibble: 1,149 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 a           SMART  
##  2 a's         SMART  
##  3 able        SMART  
##  4 about       SMART  
##  5 above       SMART  
##  6 according   SMART  
##  7 accordingly SMART  
##  8 across      SMART  
##  9 actually    SMART  
## 10 after       SMART  
## # ... with 1,139 more rows

stop_words %>% dplyr::count(lexicon)

## # A tibble: 3 x 2
##   lexicon      n
## * <chr>    <int>
## 1 onix       404
## 2 SMART      571
## 3 snowball   174

smart <- stop_words[stop_words$lexicon=="SMART",] # The dataset "stop_words" provides the smart lexicon of stop words in a dataframe format

class(smart$word)

## [1] "character"

cv_tidy_nostop <- cv_tidy_word[!cv_tidy_word %in% smart$word] # %in% is a matching operator that leaves the elements in cv_text_word when they belong to smart$word  

cv_tidy_word[1:50]

##  [1] "covid--9"      "vaccine"       "a"             "covid"        
##  [5] "-9"            "vaccine"       "is"            "a"            
##  [9] "vaccine"       "intended"      "to"            "provide"      
## [13] "acquired"      "immunity"      "against"       "severe"       
## [17] "acute"         "respiratory"   "syndrome"      "coronavirus"  
## [21] "sars"          "cov"           "the"           "virus"        
## [25] "causing"       "coronavirus"   "disease"       "---9"         
## [29] "covid"         "-9"            "prior"         "to"           
## [33] "the"           "covid"         "-9"            "pandemic"     
## [37] "there"         "was"           "an"            "established"  
## [41] "body"          "of"            "knowledge"     "about"        
## [45] "the"           "structure"     "and"           "function"     
## [49] "of"            "coronaviruses"

smart$word[1:50]

##  [1] "a"           "a's"         "able"        "about"       "above"      
##  [6] "according"   "accordingly" "across"      "actually"    "after"      
## [11] "afterwards"  "again"       "against"     "ain't"       "all"        
## [16] "allow"       "allows"      "almost"      "alone"       "along"      
## [21] "already"     "also"        "although"    "always"      "am"         
## [26] "among"       "amongst"     "an"          "and"         "another"    
## [31] "any"         "anybody"     "anyhow"      "anyone"      "anything"   
## [36] "anyway"      "anyways"     "anywhere"    "apart"       "appear"     
## [41] "appreciate"  "appropriate" "are"         "aren't"      "around"     
## [46] "as"          "aside"       "ask"         "asking"      "associated"

cv_tidy_nostop[1:50]

##  [1] "covid--9"      "vaccine"       "covid"         "-9"           
##  [5] "vaccine"       "vaccine"       "intended"      "provide"      
##  [9] "acquired"      "immunity"      "severe"        "acute"        
## [13] "respiratory"   "syndrome"      "coronavirus"   "sars"         
## [17] "cov"           "virus"         "causing"       "coronavirus"  
## [21] "disease"       "---9"          "covid"         "-9"           
## [25] "prior"         "covid"         "-9"            "pandemic"     
## [29] "established"   "body"          "knowledge"     "structure"    
## [33] "function"      "coronaviruses" "causing"       "diseases"     
## [37] "severe"        "acute"         "respiratory"   "syndrome"     
## [41] "sars"          "middle"        "east"          "respiratory"  
## [45] "syndrome"      "mers"          "enabled"       "accelerated"  
## [49] "development"   "vaccine"

cv_tidy_nostop_freq <- sort(table(cv_tidy_nostop), decreasing = TRUE)
names(cv_tidy_nostop_freq)[1:50]

##  [1] "vaccine"      "phase"        "vaccines"     "-9"           "-nited"      
##  [6] "randomized"   "covid"        "controlled"   "ii"           "efficacy"    
## [11] "preclinical"  "states"       "covid--9"     "placebo-"     "trial"       
## [16] "iii"          "doses"        "-8-"          "trials"       "rna"         
## [21] "--8"          "subunit"      "mar"          "sars"         "--6"         
## [26] "--5"          "-5"           "-6-"          "china"        "dose"        
## [31] "south"        "---"          "--9"          "-5-"          "clinical"    
## [36] "development"  "double-blind" "emergency"    "safety"       "cov"         
## [41] "vector"       "--7"          "-9-"          "johnson"      "research"    
## [46] "inactivated"  "-7-"          "dec"          "nov"          "virus"

cv_tidy_nostop_freq[1:50]

## cv_tidy_nostop
##      vaccine        phase     vaccines           -9       -nited   randomized 
##          173          146           85           69           63           62 
##        covid   controlled           ii     efficacy  preclinical       states 
##           54           53           49           46           46           45 
##     covid--9     placebo-        trial          iii        doses          -8- 
##           44           40           40           35           34           33 
##       trials          rna          --8      subunit          mar         sars 
##           31           30           29           29           28           28 
##          --6          --5           -5          -6-        china         dose 
##           27           26           26           26           26           26 
##        south          ---          --9          -5-     clinical  development 
##           26           25           25           25           25           25 
## double-blind    emergency       safety          cov       vector          --7 
##           24           24           24           23           23           22 
##          -9-      johnson     research  inactivated          -7-          dec 
##           22           22           22           21           20           20 
##          nov        virus 
##           20           20

Let’s create a wordcloud from the table of word frequency

library(wordcloud)

## Loading required package: RColorBrewer

wordcloud(words = names(cv_tidy_nostop_freq), # Sequence of unique words
          freq = cv_tidy_nostop_freq, # Frequency of words
          min.freq = 10, # Minimum frequency of words plotted
          random.order = FALSE, # Highly frequent words placed in the middle
          rot.per = 0.1, # Rate of words rotated in plot
          scale = c(3, 0.3), # Range of words in size
          colors = brewer.pal(8, "Dark2")) # Retrieve 8 colors from the list of "Dark2"

Now we have a much better wordcloud that gives more information about COVID-19 vaccine

Week6-3: Basic text processing 2

Shin Lee

2021/4/8

R pratice for Text (Pre-)Processing

Package "stringr’

Useful stringr functions for pattern matching

Re-work on retrieving text from Wikipedia

Let’s preprocess the string

Quiz