Text mining begins with understanding text data in natural language. It is the act of pre-processing text into data that are appropriate to analysis.
Today, we will see how R can be used for text pre-processing. However, we will not install any package for text analysis, nothing but a couple of ones for producing wordcloud. That means, we will apply only functions already installed at the default setting, which are included in the R base package. Surely, there are pros and cons of such base functions in R. Let’s see the cons first.
There are of course the pros.
You can access the help files about functions.
?c # Get help of a particular function c( )
## starting httpd help server ... done
?strsplit
help.search("split") # Search and return the help files for functions that include a word or phrase
strsplit("hello world", split = " ")
## [[1]]
## [1] "hello" "world"
unlist(strsplit("hello world", split = " "))
## [1] "hello" "world"
hello <- unlist(strsplit("hello world", split = " "))
hello
## [1] "hello" "world"
?paste
paste(hello, collapse = " ")
## [1] "hello world"
nchar()
: counting the number of charactersnchar("text")
## [1] 4
# Blank and punctuation marks are also counted
nchar("text mining") #11? 10?
## [1] 11
nchar("text, mining") #12? 11?
## [1] 12
strsplit()
: Texts are composed of words that consist of characters. In text mining, therefore, we segment texts into words as tokens, which are pre-processed. Then, we also might need to combine them again.sent1 <- "Text mining begins with preprocessing and tokenization."
# We split sent1 into seven pieces of words that are separated from each other by white space (blank)
sent1_split <- strsplit(sent1, split=" ") # Parameter (split=" ") specifies a character (here, blank) that separates words as tokens
sent1_split # returns a list of vectors of words segmented from sent1
## [[1]]
## [1] "Text" "mining" "begins" "with"
## [5] "preprocessing" "and" "tokenization."
class(sent1)
## [1] "character"
class(sent1_split)
## [1] "list"
class(unlist(sent1_split))
## [1] "character"
sent1_split_vector <- unlist(sent1_split)
paste()
: We can combine word segments back into a sentence that separates words by blanksent2 <- paste(sent1_split_vector, collapse=" ") # Parameter (collapse) specifies how elements in sent1_split_vector are combined. Here, we want to combine word elements into a sentence that separates them by blank (" ").
sent2
## [1] "Text mining begins with preprocessing and tokenization."
paste(sent1_split_vector, collapse="/")
## [1] "Text/mining/begins/with/preprocessing/and/tokenization."
sent2
## [1] "Text mining begins with preprocessing and tokenization."
lapply( ) applies a specified function to each element of a list or a vector and returns a new list object of the same length as the input list object. Each element of which is the result of applying a function to the corresponding element of the input list.
l in lapply( ) stands for list
sentence <- "Apply functions apply a specified function to each element of a list or a vector object."
List2 <- list(sent1, sentence)
List2
## [[1]]
## [1] "Text mining begins with preprocessing and tokenization."
##
## [[2]]
## [1] "Apply functions apply a specified function to each element of a list or a vector object."
lapply(List2, nchar) # returns a list object of results from applying a function "nchar" to each element of List2
## [[1]]
## [1] 55
##
## [[2]]
## [1] 88
unlist(lapply(List2, nchar))
## [1] 55 88
sapply( ) applies a specified function to each element of a list and returns a vector object when possible. It is the same as applying the function unlist( ) to the result of lapply( ).
List2
## [[1]]
## [1] "Text mining begins with preprocessing and tokenization."
##
## [[2]]
## [1] "Apply functions apply a specified function to each element of a list or a vector object."
sapply(List2, nchar) # applies a function "nchar" to elements as vectors in List2 as a list
## [1] 55 88
unlist(lapply(List2, nchar)) #"s" in sapply stands for "simplified"
## [1] 55 88
R is an open-source software, which means there are tons of functions being developed by many people all over the world. Using such functions, we can operate functions for text mining easily and effectively.
A package is a collection of such R functions (as well as data and compiled code). And the location where the packages are stored is called the library.
When we download a package needed using the function, install.packages("package name")
, it will be storied in the library. And to use the package, we should operate the function, library(package name)
, which makes the package available.
#install.packages("pdftools")
library(pdftools)
## Using poppler version 0.73.0
covid_text <- pdf_text("COVID-19_vaccine.pdf") # Extracting texts from PDF document
#covid_text
class(covid_text)
## [1] "character"
length(covid_text)
## [1] 93
covid_text[1]
## [1] "COVID-19 vaccine\r\nA COVID<U+2011>19 vaccine is a vaccine intended to provide acquired immunity\r\nagainst severe acute respiratory syndrome coronavirus 2 (SARS<U+2011>CoV<U+2011>2), the\r\nvirus causing coronavirus disease 2019 (COVID<U+2011>19). Prior to the\r\nCOVID<U+2011>19 pandemic, there was an established body of knowledge about\r\nthe structure and function of coronaviruses causing diseases like severe acute\r\nrespiratory syndrome (SARS) and Middle East respiratory syndrome\r\n(MERS), which enabled accelerated development of various vaccine\r\ntechnologies during early 2020.[1] On 10 January 2020, the SARS-CoV-2\r\ngenetic sequence data was shared through GISAID, and by 19 March, the\r\nglobal pharmaceutical industry announced a major commitment to address\r\nCOVID-19.[2] COVID-19 vaccination doses administered\r\n per 100 people\r\nIn Phase III trials, several COVID<U+2011>19 vaccines have demonstrated efficacy\r\nas high as 95% in preventing symptomatic COVID<U+2011>19 infections. As of\r\nMarch 2021, 12 vaccines were authorized by at least one national regulatory\r\nauthority for public use: two RNA vaccines (the Pfizer<U+2013>BioNTech vaccine\r\nand the Moderna vaccine), four conventional inactivated vaccines (BBIBP-\r\nCorV, CoronaVac, Covaxin, and CoviVac), four viral vector vaccines\r\n(Sputnik V, the Oxford<U+2013>AstraZeneca vaccine, Convidecia, and the Johnson\r\n& Johnson vaccine), and two protein subunit vaccines (EpiVacCorona and\r\nRBD-Dimer).[3] In total, as of March 2021, 308 vaccine candidates were in\r\nvarious stages of development, with 73 in clinical research, including 24 in\r\n Map of countries by approval status\r\nPhase I trials, 33 in Phase I<U+2013>II trials, and 16 in Phase III development.[3]\r\n Approved for general use, mass\r\nMany countries have implemented phased distribution plans that prioritize vaccination underway\r\nthose at highest risk of complications, such as the elderly, and those at high EUA (or equivalent) granted, mass\r\nrisk of exposure and transmission, such as healthcare workers.[4] As of vaccination underway\r\n20 March 2021, 436.37 million doses of COVID<U+2011>19 vaccine have been EUA granted, limited vaccination\r\nadministered worldwide based on official reports from national health Approved for general use, mass\r\nagencies.[5] AstraZeneca-Oxford anticipates producing 3 billion doses in vaccination planned\r\n2021, Pfizer-BioNTech 1.3 billion doses, and Sputnik V, Sinopharm, EUA granted, mass vaccination planned\r\nSinovac, and Johnson & Johnson 1 billion doses each. Moderna targets EUA pending\r\nproducing 600 million doses and Convidecia 500 million doses in 2021.[6][7]\r\nBy December 2020, more than 10 billion vaccine doses had been preordered\r\nby countries,[8] with about half of the doses purchased by high-income countries comprising 14% of the world's\r\npopulation.[9]\r\n Contents\r\n Background\r\n Planning and development\r\n Challenges\r\n Organizations\r\n History\r\n Vaccine types\r\n RNA vaccines\r\n Adenovirus vector vaccines\r\n Inactivated virus vaccines\r\n"
covid_text[93]
## [1] "Retrieved from \"https://en.wikipedia.org/w/index.php?title=COVID-19_vaccine&oldid=1013541498\"\r\nThis page was last edited on 22 March 2021, at 05:04 (UTC).\r\nText is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. By using this site, you\r\nagree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit\r\norganization.\r\n"
save(covid_text,file="covid_text.RData")
covid_text_word <- strsplit(covid_text, split=" ") # Parsing words; in each sentence, words can split by blank " ". What's the class of covid_text_word?
class(covid_text_word)
## [1] "list"
length(covid_text_word)
## [1] 93
covid_text_word[[1]]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)." "Prior"
## [27] "to" "the\r\nCOVID<U+2011>19"
## [29] "pandemic," "there"
## [31] "was" "an"
## [33] "established" "body"
## [35] "of" "knowledge"
## [37] "about\r\nthe" "structure"
## [39] "and" "function"
## [41] "of" "coronaviruses"
## [43] "causing" "diseases"
## [45] "like" "severe"
## [47] "acute\r\nrespiratory" "syndrome"
## [49] "(SARS)" "and"
## [51] "Middle" "East"
## [53] "respiratory" "syndrome\r\n(MERS),"
## [55] "which" "enabled"
## [57] "accelerated" "development"
## [59] "of" "various"
## [61] "vaccine\r\ntechnologies" "during"
## [63] "early" "2020.[1]"
## [65] "On" "10"
## [67] "January" "2020,"
## [69] "the" "SARS-CoV-2\r\ngenetic"
## [71] "sequence" "data"
## [73] "was" "shared"
## [75] "through" "GISAID,"
## [77] "and" "by"
## [79] "19" "March,"
## [81] "the\r\nglobal" "pharmaceutical"
## [83] "industry" "announced"
## [85] "a" "major"
## [87] "commitment" "to"
## [89] "address\r\nCOVID-19.[2]" ""
## [91] "" ""
## [93] "" ""
## [95] "" ""
## [97] "" ""
## [99] "" ""
## [101] "" ""
## [103] "" ""
## [105] "" ""
## [107] "" ""
## [109] "" ""
## [111] "" ""
## [113] "" ""
## [115] "" ""
## [117] "" ""
## [119] "" ""
## [121] "" ""
## [123] "" ""
## [125] "" ""
## [127] "" ""
## [129] "" ""
## [131] "" ""
## [133] "" ""
## [135] "" ""
## [137] "" ""
## [139] "" ""
## [141] "" ""
## [143] "" ""
## [145] "" ""
## [147] "" ""
## [149] "" ""
## [151] "" ""
## [153] "" ""
## [155] "" "COVID-19"
## [157] "vaccination" "doses"
## [159] "administered\r\n" ""
## [161] "" ""
## [163] "" ""
## [165] "" ""
## [167] "" ""
## [169] "" ""
## [171] "" ""
## [173] "" ""
## [175] "" ""
## [177] "" ""
## [179] "" ""
## [181] "" ""
## [183] "" ""
## [185] "" ""
## [187] "" ""
## [189] "" ""
## [191] "" ""
## [193] "" ""
## [195] "" ""
## [197] "" ""
## [199] "" ""
## [201] "" ""
## [203] "" ""
## [205] "" ""
## [207] "" ""
## [209] "" ""
## [211] "" ""
## [213] "" ""
## [215] "" ""
## [217] "" ""
## [219] "" ""
## [221] "" ""
## [223] "" ""
## [225] "" ""
## [227] "" ""
## [229] "" ""
## [231] "" ""
## [233] "" ""
## [235] "" ""
## [237] "" "per"
## [239] "100" "people\r\nIn"
## [241] "Phase" "III"
## [243] "trials," "several"
## [245] "COVID<U+2011>19" "vaccines"
## [247] "have" "demonstrated"
## [249] "efficacy\r\nas" "high"
## [251] "as" "95%"
## [253] "in" "preventing"
## [255] "symptomatic" "COVID<U+2011>19"
## [257] "infections." "As"
## [259] "of\r\nMarch" "2021,"
## [261] "12" "vaccines"
## [263] "were" "authorized"
## [265] "by" "at"
## [267] "least" "one"
## [269] "national" "regulatory\r\nauthority"
## [271] "for" "public"
## [273] "use:" "two"
## [275] "RNA" "vaccines"
## [277] "(the" "Pfizer<U+2013>BioNTech"
## [279] "vaccine\r\nand" "the"
## [281] "Moderna" "vaccine),"
## [283] "four" "conventional"
## [285] "inactivated" "vaccines"
## [287] "(BBIBP-\r\nCorV," "CoronaVac,"
## [289] "Covaxin," "and"
## [291] "CoviVac)," "four"
## [293] "viral" "vector"
## [295] "vaccines\r\n(Sputnik" "V,"
## [297] "the" "Oxford<U+2013>AstraZeneca"
## [299] "vaccine," "Convidecia,"
## [301] "and" "the"
## [303] "Johnson\r\n&" "Johnson"
## [305] "vaccine)," "and"
## [307] "two" "protein"
## [309] "subunit" "vaccines"
## [311] "(EpiVacCorona" "and\r\nRBD-Dimer).[3]"
## [313] "In" "total,"
## [315] "as" "of"
## [317] "March" "2021,"
## [319] "308" "vaccine"
## [321] "candidates" "were"
## [323] "in\r\nvarious" "stages"
## [325] "of" "development,"
## [327] "with" "73"
## [329] "in" "clinical"
## [331] "research," "including"
## [333] "24" "in\r\n"
## [335] "" ""
## [337] "" ""
## [339] "" ""
## [341] "" ""
## [343] "" ""
## [345] "" ""
## [347] "" ""
## [349] "" ""
## [351] "" ""
## [353] "" ""
## [355] "" ""
## [357] "" ""
## [359] "" ""
## [361] "" ""
## [363] "" ""
## [365] "" ""
## [367] "" ""
## [369] "" ""
## [371] "" ""
## [373] "" ""
## [375] "" ""
## [377] "" ""
## [379] "" ""
## [381] "" ""
## [383] "" ""
## [385] "" ""
## [387] "" ""
## [389] "" ""
## [391] "" ""
## [393] "" ""
## [395] "" ""
## [397] "" ""
## [399] "" ""
## [401] "" ""
## [403] "" ""
## [405] "" ""
## [407] "" ""
## [409] "" ""
## [411] "" ""
## [413] "Map" "of"
## [415] "countries" "by"
## [417] "approval" "status\r\nPhase"
## [419] "I" "trials,"
## [421] "33" "in"
## [423] "Phase" "I<U+2013>II"
## [425] "trials," "and"
## [427] "16" "in"
## [429] "Phase" "III"
## [431] "development.[3]\r\n" ""
## [433] "" ""
## [435] "" ""
## [437] "" ""
## [439] "" ""
## [441] "" ""
## [443] "" ""
## [445] "" ""
## [447] "" ""
## [449] "" ""
## [451] "" ""
## [453] "" ""
## [455] "" ""
## [457] "" ""
## [459] "" ""
## [461] "" ""
## [463] "" ""
## [465] "" ""
## [467] "" ""
## [469] "" ""
## [471] "" ""
## [473] "" ""
## [475] "" ""
## [477] "" ""
## [479] "" ""
## [481] "" ""
## [483] "" ""
## [485] "" ""
## [487] "" ""
## [489] "" ""
## [491] "" ""
## [493] "" ""
## [495] "" ""
## [497] "" ""
## [499] "" ""
## [501] "" ""
## [503] "" ""
## [505] "" ""
## [507] "" ""
## [509] "" ""
## [511] "" ""
## [513] "" "Approved"
## [515] "for" "general"
## [517] "use," "mass\r\nMany"
## [519] "countries" "have"
## [521] "implemented" "phased"
## [523] "distribution" "plans"
## [525] "that" "prioritize"
## [527] "" ""
## [529] "" ""
## [531] "" "vaccination"
## [533] "underway\r\nthose" "at"
## [535] "highest" "risk"
## [537] "of" "complications,"
## [539] "such" "as"
## [541] "the" "elderly,"
## [543] "and" "those"
## [545] "at" "high"
## [547] "" ""
## [549] "" ""
## [551] "EUA" "(or"
## [553] "equivalent)" "granted,"
## [555] "mass\r\nrisk" "of"
## [557] "exposure" "and"
## [559] "transmission," "such"
## [561] "as" "healthcare"
## [563] "workers.[4]" "As"
## [565] "of" ""
## [567] "" ""
## [569] "" ""
## [571] "" ""
## [573] "vaccination" "underway\r\n20"
## [575] "March" "2021,"
## [577] "436.37" "million"
## [579] "doses" "of"
## [581] "COVID<U+2011>19" "vaccine"
## [583] "have" "been"
## [585] "" ""
## [587] "" ""
## [589] "" ""
## [591] "" ""
## [593] "" ""
## [595] "" ""
## [597] "" ""
## [599] "" ""
## [601] "" "EUA"
## [603] "granted," "limited"
## [605] "vaccination\r\nadministered" "worldwide"
## [607] "based" "on"
## [609] "official" "reports"
## [611] "from" "national"
## [613] "health" ""
## [615] "" ""
## [617] "" ""
## [619] "" ""
## [621] "" ""
## [623] "" ""
## [625] "" ""
## [627] "Approved" "for"
## [629] "general" "use,"
## [631] "mass\r\nagencies.[5]" "AstraZeneca-Oxford"
## [633] "anticipates" "producing"
## [635] "3" "billion"
## [637] "doses" "in"
## [639] "" ""
## [641] "" ""
## [643] "" ""
## [645] "vaccination" "planned\r\n2021,"
## [647] "Pfizer-BioNTech" "1.3"
## [649] "billion" "doses,"
## [651] "and" "Sputnik"
## [653] "V," "Sinopharm,"
## [655] "" ""
## [657] "" ""
## [659] "" ""
## [661] "" ""
## [663] "" ""
## [665] "" ""
## [667] "" ""
## [669] "" ""
## [671] "EUA" "granted,"
## [673] "mass" "vaccination"
## [675] "planned\r\nSinovac," "and"
## [677] "Johnson" "&"
## [679] "Johnson" "1"
## [681] "billion" "doses"
## [683] "each." "Moderna"
## [685] "targets" ""
## [687] "" ""
## [689] "" ""
## [691] "" ""
## [693] "" ""
## [695] "" ""
## [697] "" ""
## [699] "" "EUA"
## [701] "pending\r\nproducing" "600"
## [703] "million" "doses"
## [705] "and" "Convidecia"
## [707] "500" "million"
## [709] "doses" "in"
## [711] "2021.[6][7]\r\nBy" "December"
## [713] "2020," "more"
## [715] "than" "10"
## [717] "billion" "vaccine"
## [719] "doses" "had"
## [721] "been" "preordered\r\nby"
## [723] "countries,[8]" "with"
## [725] "about" "half"
## [727] "of" "the"
## [729] "doses" "purchased"
## [731] "by" "high-income"
## [733] "countries" "comprising"
## [735] "14%" "of"
## [737] "the" "world's\r\npopulation.[9]\r\n"
## [739] "Contents\r\n" "Background\r\n"
## [741] "Planning" "and"
## [743] "development\r\n" ""
## [745] "" ""
## [747] "" ""
## [749] "" "Challenges\r\n"
## [751] "" ""
## [753] "" ""
## [755] "" ""
## [757] "Organizations\r\n" ""
## [759] "" ""
## [761] "" ""
## [763] "" "History\r\n"
## [765] "Vaccine" "types\r\n"
## [767] "" ""
## [769] "" ""
## [771] "" ""
## [773] "RNA" "vaccines\r\n"
## [775] "" ""
## [777] "" ""
## [779] "" ""
## [781] "Adenovirus" "vector"
## [783] "vaccines\r\n" ""
## [785] "" ""
## [787] "" ""
## [789] "" "Inactivated"
## [791] "virus" "vaccines\r\n"
covid_text_word <- unlist(covid_text_word)
length(covid_text_word)
## [1] 95363
strsplit(" ", split=" ")
## [[1]]
## [1] "" "" "" "" "" ""
covid_text_word[1:200]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)." "Prior"
## [27] "to" "the\r\nCOVID<U+2011>19"
## [29] "pandemic," "there"
## [31] "was" "an"
## [33] "established" "body"
## [35] "of" "knowledge"
## [37] "about\r\nthe" "structure"
## [39] "and" "function"
## [41] "of" "coronaviruses"
## [43] "causing" "diseases"
## [45] "like" "severe"
## [47] "acute\r\nrespiratory" "syndrome"
## [49] "(SARS)" "and"
## [51] "Middle" "East"
## [53] "respiratory" "syndrome\r\n(MERS),"
## [55] "which" "enabled"
## [57] "accelerated" "development"
## [59] "of" "various"
## [61] "vaccine\r\ntechnologies" "during"
## [63] "early" "2020.[1]"
## [65] "On" "10"
## [67] "January" "2020,"
## [69] "the" "SARS-CoV-2\r\ngenetic"
## [71] "sequence" "data"
## [73] "was" "shared"
## [75] "through" "GISAID,"
## [77] "and" "by"
## [79] "19" "March,"
## [81] "the\r\nglobal" "pharmaceutical"
## [83] "industry" "announced"
## [85] "a" "major"
## [87] "commitment" "to"
## [89] "address\r\nCOVID-19.[2]" ""
## [91] "" ""
## [93] "" ""
## [95] "" ""
## [97] "" ""
## [99] "" ""
## [101] "" ""
## [103] "" ""
## [105] "" ""
## [107] "" ""
## [109] "" ""
## [111] "" ""
## [113] "" ""
## [115] "" ""
## [117] "" ""
## [119] "" ""
## [121] "" ""
## [123] "" ""
## [125] "" ""
## [127] "" ""
## [129] "" ""
## [131] "" ""
## [133] "" ""
## [135] "" ""
## [137] "" ""
## [139] "" ""
## [141] "" ""
## [143] "" ""
## [145] "" ""
## [147] "" ""
## [149] "" ""
## [151] "" ""
## [153] "" ""
## [155] "" "COVID-19"
## [157] "vaccination" "doses"
## [159] "administered\r\n" ""
## [161] "" ""
## [163] "" ""
## [165] "" ""
## [167] "" ""
## [169] "" ""
## [171] "" ""
## [173] "" ""
## [175] "" ""
## [177] "" ""
## [179] "" ""
## [181] "" ""
## [183] "" ""
## [185] "" ""
## [187] "" ""
## [189] "" ""
## [191] "" ""
## [193] "" ""
## [195] "" ""
## [197] "" ""
## [199] "" ""
#install.packages("stringr")
library(stringr)
covid_text_word[1:100]
## [1] "COVID-19" "vaccine\r\nA"
## [3] "COVID<U+2011>19" "vaccine"
## [5] "is" "a"
## [7] "vaccine" "intended"
## [9] "to" "provide"
## [11] "acquired" "immunity\r\nagainst"
## [13] "severe" "acute"
## [15] "respiratory" "syndrome"
## [17] "coronavirus" "2"
## [19] "(SARS<U+2011>CoV<U+2011>2)," "the\r\nvirus"
## [21] "causing" "coronavirus"
## [23] "disease" "2019"
## [25] "(COVID<U+2011>19)." "Prior"
## [27] "to" "the\r\nCOVID<U+2011>19"
## [29] "pandemic," "there"
## [31] "was" "an"
## [33] "established" "body"
## [35] "of" "knowledge"
## [37] "about\r\nthe" "structure"
## [39] "and" "function"
## [41] "of" "coronaviruses"
## [43] "causing" "diseases"
## [45] "like" "severe"
## [47] "acute\r\nrespiratory" "syndrome"
## [49] "(SARS)" "and"
## [51] "Middle" "East"
## [53] "respiratory" "syndrome\r\n(MERS),"
## [55] "which" "enabled"
## [57] "accelerated" "development"
## [59] "of" "various"
## [61] "vaccine\r\ntechnologies" "during"
## [63] "early" "2020.[1]"
## [65] "On" "10"
## [67] "January" "2020,"
## [69] "the" "SARS-CoV-2\r\ngenetic"
## [71] "sequence" "data"
## [73] "was" "shared"
## [75] "through" "GISAID,"
## [77] "and" "by"
## [79] "19" "March,"
## [81] "the\r\nglobal" "pharmaceutical"
## [83] "industry" "announced"
## [85] "a" "major"
## [87] "commitment" "to"
## [89] "address\r\nCOVID-19.[2]" ""
## [91] "" ""
## [93] "" ""
## [95] "" ""
## [97] "" ""
## [99] "" ""
nchar(covid_text_word[1:100])
## [1] 8 10 8 7 2 1 7 8 2 7 8 17 6 5 11 8 11 1 13 10 7 11 7 4 11
## [26] 5 2 13 9 5 3 2 11 4 2 9 10 9 3 8 2 13 7 8 4 6 18 8 6 3
## [51] 6 4 11 17 5 7 11 11 2 7 21 6 5 8 2 2 7 5 3 19 8 4 3 6 7
## [76] 7 3 2 2 6 11 14 8 9 1 5 10 2 21 0 0 0 0 0 0 0 0 0 0 0
covid_text_word_main <- covid_text_word[nchar(covid_text_word)>0]
# which: a function to give the TRUE indices of a logical object
# str_detect: a function to detect the presence (TRUE) or absence (FALSE) of a pattern in a string
str_detect(covid_text_word_main, pattern="References")[1:100]
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
which(str_detect(covid_text_word_main, pattern="References"))
## [1] 406 7877
str_which(covid_text_word_main, pattern="References")
## [1] 406 7877
covid_text_word_main <- covid_text_word_main[1:7876] #selecting only words before the word "references"
# sort: a function to sort a vector into ascending or descending order
# table: a function to build a table of the counts at each observation (word)
covid_text_word_freq <- sort(table(covid_text_word_main), decreasing = TRUE)
covid_text_word_freq[1:50]
## covid_text_word_main
## the of and Phase to
## 212 195 177 130 113
## in vaccine <U+2013> for a
## 97 95 92 87 75
## 2020 2021, United vaccines I
## 57 48 48 43 42
## COVID<U+2011>19 Preclinical\r\n an by that
## 40 37 32 32 32
## with COVID-19 I<U+2013>II or as
## 30 29 28 28 26
## Randomized,\r\n are be is 2021,\r\n
## 26 25 25 25 24
## at doses RNA III South
## 24 24 24 23 23
## 2020, on placebo-\r\n Subunit The
## 22 22 22 22 22
## clinical efficacy have States 2022,
## 21 21 20 20 19
## against trial vector Nov and\r\n
## 19 19 19 18 17
covid_text_word_freq <- sort(table(tolower(covid_text_word_main)), decreasing = TRUE)
covid_text_word_freq[1:50]
##
## the of and phase to
## 234 195 177 135 113
## vaccine in <U+2013> for a
## 109 105 92 89 81
## 2020 vaccines 2021, united i
## 57 50 48 48 42
## covid<U+2011>19 randomized,\r\n preclinical\r\n an as
## 40 38 37 35 35
## by that with covid-19 placebo-\r\n
## 33 32 32 29 29
## i<U+2013>ii or on subunit are
## 28 28 27 26 25
## be is 2021,\r\n at doses
## 25 25 24 24 24
## rna emergency iii south 2020,
## 24 23 23 23 22
## efficacy clinical may trial have
## 22 21 21 21 20
## states 2022, against vector development
## 20 19 19 19 18
length(covid_text_word_freq)
## [1] 3211
covid_text_word_freq[1:20]
##
## the of and phase to
## 234 195 177 135 113
## vaccine in <U+2013> for a
## 109 105 92 89 81
## 2020 vaccines 2021, united i
## 57 50 48 48 42
## covid<U+2011>19 randomized,\r\n preclinical\r\n an as
## 40 38 37 35 35
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(words = names(covid_text_word_freq), # Sequence of unique words
freq = covid_text_word_freq, # Frequency of words
min.freq = 10, # Minimum frequency of words plotted
random.order = FALSE, # Highly frequent words placed in the middle
rot.per = 0.1, # Rate of words rotated in plot
colors = brewer.pal(8, "Dark2"), # Retrieve 8 colors from the list of "Dark2"
scale = c(3, 0.5)) # Range of words in size