Background & Data:
It’s an ongoing research project of a systematic literature review. Text data were article titles and abstracts that contain key terms of interest downloaded from Web of Science. In this demonstration, I will show how to extract the combined frequency table of individual words and n-grams. I mostly rely R base commands. I use WordDistance
built-in stop words list. I also use the str_view_all
command in the stringr
package to highlight the matched patterns in a long text.
In text analysis, it may be necessary to create a project-specific stop words list.
Packages Preparation
library(readr) # import .csv
library(dbplyr) # %in%
library(DT) # show tables in R markdown
library(WordDistance) # a package currently available for IOS (still testing for Windows and Linux), contains a relatively comprehensive stop words list
data("stopwords") # a list of stop words
library(writexl) # export multiple excel (.xlsx) worksheets
library(stringr) # str_view_all highlights all searched terms a string
CGN <- read_csv("CGN.csv")
PADM <- CGN[grepl("public administration",CGN$`WoS Categories`, ignore.case = T)==TRUE,]
#names(CGN)
Index a list of abstracts by keywords.
PADM_text <- paste(PADM$`Article Title`, PADM$Abstract)
PADM$text <- PADM_text
ID.collab <- PADM$ID[grepl("collaborative governance", PADM$text, ignore.case = T)==T]
ID.netgov <- PADM$ID[grepl("governance network|network(|ed) governance", PADM$text, ignore.case = T)==T] # the previous excel did not capture the spelling variation of "networked governance"
ID.complex <- PADM$ID[grepl("complex(|ity) (theory|network|governance)|complex(|ity)(| adaptive) system", PADM$text, ignore.case = T)==T] # search multiple patterns
ID.self <- PADM$ID[grepl("self(| |-)govern|self(| |-)organi", PADM$text, ignore.case = T)==T]
ID.collabnet <- PADM$ID[grepl("collaborative network", PADM$text, ignore.case = T)==T]
ID.collabPAPM <- PADM$ID[grepl("collaborative public (administration|management)", PADM$text, ignore.case = T)==T]
It is important to search as many spelling variations as possible. When searching multiple terms/spelling variations in a string, use ()
, |
, .*
to translate your will to R codes. For example, the pattern "governance network|network(|ed) governance"
matches “governance network(s)”, “network governance”, “networked governance”.
setdiff()
shows the discrepancies of text searching. If setdiff
returns numeric(0)
, it means the results are the same for both coding methods.
Counting all occurrences of a pattern, and then aggregate all occurrences.
If a pattern occurs multiple times in the same text, count only once.
network(ed) governance/governance network(s)
If a pattern occurs multiple times in the same text, count only once.
Counting all occurrences of a pattern, and then aggregate all occurrences.
If a pattern occurs multiple times in the same text, count only once.
Write Multiple Sheets to .xlsx
write_xlsx(list("Complex_Term_Raw" = Complex_Term_Freq_Raw, "Complex_Term_Adjusted" = Complex_Term_Freq_Adjusted, "Collaborative_Term_Raw" = Collab_Term_Freq_Raw, "Collaborative_Term_Adjusted" = Collab_Term_Freq_Adjusted, "NetGov_Term_Raw" = netgov_Term_Freq_Raw, "NetGov_Term_Adjusted" = netgov_Term_Freq_Adjusted, "Self_Term_Raw" = self_Term_Freq_Raw, "Self_Term_Adjusted" = self_Term_Freq_Adjusted), "Term_Freq_CGN.xlsx")