This guide offers a detailed, step-by-step process for creating a word cloud that visualizes the main topics from various scientific papers. To get started, ensure you have PDF versions of your research papers saved on your local drive. By following the instructions provided here, you can swiftly and effectively showcase the scope of your research to others.
(This guide also includes access to a web app, which I recommend using if you’re primarily interested in seeing the final word cloud.”)
To begin, we need to specify the path to a folders on our local drive that contains all of the PDF files. (Note: Ensure the specific path corresponds to your own local drive)
# Note: These are the file paths in my local drive. They will be different for your computer.
directory_path <- ("/Users/zakwitkower/Dropbox/Data Science Portfolio/Wordcloud/My Papers/First-author publications")
Next, we will identify the full path for each unique file in the selected folder.
Before we extract text from the PDF documents, it’s crucial to determine which sections should be included. Sections like references typically do not convey the core arguments of each document and are therefore unsuitable for a WordCloud. In contrast, sections such as the abstract and introduction are perfect for inclusion. To efficiently capture the abstract and introduction, I will set up a function that initially collects all text from a document. It will then retain all text up to the “References” header. If a “Results” section exists, it will further exclude any text preceding this section. This dual-step approach is vital because not every manuscript includes a “Results” section (e.g., theory papers, review papers, commentaries). Consequently, the text for the WordCloud will include the abstract and introduction for empirical papers and everything except the references for non-empirical papers.
require(stringr) #Text editing
#Create a function to specify final text being analyzed
extract_text_before_method <- function(file_path) {
# Extract all text from the PDF
full_text <- paste(pdf_text(file_path), collapse = "\n")
# Retain text up until "References"
extracted_text <- sub("(.*?)References\\n.*", "\\1", full_text)
# Retain text up until "Results"
extracted_text <- sub("(.*?)Results\\n.*", "\\1", extracted_text)
return(extracted_text)
}
#create an uninitialized list for us to loop over.
pdf.files<-list()
# Loop through each file and extract the text using the function specified above
for (file in pdf_files) {
pdf_text <- extract_text_before_method(file)
# Clean data by removing new lines if still necessary
pdf_text <- gsub("\n", "", pdf_text)
pdf.files[[file]] <- pdf_text
}
Now that we’ve isolated the appropriate sections of text, the next step is to process this text data by transforming it into a Term Document Matrix (TDM). A TDM is a numerical representation of the text data, enabling us to perform quantitative analysis and create a WordCloud. In the creation of our TDM, we’ll remove any unnecessary words, numbers, and punctuation, such as “the,” “it,” “who,” and statistical notations like “p < .05.” This step will ensure we have a cleaner and more focused set of terms to analyze.
require(tm)
pdf.files<-TermDocumentMatrix(pdf.files, # data run through the function will go here.
control =
list(removePunctuation = TRUE, #Remove punctuation
stopwords = TRUE, #Remove stopwords
tolower = TRUE, #Converts to lowercase
stemming = F, #Retain full words
removeNumbers = TRUE)) #Remove numbers
print(pdf.files)
## <<TermDocumentMatrix (terms: 7176, documents: 19)>>
## Non-/sparse entries: 15127/121217
## Sparsity : 89%
## Maximal term length: 46
## Weighting : term frequency (tf)
Our Term Document Matrix (TDM) is relatively expansive, containing more than 7,000 terms extracted from 19 documents. This amount of information is too substantial for a single visualization. Consequently, our next step is to distill this information down to the 100 most frequently used terms. To accomplish this, we will utilize the tidytext package.
require(tidytext) # Text mining
require(tidyverse) # Data wrangling
#Clean the data
pdf.files<-tidy(pdf.files) %>%
group_by(term) %>% # For each term
summarise(count,count = sum(count)) %>% # Count the number of times it appears in the TDM
unique()%>%
ungroup() %>%
arrange(desc(count)) # And sort the TDM in descending order
# Extract 100 most frequently used words
pdf.files<-head(pdf.files, 100)
print(pdf.files)
## # A tibble: 100 × 2
## term count
## <chr> <dbl>
## 1 head 469
## 2 expressions 368
## 3 emotion 291
## 4 bodily 287
## 5 tracy 243
## 6 facial 240
## 7 expression 239
## 8 dominance 228
## 9 tilt 228
## 10 study 197
## # ℹ 90 more rows
Now we’re ready to create a WordCloud.
(Note: The font style, rotation, family, and size of the wordcloud can be customized by modifying the code below.)
library(wordcloud2)
library(randomcoloR)
wordcloud2(pdf.files,
fontWeight = "bold",
rotate = 0,
size = 0.6,
fontFamily = "Tahoma",
color = randomColor(nrow(pdf.files), #Using the randomcoloR package
hue = "random",
luminosity = "dark"))
As a reminder, if you just want the wordcloud, check out my web app, which will automate this process for you!
You may wish to alternatively create a wordcloud that visualizes your research using your Google Scholar Profile and your name. This approach, which requires coding in R (because the Scholar package is not compatible with R shiny) is described in more detail on my website.
If you have any questions, comments, or concerns, please get in touch
with me. I’m happy to help! Email: Zakwitkower@gmail.com
Website: www.ZakWitkower.com
LinkedIn: Zak
Witkower
Twitter: @Zakwitkower