This guide offers a detailed, step-by-step process for creating a word cloud that visualizes the main topics from various scientific papers. To get started, ensure you have PDF versions of your research papers saved on your local drive. By following the instructions provided here, you can swiftly and effectively showcase the scope of your research to others.

(This guide also includes access to a web app, which I recommend using if you’re primarily interested in seeing the final word cloud.”)

Load PDF files into R

To begin, we need to specify the path to a folders on our local drive that contains all of the PDF files. (Note: Ensure the specific path corresponds to your own local drive)

# Note: These are the file paths in my local drive. They will be different for your computer. 
directory_path <- ("/Users/zakwitkower/Dropbox/Data Science Portfolio/Wordcloud/My Papers/First-author publications")

Next, we will identify the full path for each unique file in the selected folder.

require(pdftools) #Read PDFs

# List all PDF files in the directory
pdf_files <- list.files(path = directory_path, #insert oath (from above)
                         pattern = "pdf$",      #All PDF files in folder
                         full.names = T)        #return entire path

Identify and select terms

Before we extract text from the PDF documents, it’s crucial to determine which sections should be included. Sections like references typically do not convey the core arguments of each document and are therefore unsuitable for a WordCloud. In contrast, sections such as the abstract and introduction are perfect for inclusion. To efficiently capture the abstract and introduction, I will set up a function that initially collects all text from a document. It will then retain all text up to the “References” header. If a “Results” section exists, it will further exclude any text preceding this section. This dual-step approach is vital because not every manuscript includes a “Results” section (e.g., theory papers, review papers, commentaries). Consequently, the text for the WordCloud will include the abstract and introduction for empirical papers and everything except the references for non-empirical papers.

require(stringr)  #Text editing

#Create a function to specify final text being analyzed
extract_text_before_method <- function(file_path) {
  # Extract all text from the PDF
  full_text <- paste(pdf_text(file_path), collapse = "\n")
  # Retain text up until "References"
  extracted_text <- sub("(.*?)References\\n.*", "\\1", full_text)
  # Retain text up until "Results"
  extracted_text <- sub("(.*?)Results\\n.*", "\\1", extracted_text)
  
  return(extracted_text)
}

#create an uninitialized list for us to loop over.
pdf.files<-list()

# Loop through each file and extract the text using the function specified above
for (file in pdf_files) {
  pdf_text <- extract_text_before_method(file)
  # Clean data by removing new lines if still necessary
  pdf_text <- gsub("\n", "", pdf_text)
  pdf.files[[file]] <- pdf_text
}

Now that we’ve isolated the appropriate sections of text, the next step is to process this text data by transforming it into a Term Document Matrix (TDM). A TDM is a numerical representation of the text data, enabling us to perform quantitative analysis and create a WordCloud. In the creation of our TDM, we’ll remove any unnecessary words, numbers, and punctuation, such as “the,” “it,” “who,” and statistical notations like “p < .05.” This step will ensure we have a cleaner and more focused set of terms to analyze.

require(tm)

pdf.files<-TermDocumentMatrix(pdf.files, # data run through the function will go here.
                   control = 
                     list(removePunctuation = TRUE, #Remove punctuation
                          stopwords = TRUE,         #Remove stopwords 
                          tolower = TRUE,           #Converts to lowercase
                          stemming = F,             #Retain full words
                          removeNumbers = TRUE))    #Remove numbers


print(pdf.files)

## <<TermDocumentMatrix (terms: 7176, documents: 19)>>
## Non-/sparse entries: 15127/121217
## Sparsity           : 89%
## Maximal term length: 46
## Weighting          : term frequency (tf)

Our Term Document Matrix (TDM) is relatively expansive, containing more than 7,000 terms extracted from 19 documents. This amount of information is too substantial for a single visualization. Consequently, our next step is to distill this information down to the 100 most frequently used terms. To accomplish this, we will utilize the tidytext package.

require(tidytext) # Text mining
require(tidyverse) # Data wrangling


#Clean the data
pdf.files<-tidy(pdf.files) %>% 
  group_by(term) %>%  # For each term
  summarise(count,count = sum(count)) %>% # Count the number of times it appears in the TDM
  unique()%>%  
  ungroup() %>%
  arrange(desc(count)) # And sort the TDM in descending order

# Extract 100 most frequently used words
pdf.files<-head(pdf.files, 100) 

print(pdf.files)

## # A tibble: 100 × 2
##    term        count
##    <chr>       <dbl>
##  1 head          469
##  2 expressions   368
##  3 emotion       291
##  4 bodily        287
##  5 tracy         243
##  6 facial        240
##  7 expression    239
##  8 dominance     228
##  9 tilt          228
## 10 study         197
## # ℹ 90 more rows

Create wordcloud

Now we’re ready to create a WordCloud.

(Note: The font style, rotation, family, and size of the wordcloud can be customized by modifying the code below.)

library(wordcloud2)
library(randomcoloR)

wordcloud2(pdf.files, 
                 fontWeight = "bold", 
                 rotate = 0, 
                 size = 0.6, 
                 fontFamily = "Tahoma", 
                 color = randomColor(nrow(pdf.files), #Using the randomcoloR package
                                     hue = "random", 
                                     luminosity = "dark"))

As a reminder, if you just want the wordcloud, check out my web app, which will automate this process for you!

You may wish to alternatively create a wordcloud that visualizes your research using your Google Scholar Profile and your name. This approach, which requires coding in R (because the Scholar package is not compatible with R shiny) is described in more detail on my website.

If you have any questions, comments, or concerns, please get in touch with me. I’m happy to help! Email: Zakwitkower@gmail.com
Website: www.ZakWitkower.com
LinkedIn: Zak Witkower
Twitter: @Zakwitkower

Visualizing Research Topics with a WordCloud using PDF Files

Zak Witkower

2022-12-22

Load PDF files into R

Identify and select terms

Create wordcloud