It can be challenging to explain the key aspects of your research, especially if you study a diverse range of topics. To assist with this task, I developed a step-by-step guide on how to create a wordcloud that summarizes the main themes in your published and ongoing research papers. To use this guide, all you will need is the PDF versions of your research papers, and some basic skills with the R programming language. By following the steps outlined below, you will be able to quickly and easily communicate the breadth and depth of your research to others.

Load PDF files into R

To begin, we will locate and save the paths to two folders on our local drive. The first folder should contain PDF versions of all published research papers, while the second folder should contain PDF versions of all ongoing research papers.(Note: The specific path should correspond to your own local drive)

# Note: These are the file paths in my local drive. They will be different for your computer. 
Published.path<-"~/Dropbox/Data Science Portfolio/Wordcloud/My Papers/First-author publications"
InPrep.path<-"~/Dropbox/Data Science Portfolio/Wordcloud/My Papers/Research in preparation"

Next, we will identify the full path for each unique file in each folder, and create separate corpuses for the published research papers and the ongoing research papers.

library(tm) 

#List full path for all PDF files in "First-Author publications" folder
Published.file.names <- list.files(path = Published.path, #Path (from above)
                                   pattern = "pdf$",      #All PDF files
                                   full.names = T)        #Pull entire path

#List full path for all PDF files in "Research in preparation" folder
InPrep.file.names <- list.files(path = InPrep.path, 
                                pattern = "pdf$", 
                                full.names = T)


#Create separate corpuses for "First-Author publications" and "Research in preparation"
Published.corp <- Corpus(URISource(Published.file.names),
                         readerControl = list(reader = readPDF))

InPrep.corp <- Corpus(URISource(InPrep.file.names), 
                      readerControl = list(reader = readPDF)) 

Identify and select terms

With the two corpuses established, the next step is to convert each corpus into a Term Document Matrix (TDM). A TDM outlines the frequency of all terms present in each document. After creating the TDM for all relevant terms, we will remove any unnecessary words, numbers, and punctuation (e.g., “the,” “it,” “who,” and “p < .05”). This will provide us with a cleaner and more focused set of terms to work with.

#Note: Here I create a function to generate a TDM from each corpus
extract.TDM<-function(data){
  TermDocumentMatrix(data, # data run through the function will go here.
                     control = 
                       list(removePunctuation = TRUE, #Remove punctuation
                            stopwords = TRUE,         #Remove stopwords 
                            tolower = TRUE,           #Converts to lowercase
                            stemming = F,             #Retain full words
                            removeNumbers = TRUE))    #Remove numbers
}

#Create Term Document Matrix (TDM) for each corpus
Published.tdm<- extract.TDM(Published.corp) 
InPrep.tdm <- extract.TDM(InPrep.corp)

print(Published.tdm)
## <<TermDocumentMatrix (terms: 6747, documents: 13)>>
## Non-/sparse entries: 16535/71176
## Sparsity           : 81%
## Maximal term length: 48
## Weighting          : term frequency (tf)

It’s worth noting that our TDM for published papers includes a large number of terms - nearly 7,000 - extracted from 13 unique documents. While this may provide a detailed overview of our research, it is too much information to include in a single image To create a more manageable and visually appealing visualization, we will use the tidytext package to distill this list down to the terms that are used most frequently. Specifically, I will combine each TDM into a dataframe that lists the key terms and their frequencies across all documents–for both published and ongoing research papers–and exclude any terms that occur less than 100 times. This will allow us to effectively convey the main themes of our research without overwhelming the viewer.

library(tidytext) #text mining
library(tidyverse) #data wrangling

#Wrap data cleaning in function to parsimoniously execute with each corpus
cleaning<-function(data){
  data<-tidy(data) %>%                #transform with tidytext
    group_by(term) %>%                #for each term...
    summarise(count,count = sum(count)) %>% #sum term frequency 
    subset(count>100)%>%              #remove terms that occur <100 times 
    unique()%>%                       #remove duplicate rows
    arrange(desc(count))%>%           #sort based on word frequency 
    ungroup()                         #good habit to "ungroup" :) 
}

#Clean each TDM using the function generated above
Published.tdm<-cleaning(Published.tdm)
InPrep.tdm<-cleaning(InPrep.tdm)

#Add a new column to each TDM with its origin
Published.tdm$Document.type<-"Published"
InPrep.tdm$Document.type<-"in preparation"

#now we bind these two TDMs together.
Final.data<-rbind(Published.tdm,
                  InPrep.tdm)

head(Final.data, 5)
## # A tibble: 5 × 3
##   term        count Document.type
##   <chr>       <dbl> <chr>        
## 1 head          689 Published    
## 2 emotion       680 Published    
## 3 expressions   642 Published    
## 4 dominance     587 Published    
## 5 prestige      438 Published

Create wordcloud

Now that we have our final TDM, it’s time to create the wordcloud. (Note: The font style, rotation, family, and size of the wordcloud can be customized by modifying the code below.)

library(wordcloud2)

wordcloud2(Final.data, 
           color = "black",      #color of wordcloud
           fontWeight = "bold",  #bold all items
           rotate = 0,           #max rotation of words 
           size=.30,             #size of wordcloud
           fontFamily = "Times") #font of wordcloud   

To identify published and ongoing research, we will use different colors for the corresponding terms in the wordcloud.(Note: The font colors for published and ongoing research can be customized by modifying the code below.

library(randomcoloR) #Retrieve color palettes

#count the number of terms are "published" and "in prep" in our final TDM
n.Published.terms<-length(Final.data$term[Final.data$Document.type == "Published"]) 
n.InPrep.terms<-length(Final.data$term[Final.data$Document.type == "in preparation"])

#create a vector of colors, using the number of terms from "published" and "in prep" TDM
colorvector<- c(pub.colors<-randomColor(n.Published.terms, 
                                        hue = "blue", 
                                        luminosity = "dark"), #random dark blue colors
                prep.colors<-replicate(n.InPrep.terms, "#bebebe")) #grey

final.wordcloud<-wordcloud2(Final.data, 
           color = colorvector,   #Use the color vector created above
           fontWeight = "bold",   
           rotate = 0,            
           size=.30,              
           fontFamily = "Times") 

Unfortunately, wordclouds are output as an “htmlwidget,” which can be challenging to annotate and edit. To edit and annotate our wordcloud more easily, we will convert it to a PNG image. To do this, I will create a new folder on my computer, set the working directory to that folder, and export the htmlwidget as a PNG file into the folder. This will allow us to easily edit and annotate the image later.

#Set working directory to the new folder I created
setwd("~/Dropbox/Data Science Portfolio/Wordcloud/Final Wordcloud/")

#save html widget 
htmlwidgets::saveWidget(final.wordcloud, "tmp.html", selfcontained = F,
                        knitrOptions = list(results = "hide")) 
#screenshot html widget to PNG
webshot::webshot("tmp.html","Wordcloud_only.png", 
                 vwidth = 1195.2, vheight = 1046.4, delay = 5) 

Now we can load our PNG image back into R.

img<-"Wordcloud_only.png"

Annotate wordcloud using ggplot

To distinguish between terms from published and ongoing research, we will determine the mathematical average of the colors used to represent these terms in the wordcloud. These average colors will be incorporated into a legend that indicates whether a term corresponds to published or ongoing research.

#Extract the average color used for terms in each TDM
library(devtools)
# install_github("BenaroyaResearch/miscHelpers")
library("miscHelpers") 

#This function calculates the average color from a vector of colors
average.published.color<-average_colors(pub.colors)
average.prep.color<-average_colors(prep.colors)

Now, let’s add a custom legend and a title to the wordcloud.

library(cowplot) #for draw_image()

final.plot<-qplot(0:10, 0:10, geom="blank") +  #create quick plot
  draw_image(img, x = 0, y = 0, width = 10.5, height = 10.5) + # paste wordcloud
  scale_y_continuous(limits = c(0, 10), expand = c(0,0)) +
  scale_x_continuous(limits = c(0, NA), expand = c(0,0)) +
  theme_void()+  
  annotate(geom = "label",          #create first annotation layer
           label = "Published",     #what the label will actually say
           x = 3.5, y = 1.25,      
           color = average.published.color, #average color identified above
           size = 8, family = "Times",fontface = 'bold.italic') +
  annotate(geom = "label",          #create second annotation layer
           label = "In Prep",
           x = 6.5, y = 1.25, 
           color = average.prep.color, #average color identified above
           size = 8,family = "Times", fontface = 'bold.italic') + 
  annotate(geom = "text",           #create third annotation layer (title)
           label = "What do I research?",
           x = 5, y = 9.5, 
           color = "black", size = 12, family = "Times", fontface = 'bold') 

Our final wordcloud

Here is the final version of our wordcloud!

print(final.plot)

If you have any questions, comments, or concerns, please get in touch with me. I’m happy to help:
Email: Zakwitkower@gmail.com
Website: www.ZakWitkower.com
LinkedIn: Zak Witkower
Twitter: @Zakwitkower