It can be challenging to explain the key aspects of your research, especially if you study a diverse range of topics. To assist with this task, I developed a step-by-step guide on how to create a wordcloud that summarizes the main themes in your published and ongoing research papers. To use this guide, all you will need is the PDF versions of your research papers, and some basic skills with the R programming language. By following the steps outlined below, you will be able to quickly and easily communicate the breadth and depth of your research to others.
To begin, we will locate and save the paths to two folders on our local drive. The first folder should contain PDF versions of all published research papers, while the second folder should contain PDF versions of all ongoing research papers.(Note: The specific path should correspond to your own local drive)
# Note: These are the file paths in my local drive. They will be different for your computer.
<-"~/Dropbox/Data Science Portfolio/Wordcloud/My Papers/First-author publications"
Published.path<-"~/Dropbox/Data Science Portfolio/Wordcloud/My Papers/Research in preparation" InPrep.path
Next, we will identify the full path for each unique file in each folder, and create separate corpuses for the published research papers and the ongoing research papers.
library(tm)
#List full path for all PDF files in "First-Author publications" folder
<- list.files(path = Published.path, #Path (from above)
Published.file.names pattern = "pdf$", #All PDF files
full.names = T) #Pull entire path
#List full path for all PDF files in "Research in preparation" folder
<- list.files(path = InPrep.path,
InPrep.file.names pattern = "pdf$",
full.names = T)
#Create separate corpuses for "First-Author publications" and "Research in preparation"
<- Corpus(URISource(Published.file.names),
Published.corp readerControl = list(reader = readPDF))
<- Corpus(URISource(InPrep.file.names),
InPrep.corp readerControl = list(reader = readPDF))
With the two corpuses established, the next step is to convert each corpus into a Term Document Matrix (TDM). A TDM outlines the frequency of all terms present in each document. After creating the TDM for all relevant terms, we will remove any unnecessary words, numbers, and punctuation (e.g., “the,” “it,” “who,” and “p < .05”). This will provide us with a cleaner and more focused set of terms to work with.
#Note: Here I create a function to generate a TDM from each corpus
<-function(data){
extract.TDMTermDocumentMatrix(data, # data run through the function will go here.
control =
list(removePunctuation = TRUE, #Remove punctuation
stopwords = TRUE, #Remove stopwords
tolower = TRUE, #Converts to lowercase
stemming = F, #Retain full words
removeNumbers = TRUE)) #Remove numbers
}
#Create Term Document Matrix (TDM) for each corpus
<- extract.TDM(Published.corp)
Published.tdm<- extract.TDM(InPrep.corp)
InPrep.tdm
print(Published.tdm)
## <<TermDocumentMatrix (terms: 6747, documents: 13)>>
## Non-/sparse entries: 16535/71176
## Sparsity : 81%
## Maximal term length: 48
## Weighting : term frequency (tf)
It’s worth noting that our TDM for published papers includes a large number of terms - nearly 7,000 - extracted from 13 unique documents. While this may provide a detailed overview of our research, it is too much information to include in a single image To create a more manageable and visually appealing visualization, we will use the tidytext package to distill this list down to the terms that are used most frequently. Specifically, I will combine each TDM into a dataframe that lists the key terms and their frequencies across all documents–for both published and ongoing research papers–and exclude any terms that occur less than 100 times. This will allow us to effectively convey the main themes of our research without overwhelming the viewer.
library(tidytext) #text mining
library(tidyverse) #data wrangling
#Wrap data cleaning in function to parsimoniously execute with each corpus
<-function(data){
cleaning<-tidy(data) %>% #transform with tidytext
datagroup_by(term) %>% #for each term...
summarise(count,count = sum(count)) %>% #sum term frequency
subset(count>100)%>% #remove terms that occur <100 times
unique()%>% #remove duplicate rows
arrange(desc(count))%>% #sort based on word frequency
ungroup() #good habit to "ungroup" :)
}
#Clean each TDM using the function generated above
<-cleaning(Published.tdm)
Published.tdm<-cleaning(InPrep.tdm)
InPrep.tdm
#Add a new column to each TDM with its origin
$Document.type<-"Published"
Published.tdm$Document.type<-"in preparation"
InPrep.tdm
#now we bind these two TDMs together.
<-rbind(Published.tdm,
Final.data
InPrep.tdm)
head(Final.data, 5)
## # A tibble: 5 × 3
## term count Document.type
## <chr> <dbl> <chr>
## 1 head 689 Published
## 2 emotion 680 Published
## 3 expressions 642 Published
## 4 dominance 587 Published
## 5 prestige 438 Published
Now that we have our final TDM, it’s time to create the wordcloud. (Note: The font style, rotation, family, and size of the wordcloud can be customized by modifying the code below.)
library(wordcloud2)
wordcloud2(Final.data,
color = "black", #color of wordcloud
fontWeight = "bold", #bold all items
rotate = 0, #max rotation of words
size=.30, #size of wordcloud
fontFamily = "Times") #font of wordcloud
To identify published and ongoing research, we will use different colors for the corresponding terms in the wordcloud.(Note: The font colors for published and ongoing research can be customized by modifying the code below.
library(randomcoloR) #Retrieve color palettes
#count the number of terms are "published" and "in prep" in our final TDM
<-length(Final.data$term[Final.data$Document.type == "Published"])
n.Published.terms<-length(Final.data$term[Final.data$Document.type == "in preparation"])
n.InPrep.terms
#create a vector of colors, using the number of terms from "published" and "in prep" TDM
<- c(pub.colors<-randomColor(n.Published.terms,
colorvectorhue = "blue",
luminosity = "dark"), #random dark blue colors
<-replicate(n.InPrep.terms, "#bebebe")) #grey
prep.colors
<-wordcloud2(Final.data,
final.wordcloudcolor = colorvector, #Use the color vector created above
fontWeight = "bold",
rotate = 0,
size=.30,
fontFamily = "Times")
Unfortunately, wordclouds are output as an “htmlwidget,” which can be challenging to annotate and edit. To edit and annotate our wordcloud more easily, we will convert it to a PNG image. To do this, I will create a new folder on my computer, set the working directory to that folder, and export the htmlwidget as a PNG file into the folder. This will allow us to easily edit and annotate the image later.
#Set working directory to the new folder I created
setwd("~/Dropbox/Data Science Portfolio/Wordcloud/Final Wordcloud/")
#save html widget
::saveWidget(final.wordcloud, "tmp.html", selfcontained = F,
htmlwidgetsknitrOptions = list(results = "hide"))
#screenshot html widget to PNG
::webshot("tmp.html","Wordcloud_only.png",
webshotvwidth = 1195.2, vheight = 1046.4, delay = 5)
Now we can load our PNG image back into R.
<-"Wordcloud_only.png" img
To distinguish between terms from published and ongoing research, we will determine the mathematical average of the colors used to represent these terms in the wordcloud. These average colors will be incorporated into a legend that indicates whether a term corresponds to published or ongoing research.
#Extract the average color used for terms in each TDM
library(devtools)
# install_github("BenaroyaResearch/miscHelpers")
library("miscHelpers")
#This function calculates the average color from a vector of colors
<-average_colors(pub.colors)
average.published.color<-average_colors(prep.colors) average.prep.color
Now, let’s add a custom legend and a title to the wordcloud.
library(cowplot) #for draw_image()
<-qplot(0:10, 0:10, geom="blank") + #create quick plot
final.plotdraw_image(img, x = 0, y = 0, width = 10.5, height = 10.5) + # paste wordcloud
scale_y_continuous(limits = c(0, 10), expand = c(0,0)) +
scale_x_continuous(limits = c(0, NA), expand = c(0,0)) +
theme_void()+
annotate(geom = "label", #create first annotation layer
label = "Published", #what the label will actually say
x = 3.5, y = 1.25,
color = average.published.color, #average color identified above
size = 8, family = "Times",fontface = 'bold.italic') +
annotate(geom = "label", #create second annotation layer
label = "In Prep",
x = 6.5, y = 1.25,
color = average.prep.color, #average color identified above
size = 8,family = "Times", fontface = 'bold.italic') +
annotate(geom = "text", #create third annotation layer (title)
label = "What do I research?",
x = 5, y = 9.5,
color = "black", size = 12, family = "Times", fontface = 'bold')
Here is the final version of our wordcloud!
print(final.plot)
If you have any questions, comments, or concerns, please get in touch
with me. I’m happy to help:
Email: Zakwitkower@gmail.com
Website: www.ZakWitkower.com
LinkedIn: Zak
Witkower
Twitter: @Zakwitkower