• Load abstracts from Google Scholar
  • Transform and clean data
  • Create wordcloud
  • Limitations
  • Wrapping everything in a function

This guide will explain how to visualize key themes from a Google Scholar profile using a WordCloud. This tutorial is designed to be accessible to users with only a basic understanding of R programming language.

(The entire code for this tutorial is consolidated into a single function and presented at the end of this document. If you’re only interested in generating the wordcloud, and don’t care about the procedure, feel free to skip to the end of this document.”)

Load abstracts from Google Scholar

First, we need to specify a Google Scholar profile using a researcher’s name. To do this, assign an author’s first and last name to the objects “FirstName” and “LastName” (respectfully). For the purposes of this tutorial, I’ll use my own name.

FirstName<-"Zak"  
LastName<-"Witkower"

Next, retrieve all items indexed by Google Scholar (e.g., publications) for the specified researcher, using the scholar library.

library(scholar)
library(tibble) #for data cleaning

# Retrieve unique Google Scholar identifier using first and last name objects created above. 
IDnumber<-get_scholar_id(first_name = FirstName, 
                         last_name = LastName)

# Generate list of articles indexed in Google Scholar
Mypublications<-tibble(get_publications(IDnumber))

The resulting dataframe is in wide form, with each row representing a unique item indexed by Google Scholar for the specified researcher, and each column representing a different variable or attribute (e.g., title, journal name, number of citations, etc).

head(Mypublications, 10)
## # A tibble: 10 × 8
##    title                           author journal number cites  year cid   pubid
##    <chr>                           <chr>  <chr>   <chr>  <dbl> <dbl> <chr> <chr>
##  1 Two signals of social rank: Pr… Z Wit… Journa… 118 (…   148  2020 1283… qjMa…
##  2 Bodily communication of emotio… Z Wit… Emotio… 11 (2…   125  2019 1582… UeHW…
##  3 Predicting cyberbullying perpe… C Bar… Aggres… 43 (2…   102  2017 5427… 2osO…
##  4 The evolution of pride and soc… JL Tr… Advanc… 62, 5…    85  2020 7726… hqOj…
##  5 A facial-action imposter: How … Z Wit… Psycho… 30 (6…    47  2019 3325… W7OE…
##  6 The psychological structure, s… E Mer… Curren… 39, 1…    39  2021 8498… ULOm…
##  7 How affect shapes status: Dist… Z Wit… Curren… 33, 1…    37  2020 1574… YsMS…
##  8 Breaking the link between prov… C Bar… Aggres… 42 (6…    19  2016 5389… 9yKS…
##  9 Evidence for distinct facial s… JD Ma… Affect… 2, 14…    18  2021 1827… KlAt…
## 10 Beyond face value: Evidence fo… Z Wit… Affect… 2 (3)…    17  2021 1677… _kc_…

Next we will to retrieve the abstracts for each publication using a loop, and append each abstract to our dataframe.

# initialize new column to be filled with abstracts
Mypublications$abstract<-rep("initialize", nrow(Mypublications))

#"for" loop to add abstract for each publication
for (i in 1:nrow(Mypublications)) {
  Abstracts <- get_publication_abstract(id = IDnumber, 
                                     pub_id = Mypublications$pubid[i])
  
#This is included to mitigate  "replacement has length zero" error, in case there is an issue finding a file 
  ifelse(length(Abstracts) == 0, NA, Mypublications$abstract[i] <- Abstracts) 
}

Transform and clean data

Now we are ready to turn our dataframe into a corpus, which we will later use to create a Term Document Matrix (TDM). A TDM is a numerical representation of the text data that will allow us to perform quantitative analysis and create a wordcloud.

library(tm)
library(tidytext)
library(tidyverse)

#create corpus from abstracts
MyAbstracts.corpus<-VCorpus(VectorSource(Mypublications$abstract))

#create Term Document Matrix (TDM) from corpus
MyAbstracts.TDM<-TermDocumentMatrix(MyAbstracts.corpus, 
                                    control =  #Data editing:
                                      list(removePunctuation = TRUE, 
                                           stopwords = TRUE, 
                                           tolower = TRUE,
                                           stemming = F, 
                                           removeNumbers = TRUE))

Lets identify the 100 most frequently used terms. I do this by summing the frequency of each unique term across all documents.

tidy.TDM<-tidy(MyAbstracts.TDM) %>% 
  group_by(term) %>% 
  summarise(count,count = sum(count)) %>% #sum frequency
  unique()%>% 
  ungroup() %>%
  arrange(desc(count)) #rearrange based on frequency of term

#only retain the top 100 most frequently used terms
tidy.TDM<-head(tidy.TDM, 100) 

head(tidy.TDM)

Create wordcloud

Finally, lets create a wordcloud.

(Note: You can adjust the aesthetics of the wordcloud using the code below.)

library(wordcloud2)
library(randomcoloR) 

#create wordcloud
wordcloud2(tidy.TDM, 
           color = randomColor(nrow(tidy.TDM), #create color palette with randomcoloR
                               hue = "random", 
                               luminosity = "dark"),
           fontWeight = "bold",  #bold all items
           rotate = 0,           #max rotation of words within wordcloud
           size=.75,             #size of wordcloud
           fontFamily = "Times") #font of wordcloud   

Limitations

If the specified author includes their middle initial in their Google Scholar profile, or has three names, you may encounter problems generating their wordcloud. There are a few possible solutions:
* Input their first and last name exclusively (e.g., input “Jessica L. Tracy” as “Jessica Tracy”, or “Gerben van Kleef” as “Gerben Kleef”)
* Input their first and middle initials as their first name (e.g., input “Friedrich M. Götz” as “FM Götz”)
* If all else fails, you can retrieve a researcher’s Google Scholar ID manually from the hyperlink of their Google Scholar page, instead of retrieving it based on their first and last name (i.e., the method used here). Look for text “user=” embedded within the hyperlink, and extract the 12 letters that follow. You can assign their ID as a character string to the object “IDnumber”, and skip the first step of retrieving Google Scholar IDs with the “get_scholar_id” function

This method relies heavily on querying Google Scholar, which may introduce some complications. For example:
* An internet connection is required
* The specified researcher must have an active Google Scholar profile
* The method will extract all files indexed by Google Scholar for the specified profile, which may include non-peer-reviewed files or supplemental materials
* Rate limits imposed by Google Scholar may restrict the total number and speed of your requests. * If you try to generate a wordcloud for a particularly productive researcher – which requires you to request all of the publications and abstracts of a specified author – you may exceed Google’s rate limit (i.e., too many requests in a short period of time), resulting in an error.

You may wish to alternatively create a wordcloud that visualizes your research by uploading PDF files of your research saved on your local drive. This approach, which is described in more detail on my website.

If you have any questions, comments, or concerns, please get in touch with me. I’m happy to help! Email: Zakwitkower@gmail.com
Website: www.ZakWitkower.com
LinkedIn: Zak Witkower
Twitter: @Zakwitkower

Wrapping everything in a function

For simplicity, I’ve consolidated all the steps into a single function that automates the creation of a wordcloud using just your first and last name. Enter the first and last name of the researcher whose research you want to visualize at the top of the code chunk (assigned to objects “FirstName” and “LastName”), and then run the rest of the code. The function will handle all additional steps, and output a wordcloud.

#####################################################################
####### Please input the first and last name of a researcher ########
#####################################################################

FirstName<-"Zak"  
LastName<-"Witkower"

########################################################################################
#### After assigning a name above, run the rest of the code to create the function: ####
########################################################################################

#load libraries
require(scholar)
require(tibble) 
require(tm)
require(tidytext)
require(tidyverse)
require(wordcloud2)
require(randomcoloR) 

#Create function
GoogleScholarWordCloud<-function(first,last){
  
  #Get ID number from name
  IDnumber<-get_scholar_id(first_name = first, 
                           last_name = last)
  
  #Get publications
  Mypublications<-tibble(get_publications(IDnumber))
  
  #Add abstracts
  Mypublications$abstract<-rep("initialize", nrow(Mypublications))
  for (i in 1:nrow(Mypublications)) {
    Abstracts <- get_publication_abstract(id = IDnumber, 
                                          pub_id = Mypublications$pubid[i])
    ifelse(length(Abstracts) == 0, NA, Mypublications$abstract[i] <- Abstracts) 
  }
  
  #Create Corpus
  MyAbstracts.corpus<-VCorpus(VectorSource(Mypublications$abstract))
  
  #Create TDM from corpus
  MyAbstracts.TDM<-TermDocumentMatrix(MyAbstracts.corpus,
                                      control = 
                                        list(removePunctuation = TRUE,  
                                             stopwords = TRUE,
                                             tolower = TRUE,
                                             stemming = F,
                                             removeNumbers = TRUE))
  #Tidy TDM
  MyAbstracts.TDM<-tidy(MyAbstracts.TDM) %>% 
    group_by(term) %>% 
    summarise(count,count = sum(count)) %>% 
    unique()%>%  
    ungroup() %>%
    arrange(desc(count))    
  
  #retain top 100 most frequently used words
  MyAbstracts.TDM<-head(MyAbstracts.TDM, 100) 
  
  #create wordcloud
  wordcloud<-wordcloud2(MyAbstracts.TDM,
             fontWeight = "bold",  #bold all items
             rotate = 0,           #max rotation of words 
             size=.75,             #size of wordcloud
             fontFamily = "Times", #font of wordcloud   
             color = randomColor(nrow(MyAbstracts.TDM), #create color palette
                                 hue = "random", 
                                 luminosity = "dark")) 
  
  return(wordcloud)
  }

##################################################################
### Run the function we just created, using the assigned name ###
##################################################################

GoogleScholarWordCloud(FirstName, LastName)

############################ End. ###################################
