Google Scholar Data with R

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(vroom)
library(ggsci)
library(scholar)
load("gs.RData")

Introduction

The R package scholar allows you to access Google Scholar citation data using the Google Scholar ID. I’m still exploring the possibilities, but the package is very cool.

The number of queries to the API is monitored, so be careful how you structure your code. Many of my chunks are “commented-out” so that they will not run. The data has been saved and loaded into the R environment so the queries do not need to be repeated.

Get Scholar Profile Information

We can use Jason B. Reed as an example.

Find their ID from the URL of the webpage. There are ways to search by name too with the package.

gs_id <- "CpBa1V4AAAAJ"

Get the basic profile data for Jason. (Avoid running this chunk more than once.)

# jbr_profile <- get_profile(gs_id)

The result is a list.

jbr_profile

## $id
## [1] "CpBa1V4AAAAJ"
## 
## $name
## [1] "Jason B Reed"
## 
## $affiliation
## [1] "Associate Professor, Purdue University"
## 
## $total_cites
## [1] 425
## 
## $h_index
## [1] 13
## 
## $i10_index
## [1] 18
## 
## $fields
## [1] "Systematic Reviews"               "Library Management"              
## [3] "Library Professional Development"
## 
## $homepage
## character(0)
## 
## $coauthors
## [1] "Benjamin Jahre"       "Alexander J. Carroll" "Leo S. Lo"           
## [4] "Sort by citations"    "Sort by year"         "Sort by title"       
## [7] "About Scholar"        "Search help"         
## 
## $available
## [1] 3
## 
## $not_available
## [1] 0

Get Scholar Publications

You can use another function to extract the publications from Google Scholar.

# jbr_pubs <- get_publications(gs_id)

The result is a data frame, but I prefer a tibble. The variables are common bibliometric variables including cumulative citations.

jbr_pubs <- jbr_pubs %>%
  as_tibble()
jbr_pubs

## # A tibble: 55 × 8
##    title                           author journal number cites  year cid   pubid
##    <chr>                           <chr>  <chr>   <chr>  <dbl> <dbl> <chr> <chr>
##  1 Pharmacists’ impact on older a… JL Ne… Vaccine "38 (…    40  2020 1473… roLk…
##  2 The effects of serious gaming … L van… Climat… "170 …    35  2022 1820… qxL8…
##  3 Poultry consumption and human … G Con… Advanc… "13 (…    33  2022 1593… TQgY…
##  4 Reviewing the current state of… JB Re… Collec… "44 (…    32  2019 2445… W7OE…
##  5 Meat consumption and gut micro… Y Wan… Advanc… "14 (…    23  2023 1174… QIV2…
##  6 Examining positive youth devel… E Maj… The Jo… "42 (…    23  2022 1711… YOwf…
##  7 A scoping review of engineerin… M Phi… Journa… "113 …    22  2024 1761… r0Bp…
##  8 Not just playing: The politics… JM Ve… Geofor… "137,…    21  2022 1178… 4DMP…
##  9 Effect of pharmacy-led interve… M Har… Journa… "62 (…    21  2022 4429… Wp0g…
## 10 The gender wage gap in researc… HA Ho… Colleg… ""        19  2020 8504… Se3i…
## # ℹ 45 more rows

Get Author Google Citation History

You can retrieve author citations by year.

# jbr_history <- get_citation_history(gs_id)

You can determine the cumulative citations.

jbr_history <- jbr_history %>%
  as_tibble() %>%
  mutate(cumal_cites = cumsum(cites))
jbr_history

## # A tibble: 10 × 3
##     year cites cumal_cites
##    <dbl> <dbl>       <dbl>
##  1  2016     3           3
##  2  2017     1           4
##  3  2018     2           6
##  4  2019     3           9
##  5  2020    11          20
##  6  2021    36          56
##  7  2022    48         104
##  8  2023   102         206
##  9  2024   180         386
## 10  2025    31         417

Plot the result.

jbr_history %>%
  ggplot() +
  aes(x=year, y=cumal_cites) +
  geom_line(lwd=1) +
  scale_x_continuous(breaks=seq(from=2016, to=2025, by=1)) +
  theme_bw()

Get Article Citation History

You can get historical citations for specific publications. Below, I do the top 10 articles for JBR. I’m not sure of the API limits for this.

# pubids <- pull(jbr_pubs, pubid)[1:10]
# jbr_ach <- get_article_cite_history(gs_id, pubids[1])
# for(i in 2:10){
#   ach <- get_article_cite_history(gs_id, pubids[i])
#   jbr_ach <- bind_rows(jbr_ach, ach)
# }

The result links the article via the pubid. However, you could devise ways to join this to data such as journal-year, etc.

jbr_ach <- jbr_ach %>%
  as_tibble()
jbr_ach

## # A tibble: 41 × 3
##     year cites pubid       
##    <int> <dbl> <chr>       
##  1  2020     1 roLk4NBRz8UC
##  2  2021    12 roLk4NBRz8UC
##  3  2022     5 roLk4NBRz8UC
##  4  2023    11 roLk4NBRz8UC
##  5  2024     8 roLk4NBRz8UC
##  6  2025     2 roLk4NBRz8UC
##  7  2022     3 qxL8FJ1GzNcC
##  8  2023    13 qxL8FJ1GzNcC
##  9  2024    15 qxL8FJ1GzNcC
## 10  2025     4 qxL8FJ1GzNcC
## # ℹ 31 more rows

Visualize the yearly citations.

jbr_ach %>%
  ggplot() +
  aes(x=year, y=cites, color=pubid) +
  geom_line(lwd=1) +
  scale_color_aaas() +
  theme_bw()

Determine cumulative citations as visualize.

jbr_ach %>%
  group_by(pubid) %>%
  mutate(cumal_cites = cumsum(cites)) %>%
  ggplot() +
  aes(x=year, y=cumal_cites, color=pubid) +
  geom_line(lwd=1) +
  scale_color_aaas() +
  theme_bw()

Get Journal Rank

I have questions about this. The matching is fuzzy, but the result does not show the match between query and result.

# jbr_journals <- jbr_pubs %>%
#   pull(journal) %>%
#   unique()
# jbr_jr <- get_journalrank(journals=jbr_journals, max.distance=0.1)

jbr_jr <- jbr_jr %>%
  as_tibble()

jbr_jr

## # A tibble: 29 × 20
##     Rank    Sourceid Journal        Type  Issn     SJR SJR.Best.Quartile H.index
##    <int>       <dbl> <chr>          <chr> <chr>  <dbl> <chr>               <int>
##  1  2384       21376 Vaccine        jour… 0264…  1.39  Q1                    191
##  2  2513       12177 Climatic Chan… jour… 1573…  1.36  Q1                    198
##  3  1107 21100202730 Advances in N… jour… 2156…  2.15  Q1                    103
##  4  9327  4700152789 Collection Ma… jour… 0146…  0.536 Q1                     19
##  5  4671       25675 Journal of Ea… jour… 1552…  0.917 Q1                     73
##  6  2735       12481 Journal of En… jour… 1069…  1.29  Q1                    113
##  7  2287       28611 Geoforum       jour… 0016…  1.42  Q1                    125
##  8     1       28773 Ca-A Cancer J… jour… 1542… 56.2   Q1                    182
##  9  3440       14238 College and R… jour… 0010…  1.11  Q1                     55
## 10  9310 19500157042 Currents in P… jour… 1877…  0.537 Q1                     23
## # ℹ 19 more rows
## # ℹ 12 more variables: Total.Docs...2021. <int>, Total.Docs...3years. <int>,
## #   Total.Refs. <int>, Total.Cites..3years. <int>,
## #   Citable.Docs...3years. <int>, Cites...Doc...2years. <dbl>,
## #   Ref....Doc. <dbl>, Country <chr>, Region <chr>, Publisher <chr>,
## #   Coverage <chr>, Categories <chr>

Compare Scholars

This is a cool feature. Be careful not to unnecessarily repeat queries.

# compare_ids <- c("CpBa1V4AAAAJ", "5qP2Sl0AAAAJ", "2xGm54gAAAAJ", "2Zj8gKoAAAAJ", "zac0dKsAAAAJ", "zdI9WLAAAAAJ", "aTWTBZMAAAAJ", "ZPJBrW4AAAAJ", "1SW-xYYAAAAJ")
# compare_tb <- compare_scholar_careers(compare_ids) %>%
#   as_tibble()

The result is year cites for the authors.

compare_tb

## # A tibble: 181 × 5
##    id            year cites career_year name             
##    <chr>        <dbl> <dbl>       <dbl> <chr>            
##  1 1SW-xYYAAAAJ  2018    15           0 Margaret Phillips
##  2 1SW-xYYAAAAJ  2019    40           1 Margaret Phillips
##  3 1SW-xYYAAAAJ  2020    61           2 Margaret Phillips
##  4 1SW-xYYAAAAJ  2021    85           3 Margaret Phillips
##  5 1SW-xYYAAAAJ  2022   124           4 Margaret Phillips
##  6 1SW-xYYAAAAJ  2023   196           5 Margaret Phillips
##  7 1SW-xYYAAAAJ  2024   203           6 Margaret Phillips
##  8 1SW-xYYAAAAJ  2025    48           7 Margaret Phillips
##  9 2xGm54gAAAAJ  2001    15           0 Michael Fosmire  
## 10 2xGm54gAAAAJ  2002     9           1 Michael Fosmire  
## # ℹ 171 more rows

Compare yearly citations.

compare_tb %>%
  ggplot() +
  aes(x=year, y=cites, color=name) +
  geom_line(lwd=1) +
  scale_x_continuous(limits=c(2012, 2026), breaks=seq(from=2012, to=2026, by=2)) +
  scale_color_simpsons() +
  theme_bw() +
  ggtitle("Yearly Citations for Select Libraries' Faculty")

Compare cumulative citations.

compare_tb %>%
  group_by(name) %>%
  mutate(cumal_cites = cumsum(cites)) %>%
  ggplot() +
  aes(x=year, y=cumal_cites, color=name) +
  geom_line(lwd=1) +
  scale_x_continuous(limits=c(2012, 2026), breaks=seq(from=2012, to=2026, by=2)) +
  scale_color_simpsons() +
  theme_bw() +
  ggtitle("Cumulative Citations for Select Libraries' Faculty")

Seems unfair to compare shorter careers to longer careers. Limit the years. Long-established scholars will still have more publications.

compare_tb %>%
  filter(year >= 2018) %>%
  group_by(name) %>%
  mutate(cumal_cites = cumsum(cites)) %>%
  ggplot() +
  aes(x=year, y=cumal_cites, color=name) +
  geom_line(lwd=1) +
  scale_x_continuous(limits=c(2018, 2026), breaks=seq(from=2018, to=2026, by=2)) +
  scale_color_simpsons() +
  theme_bw() +
  ggtitle("Cumulative Citations for Select Libraries' Faculty")

Save the data in the environment so the Google Scholar queries do not need to be repeated.

# save.image("gs.RData")