PubMed Counts of Articles

Overview

What are the rates of publication of papers on various topics?

R Markdown

This is an R Markdown File, which is a way to interlace three things:

R (or other!) programming language
Statistical or scientific reasoning about the code we write, and
The output of R code

… all in one place. You can export this interweaving of human and computer language as well as the code output to various formats – pdf, Microsoft Word, or html. Are you new to R Markdown? Here are some great resources:

R Markdown was developed by RStudio. Read their descriptions and examples on their site.
The RStudio gurus wrote a great book about it – check it out!

Load Packages

Here we add the packages we’ll use. tidyverse helps reshape data, easyPubMed simplifies the use of the PubMed API, and printr allows us to show data frames more attractively.

library(tidyverse)
library(easyPubMed)
library(printr)

A Single Search

I’d like to search across several years, for several different search terms. Let’s start simply, however, with a single search. What can PubMed tell me about articles published in 2020 that mention “medicine” and “disparity”?

The results tell me the overall count (600) and give me the first few (from RetStart to RetMax) matching article IDs.

results <- get_pubmed_ids("2015[Date - Publication] AND medicine disparity")
results

## $Count
## [1] "600"
## 
## $RetMax
## [1] "20"
## 
## $RetStart
## [1] "0"
## 
## $QueryKey
## [1] "1"
## 
## $WebEnv
## [1] "MCID_63000f04892cbf6ded64d325"
## 
## $IdList
## $IdList$Id
## [1] "28149430"
## 
## $IdList$Id
## [1] "27927559"
## 
## $IdList$Id
## [1] "26289634"
## 
## $IdList$Id
## [1] "26489876"
## 
## $IdList$Id
## [1] "27330533"
## 
## $IdList$Id
## [1] "27294764"
## 
## $IdList$Id
## [1] "27294752"
## 
## $IdList$Id
## [1] "27271069"
## 
## $IdList$Id
## [1] "27271063"
## 
## $IdList$Id
## [1] "27269496"
## 
## $IdList$Id
## [1] "27250705"
## 
## $IdList$Id
## [1] "26997937"
## 
## $IdList$Id
## [1] "26985407"
## 
## $IdList$Id
## [1] "26939507"
## 
## $IdList$Id
## [1] "26928630"
## 
## $IdList$Id
## [1] "26896117"
## 
## $IdList$Id
## [1] "26896110"
## 
## $IdList$Id
## [1] "26896100"
## 
## $IdList$Id
## [1] "26863559"
## 
## $IdList$Id
## [1] "26863333"
## 
## 
## $TranslationSet
## $TranslationSet$From
## [1] "medicine"
## 
## $TranslationSet$To
## [1] "\"medicine\"[MeSH Terms] OR \"medicine\"[All Fields]"
## 
## 
## $QueryTranslation
## [1] "2015[Date - Publication] AND ((\"medicine\"[MeSH Terms] OR \"medicine\"[All Fields]) AND disparity[All Fields])"
## 
## $OriginalQuery
## [1] "2015[Date+-+Publication]+AND+medicine+disparity"

I can also extract just the count of articles:

results$Count

## [1] "600"

But I don’t want to do that for every combination of year and search term, to get the count of articles!

I’ll start by creating my search terms.

Create Search Terms

Let’s create a data frame which contains all the combinations of the variables we want to search for. First, we’ll define three categories that we’ll combine.

years will be the series from 2012 to 2022.
terms will be the terms we want to search on.

We’ll make the combinatoric using expand_grid() from tidyverse.

years <- c(2012:2023)

terms <- c(
  'medicine disparity', 
  'medicine racism', 
  'medicine racial bias')

search_terms <- expand_grid("year" = years,
                       "term" = terms)

Let’s peek!

head(search_terms, 10)

year	term
2012	medicine disparity
2012	medicine racism
2012	medicine racial bias
2013	medicine disparity
2013	medicine racism
2013	medicine racial bias
2014	medicine disparity
2014	medicine racism
2014	medicine racial bias
2015	medicine disparity

OK, now we’ll pad those search terms with the text that the API requires:

search_terms <- search_terms %>%
  mutate(final = paste(year, 
                       "[Date - Publication]",
                       " AND ",
                       term,
                       sep = ""
                       ))

And let’s look again:

head(search_terms, 20)

year	term	final
2012	medicine disparity	2012[Date - Publication] AND medicine disparity
2012	medicine racism	2012[Date - Publication] AND medicine racism
2012	medicine racial bias	2012[Date - Publication] AND medicine racial bias
2013	medicine disparity	2013[Date - Publication] AND medicine disparity
2013	medicine racism	2013[Date - Publication] AND medicine racism
2013	medicine racial bias	2013[Date - Publication] AND medicine racial bias
2014	medicine disparity	2014[Date - Publication] AND medicine disparity
2014	medicine racism	2014[Date - Publication] AND medicine racism
2014	medicine racial bias	2014[Date - Publication] AND medicine racial bias
2015	medicine disparity	2015[Date - Publication] AND medicine disparity
2015	medicine racism	2015[Date - Publication] AND medicine racism
2015	medicine racial bias	2015[Date - Publication] AND medicine racial bias
2016	medicine disparity	2016[Date - Publication] AND medicine disparity
2016	medicine racism	2016[Date - Publication] AND medicine racism
2016	medicine racial bias	2016[Date - Publication] AND medicine racial bias
2017	medicine disparity	2017[Date - Publication] AND medicine disparity
2017	medicine racism	2017[Date - Publication] AND medicine racism
2017	medicine racial bias	2017[Date - Publication] AND medicine racial bias
2018	medicine disparity	2018[Date - Publication] AND medicine disparity
2018	medicine racism	2018[Date - Publication] AND medicine racism

Search in PubMed

Now we’ll make a short function that returns the count of results for a given term:

count_results <- function(term) {
  results <- get_pubmed_ids(term)
  count <- as.integer(results$Count)
  return(count)
}

And now we’ll use that function to populate a new column. Note that we’re using an lapply function that lets us put a pause between searches in order to not go over the “anonymous” API rate supported by PubMed.

search_terms <- search_terms %>% 
  mutate(num_results = lapply(final, function(f) {
    Sys.sleep(0.5)
    count_results(f)
    }))

Visualize the Data

ggplot lets us take a look at our results graphically:

ggplot(search_terms, 
       aes(x=year, y=num_results)) +
  geom_col() +
  facet_wrap(term ~ .) +
  xlab("Year") +
  ylab("Count") + 
  ggtitle("Healthcare Disparity Articles")