What are the rates of publication of papers on various topics?
This is an R Markdown File, which is a way to interlace three things:
… all in one place. You can export this interweaving of human and computer language as well as the code output to various formats – pdf, Microsoft Word, or html. Are you new to R Markdown? Here are some great resources:
Here we add the packages we’ll use. tidyverse helps
reshape data, easyPubMed simplifies the use of the PubMed
API, and printr allows us to show data frames more
attractively.
library(tidyverse)
library(easyPubMed)
library(printr)
I’d like to search across several years, for several different search terms. Let’s start simply, however, with a single search. What can PubMed tell me about articles published in 2020 that mention “medicine” and “disparity”?
The results tell me the overall count (600) and give me the first few
(from RetStart to RetMax) matching article
IDs.
results <- get_pubmed_ids("2015[Date - Publication] AND medicine disparity")
results
## $Count
## [1] "600"
##
## $RetMax
## [1] "20"
##
## $RetStart
## [1] "0"
##
## $QueryKey
## [1] "1"
##
## $WebEnv
## [1] "MCID_63000f04892cbf6ded64d325"
##
## $IdList
## $IdList$Id
## [1] "28149430"
##
## $IdList$Id
## [1] "27927559"
##
## $IdList$Id
## [1] "26289634"
##
## $IdList$Id
## [1] "26489876"
##
## $IdList$Id
## [1] "27330533"
##
## $IdList$Id
## [1] "27294764"
##
## $IdList$Id
## [1] "27294752"
##
## $IdList$Id
## [1] "27271069"
##
## $IdList$Id
## [1] "27271063"
##
## $IdList$Id
## [1] "27269496"
##
## $IdList$Id
## [1] "27250705"
##
## $IdList$Id
## [1] "26997937"
##
## $IdList$Id
## [1] "26985407"
##
## $IdList$Id
## [1] "26939507"
##
## $IdList$Id
## [1] "26928630"
##
## $IdList$Id
## [1] "26896117"
##
## $IdList$Id
## [1] "26896110"
##
## $IdList$Id
## [1] "26896100"
##
## $IdList$Id
## [1] "26863559"
##
## $IdList$Id
## [1] "26863333"
##
##
## $TranslationSet
## $TranslationSet$From
## [1] "medicine"
##
## $TranslationSet$To
## [1] "\"medicine\"[MeSH Terms] OR \"medicine\"[All Fields]"
##
##
## $QueryTranslation
## [1] "2015[Date - Publication] AND ((\"medicine\"[MeSH Terms] OR \"medicine\"[All Fields]) AND disparity[All Fields])"
##
## $OriginalQuery
## [1] "2015[Date+-+Publication]+AND+medicine+disparity"
I can also extract just the count of articles:
results$Count
## [1] "600"
But I don’t want to do that for every combination of year and search term, to get the count of articles!
I’ll start by creating my search terms.
Let’s create a data frame which contains all the combinations of the variables we want to search for. First, we’ll define three categories that we’ll combine.
years will be the series from 2012 to 2022.terms will be the terms we want to search on.We’ll make the combinatoric using expand_grid() from
tidyverse.
years <- c(2012:2023)
terms <- c(
'medicine disparity',
'medicine racism',
'medicine racial bias')
search_terms <- expand_grid("year" = years,
"term" = terms)
Let’s peek!
head(search_terms, 10)
| year | term |
|---|---|
| 2012 | medicine disparity |
| 2012 | medicine racism |
| 2012 | medicine racial bias |
| 2013 | medicine disparity |
| 2013 | medicine racism |
| 2013 | medicine racial bias |
| 2014 | medicine disparity |
| 2014 | medicine racism |
| 2014 | medicine racial bias |
| 2015 | medicine disparity |
OK, now we’ll pad those search terms with the text that the API requires:
search_terms <- search_terms %>%
mutate(final = paste(year,
"[Date - Publication]",
" AND ",
term,
sep = ""
))
And let’s look again:
head(search_terms, 20)
| year | term | final |
|---|---|---|
| 2012 | medicine disparity | 2012[Date - Publication] AND medicine disparity |
| 2012 | medicine racism | 2012[Date - Publication] AND medicine racism |
| 2012 | medicine racial bias | 2012[Date - Publication] AND medicine racial bias |
| 2013 | medicine disparity | 2013[Date - Publication] AND medicine disparity |
| 2013 | medicine racism | 2013[Date - Publication] AND medicine racism |
| 2013 | medicine racial bias | 2013[Date - Publication] AND medicine racial bias |
| 2014 | medicine disparity | 2014[Date - Publication] AND medicine disparity |
| 2014 | medicine racism | 2014[Date - Publication] AND medicine racism |
| 2014 | medicine racial bias | 2014[Date - Publication] AND medicine racial bias |
| 2015 | medicine disparity | 2015[Date - Publication] AND medicine disparity |
| 2015 | medicine racism | 2015[Date - Publication] AND medicine racism |
| 2015 | medicine racial bias | 2015[Date - Publication] AND medicine racial bias |
| 2016 | medicine disparity | 2016[Date - Publication] AND medicine disparity |
| 2016 | medicine racism | 2016[Date - Publication] AND medicine racism |
| 2016 | medicine racial bias | 2016[Date - Publication] AND medicine racial bias |
| 2017 | medicine disparity | 2017[Date - Publication] AND medicine disparity |
| 2017 | medicine racism | 2017[Date - Publication] AND medicine racism |
| 2017 | medicine racial bias | 2017[Date - Publication] AND medicine racial bias |
| 2018 | medicine disparity | 2018[Date - Publication] AND medicine disparity |
| 2018 | medicine racism | 2018[Date - Publication] AND medicine racism |
Now we’ll make a short function that returns the count of results for a given term:
count_results <- function(term) {
results <- get_pubmed_ids(term)
count <- as.integer(results$Count)
return(count)
}
And now we’ll use that function to populate a new column. Note that
we’re using an lapply function that lets us put a pause
between searches in order to not go over the “anonymous” API rate
supported by PubMed.
search_terms <- search_terms %>%
mutate(num_results = lapply(final, function(f) {
Sys.sleep(0.5)
count_results(f)
}))
ggplot lets us take a look at our results
graphically:
ggplot(search_terms,
aes(x=year, y=num_results)) +
geom_col() +
facet_wrap(term ~ .) +
xlab("Year") +
ylab("Count") +
ggtitle("Healthcare Disparity Articles")