Introduction

Kristianna Pettibone is working on a project to evaluate grantee approaches to data sharing.

She is submitting a proposal for a Data Fellow and one idea she’s included in the proposal is to have that person work on this activity and figure out how to automate it. But in order to do that, they need a manually coded data file that they can use to teach the machine learning.

(Note) I believe that natural-language processing would be the best approach here, but even that may not be particularly necessary. Many of the questions that are being asked can be answered with regex and string parsing.

Through a search in PubMed Central (a full-text open access publication repository), we’ve identified around 1600 publications that have statements declaring data availability, code availability, and source data availability. These are a mix of standard text and free text.

PMC search to limit to grantee publications with Data Availability Statements (DAS) in the past five years is: niehs[gr] AND (has data avail[filter] OR has data citations[filter]) AND "last 5 years"[PDat]

At a minimum, to minimize manual curation, they want a script to extract the text from XML in the data, code, and source data sections (if they exist) and acknowledgments section for grantee info and populate the appropriate columns in an Excel file.

Even better, if some precise extraction could be done to:

identify and extract the NIEHS grant numbers only
identify and extract specific repository names mentioned in the data and code availability sections
categorize DAS content by
- not available due to IRB
- available by request
- source data provided
- deposited in repository
- others?

Would need to support having multiple answers (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648103/).

Below are links to three test publications as well as the XML for all three pubs.

Questions

What do authors say in their Data Availability Statements. We’ll review this information and assess where they are on the data sharing spectrum – from least accessible to most accessible = call me to get the data vs. data are in an easily findable, searchable, publicly accessible data repository.
What data sets get used/cited? Can we see if there is a relationship between how the data are made available and how many people cite it or use it.
We’d also like to learn about how people cite the data – but I’m not sure that’s available from this data. Do they re-analyze it? Do they pool it with other data? Do they just reference the data as background to whatever it is they are studying? Something else?
Regarding the Supplementary Data – mostly I’m just curious what authors list there.
For all these analyses, we’ll likely analyze whether there are certain grant types (R01s, P30s, U01s, etc.) or programs or universities that tend to be better at data sharing – so I’ll want to be able to analyze the all of the above in relationship to the grant numbers.
I’m also interested in any apparent relationship between data sharing and whether publications cite intramural grants, extramural grants, or a combination. And also whether there is any relationship between level of data sharing and whether a publication cites funding from multiple ICs.
Will probably also look to see if there is a relationship between data sharing and a higher RCR value.

Information desired

PMID
PMCID
Data Availability Statement
Source Data
Code Availability
Data Citations
Acknowledgements or Funding (for grant number)
Supplementary Data Information
Number of Citations (Scopus would be the better place to look)

Getting XML Data

PubMed/PMC API Query

In order to get the XML information from the PubMed/PMC API, we can use a few different functions to get access to this information. This process has only been done once and the resulting information was saved to an XML file where the rest of the work was completed from.

First we save all of the query parameters into a list to be used when we make a GET call to the API.

query_url <- list(db = "pmc",
                  tool = "niehs-lit-scraper",
                  email = "trey.saddler",
                  term = 'niehs[gr] AND (has data avail[filter] OR has data citations[filter]) AND "last 5 years"[PDat]',
                  retmax = 10000,
                  usehistory = "y")

Then we run the query on the esearch/eutils API to get a list of the PubMed IDs to be put into the next API call.

pubmed_get <- GET(url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
                  query = query_url)

In the next step, we have to parse the GET call into XML so that we can extract the query_id and the WebEnv for the next API call.

parsed_pubmed_get <- content(pubmed_get, as = "parsed")
webenv <- xml_text(xml_find_first(parsed_pubmed_get, "//WebEnv"))

Now we can build our query parameters to grab the full XML output for all entries using the information from the previous API call.

pmc_query <- list(db = "pmc",
                  tool = "niehs-lit-scraper",
                  email = "trey.saddler",
                  retmax = 10000,
                  usehistory = "y",
                  query_key = 1,
                  WebEnv = webenv)

pmc_get <- GET(url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?",
               query = pmc_query)

parsed_pmc <- content(pmc_get, as = "parsed")

And finally we can write this XML document to file for future work, creating a snapshot in time.

write_xml(parsed_pmc, here("output", "pmc_output.xml"))

Import XML

Here we import the previous XML document to have a clean working slate for future operations.

parsed_pmc <- read_xml("output/pmc_output.xml")

Data Availability Statements

DAS’s are contained in various places in the XML information returned from a PubMed Central search. In order to extract all information about data availability, it is necessary to track down all possible locations where this data may be stored.

Notes

TODO

Titles of DA sections in the notes were in many different forms. Not all of these DA statements in the notes had a proper XPath attribute of notes-type=data-availability. Therefore, we need to manually select what titles meet our criteria and use those titles for future searches.

The list of all titles of the notes sections are:

Author contribution, Author Contribution, Author contributions, Author Contributions, Author’s contributions, Authors’ contributions, Authors’ Contributions, Authors’ information, Authors’s contributions, Availability of Data and Code, Availability of data and materials, Code availability, Code Availability, Competing interests, Competing Interests, Compliance with ethical standards, Compliance with Ethical Standards, Conflict of interest, Conflict of Interest, Conflicts of Interest, Consent for publication, Data availability, Data Availability, Data availability and ethical considerations, Data Availability Statement, Data availability:, Data Availbility Statements, Disclaimer, Ethical approval, Ethical Approval, Ethics approval, Ethics approval and consent to participate, Funding, Funding Information, Informed consent, Notes, Publisher’s Note, Resource sharing, Supporting Information Available

From these, the following subset of titles selected was:

das_notes_titles <- c("Availability of Data and Code",
                     "Availability of data and materials",
                     "Code availability",
                     "Code Availability",
                     "Data availability",
                     "Data Availability",
                     "Data availability and ethical considerations",
                     "Data Availability Statement",
                     "Data availability:",
                     "Data Availbility Statements",
                     "Resource sharing",
                     "Supporting Information Available"
                     )

Using these titles, we can search for any titles that match.

das_notes_no_attributes_title <- parsed_pmc %>% 
  xml_find_all(".//notes/title") %>% 
  xml_text()
das_notes_no_attributes_title_xpath <- parsed_pmc %>% 
  xml_find_all(".//notes/title") %>% 
  xml_path()

das_notes_title_noa <- tibble(das_notes_no_attributes_title,
                              xpath = das_notes_no_attributes_title_xpath)

das_notes_noa_filtered <- das_notes_title_noa %>% 
  filter(das_notes_no_attributes_title %in% das_notes_titles) %>% 
  mutate_at(vars(xpath), funs(stringr::str_extract(., "/pmc-articleset/article\\[\\d+\\]/(back|front)/notes(\\[\\d+\\]|)")))

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

das_notes_noa_text <- lapply(das_notes_noa_filtered$xpath, function(x){xml_find_all(parsed_pmc, x)})

das_notes_noa_text_extracted <- lapply(das_notes_noa_text, xml_text) %>% 
  unlist()

das_notes_noa_filtered <- das_notes_noa_filtered %>% 
                            add_column(raw_text = das_notes_noa_text_extracted)

The total number of entries with DAS’s with a proper XPath attribute in the notes is 296. The number of entries with multiple statements with proper XPath attributes in the notes is 32.

Generating CSV

# Adding additional columns for export
das_notes_noa_csv <- das_notes_noa_filtered %>% 
  rename(node_xpath = xpath)

das_notes_xpath_pmid <- das_notes_noa_csv$node_xpath %>% 
  str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
  str_c("/front/article-meta/article-id[@pub-id-type='pmid']") %>% 
  lapply(function(x){xml_find_all(parsed_pmc, x)}) %>% 
  lapply(xml_integer) %>% 
  unlist()

das_notes_xpath_pmc <- das_notes_noa_csv$node_xpath %>% 
  str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
  str_c("/front/article-meta/article-id[@pub-id-type='pmc']") %>% 
  lapply(function(x){xml_find_first(parsed_pmc, x)}) %>% 
  lapply(xml_integer) %>% 
  unlist()

rm_title_raw_text <- paste0("(?<=(", paste(das_notes_titles, collapse = "|"), ")).*")

das_notes_noa_csv_text <- das_notes_noa_csv %>% 
   mutate_at("raw_text", ~str_extract(., pattern = rm_title_raw_text))
  
das_notes_noa_filtered <- das_notes_noa_csv_text %>% 
                            add_column(pmid = das_notes_xpath_pmid,
                                       pmc = das_notes_xpath_pmc,
                                       appears_in = "notes") %>%
                            rename(note_title = das_notes_no_attributes_title) %>% 
                            select(pmid, pmc, note_title, everything())

write_csv(das_notes_noa_filtered, here("output", "das_notes_raw_text.csv"))

tally_notes <- das_notes_noa_filtered %>% 
  count(raw_text) %>% 
  top_n(5, n) %>% 
  arrange(desc(n))

raw_text	n
All relevant data are within the paper and its Supporting Information files.	107
All relevant data are within the manuscript and its Supporting Information files.	25
All relevant data are within the paper.	24
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.	17
Not applicable.	4

Sections

The total number of entries with a data-availability statement as a section in the paper with an XPath of .//sec[@sec-type='data-availability'] is 191.

One paper had data availability statements in both the ‘Materials and Methods’ section and after the ‘Supplementary Material’ section. This paper had the PMC ID of 6538374.

Custom Meta

da_custom <- parsed_pmc %>% 
  xml_find_all(".//custom-meta[@id='data-availability']") %>% 
  xml_path() %>% 
  str_extract("/pmc-articleset/article\\[\\d+\\]") %>% 
  length()

da_custom_unique <- parsed_pmc %>% 
  xml_find_all(".//custom-meta[@id='data-availability']") %>% 
  xml_path() %>% 
  str_extract("/pmc-articleset/article\\[\\d+\\]") %>% 
  unique() %>% 
  length()

398 entries had DAS’s in ‘custom-meta’ sections. 398 unique entries had DAS’s in ‘custom-meta’ sections, meaning that no entries had multiple ‘custom-meta’ DAS’s.

Checking for Missing XPaths

TODO By combining all of the previously searched for XPaths and filtering to unique entries, we can check for any missing XML paths containing DA statements.

total_xpath_entries <- parsed_pmc %>% 
  xml_find_all("//notes[@notes-type='data-availability']|//sec[@sec-type='data-availability']|//custom-meta[@id='data-availability']") %>% 
  xml_path() %>% 
  str_extract("/pmc-articleset/article\\[\\d+\\]") %>% 
  unique() %>% 
  length()

Code Repositories

This section will look at the usage of code repositories (e.g. GitHub, Gitlab, etc) in these publications.

Citation Information

This section will contain information both about the number of citations that each entry has as well as information about other publications that reference the original entries.

Supplementary Data

This section will contain information about the number of supplementary files and the file types that each publication contain.

Funding Information

This sections will cover funding information.

Grant Types

This section will explore the various grant types that are cited in the publications (R01, P30, U01, etc).

Programs/Universities

This section will explore the different programs and universities that are generating publications with DAS’s.

Funding Source Influence

This section will explore if there is any relationship between data sharing and whether publications cite intramural grants, extramural grants, or a combination. And also whether there is any relationship between level of data sharing and whether a publication cites funding from multiple ICs.

RCR Value

This section will explore if there is a relationship between data sharing and a higher RCR value.

NIEHS Publications on PubMed Central with Data Availability Information