Kristianna Pettibone is working on a project to evaluate grantee approaches to data sharing.
She is submitting a proposal for a Data Fellow and one idea she’s included in the proposal is to have that person work on this activity and figure out how to automate it. But in order to do that, they need a manually coded data file that they can use to teach the machine learning.
(Note) I believe that natural-language processing would be the best approach here, but even that may not be particularly necessary. Many of the questions that are being asked can be answered with regex and string parsing.
Through a search in PubMed Central (a full-text open access publication repository), we’ve identified around 1600 publications that have statements declaring data availability, code availability, and source data availability. These are a mix of standard text and free text.
PMC search to limit to grantee publications with Data Availability Statements (DAS) in the past five years is: niehs[gr] AND (has data avail[filter] OR has data citations[filter]) AND "last 5 years"[PDat]
At a minimum, to minimize manual curation, they want a script to extract the text from XML in the data, code, and source data sections (if they exist) and acknowledgments section for grantee info and populate the appropriate columns in an Excel file.
Even better, if some precise extraction could be done to:
Would need to support having multiple answers (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7648103/).
Below are links to three test publications as well as the XML for all three pubs.
In order to get the XML information from the PubMed/PMC API, we can use a few different functions to get access to this information. This process has only been done once and the resulting information was saved to an XML file where the rest of the work was completed from.
First we save all of the query parameters into a list to be used when we make a GET call to the API.
query_url <- list(db = "pmc",
tool = "niehs-lit-scraper",
email = "trey.saddler",
term = 'niehs[gr] AND (has data avail[filter] OR has data citations[filter]) AND "last 5 years"[PDat]',
retmax = 10000,
usehistory = "y")
Then we run the query on the esearch/eutils API to get a list of the PubMed IDs to be put into the next API call.
pubmed_get <- GET(url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",
query = query_url)
In the next step, we have to parse the GET call into XML so that we can extract the query_id and the WebEnv for the next API call.
parsed_pubmed_get <- content(pubmed_get, as = "parsed")
webenv <- xml_text(xml_find_first(parsed_pubmed_get, "//WebEnv"))
Now we can build our query parameters to grab the full XML output for all entries using the information from the previous API call.
pmc_query <- list(db = "pmc",
tool = "niehs-lit-scraper",
email = "trey.saddler",
retmax = 10000,
usehistory = "y",
query_key = 1,
WebEnv = webenv)
pmc_get <- GET(url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?",
query = pmc_query)
parsed_pmc <- content(pmc_get, as = "parsed")
And finally we can write this XML document to file for future work, creating a snapshot in time.
write_xml(parsed_pmc, here("output", "pmc_output.xml"))
Here we import the previous XML document to have a clean working slate for future operations.
parsed_pmc <- read_xml("output/pmc_output.xml")
DAS’s are contained in various places in the XML information returned from a PubMed Central search. In order to extract all information about data availability, it is necessary to track down all possible locations where this data may be stored.
TODO
Titles of DA sections in the notes were in many different forms. Not all of these DA statements in the notes had a proper XPath attribute of notes-type=data-availability. Therefore, we need to manually select what titles meet our criteria and use those titles for future searches.
The list of all titles of the notes sections are:
Author contribution, Author Contribution, Author contributions, Author Contributions, Author’s contributions, Authors’ contributions, Authors’ Contributions, Authors’ information, Authors’s contributions, Availability of Data and Code, Availability of data and materials, Code availability, Code Availability, Competing interests, Competing Interests, Compliance with ethical standards, Compliance with Ethical Standards, Conflict of interest, Conflict of Interest, Conflicts of Interest, Consent for publication, Data availability, Data Availability, Data availability and ethical considerations, Data Availability Statement, Data availability:, Data Availbility Statements, Disclaimer, Ethical approval, Ethical Approval, Ethics approval, Ethics approval and consent to participate, Funding, Funding Information, Informed consent, Notes, Publisher’s Note, Resource sharing, Supporting Information Available
From these, the following subset of titles selected was:
das_notes_titles <- c("Availability of Data and Code",
"Availability of data and materials",
"Code availability",
"Code Availability",
"Data availability",
"Data Availability",
"Data availability and ethical considerations",
"Data Availability Statement",
"Data availability:",
"Data Availbility Statements",
"Resource sharing",
"Supporting Information Available"
)
Using these titles, we can search for any titles that match.
das_notes_no_attributes_title <- parsed_pmc %>%
xml_find_all(".//notes/title") %>%
xml_text()
das_notes_no_attributes_title_xpath <- parsed_pmc %>%
xml_find_all(".//notes/title") %>%
xml_path()
das_notes_title_noa <- tibble(das_notes_no_attributes_title,
xpath = das_notes_no_attributes_title_xpath)
das_notes_noa_filtered <- das_notes_title_noa %>%
filter(das_notes_no_attributes_title %in% das_notes_titles) %>%
mutate_at(vars(xpath), funs(stringr::str_extract(., "/pmc-articleset/article\\[\\d+\\]/(back|front)/notes(\\[\\d+\\]|)")))
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
das_notes_noa_text <- lapply(das_notes_noa_filtered$xpath, function(x){xml_find_all(parsed_pmc, x)})
das_notes_noa_text_extracted <- lapply(das_notes_noa_text, xml_text) %>%
unlist()
das_notes_noa_filtered <- das_notes_noa_filtered %>%
add_column(raw_text = das_notes_noa_text_extracted)
The total number of entries with DAS’s with a proper XPath attribute in the notes is 296. The number of entries with multiple statements with proper XPath attributes in the notes is 32.
# Adding additional columns for export
das_notes_noa_csv <- das_notes_noa_filtered %>%
rename(node_xpath = xpath)
das_notes_xpath_pmid <- das_notes_noa_csv$node_xpath %>%
str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
str_c("/front/article-meta/article-id[@pub-id-type='pmid']") %>%
lapply(function(x){xml_find_all(parsed_pmc, x)}) %>%
lapply(xml_integer) %>%
unlist()
das_notes_xpath_pmc <- das_notes_noa_csv$node_xpath %>%
str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
str_c("/front/article-meta/article-id[@pub-id-type='pmc']") %>%
lapply(function(x){xml_find_first(parsed_pmc, x)}) %>%
lapply(xml_integer) %>%
unlist()
rm_title_raw_text <- paste0("(?<=(", paste(das_notes_titles, collapse = "|"), ")).*")
das_notes_noa_csv_text <- das_notes_noa_csv %>%
mutate_at("raw_text", ~str_extract(., pattern = rm_title_raw_text))
das_notes_noa_filtered <- das_notes_noa_csv_text %>%
add_column(pmid = das_notes_xpath_pmid,
pmc = das_notes_xpath_pmc,
appears_in = "notes") %>%
rename(note_title = das_notes_no_attributes_title) %>%
select(pmid, pmc, note_title, everything())
write_csv(das_notes_noa_filtered, here("output", "das_notes_raw_text.csv"))
tally_notes <- das_notes_noa_filtered %>%
count(raw_text) %>%
top_n(5, n) %>%
arrange(desc(n))
| raw_text | n |
|---|---|
| All relevant data are within the paper and its Supporting Information files. | 107 |
| All relevant data are within the manuscript and its Supporting Information files. | 25 |
| All relevant data are within the paper. | 24 |
| The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. | 17 |
| Not applicable. | 4 |
The total number of entries with a data-availability statement as a section in the paper with an XPath of .//sec[@sec-type='data-availability'] is 191.
One paper had data availability statements in both the ‘Materials and Methods’ section and after the ‘Supplementary Material’ section. This paper had the PMC ID of 6538374.
da_custom <- parsed_pmc %>%
xml_find_all(".//custom-meta[@id='data-availability']") %>%
xml_path() %>%
str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
length()
da_custom_unique <- parsed_pmc %>%
xml_find_all(".//custom-meta[@id='data-availability']") %>%
xml_path() %>%
str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
unique() %>%
length()
398 entries had DAS’s in ‘custom-meta’ sections. 398 unique entries had DAS’s in ‘custom-meta’ sections, meaning that no entries had multiple ‘custom-meta’ DAS’s.
TODO By combining all of the previously searched for XPaths and filtering to unique entries, we can check for any missing XML paths containing DA statements.
total_xpath_entries <- parsed_pmc %>%
xml_find_all("//notes[@notes-type='data-availability']|//sec[@sec-type='data-availability']|//custom-meta[@id='data-availability']") %>%
xml_path() %>%
str_extract("/pmc-articleset/article\\[\\d+\\]") %>%
unique() %>%
length()
What do authors say in their Data Availability Statements? We’ll review this information and assess where they are on the data sharing spectrum – from least accessible to most accessible (e.g., call me to get the data vs. data are in an easily findable, searchable, publicly accessible data repository).
This section will look at the usage of code repositories (e.g. GitHub, Gitlab, etc) in these publications.
This section will contain information both about the number of citations that each entry has as well as information about other publications that reference the original entries.
This section will contain information about the number of supplementary files and the file types that each publication contain.
This sections will cover funding information.
This section will explore the various grant types that are cited in the publications (R01, P30, U01, etc).
This section will explore the different programs and universities that are generating publications with DAS’s.
This section will explore if there is any relationship between data sharing and whether publications cite intramural grants, extramural grants, or a combination. And also whether there is any relationship between level of data sharing and whether a publication cites funding from multiple ICs.
This section will explore if there is a relationship between data sharing and a higher RCR value.