Example: Parsing PubMed Records Using Custom Functions

Preparation

If you have a entrez key you can save it in the variable ENTREZ_KEY. For this example I will use the easyPubMed package to get one article.

# put below your own entrez key if you have one
ENTREZ_KEY=NULL
library(easyPubMed)
library(xml2)

#copy the funcitons in your working directory and export them.
#if you have the functions in other place, you need to specify the path
source('getMetadata.R')
source('getAuthors.R')
source('makeDfFromXmlTxt.R')
source('nodeList2Df.R')

To make the functions work, we need to use the xml2 package.

Next, we need at least one PubMed record.

query <- 'cancer[TI] AND "2020"[dp] AND "journal article"[pt] AND "free full text"[sb] AND adolescent[mh]'

search <- get_pubmed_ids(query, api_key = ENTREZ_KEY)

oneRecordCancer2020 <- fetch_pubmed_data(search, retmax = 1)

doc <- xml2::read_xml(oneRecordCancer2020)|>
  xml2::xml_find_all(".//PubmedArticle")

The oneRecordCancer2020 object contains one PubMed record. This record will be used for parsing. But first, we need to convert it into a xml2 document since the getMetadata function work with this object.

Parse the PubMed record

To extract the abstract, we place it in the metadata.list argument.

recordAbs <-
  getMetadata(doc, metadata.list = 'abstract')

knitr::kable(recordAbs[["abstract"]])

text	Label	id
We previously developed an age-scalable 3D computational phantom that has been widely used for retrospective whole-body dose reconstructions of conventional two-dimensional historic radiation therapy (RT) treatments in late effects studies of childhood cancer survivors. This phantom is modeled in the FORTRAN programming language and is not readily applicable for dose reconstructions for survivors treated with contemporary RT whose treatment plans were designed using computed tomography images and complex treatment fields. The goal of this work was to adapt the current FORTRAN model of our age-scalable computational phantom into Digital Imaging and Communications in Medicine (DICOM) standard so that it can be used with any treatment planning system (TPS) to reconstruct contemporary RT. Additionally, we report a detailed description of the phantom’s age-based scaling functions, information that was not previously published.	PURPOSE	34584772
We developed a Python script that adapts our phantom model from FORTRAN to DICOM. To validate the conversion, we compared geometric parameters for the phantom modeled in FORTRAN and DICOM scaled to ages 1 month, 6 months, 1, 2, 3, 5, 8, 10, 15, and 18 years. Specifically, we calculated the percent differences between the corner points and volume of each body region and the normalized mean square distance (NMSD) between each of the organs. In addition, we also calculated the percent difference between the heights of our DICOM age-scaled phantom and the heights (50th percentile) reported by the World Health Organization (WHO) and Centers for Disease Control and Prevention (CDC) for male and female children of the same ages. Additionally, we calculated the difference between the organ masses for our DICOM phantom and the organ masses for two reference phantoms (from International Comission on Radiation Protection (ICRP) 89 and the University of Florida/National Cancer Institute reference hybrid voxel phantoms) for ages newborn, 1, 5, 10, 15 and adult. Lastly, we conducted a feasibility study using our DICOM phantom for organ dose calculations in a commercial TPS. Specifically, we simulated a 6 MV photon right-sided flank field RT plan for our DICOM phantom scaled to age 3.9 years; treatment field parameters and age were typical of a Wilms tumor RT treatment in the Childhood Cancer Survivor Study. For comparison, the same treatment was simulated using our in-house dose calculation system with our FORTRAN phantom. The percent differences (between FORTRAN and DICOM) in mean dose and percent of volume receiving dose ⩾5 Gy were calculated for two organs at risk, liver and pancreas.	METHOD	34584772
The percent differences in corner points and the volumes of head, neck, and trunk body regions between our phantom modeled in FORTRAN and DICOM agreed within 3%. For all of the ages, the NMSDs were negliglible with a maximum NMSD of 7.80 × 10-2 mm for occiptital lobe of 1 month. The heights of our age-scaled phantom agreed with WHO/CDC data within 7% from infant to adult, and within 2% agreement for ages 5 years and older. We observed that organ masses in our phantom are less than the organ masses for other reference phantoms. Dose calculations done with our in-house calculation system (with FORTRAN phantom) and commercial TPS (with DICOM phantom) agreed within 7%.	RESULTS	34584772
We successfully adapted our phantom model from the FORTRAN language to DICOM standard and validated its geometric consistency. We also demonstrated that our phantom model is representative of population height data for infant to adult, but that the organ masses are smaller than in other reference phantoms and need further refinement. Our age-scalable computational phantom modeled in DICOM standard can be scaled to any age at RT and used within a commercial TPS to retrospectively reconstruct doses from contemporary RT in childhood cancer survivors.	CONCLUSION	34584772

You can specify other metadata fields if you are interested in them.

recordMetadata <-
  getMetadata(doc,
              metadata.list = c('keywordTerms','journal'))

knitr::kable(recordMetadata$keywordTerms)

text	MajorTopicYN	id
computational phantoms	N	34584772
dose reconstruction	N	34584772
late effects	N	34584772
pediatric phantoms	N	34584772

knitr::kable(recordMetadata$journal)

Title	id
Biomedical physics & engineering express	34584772

Currently, the available metadata fields that can be parsed in the metadata.list argument are the follows:

abstract
meshDescrip
meshQualifier
keywordTerms
pubType
journal
country
language
title
year
authors

Due to the structure of the PubMed record, each metadata field requires one of the following functions getAuthors.R, makeDfFromXmlTxt or nodeList2Df.

## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Spanish_Mexico.utf8  LC_CTYPE=Spanish_Mexico.utf8   
## [3] LC_MONETARY=Spanish_Mexico.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=Spanish_Mexico.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xml2_1.3.3      easyPubMed_2.22 DT_0.27        
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29     R6_2.5.1          jsonlite_1.8.4    magrittr_2.0.3   
##  [5] evaluate_0.20     cachem_1.0.6      rlang_1.0.6       cli_3.3.0        
##  [9] rstudioapi_0.13   jquerylib_0.1.4   bslib_0.4.2       rmarkdown_2.20.1 
## [13] tools_4.2.1       htmlwidgets_1.5.4 xfun_0.36         yaml_2.3.7       
## [17] fastmap_1.1.0     compiler_4.2.1    htmltools_0.5.4   knitr_1.42       
## [21] sass_0.4.5

Example: Parsing PubMed Records Using Custom Functions

juarpasi

2023-05-11

Introduction

Preparation

Parse the PubMed record