Libraries

You will need the following libraries installed and loaded for this analysis.

Data

For this exploration of XML with R, I am going to use .xml files downloaded from Pubmed. Pubmed is a database of bibliographic citations of scientific articles published in the life sciences and medicine. It is maintained by the National Library of Medicine. You can learn more about this citations repository and download files here.

To begin, let us load some data. There are three files that we will use. The first file is a single article citation for an article written on Kaposi’s Sarcoma. This file will be used to explore an xml file that contains a sigle record. The second file is another single article citation for an article written about POEMS syndrome. We will use this second file for when we want to execute the same code across multiple, single record files. The third file is an xml file containing ~600 article citations for articles written about Castleman Disease. This file will be useful for when we want to execute code across multiple records stored in one file.

Using the xmlParse function we can read the xml file into a variable.

xml_1 <- xmlParse('kaposis_sarcoma.xml')
xml_2 <- xmlParse('poems_syndrome.xml')
xml_3 <- xmlParse('pubmed_results.xml')

With the xml files parsed, we can now begin to explore the xml files with R. We will begin with the single record files. The following are some basic commands that will allow you to navigate the xml file.

class(xml_1)  # provides the class of the parsed file

## [1] "XMLInternalDocument" "XMLAbstractDocument"

Exploring XML Files

xml files are constructed from a hierarchy of parent and child nodes. If we identify the root node of the xml, we can view the entire xml file or use the root to navigate our way through the hierarchy. Let’s begin by looking at the entire xml file.

xmlRoot(xml_1)

## <PubmedArticleSet>
##   <PubmedArticle>
##     <MedlineCitation Status="In-Data-Review" Owner="NLM">
##       <PMID Version="1">31059555</PMID>
##       <DateRevised>
##         <Year>2019</Year>
##         <Month>05</Month>
##         <Day>16</Day>
##       </DateRevised>
##       <Article PubModel="Electronic-eCollection">
##         <Journal>
##           <ISSN IssnType="Electronic">1553-7374</ISSN>
##           <JournalIssue CitedMedium="Internet">
##             <Volume>15</Volume>
##             <Issue>5</Issue>
##             <PubDate>
##               <Year>2019</Year>
##               <Month>May</Month>
##             </PubDate>
##           </JournalIssue>
##           <Title>PLoS pathogens</Title>
##           <ISOAbbreviation>PLoS Pathog.</ISOAbbreviation>
##         </Journal>
##         <ArticleTitle>Kaposi's sarcoma-associated herpesvirus vIRF2 protein utilizes an IFN-dependent pathway to regulate viral early gene expression.</ArticleTitle>
##         <Pagination>
##           <MedlinePgn>e1007743</MedlinePgn>
##         </Pagination>
##         <ELocationID EIdType="doi" ValidYN="Y">10.1371/journal.ppat.1007743</ELocationID>
##         <Abstract>
##           <AbstractText>Kaposi's sarcoma-associated herpesvirus (KSHV; human herpesvirus 8) belongs to the subfamily of Gammaherpesvirinae and is the etiological agent of Kaposi's sarcoma as well as of two lymphoproliferative diseases: primary effusion lymphoma and multicentric Castleman disease. The KSHV life cycle is divided into a latent and a lytic phase and is highly regulated by viral immunomodulatory proteins which control the host antiviral immune response. Among them is a group of proteins with homology to cellular interferon regulatory factors, the viral interferon regulatory factors 1-4. The KSHV vIRFs are known as inhibitors of cellular interferon signaling and are involved in different oncogenic pathways. Here we characterized the role of the second vIRF protein, vIRF2, during the KSHV life cycle. We found the vIRF2 protein to be expressed in different KSHV positive cells with early lytic kinetics. Importantly, we observed that vIRF2 suppresses the expression of viral early lytic genes in both newly infected and reactivated persistently infected endothelial cells. This vIRF2-dependent regulation of the KSHV life cycle might involve the increased expression of cellular interferon-induced genes such as the IFIT proteins 1, 2 and 3, which antagonize the expression of early KSHV lytic proteins. Our findings suggest a model in which the viral protein vIRF2 allows KSHV to harness an IFN-dependent pathway to regulate KSHV early gene expression.</AbstractText>
##         </Abstract>
##         <AuthorList CompleteYN="Y">
##           <Author ValidYN="Y">
##             <LastName>Koch</LastName>
##             <ForeName>Sandra</ForeName>
##             <Initials>S</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0003-3759-9996</Identifier>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Damas</LastName>
##             <ForeName>Modester</ForeName>
##             <Initials>M</Initials>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Freise</LastName>
##             <ForeName>Anika</ForeName>
##             <Initials>A</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0003-3312-888X</Identifier>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Hage</LastName>
##             <ForeName>Elias</ForeName>
##             <Initials>E</Initials>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Dhingra</LastName>
##             <ForeName>Akshay</ForeName>
##             <Initials>A</Initials>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Rückert</LastName>
##             <ForeName>Jessica</ForeName>
##             <Initials>J</Initials>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Gallo</LastName>
##             <ForeName>Antonio</ForeName>
##             <Initials>A</Initials>
##             <AffiliationInfo>
##               <Affiliation>Heinrich-Pette-Institute, Leibniz Institute for Experimental Virology, Hamburg, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hamburg Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Kremmer</LastName>
##             <ForeName>Elisabeth</ForeName>
##             <Initials>E</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0002-8133-8374</Identifier>
##             <AffiliationInfo>
##               <Affiliation>Institute of Molecular Immunology, Helmholtz Centre Munich, German Research Center for Environmental Health, Munich, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Tegge</LastName>
##             <ForeName>Werner</ForeName>
##             <Initials>W</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0003-1088-974X</Identifier>
##             <AffiliationInfo>
##               <Affiliation>Helmholtz Centre for Infection Research, Braunschweig, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Brönstrup</LastName>
##             <ForeName>Mark</ForeName>
##             <Initials>M</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0002-8971-7045</Identifier>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>Helmholtz Centre for Infection Research, Braunschweig, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Brune</LastName>
##             <ForeName>Wolfram</ForeName>
##             <Initials>W</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0002-6078-5255</Identifier>
##             <AffiliationInfo>
##               <Affiliation>Heinrich-Pette-Institute, Leibniz Institute for Experimental Virology, Hamburg, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hamburg Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##           <Author ValidYN="Y">
##             <LastName>Schulz</LastName>
##             <ForeName>Thomas F</ForeName>
##             <Initials>TF</Initials>
##             <Identifier Source="ORCID">http://orcid.org/0000-0001-8792-5345</Identifier>
##             <AffiliationInfo>
##               <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##             </AffiliationInfo>
##             <AffiliationInfo>
##               <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##             </AffiliationInfo>
##           </Author>
##         </AuthorList>
##         <Language>eng</Language>
##         <PublicationTypeList>
##           <PublicationType UI="D016428">Journal Article</PublicationType>
##         </PublicationTypeList>
##         <ArticleDate DateType="Electronic">
##           <Year>2019</Year>
##           <Month>05</Month>
##           <Day>06</Day>
##         </ArticleDate>
##       </Article>
##       <MedlineJournalInfo>
##         <Country>United States</Country>
##         <MedlineTA>PLoS Pathog</MedlineTA>
##         <NlmUniqueID>101238921</NlmUniqueID>
##         <ISSNLinking>1553-7366</ISSNLinking>
##       </MedlineJournalInfo>
##       <CoiStatement>The authors have declared that no competing interests exist.</CoiStatement>
##     </MedlineCitation>
##     <PubmedData>
##       <History>
##         <PubMedPubDate PubStatus="received">
##           <Year>2018</Year>
##           <Month>08</Month>
##           <Day>25</Day>
##         </PubMedPubDate>
##         <PubMedPubDate PubStatus="accepted">
##           <Year>2019</Year>
##           <Month>03</Month>
##           <Day>31</Day>
##         </PubMedPubDate>
##         <PubMedPubDate PubStatus="revised">
##           <Year>2019</Year>
##           <Month>05</Month>
##           <Day>16</Day>
##         </PubMedPubDate>
##         <PubMedPubDate PubStatus="pubmed">
##           <Year>2019</Year>
##           <Month>5</Month>
##           <Day>7</Day>
##           <Hour>6</Hour>
##           <Minute>0</Minute>
##         </PubMedPubDate>
##         <PubMedPubDate PubStatus="medline">
##           <Year>2019</Year>
##           <Month>5</Month>
##           <Day>7</Day>
##           <Hour>6</Hour>
##           <Minute>0</Minute>
##         </PubMedPubDate>
##         <PubMedPubDate PubStatus="entrez">
##           <Year>2019</Year>
##           <Month>5</Month>
##           <Day>7</Day>
##           <Hour>6</Hour>
##           <Minute>0</Minute>
##         </PubMedPubDate>
##       </History>
##       <PublicationStatus>epublish</PublicationStatus>
##       <ArticleIdList>
##         <ArticleId IdType="pubmed">31059555</ArticleId>
##         <ArticleId IdType="doi">10.1371/journal.ppat.1007743</ArticleId>
##         <ArticleId IdType="pii">PPATHOGENS-D-18-01676</ArticleId>
##       </ArticleIdList>
##     </PubmedData>
##   </PubmedArticle>
## </PubmedArticleSet>

Next, let’s use this root node to learn more about the xml file. Running the code below tells us the that the root node is named ‘PubmedArticleSet’ and it has one child node named ‘PubmedArticle’.

#looking at the data
xmltop = xmlRoot(xml_1) #gives content of root
xmlName(xmltop) #give name of node, PubmedArticleSet

## [1] "PubmedArticleSet"

xmlSize(xmltop) #how many children in node, 19

## [1] 1

xmlName(xmltop[[1]]) #name of root's children

## [1] "PubmedArticle"

Compare these results to our file containing multiple citation records in one file. The nodes are named the same and we can see that this one file has 655 article citations.

#looking at the data
xmltop3 = xmlRoot(xml_3) #gives content of root
xmlName(xmltop3) #give name of node, PubmedArticleSet

## [1] "PubmedArticleSet"

xmlSize(xmltop3) #how many children in node, 19

## [1] 655

xmlName(xmltop3[[1]]) #name of root's children

## [1] "PubmedArticle"

If you notice the last line of code we ran, we have referenced an index number: [[1]]. In this line of code, we are using the xmlName function to get the name of this node. We can also use these index numbers to view the xml content at that index value. For example, here is the xml for the first child–the first article citation–of our multi-record xml file.

xmltop3[[1]]

## <PubmedArticle>
##   <MedlineCitation Status="In-Data-Review" Owner="NLM">
##     <PMID Version="1">31059555</PMID>
##     <DateRevised>
##       <Year>2019</Year>
##       <Month>05</Month>
##       <Day>16</Day>
##     </DateRevised>
##     <Article PubModel="Electronic-eCollection">
##       <Journal>
##         <ISSN IssnType="Electronic">1553-7374</ISSN>
##         <JournalIssue CitedMedium="Internet">
##           <Volume>15</Volume>
##           <Issue>5</Issue>
##           <PubDate>
##             <Year>2019</Year>
##             <Month>May</Month>
##           </PubDate>
##         </JournalIssue>
##         <Title>PLoS pathogens</Title>
##         <ISOAbbreviation>PLoS Pathog.</ISOAbbreviation>
##       </Journal>
##       <ArticleTitle>Kaposi's sarcoma-associated herpesvirus vIRF2 protein utilizes an IFN-dependent pathway to regulate viral early gene expression.</ArticleTitle>
##       <Pagination>
##         <MedlinePgn>e1007743</MedlinePgn>
##       </Pagination>
##       <ELocationID EIdType="doi" ValidYN="Y">10.1371/journal.ppat.1007743</ELocationID>
##       <Abstract>
##         <AbstractText>Kaposi's sarcoma-associated herpesvirus (KSHV; human herpesvirus 8) belongs to the subfamily of Gammaherpesvirinae and is the etiological agent of Kaposi's sarcoma as well as of two lymphoproliferative diseases: primary effusion lymphoma and multicentric Castleman disease. The KSHV life cycle is divided into a latent and a lytic phase and is highly regulated by viral immunomodulatory proteins which control the host antiviral immune response. Among them is a group of proteins with homology to cellular interferon regulatory factors, the viral interferon regulatory factors 1-4. The KSHV vIRFs are known as inhibitors of cellular interferon signaling and are involved in different oncogenic pathways. Here we characterized the role of the second vIRF protein, vIRF2, during the KSHV life cycle. We found the vIRF2 protein to be expressed in different KSHV positive cells with early lytic kinetics. Importantly, we observed that vIRF2 suppresses the expression of viral early lytic genes in both newly infected and reactivated persistently infected endothelial cells. This vIRF2-dependent regulation of the KSHV life cycle might involve the increased expression of cellular interferon-induced genes such as the IFIT proteins 1, 2 and 3, which antagonize the expression of early KSHV lytic proteins. Our findings suggest a model in which the viral protein vIRF2 allows KSHV to harness an IFN-dependent pathway to regulate KSHV early gene expression.</AbstractText>
##       </Abstract>
##       <AuthorList CompleteYN="Y">
##         <Author ValidYN="Y">
##           <LastName>Koch</LastName>
##           <ForeName>Sandra</ForeName>
##           <Initials>S</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0003-3759-9996</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Damas</LastName>
##           <ForeName>Modester</ForeName>
##           <Initials>M</Initials>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Freise</LastName>
##           <ForeName>Anika</ForeName>
##           <Initials>A</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0003-3312-888X</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Hage</LastName>
##           <ForeName>Elias</ForeName>
##           <Initials>E</Initials>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Dhingra</LastName>
##           <ForeName>Akshay</ForeName>
##           <Initials>A</Initials>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Rückert</LastName>
##           <ForeName>Jessica</ForeName>
##           <Initials>J</Initials>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Gallo</LastName>
##           <ForeName>Antonio</ForeName>
##           <Initials>A</Initials>
##           <AffiliationInfo>
##             <Affiliation>Heinrich-Pette-Institute, Leibniz Institute for Experimental Virology, Hamburg, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hamburg Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Kremmer</LastName>
##           <ForeName>Elisabeth</ForeName>
##           <Initials>E</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0002-8133-8374</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Institute of Molecular Immunology, Helmholtz Centre Munich, German Research Center for Environmental Health, Munich, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Tegge</LastName>
##           <ForeName>Werner</ForeName>
##           <Initials>W</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0003-1088-974X</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Helmholtz Centre for Infection Research, Braunschweig, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Brönstrup</LastName>
##           <ForeName>Mark</ForeName>
##           <Initials>M</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0002-8971-7045</Identifier>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>Helmholtz Centre for Infection Research, Braunschweig, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Brune</LastName>
##           <ForeName>Wolfram</ForeName>
##           <Initials>W</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0002-6078-5255</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Heinrich-Pette-Institute, Leibniz Institute for Experimental Virology, Hamburg, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hamburg Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Schulz</LastName>
##           <ForeName>Thomas F</ForeName>
##           <Initials>TF</Initials>
##           <Identifier Source="ORCID">http://orcid.org/0000-0001-8792-5345</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Hannover Medical School, Institute of Virology, Hannover, Germany.</Affiliation>
##           </AffiliationInfo>
##           <AffiliationInfo>
##             <Affiliation>German Centre for Infection Research, Hannover-Braunschweig Site, Germany.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##       </AuthorList>
##       <Language>eng</Language>
##       <PublicationTypeList>
##         <PublicationType UI="D016428">Journal Article</PublicationType>
##       </PublicationTypeList>
##       <ArticleDate DateType="Electronic">
##         <Year>2019</Year>
##         <Month>05</Month>
##         <Day>06</Day>
##       </ArticleDate>
##     </Article>
##     <MedlineJournalInfo>
##       <Country>United States</Country>
##       <MedlineTA>PLoS Pathog</MedlineTA>
##       <NlmUniqueID>101238921</NlmUniqueID>
##       <ISSNLinking>1553-7366</ISSNLinking>
##     </MedlineJournalInfo>
##     <CoiStatement>The authors have declared that no competing interests exist.</CoiStatement>
##   </MedlineCitation>
##   <PubmedData>
##     <History>
##       <PubMedPubDate PubStatus="received">
##         <Year>2018</Year>
##         <Month>08</Month>
##         <Day>25</Day>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="accepted">
##         <Year>2019</Year>
##         <Month>03</Month>
##         <Day>31</Day>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="revised">
##         <Year>2019</Year>
##         <Month>05</Month>
##         <Day>16</Day>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="pubmed">
##         <Year>2019</Year>
##         <Month>5</Month>
##         <Day>7</Day>
##         <Hour>6</Hour>
##         <Minute>0</Minute>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="medline">
##         <Year>2019</Year>
##         <Month>5</Month>
##         <Day>7</Day>
##         <Hour>6</Hour>
##         <Minute>0</Minute>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="entrez">
##         <Year>2019</Year>
##         <Month>5</Month>
##         <Day>7</Day>
##         <Hour>6</Hour>
##         <Minute>0</Minute>
##       </PubMedPubDate>
##     </History>
##     <PublicationStatus>epublish</PublicationStatus>
##     <ArticleIdList>
##       <ArticleId IdType="pubmed">31059555</ArticleId>
##       <ArticleId IdType="doi">10.1371/journal.ppat.1007743</ArticleId>
##       <ArticleId IdType="pii">PPATHOGENS-D-18-01676</ArticleId>
##     </ArticleIdList>
##   </PubmedData>
## </PubmedArticle>

And here is the 13th child node–the 13th article citation–in our multi-record file.

xmltop3[[13]]

## <PubmedArticle>
##   <MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM">
##     <PMID Version="1">30915224</PMID>
##     <DateRevised>
##       <Year>2019</Year>
##       <Month>03</Month>
##       <Day>29</Day>
##     </DateRevised>
##     <Article PubModel="Electronic-eCollection">
##       <Journal>
##         <ISSN IssnType="Print">2051-3380</ISSN>
##         <JournalIssue CitedMedium="Print">
##           <Volume>7</Volume>
##           <Issue>4</Issue>
##           <PubDate>
##             <Year>2019</Year>
##             <Month>May</Month>
##           </PubDate>
##         </JournalIssue>
##         <Title>Respirology case reports</Title>
##         <ISOAbbreviation>Respirol Case Rep</ISOAbbreviation>
##       </Journal>
##       <ArticleTitle>Unusual presentation of Castleman's disease mimicking lung cancer.</ArticleTitle>
##       <Pagination>
##         <MedlinePgn>e00416</MedlinePgn>
##       </Pagination>
##       <ELocationID EIdType="doi" ValidYN="Y">10.1002/rcr2.416</ELocationID>
##       <Abstract>
##         <AbstractText>Castleman's disease (CD) is an uncommon lymphoproliferative disorder characterized as either unicentric or multicentric presentation based on the involving sites. The most frequent presentation of CD is a solitary mediastinal mass. We reported a patient with a history of heavy smoking with particular image features of CD, which presented as mediastinal lymphadenopathy and peribronchovascular interstitial thickening mimicking lung cancer or sarcoidosis initially.</AbstractText>
##       </Abstract>
##       <AuthorList CompleteYN="Y">
##         <Author ValidYN="Y">
##           <LastName>Chen</LastName>
##           <ForeName>Ming-Tsung</ForeName>
##           <Initials>MT</Initials>
##           <Identifier Source="ORCID">https://orcid.org/0000-0002-2659-8595</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Division of Pulmonary and Critical Care, Department of Internal Medicine Tri-Service General Hospital, National Defense Medical Center Taipei Taiwan.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Lee</LastName>
##           <ForeName>Shih-Chun</ForeName>
##           <Initials>SC</Initials>
##           <AffiliationInfo>
##             <Affiliation>Division of Thoracic Surgery, Department of Surgery Tri-Service General Hospital, National Defense Medical Center Taipei Taiwan.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Lu</LastName>
##           <ForeName>Chun-Chi</ForeName>
##           <Initials>CC</Initials>
##           <AffiliationInfo>
##             <Affiliation>Division of Rheumatology/Immunology/Allergy, Department of Internal Medicine Tri-Service General Hospital, National Defense Medical Center Taipei Taiwan.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##         <Author ValidYN="Y">
##           <LastName>Tsai</LastName>
##           <ForeName>Chen-Liang</ForeName>
##           <Initials>CL</Initials>
##           <Identifier Source="ORCID">https://orcid.org/0000-0001-9783-5423</Identifier>
##           <AffiliationInfo>
##             <Affiliation>Division of Pulmonary and Critical Care, Department of Internal Medicine Tri-Service General Hospital, National Defense Medical Center Taipei Taiwan.</Affiliation>
##           </AffiliationInfo>
##         </Author>
##       </AuthorList>
##       <Language>eng</Language>
##       <PublicationTypeList>
##         <PublicationType UI="D002363">Case Reports</PublicationType>
##       </PublicationTypeList>
##       <ArticleDate DateType="Electronic">
##         <Year>2019</Year>
##         <Month>03</Month>
##         <Day>14</Day>
##       </ArticleDate>
##     </Article>
##     <MedlineJournalInfo>
##       <Country>United States</Country>
##       <MedlineTA>Respirol Case Rep</MedlineTA>
##       <NlmUniqueID>101631052</NlmUniqueID>
##       <ISSNLinking>2051-3380</ISSNLinking>
##     </MedlineJournalInfo>
##     <KeywordList Owner="NOTNLM">
##       <Keyword MajorTopicYN="N">Castleman's disease</Keyword>
##       <Keyword MajorTopicYN="N">lymphadenopathy</Keyword>
##       <Keyword MajorTopicYN="N">peribronchovascular interstitial thickening</Keyword>
##     </KeywordList>
##   </MedlineCitation>
##   <PubmedData>
##     <History>
##       <PubMedPubDate PubStatus="received">
##         <Year>2019</Year>
##         <Month>01</Month>
##         <Day>14</Day>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="revised">
##         <Year>2019</Year>
##         <Month>02</Month>
##         <Day>17</Day>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="accepted">
##         <Year>2019</Year>
##         <Month>02</Month>
##         <Day>23</Day>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="entrez">
##         <Year>2019</Year>
##         <Month>3</Month>
##         <Day>28</Day>
##         <Hour>6</Hour>
##         <Minute>0</Minute>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="pubmed">
##         <Year>2019</Year>
##         <Month>3</Month>
##         <Day>28</Day>
##         <Hour>6</Hour>
##         <Minute>0</Minute>
##       </PubMedPubDate>
##       <PubMedPubDate PubStatus="medline">
##         <Year>2019</Year>
##         <Month>3</Month>
##         <Day>28</Day>
##         <Hour>6</Hour>
##         <Minute>1</Minute>
##       </PubMedPubDate>
##     </History>
##     <PublicationStatus>epublish</PublicationStatus>
##     <ArticleIdList>
##       <ArticleId IdType="pubmed">30915224</ArticleId>
##       <ArticleId IdType="doi">10.1002/rcr2.416</ArticleId>
##       <ArticleId IdType="pii">RCR2416</ArticleId>
##       <ArticleId IdType="pmc">PMC6417363</ArticleId>
##     </ArticleIdList>
##   </PubmedData>
## </PubmedArticle>

If you are wondering whether you can use index numbers to navigate through the xml hierarchy, you can. Using the xmlName function, we can get the name of the node at different parts of the xml.

xmlName(xmltop[[1]])

## [1] "PubmedArticle"

xmlName(xmltop[[1]][[1]])

## [1] "MedlineCitation"

xmlName(xmltop[[1]][[1]][[1]])

## [1] "PMID"

Using the index values is useful, but is not always easy to read. You can also use node names to walk the hierarchy. For example, here is how we can get to the node with the article’s title.

xmlName(xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']])

## [1] "ArticleTitle"

Dropping the xmlValue function, we can view the actual content at this node.

xmltop[[1]][[1]][[1]] #view PMID

## <PMID Version="1">31059555</PMID>

xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']] # view article title

## <ArticleTitle>Kaposi's sarcoma-associated herpesvirus vIRF2 protein utilizes an IFN-dependent pathway to regulate viral early gene expression.</ArticleTitle>

Extracting XML Content for Analysis

So far we have looked at how we can read an XML file into a variable, get the root node for that XML, navigate the child nodes through index values or node names, and view the content at a specific node. If you look closely at the last lines of code we ran, we can see the content at a particular node but we are also seeing the xml tags. This is great for looking at our data, but if we want to analyze our data presumably we don’t want the tags. Fortunately, there is another function to extract just the value at a given node: xmlValue. We will work with this function in order to get the content we want.

Let’s start simple. How can I get the article title for a single-record file?

xmlValue(xmltop[['PubmedArticle']][['MedlineCitation']][['Article']][['ArticleTitle']])

## [1] "Kaposi's sarcoma-associated herpesvirus vIRF2 protein utilizes an IFN-dependent pathway to regulate viral early gene expression."

How about the PMID?

xmlValue(xmltop[[1]][[1]][[1]])

## [1] "31059555"

How about the PMID for the 13th article in the multi-record file?

xmlValue(xmltop3[[13]][[1]][[1]])

## [1] "30915224"

Pretty cool, but writing out all the node names or indices is a bit cumbersome. Fortunately, there are other ways to navigate XML that has nothing to do with R but with XML itself: XPaths. XPath is a syntax describing a navigation path through XML elements. It looks very similar to a filepath. For example, we can navigate to the PMID element with this path: /PubmedArticleSet/PubmedArticle/MedlineCitation/PMID. R provides a function, xpathSApply, that lets us use XPaths to navigate directly to a specific path and then apply a function to that node.

xpathSApply(xml_1, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue)

## [1] "31059555"

We can also use an abbreviated XPath to go directly to the PMID node:

xpathSApply(xmltop, '//PMID', xmlValue)

## [1] "31059555"

This is very useful for when you want to extract all the occurences of a single element from an XML file. We can use the same syntax as above to get all of the PMIDs from our multi-record file. Since the output of xpathSApply is a list, I will add an additional element to convert the restuls to a dataframe.

pmids <- as.data.frame(xpathSApply(xmltop3, '//PMID', xmlValue))
head(pmids)

##   xpathSApply(xmltop3, "//PMID", xmlValue)
## 1                                 31059555
## 2                                 31045763
## 3                                 31040538
## 4                                 31016730
## 5                                 31012139
## 6                                 31004430

Notice that xpathSApply will also work directly on our parsed xml, xml_1.

pmids <- as.data.frame(xpathSApply(xml_3, '//PMID', xmlValue))
head(pmids)

##   xpathSApply(xml_3, "//PMID", xmlValue)
## 1                               31059555
## 2                               31045763
## 3                               31040538
## 4                               31016730
## 5                               31012139
## 6                               31004430

nrow(pmids)

## [1] 693

Notice the number of PMIDs (693) is greater than the number of artilces (655). PMIDs are unqiue to an article so the element, //PMID must be used in multiple places in the XML (for example, in a citations section). One way to avoid getting values for elements that are used in multiple places is to provide the full XPath:

pmids2 <- as.data.frame(xpathSApply(xml_3, '/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID', xmlValue))
head(pmids2)

##   xpathSApply(xml_3, "/PubmedArticleSet/PubmedArticle/MedlineCitation/PMID", xmlValue)
## 1                                                                             31059555
## 2                                                                             31045763
## 3                                                                             31040538
## 4                                                                             31016730
## 5                                                                             31012139
## 6                                                                             31004430

nrow(pmids2)

## [1] 655

Now we can extract the value of a all occurences of a node in a single-record file and in a multi-record file. How about getting the PMIDs from multiple, single-record files?

# create list of file names
mylist <- as.list(c('kaposis_sarcoma.xml', 'poems_syndrome.xml'))

# function to parse each file and extract PMID value
get_pmids <- function(x){
  xml_data <- xmlParse(x)
  pmid <- xpathSApply(xml_data, '//PMID', xmlValue)
  pmid
}

# apply the function to a single file
get_pmids('kaposis_sarcoma.xml')

## [1] "31059555"

# apply the function to a list of files
data_list <- lapply(mylist, get_pmids)

# comine the results and convert to a dataframe
pmid_all <- as.data.frame(do.call("rbind", data_list), stringsAsFactors = FALSE)
pmid_all

##         V1
## 1 31059555
## 2 31012139

What if you set of files is too long to type or copy all the filenames? There is a command for that too. In the next code chunck we use a base R function to get a list of files in the working directory that match the pattern xml. The rest of the code is the same as above. This won’t work for a combination of single-record files and multi-record files. But it is a good approach when you have 100s of single-record files all in one directory.

mylist2 <- list.files(pattern = '.xml')
mylist2

## [1] "kaposis_sarcoma.xml" "poems_syndrome.xml"  "pubmed_results.xml"

Conditional XPaths

We can get more exacting with how we navigate through the XML in order to get the data we want. For example, what if we wanted to get the abstract for a specific PMID in our mutli-record file? We can put that condition right in the XPath. In the code below were are asking for the xml value of the abstrct for the XPath where the PMID element is equal to 31012139.

xpathSApply(xml_3,
            '//PubmedArticle/MedlineCitation[PMID=31012139]/Article/Abstract',
            xmlValue)

## [1] "POEMS syndrome is a paraneoplastic syndrome due to an underlying plasma cell neoplasm. The major criteria for the syndrome are polyradiculoneuropathy, clonal plasma cell disorder (PCD), sclerotic bone lesions, elevated vascular endothelial growth factor, and the presence of Castleman disease. Minor features include organomegaly, endocrinopathy, characteristic skin changes, papilledema, extravascular volume overload, and thrombocytosis. Diagnoses are often delayed because the syndrome is rare and can be mistaken for other neurologic disorders, most commonly chronic inflammatory demyelinating polyradiculoneuropathy. POEMS syndrome should be distinguished from the Castleman disease variant of POEMS syndrome, which has no clonal PCD and typically little to no peripheral neuropathy but has several of the minor diagnostic criteria for POEMS syndrome.The diagnosis of POEMS syndrome is made with 3 of the major criteria, two of which must include polyradiculoneuropathy and clonal plasma cell disorder, and at least one of the minor criteria.Because the pathogenesis of the syndrome is not well understood, risk stratification is limited to clinical phenotype rather than specific molecular markers. Risk factors include low serum albumin, age, pleural effusion, pulmonary hypertension, and reduced eGFR.For those patients with a dominant sclerotic plasmacytoma, first line therapy is irradiation. Patients with diffuse sclerotic lesions or disseminated bone marrow involvement and for those who have progression of their disease 3 to 6 months after completing radiation therapy should receive systemic therapy. Corticosteroids are temporizing, but alkylators are the mainstay of treatment, either in the form of low dose conventional therapy or high dose with stem cell transplantation. Lenalidomide shows promise with manageable toxicity. Thalidomide and bortezomib also have activity, but their benefit needs to be weighed against their risk of exacerbating the peripheral neuropathy. Prompt recognition and institution of both supportive care measures and therapy directed against the plasma cell result in the best outcomes. This article is protected by copyright. All rights reserved.This article is protected by copyright. All rights reserved."

The above example works because Abstract is underneath the same parent node, MedlineCitation. But what if the node upon which you want to base your condition is underneath a different parent than the node for which you want the value? Just put more of the XPath inside the brackets. For example, in the following code I get the list of identification numbers for PMID 31059555.

xpathSApply(xml_3,
            '//PubmedArticle[MedlineCitation/PMID=31059555]/PubmedData/ArticleIdList/ArticleId',
            xmlValue)

## [1] "31059555"                     "10.1371/journal.ppat.1007743"
## [3] "PPATHOGENS-D-18-01676"

Finally, if we want to look for a value where a particular attribute is equal to some value, we can use similar syntax. This time I will extract the DOI value for PMID 31059555.

xpathSApply(xml_3,
            '//PubmedArticle[MedlineCitation/PMID=31059555]/PubmedData/ArticleIdList/ArticleId[@IdType="doi"]',
            xmlValue)

## [1] "10.1371/journal.ppat.1007743"

Extracting Multiple Elements at a Time

Everything we have done above extracts the value of a single element. We did this for a single file, we did this for files with multiple records, we did this for multiple files, and we did this based on certain conditions. What if we want to extract multiple elements at the same time? For example, what if we want to build a dataframe that lists alls the PMIDs and all the authors for each PMID? XML doesn’t make this particularly easy because not all elements are required (e.g. an article record might not have an author element) and some records might have repeating elements (e.g. an article might have multiple authors. Putting everything we stepped through above together, we can write a function to do just this.

#create a list of main article PMIDs
pmid_list <- as.numeric(xpathSApply(xml_3, "//PubmedArticle/MedlineCitation/PMID", xmlValue))

#function to extract PMID and Authors
author_df <- function(pmid.value){
  PMID <- xpathSApply(xml_3, paste('//PubmedArticle/MedlineCitation[PMID=',pmid.value,']/PMID'), xmlValue)
  
  if(length(xpathSApply(xml_3, paste('//PubmedArticle/MedlineCitation[PMID=',pmid.value,']/Article/AuthorList/Author'), xmlValue)) > 0){
    author <- xpathSApply(xml_3, paste('//PubmedArticle/MedlineCitation[PMID=',pmid.value,']/Article/AuthorList/Author'), xmlValue)
  }else{
    author <- 'no author provided'
  }
  as.data.frame(cbind(PMID=PMID, author=author))
} 

#loop through this function with a list of PMIDs
data.list <- lapply(pmid_list, author_df)
authors <- as.data.frame(do.call("rbind", data.list), stringsAsFactors = FALSE)

tail(authors)

##          PMID            author
## 4023 23211339   TakedaYukihikoY
## 4024 23211339   ArakawaAtsushiA
## 4025 23211339        OsawaIsaoI
## 4026 23211339 HorikoshiSatoshiS
## 4027 23211339       YaoTakashiT
## 4028 23211339   TominoYasuhikoY

Exploring and Extracting Data from XML in R