Objective

In this tutorial, we shall take a journey together to explore the structure of the DrugBank database. We will observe how the drugs information is structured within DrugBank’s XML database and see how this information can be retrieved using R. Our main purpose here is parsing the database from its containing XML file. Let us begin!


What is DrugBank?

DrugBank is a comprehensive, freely accessible, online database containing information on drugs and their targets. As both a bioinformatics and a cheminformatics resource, DrugBank combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.


The DrugBank XML file

Below is what the XML file looks like on the inside. As we can see, there is a single <drugbank> node and, within it, lie thousands of <drug> nodes (which we talk about in greater detail in the following section). These nodes contain the information of the many drugs that constitute the DrugBank database.

<?xml version="1.0" encoding="UTF-8"?>
<drugbank xmlns="http://www.drugbank.ca" 
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
          xsi:schemaLocation="http://www.drugbank.ca http://www.drugbank.ca/docs/drugbank.xsd" 
          version="5.1" 
          exported-on="2018-07-03">

    <drug>  ...  </drug>
    <drug>  ...  </drug>
    <drug>  ...  </drug>
    <drug>  ...  </drug>
    <drug>  ...  </drug>
              
             .
             .
             .
             
    <drug>  ...  </drug>
    <drug>  ...  </drug>
    <drug>  ...  </drug>

</drugbank>

To follow along with the code in this tutorial, you may download and use this XML file. It is a dummy XML database file that contains only a single drug record (i.e. a single <drug> node). Alternatively, you may instead download and use the entire DrugBank XML database file which is available here.


The <drug> node (and its children)

Each drug in the database is represented by a drug node which contains, along with its associated attributes, many children nodes with varying names, properties and (more importantly) structures.

Below is a portion of a <drug> node as an example.

<drug type="biotech" created="2005-06-13" updated="2018-07-02">
    <drugbank-id primary="true">DB00001</drugbank-id>
    <drugbank-id>BTD00024</drugbank-id>
    <drugbank-id>BIOD00024</drugbank-id>
    <name>Lepirudin</name>
    <description>
        Lepirudin is identical to natural hirudin except for substitution of 
        leucine for isoleucine at the N-terminal end of the molecule and the 
        absence of a sulfate group on the tyrosine at position 63. It is 
        produced via yeast cells. Bayer ceased the production of lepirudin 
        (Refludan) effective May 31, 2012.
    </description>
    <cas-number>138068-37-8</cas-number>
    <unii>Y43GF64R34</unii>
    <state>liquid</state>
    <groups>
        <group>approved</group>
    </groups>
        
        .
        .
        .
        
    <!-- many other children -->
        
        .
        .
        .
      
</drug>  

What we see in the example above is the following:

Right now, our current conceptual understanding of the structure of a <drug> node looks something like the figure below.


The <drug> node parser

We shall now create a function that takes a drug record as input (in the form of an XML <drug> node) and parses it. We will first focus on extracting the simpler children of the <drug> node (i.e. the ones that consist of a single value). To that end, we will use the following functions of the dplyr, purrr and XML packages:

library(dplyr)    ## for tibble() function
library(XML)      ## for xmlValue() and xmlGetAttr() functions

## Extract drug df
drug_df <- function(drug) {    ## drug = provided <drug> node
    tibble(
        ## drug attribues
        type = xmlGetAttr(node = drug, name = "type"),
        created = as.Date(xmlGetAttr(node = drug, name = "created")),
        updated = as.Date(xmlGetAttr(node = drug, name = "updated")),
        
        ## Each drug has 3 keys: one primary (that always exists) and two optional
        primary_key = xmlValue(drug["drugbank-id"][[1]]),
        secondary_key = ifelse(length(drug["drugbank-id"]) > 1, xmlValue(drug["drugbank-id"][[2]]), NA),
        third_key = ifelse(length(drug["drugbank-id"]) > 2, xmlValue(drug["drugbank-id"][[3]]), NA),
        
        ## drug name
        name = xmlValue(drug[["name"]]),
        
        ## drug description
        description = xmlValue(drug[["description"]]),
        
        ## CAS no.
        cas_number = xmlValue(drug[["cas-number"]]),
        
        ## UNII: unique ingredient identifier
        unii = xmlValue(drug[["unii"]]),
        
        ## state (e.g. solid, liquid)
        state = xmlValue(drug[["state"]])
        
        ##
        ## Getting other attributes...
        ##     
    )
}

The above code parses information from a single <drug> node. With the help of the map_df() function, we can automate the information extraction from all the <drug> records. The extracted information will be returned in a data frame (tibble) object containing all the drugs data. Before extracting the data though, we need to load the data first.

drugbank <- xmlParse('drugbank_record.xml')    ## load XML file
drugbank_root <- xmlRoot(drugbank)             ## get XML root node
all_drugs <- xmlChildren(drugbank_root)        ## get root's children (drugs)

Now that XML file has been loaded, let’s extract the drugs’ data. The map_df() function takes two parameters:

library(purrr)    ## for map_df() function

## apply the 'drug_df' function to all the drug records
drugs <- map_df(all_drugs, ~drug_df(.x))

Common parser (One for All)

So far, we have written code for parsing a <drug> node, followed by code for automating the data extraction for all such nodes. However, we have only extracted data from some of the children (of the <drug> nodes).

We have also noticed that many of these children are similar in structure. We shall make use of that similarity by building a generic parser that could be reused to parse these children. Here goes…

drug_sub_df <- function(base_node,                 ## base node
                        child_node,                ## child node to be parsed
                        sub_child_node = NULL,     ## sub-child (if any)
                        id = "drugbank-id") {      ## identifier to be attached to parsed data
    
    ## get desired XML content
    if (!is.null(base_node[[child_node]])) {
        ## if no sub-child specified...
        if (is.null(sub_child_node)) {
            ## get content of child
            df <- xmlToDataFrame(base_node[[child_node]], 
                                 stringsAsFactors = FALSE)
        ## if sub-child specified
        } else {
            ## get content of sub-child
            df <- xmlToDataFrame(base_node[[child_node]][[sub_child_node]], 
                                 stringsAsFactors = FALSE)
        }
    } else {
        df <- NULL
    }
    
    ## assign the parent key to the resultant data frame, if any
    if (!is.null(df) && nrow(df) > 0) {
        parent_key <- NULL
      
        ## get id of base node
        if (!is.null(id)) {
            parent_key <- xmlValue(base_node[id][[1]])
        }
        
        ## set parent key
        if (!is.null(parent_key)) {
            df$parent_key <- parent_key
        }
    }
    
    return(df)
}

We break down the details of the above function in the following few sections to show how it would be used in a number of different cases.

Mandatory paramters

The drug_sub_df() method takes two mandatory paramaters:

  • base_node: a base XML node
  • child_node: a child of the base node to be parsed

To give a couple of examples, let’s say we want to parse the <groups> and <products> information of all the drugs. The following code would get us the information we want.

## get drugs' groups and products info
drug_groups <- map_df(all_drugs, ~drug_sub_df(.x, "groups"))
drug_products <- map_df(all_drugs, ~drug_sub_df(.x, "products"))

Note that the XML structure being parsed is as follows.

<drug type="biotech" created="2005-06-13" updated="2018-07-02">
        
        .
        .
        .
        
    <groups>
        <group>approved</group>
    </groups>
        
        .
        .
        .
        
    <products>
        <product>
            <name>Refludan</name>
            <labeller>Bayer</labeller>
            <ndc-id/>
            <ndc-product-code/>
            <dpd-id>02240996</dpd-id>
            <ema-product-code/>
            <ema-ma-number/>
            <started-marketing-on>2000-01-31</started-marketing-on>
            <ended-marketing-on>2013-07-26</ended-marketing-on>
            <dosage-form>Powder, for solution</dosage-form>
            <strength>50 mg</strength>
            <route>Intravenous</route>
            <fda-application-number/>
            <generic>false</generic>
            <over-the-counter>false</over-the-counter>
            <approved>true</approved>
            <country>Canada</country>
            <source>DPD</source>
        </product>
        <product>  ...  </product>
            
            .
            .
            
        <product>  ...  </product>
        <product>  ...  </product>
    </products>  
        
        .
        .
        .
      
</drug>  

In the above example, the nodes that we parsed consist of a group of similar homogeneous nodes; that is:

  • a <groups> node consisting of multiple <group> nodes
  • a <products> node consisting of multiple <product> nodes

Moreover, the parents of the parsed <groups> and <products> nodes are their corresponding <drug> nodes. The IDs of these parents are then attached to their respective parsed data.

Optional parameters

The drug_sub_df() method has two optional parameters: sub_child_node and id.

The sub_child_node parameter is used when we access a node using two names. Below is an example.

## get just the articles from the drugs' general references
drug_articles <- map_df(all_drugs, ~drug_sub_df(.x, "general-references", sub_child_node = "articles"))

The above code parses the following XML snippet. Note that there are other nodes beneath <general-references> that would not be considered by the above code (i.e. <textbooks> and <links>).

<drug type="biotech" created="2005-06-13" updated="2018-07-02">
        
        .
        .
        .
        
    <general-references>
        <articles>
            <article>
                <pubmed-id>16244762</pubmed-id>
                <citation>
                    Smythe MA, Stephens JL, Koerber JM, Mattson JC: 
                    A comparison of lepirudin and argatroban outcomes. 
                    Clin Appl Thromb Hemost. 2005 Oct;11(4):371-4.
                </citation>
            </article>
            <article>
                <pubmed-id>16690967</pubmed-id>
                <citation>
                    Tardy B, Lecompte T, Boelhen F, Tardy-Poncet B, Elalamy I, 
                    Morange P, Gruel Y, Wolf M, Francois D, Racadot E, Camarasa P, 
                    Blouch MT, Nguyen F, Doubine S, Dutrillaux F, Alhenc-Gelas M, 
                    Martin-Toutain I, Bauters A, Ffrench P, de Maistre E, 
                    Grunebaum L, Mouton C, Huisse MG, Gouault-Heilmann M, Lucke V: 
                    Predictive factors for thrombosis and major bleeding in an 
                    observational study in 181 patients with heparin-induced 
                    thrombocytopenia treated with lepirudin. 
                    Blood. 2006 Sep 1;108(5):1492-6. Epub 2006 May 11.
                </citation>
            </article>
            <article>
                <pubmed-id>16241940</pubmed-id>
                <citation>
                    Lubenow N, Eichler P, Lietz T, Greinacher A: 
                    Lepirudin in patients with heparin-induced thrombocytopenia - 
                    results of the third prospective study (HAT-3) and a combined 
                    analysis of HAT-1, HAT-2, and HAT-3. 
                    J Thromb Haemost. 2005 Nov;3(11):2428-36.
                </citation>
            </article>
        </articles>
        <textbooks/>
        <links>
          <link>
            <title>Google books</title>
            <url>http://books.google.com/books?id=iadLoXoQkWEC&amp;pg=PA440</url>
          </link>
        </links>
    </general-references>
        
        .
        .
        .
      
</drug>

As for the id, it is used to specify the parent_key field (by default, drugbank-id) to attach to the extracted data. If we have a look at the previously extracted groups, products and articles data, we will find each of the returned items has the drugbank-id of its corresponding drug attached to it in the parent_key column.

glimpse(drug_groups)
Observations: 1
Variables: 2
$ text       <chr> "approved"
$ parent_key <chr> "DB00001"
glimpse(drug_products)
Observations: 5
Variables: 19
$ name                     <chr> "Refludan", "Refludan", "Refludan", "Refludan", "Refludan"
$ labeller                 <chr> "Bayer", "Celgene Europe Limited", "Celgene Europe Limited", "Celgene Euro...
$ `ndc-id`                 <chr> "", "", "", "", ""
$ `ndc-product-code`       <chr> "", "", "", "", ""
$ `dpd-id`                 <chr> "02240996", "", "", "", ""
$ `ema-product-code`       <chr> "", "EMEA/H/C/000122", "EMEA/H/C/000122", "EMEA/H/C/000122", "EMEA/H/C/000...
$ `ema-ma-number`          <chr> "", "EU/1/97/035/001", "EU/1/97/035/002", "EU/1/97/035/003", "EU/1/97/035/...
$ `started-marketing-on`   <chr> "2000-01-31", "1997-03-13", "1997-03-13", "1997-03-13", "1997-03-13"
$ `ended-marketing-on`     <chr> "2013-07-26", "2012-07-27", "2012-07-27", "2012-07-27", "2012-07-27"
$ `dosage-form`            <chr> "Powder, for solution", "Injection, solution, concentrate", "Injection, so...
$ strength                 <chr> "50 mg", "50 mg", "50 mg", "20 mg", "20 mg"
$ route                    <chr> "Intravenous", "Intravenous", "Intravenous", "Intravenous", "Intravenous"
$ `fda-application-number` <chr> "", "", "", "", ""
$ generic                  <chr> "false", "false", "false", "false", "false"
$ `over-the-counter`       <chr> "false", "false", "false", "false", "false"
$ approved                 <chr> "true", "false", "false", "false", "false"
$ country                  <chr> "Canada", "EU", "EU", "EU", "EU"
$ source                   <chr> "DPD", "EMA", "EMA", "EMA", "EMA"
$ parent_key               <chr> "DB00001", "DB00001", "DB00001", "DB00001", "DB00001"
glimpse(drug_articles)
Observations: 3
Variables: 3
$ `pubmed-id` <chr> "16244762", "16690967", "16241940"
$ citation    <chr> "Smythe MA, Stephens JL, Koerber JM, Mattson JC: A comparison of lepirudin and argatrob...
$ parent_key  <chr> "DB00001", "DB00001", "DB00001"

In the above examples, we see that the IDs (i.e. parent_key’s) attached to each of the items is that of the corresponding drug, DB00001. This is the case because the dummy XML file used in this tutorial has only a single drug. When extracting the data from the full DrugBank XML database, other drug IDs would appear.

Furthermore, the drugbank-id is the default value used to fill the parent_key in the returned result. Below, we show an example where we specify another field to use for filling the parent_key column.

## get the synonyms and experimental properties (with modified attached IDs)
drug_synonyms <- map_df(all_drugs, ~drug_sub_df(.x, "synonyms", id = "name"))
drug_exp_props <- map_df(all_drugs, ~drug_sub_df(.x, "experimental-properties", id = "cas-number"))
drug_synonyms
  text                  parent_key
1 Hirudin variant-1     Lepirudin
2 Lepirudin recombinant Lepirudin
drug_exp_props
  kind              value             source                                                   parent_key
1 Melting Point     65 °C             Otto, A. & Seckler, R. Eur. J. Biochem. 202:67-73 (1991) 138068-37-8
2 Hydrophobicity    -0.777                                                                     138068-37-8
3 Isoelectric Point 4.04                                                                       138068-37-8
4 Molecular Weight  6963.425                                                                   138068-37-8
5 Molecular Formula C287H440N80O110S6                                                          138068-37-8

Special Cases (All for one)

There are a number of nodes that require special parsing in the DrugBank database. In the remainder of the article, we are going to explore some of them.

Reaction (Left, right, enzymes, what??)

The <reactions> node is a relatively complex child that contain a variety of children and sub-children nodes. The following figure shows the structure of a <reactions> node.

An example of the corresponding XML is given below. Note that the XML file provided at the beginning of the tutorial does not contain the example below.

<drug type="..." created="..." updated="...">
        
        .
        .
        .
        
    <reactions>
        <reaction>
            <sequence>1</sequence>
            <left-element>
                <drugbank-id>DB00091</drugbank-id>
                <name>Cyclosporine</name>
            </left-element>
            <right-element>
                <drugbank-id>DBMET00359</drugbank-id>
                <name>Metabolite AM1</name>
            </right-element>
            <enzymes>
                <enzyme>
                    <drugbank-id>BE0002638</drugbank-id>
                    <name>Cytochrome P450 3A4</name>
                    <uniprot-id>P08684</uniprot-id>
                </enzyme>
            </enzymes>
        </reaction>
    </reactions>
        
        .
        .
        .
      
</drug>

To parse the above XML, we write the functions below.

## parse <enzymes> child
get_enzymes_df <- function(drug) {
    return(map_df(xmlChildren(drug[["reactions"]]),
                  ~drug_sub_df(.x, "enzymes", id = NULL)))
    # reactions_enzymes <- drug_sub_df(.x, "enzymes", id = NULL)
}

## parse rest of children
get_reactions_rec <- function(r, drug_key) {
    tibble(
        sequence            = xmlValue(r[["sequence"]]),
        
        left_drugbank_id    = xmlValue(r[["left-element"]][["drugbank-id"]]),
        left_drugbank_name  = xmlValue(r[["left-element"]][["name"]]),
        
        right_drugbank_id   = xmlValue(r[["right-element"]][["drugbank-id"]]),
        right_drugbank_name = xmlValue(r[["right-element"]][["name"]]),
        
        parent_key = drug_key
    )
}

get_reactions_df <- function(drug) {
    return(map_df(xmlChildren(drug[["reactions"]]),
                  ~ get_reactions_rec(., xmlValue(drug["drugbank-id"][[1]]))))
}

Below is a demonstration of how to use the above functions. First, we will show how to apply the functions for a single drug, and then provide the code for automating this operation for all drugs.

## extract reactions info from a single drug (the first one)
first_drug_reactions_enzymes <- get_enzymes_df(all_drugs[[1]])
first_drug_reactions_remaining <- get_reactions_df(all_drugs[[1]])
## extract reactions info from all drugs
drug_reactions_enzymes <- map_df(all_drugs, ~get_enzymes_df(.x))
drug_reactions_remaining <- map_df(all_drugs, ~get_reactions_df(.x))

Let’s see the returned data.

drug_reactions_enzymes
  drugbank-id name                uniprot-id
1 BE0002638   Cytochrome P450 3A4 P08684
drug_reactions_remaining
  sequence left_drugbank_id left_drugbank_name right_drugbank_id right_drugbank_name parent_key
1 1        DB00091          Cyclosporine       DBMET00359        Metabolite AM1      DB00001  

ATC (twins, twins, ..)

<atc-codes> is another child of the <drug> node that also requires special attention. Observe the following XML snippet that shows thse structure of the <atc-codes> node.

<drug type="..." created="..." updated="...">
        
        .
        .
        .
        
    <atc-codes>
        <atc-code code="B01AE02">
            <level code="B01AE">Direct thrombin inhibitors</level>
            <level code="B01A">ANTITHROMBOTIC AGENTS</level>
            <level code="B01">ANTITHROMBOTIC AGENTS</level>
            <level code="B">BLOOD AND BLOOD FORMING ORGANS</level>
        </atc-code>
    </atc-codes>
        
        .
        .
        .
      
</drug>

we can parse it as follows.

get_atc_codes_rec <- function(r, drug_key) {
    tibble(
        atc_code = xmlGetAttr(r, name = "code"),
        
        level_1 = xmlValue(r[[1]]),
        code_1 = xmlGetAttr(r[[1]], name = "code"),
        
        level_2 = xmlValue(r[[2]]),
        code_2 = xmlGetAttr(r[[2]], name = "code"),
        
        level_3 = xmlValue(r[[3]]),
        code_3 = xmlGetAttr(r[[3]], name = "code"),
        
        level_4 = xmlValue(r[[4]]),
        code_4 = xmlGetAttr(r[[4]], name = "code"),
        
        parent_key = drug_key
    )
}

get_atc_codes_df <- function(drug) {
    return (map_df(xmlChildren(drug[["atc-codes"]]),
                   ~ get_atc_codes_rec(.x, xmlValue(drug["drugbank-id"][[1]]))))
}
## extract ATC codes info from all drugs
drug_atc_codes <- map_df(all_drugs, ~get_atc_codes_df(.x))

Let’s have a look at the returned results.

glimpse(drug_atc_codes)
Observations: 1
Variables: 10
$ atc_code   <chr> "B01AE02"
$ level_1    <chr> "Direct thrombin inhibitors"
$ code_1     <chr> "B01AE"
$ level_2    <chr> "ANTITHROMBOTIC AGENTS"
$ code_2     <chr> "B01A"
$ level_3    <chr> "ANTITHROMBOTIC AGENTS"
$ code_3     <chr> "B01"
$ level_4    <chr> "BLOOD AND BLOOD FORMING ORGANS"
$ code_4     <chr> "B"
$ parent_key <chr> "DB00001"

We have covered a couple of examples of special nodes that require separate treatment for parsing them. While there are more instances of such nodes, we will stop here with these special nodes.


Conclusion

Our journey within the DrugBank XML database comes to an end here. Do check out our dbparser package that was built for parsing all the data within DrugBank.

We hope that you enjoyed our journey together.


About the Authors

We are a team of data scientists called Dainanahan. Our aim is to provide practitioners in the field of drug discovery with useful tools that would make their life easier and help increase their productivity. This suite of tools that we are planning to make is codenamed DrugVerse, and dbparser is just the beginning, so stay tuned!