We start by loading the XML package and defining a few functions that we will use later on.
require(XML)
## Loading required package: XML
count.xml.matches <- function(xpath, doctree) {
length(getNodeSet(doctree, xpath))
}
get.name.attr <- function(node) {xmlGetAttr(node,"name")}
Note that the “name attribute” is not the same thing as an “xmlName”.
The xmlParse() function will read a file and parse it as XML. The argument can be the name of a local xml file, a compressed XML file, or the URL of an XML file. If the file is compressed you don't have to uncompress it; the parser will uncompress on the fly. Likewise, if it's a URL, you don't have to download it or save it locally.
The returned object is opaque from the user's point of view. Mostly it's just good for passing to other functions in the XML package.
gcam.config <- xmlParse("output-ctax.xml")
str(gcam.config)
## Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>
The getNodeSet() function fetches all the nodes that match a specified criterion. The following snippet gets all of the regions in the output. The result returned is a list of all the nodes that match the query string. The length of the list tells you how many matches you got. You can access the individual elements of the list using the usual list indexing notation.
The print method for a node prints out the textual XML representation for the node. If the node is reasonably sized, this can be a good way to get a peek at what's in it. Don't try it with a GCAM region, though; they're huge.
getNodeSet(gcam.config, "//region") -> rgns
length(rgns)
## [1] 14
get.name.attr(rgns[[2]])
## [1] "Canada"
mm <- getNodeSet(gcam.config,'//MagiccModel')
mm[[1]]
## <MagiccModel>
## <ghgInputFileName>../input/nonco2/Historical Emissions/Default Emissions Module/Hist_to_2008_Annual.csv</ghgInputFileName>
## <last-historical-year>2005</last-historical-year>
## <bc-unit-forcing>0</bc-unit-forcing>
## <carbon-model-start-year>1705</carbon-model-start-year>
## </MagiccModel>
You use XPath to specify the nodes you are looking for. Here are a few examples where we look at the delivered biomass and regional biomass sectors. The '@' sign specifies an attribute, in this case the name of the sector. So, these searches will give us only the supply sectors that have the specified name attribute. We'd also like to know if any of those sectors have more than one technology. The position() operator tells us where the node is in the list of children of its parents (i.e., the node and its siblings). By asking for position > 1 we are for all the nodes that aren't the first in their list of siblings. These nodes must therefore be part of a subsector with more than one technology. On the other hand, if we don't find any such nodes, then we know that all of the subsectors under the delivered biomass sector have no more than one technology each.
count.xml.matches('//supplysector[@name="delivered biomass"]', gcam.config)
## [1] 14
count.xml.matches('//supplysector[@name="delivered biomass"]//technology', gcam.config)
## [1] 14
count.xml.matches('//supplysector[@name="delivered biomass"]//technology[position()>1]', gcam.config)
## [1] 0
count.xml.matches('//supplysector[@name="regional biomass"]//technology[position()>1]', gcam.config)
## [1] 0
count.xml.matches('//supplysector[@name="regional biomassOil"]//technology', gcam.config)
## [1] 4
Hmm. It seems not all regions have a biomassOil sector. I wonder which ones do. It turns out that there are operators that allow you to find nodes that bear some specified relationship to another node you've identified. In this case we'll find all of the ancestors of a biomassOil sector that are of type region. We can use sapply on the list we get back to run the get.name.attr function on each item in the list.
rbo <- getNodeSet(gcam.config, '//supplysector[@name="regional biomassOil"]/ancestor::region')
sapply(rbo, get.name.attr)
## [1] "USA" "Africa" "Southeast Asia" "India"
Here's another example. We can see that the industrial feedstocks sectors have several technologies apiece, but there are no technologies with position > 1. That tells us that each sector has several subsectors, each with just one technology. We can narrow it down to the USA region and see what the subsectors are called.
count.xml.matches('//supplysector[@name="industrial feedstocks"]//technology', gcam.config)
## [1] 56
count.xml.matches('//supplysector[@name="industrial feedstocks"]//technology[position()>1]', gcam.config)
## [1] 0
count.xml.matches('//supplysector[@name="industrial feedstocks"]//subsector', gcam.config)
## [1] 56
ifs.subsec.usa <- getNodeSet(gcam.config, '//region[@name="USA"]//supplysector[@name="industrial feedstocks"]//subsector')
sapply(ifs.subsec.usa, get.name.attr)
## [1] "biomass" "coal" "gas" "refined liquids"
There are functions that allow you to get the parent of a node or a list of its children:
usa.rgn <- getNodeSet(gcam.config,'//region[@name="USA"]')
usa.children <- xmlChildren(usa.rgn[[1]])
length(usa.children)
## [1] 164
get.name.attr(usa.children[[42]])
## [1] "comm cooling"
pchild <- xmlParent(usa.children[[23]])
xmlName(pchild)
## [1] "region"
get.name.attr(pchild)
## [1] "USA"
For a leaf node, you might want to get the value:
xmlValue(xmlChildren(mm[[1]])[[2]])
## [1] "2005"
There are a whole slew of other functions in the XML package, but you'll get a lot of mileage out of the ones above. The list below has links to the documentation for the XML package and to a brief tutorial on XPath queries.