Source file ⇒ lec31.Rmd
Last time I showed you how to create an XML file in R. Now I will show you how to read an XML file and make a data table in R.
Please copy the following XLM document to Sublime and save in your home directory as plant_catalog.xml.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Marsh Marigold</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Sunny</LIGHT>
<PRICE>$6.81</PRICE>
<AVAILABILITY>051799</AVAILABILITY>
</PLANT>
</CATALOG>
R allows for object-oriented (OO) programming. We’re not going to do any of this style of programming ourselves (i.e. defining S4 classes in R, and their slots and methods), but it’s helpful to know how to interpret it when we see it.
For example you can define a dog class consisting of all types of dogs. A class is the blueprint from which all objects are created.An object is an instance of a class, for example a particular dog from the dog class. The dog blueprint says what dogs are and what they can do. Dogs come in different types, have diffent names and ages. Dogs also can do things like bark, play fetch, roll over and even skateboard. My dog is an object of the dog class. He is a bulldog, his name is Max, and he is three years old. He can bark, and skateboard.
Data Frames are examples of a class in R. An object is a particular Data Frame you are working with. Data Frames come in different sizes, have different types of entries, and names (called the slots of the class). Data Frames also have methods (i.e. functions) which operate on them like ncol
, or names
or head
.
To check the class of an object in R, just use class(objectname)
. This might be a vector in case methods of the class are inherited from other classes.
To read an XML file into R, use xmlTreeParse
.
library(XML)
doc <- xmlTreeParse("/Users/Adam/Desktop/plant_catalog.xml")
class(doc)
## [1] "XMLDocument" "XMLAbstractDocument"
The XML package allows us to create a special class, called XMLDocument
, to represent an XML tree.
xmlTreeParse
implements what is called the DOM
(Document Object Model) parser. It reads the entire file into memory.
We don’t have time to cover it, but you should be aware of another parsing model called SAX
(Simple API for XML). It reads the document incrementally and is more memory efficient, but it is trickier to use. This might be necessary for large xml files.
The XMLDocument
class has a method called xmlRoot()
which gets the top-level XML node as a list of lists.
root <- xmlRoot(doc)
class(root)
## [1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode"
## [4] "oldClass"
root
## <CATALOG>
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Columbine</COMMON>
## <BOTANICAL>Aquilegia canadensis</BOTANICAL>
## <ZONE>3</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$9.37</PRICE>
## <AVAILABILITY>030699</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Marsh Marigold</COMMON>
## <BOTANICAL>Caltha palustris</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Sunny</LIGHT>
## <PRICE>$6.81</PRICE>
## <AVAILABILITY>051799</AVAILABILITY>
## </PLANT>
## </CATALOG>
root
is an object of the XMLNode class.
Each Plant node in root
is a list. We can access an element within a node (i.e., a child), using the usual [[ ]] indexing for lists.It has methods which can give you information of the children nodes on the root.
# Look at the first plant node
oneplant <- root[[1]]
class(oneplant)
## [1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode"
## [4] "oldClass"
oneplant
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
We can drill down further into the list:
oneplant[['COMMON']]
## <COMMON>Bloodroot</COMMON>
Note that this doesn’t remove the markup, though. To do this, use the function xmlValue
.
xmlValue(oneplant[['COMMON']])
## [1] "Bloodroot"
xmlValue(oneplant[['BOTANICAL']])
## [1] "Sanguinaria canadensis"
The names of oneplant is a named character vector of the names of its children.
names(oneplant)
## COMMON BOTANICAL ZONE LIGHT PRICE
## "COMMON" "BOTANICAL" "ZONE" "LIGHT" "PRICE"
## AVAILABILITY
## "AVAILABILITY"
To illustrate how we manipulate an XML object in R, we’ll take this data and reformat it into a data frame with one row for each plant.
There are special XML versions of lapply
, and sapply
, named xmlApply
, xmlSApply
. Each takes an XMLNode
object as its primary argument. They iterate over the node’s children nodes, invoking the given function.
Like lapply
, xmlApply
returns a list. Like sapply
, xmlSApply
returns a simpler data structure if possible.
First, a quick review of sapply
and lapply
. Remember:
lapply
and sapply
can operate on either a list or a vector.1:3 %>% lapply( function(x){x^2})
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
1:3 %>% sapply( function(x){x^2})
## [1] 1 4 9
myList <- list(a=1, b=2, c=3)
myList %>% lapply( function(x){x^2})
## $a
## [1] 1
##
## $b
## [1] 4
##
## $c
## [1] 9
myList %>% sapply( function(x){x^2})
## a b c
## 1 4 9
myList %>% sapply(log)
## a b c
## 0.0000000 0.6931472 1.0986123
myList %>% sapply(log, base = 10)
## a b c
## 0.0000000 0.3010300 0.4771213
myList %>% sapply(function(x, pow){x^pow}, pow = 3)
## a b c
## 1 8 27
myList <- list(a=1:2, b=3:5, c=6)
myList %>% lapply( function(x){x^2})
## $a
## [1] 1 4
##
## $b
## [1] 9 16 25
##
## $c
## [1] 36
myList %>% sapply( function(x){x^2})
## $a
## [1] 1 4
##
## $b
## [1] 9 16 25
##
## $c
## [1] 36
(All of these are just examples to illustrate the differences between sapply and lapply. Of course if you have a vector, for simple arithmetic you should just use vectorized operations, e.g. vec^2
.)
The functions xmlApply
and xmlSApply
work like lapply
and sapply
, except their arguments are XML nodes.
xmlApply
returns a list (the elements may themselves be XML nodes).
If it can, xmlSApply
returns a vector or matrix. If not, it also returns a list.
With all of these functions, always ask yourself
In our plants example, we can use xmlSApply to extract the common names of all the plants.
root
## <CATALOG>
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Columbine</COMMON>
## <BOTANICAL>Aquilegia canadensis</BOTANICAL>
## <ZONE>3</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$9.37</PRICE>
## <AVAILABILITY>030699</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Marsh Marigold</COMMON>
## <BOTANICAL>Caltha palustris</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Sunny</LIGHT>
## <PRICE>$6.81</PRICE>
## <AVAILABILITY>051799</AVAILABILITY>
## </PLANT>
## </CATALOG>
commons <- root %>% xmlSApply (function(x){xmlValue(x[['COMMON']])})
commons
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
Discuss with your neighbor how the command above works
The elements of the root node are all plant nodes, (like `oneplant`).
Using the same strategy, we can create a full dataframe with all variables.
root[[1]]
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]]) %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants
COMMON | BOTANICAL | ZONE | LIGHT | PRICE | AVAILABILITY |
---|---|---|---|---|---|
Bloodroot | Sanguinaria canadensis | 4 | Mostly Shady | $2.44 | 031599 |
Columbine | Aquilegia canadensis | 3 | Mostly Shady | $9.37 | 030699 |
Marsh Marigold | Caltha palustris | 4 | Mostly Sunny | $6.81 | 051799 |
Discuss with your neighbor what the commands above are doing.
Recognize that root %>% xmlSApply(getvar,var) is the above code if var='Common'. However we are going to let var be all of the tags. The names of the tags is names(root[[1]]). This gives us a named list res which in the last command we convert to a data frame.
names(root[[1]])
## COMMON BOTANICAL ZONE LIGHT PRICE
## "COMMON" "BOTANICAL" "ZONE" "LIGHT" "PRICE"
## AVAILABILITY
## "AVAILABILITY"
res
## $COMMON
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
##
## $BOTANICAL
## PLANT PLANT PLANT
## "Sanguinaria canadensis" "Aquilegia canadensis" "Caltha palustris"
##
## $ZONE
## PLANT PLANT PLANT
## "4" "3" "4"
##
## $LIGHT
## PLANT PLANT PLANT
## "Mostly Shady" "Mostly Shady" "Mostly Sunny"
##
## $PRICE
## PLANT PLANT PLANT
## "$2.44" "$9.37" "$6.81"
##
## $AVAILABILITY
## PLANT PLANT PLANT
## "031599" "030699" "051799"
Steps: 1. Load the XML document from the web or your computer
2. Convert to a data table
3. Extract the information you need from the data table.
Here is the full plant_catalog data
http://www.xmlfiles.com/examples/plant_catalog.xml
We can load it into R with
xml <- 'http://www.xmlfiles.com/examples/plant_catalog.xml'
doc1 <- xmlTreeParse(xml)
Find the top 3 cheapest plants requiring Mostly Shady light.
# make XML into a data table
root <- 'http://www.xmlfiles.com/examples/plant_catalog.xml' %>%
xmlTreeParse() %>%
xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]]) %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants
COMMON | BOTANICAL | ZONE | LIGHT | PRICE | AVAILABILITY |
---|---|---|---|---|---|
Bloodroot | Sanguinaria canadensis | 4 | Mostly Shady | $2.44 | 031599 |
Columbine | Aquilegia canadensis | 3 | Mostly Shady | $9.37 | 030699 |
Marsh Marigold | Caltha palustris | 4 | Mostly Sunny | $6.81 | 051799 |
Cowslip | Caltha palustris | 4 | Mostly Shady | $9.90 | 030699 |
Dutchman’s-Breeches | Diecentra cucullaria | 3 | Mostly Shady | $6.44 | 012099 |
Ginger, Wild | Asarum canadense | 3 | Mostly Shady | $9.03 | 041899 |
Hepatica | Hepatica americana | 4 | Mostly Shady | $4.45 | 012699 |
Liverleaf | Hepatica americana | 4 | Mostly Shady | $3.99 | 010299 |
Jack-In-The-Pulpit | Arisaema triphyllum | 4 | Mostly Shady | $3.23 | 020199 |
Mayapple | Podophyllum peltatum | 3 | Mostly Shady | $2.98 | 060599 |
Phlox, Woodland | Phlox divaricata | 3 | Sun or Shade | $2.80 | 012299 |
Phlox, Blue | Phlox divaricata | 3 | Sun or Shade | $5.59 | 021699 |
Spring-Beauty | Claytonia Virginica | 7 | Mostly Shady | $6.59 | 020199 |
Trillium | Trillium grandiflorum | 5 | Sun or Shade | $3.90 | 042999 |
Wake Robin | Trillium grandiflorum | 5 | Sun or Shade | $3.20 | 022199 |
Violet, Dog-Tooth | Erythronium americanum | 4 | Shade | $9.04 | 020199 |
Trout Lily | Erythronium americanum | 4 | Shade | $6.94 | 032499 |
Adder’s-Tongue | Erythronium americanum | 4 | Shade | $9.58 | 041399 |
Anemone | Anemone blanda | 6 | Mostly Shady | $8.86 | 122698 |
Grecian Windflower | Anemone blanda | 6 | Mostly Shady | $9.16 | 071099 |
Bee Balm | Monarda didyma | 4 | Shade | $4.59 | 050399 |
Bergamont | Monarda didyma | 4 | Shade | $7.16 | 042799 |
Black-Eyed Susan | Rudbeckia hirta | Annual | Sunny | $9.80 | 061899 |
Buttercup | Ranunculus | 4 | Shade | $2.57 | 061099 |
Crowfoot | Ranunculus | 4 | Shade | $9.34 | 040399 |
Butterfly Weed | Asclepias tuberosa | Annual | Sunny | $2.78 | 063099 |
Cinquefoil | Potentilla | Annual | Shade | $7.06 | 052599 |
Primrose | Oenothera | 3 - 5 | Sunny | $6.56 | 013099 |
Gentian | Gentiana | 4 | Sun or Shade | $7.81 | 051899 |
Blue Gentian | Gentiana | 4 | Sun or Shade | $8.56 | 050299 |
Jacob’s Ladder | Polemonium caeruleum | Annual | Shade | $9.26 | 022199 |
Greek Valerian | Polemonium caeruleum | Annual | Shade | $4.36 | 071499 |
California Poppy | Eschscholzia californica | Annual | Sun | $7.89 | 032799 |
Shooting Star | Dodecatheon | Annual | Mostly Shady | $8.60 | 051399 |
Snakeroot | Cimicifuga | Annual | Shade | $5.63 | 071199 |
Cardinal Flower | Lobelia cardinalis | 2 | Shade | $3.02 | 022299 |
new_plants <- plants %>% filter(LIGHT=="Mostly Shady")
new_plants_clean <- new_plants %>% mutate(cost=as.numeric(gsub("\\$", "", new_plants$PRICE))) %>% arrange(cost) %>% select(COMMON,PRICE) %>% head(3)
new_plants_clean
COMMON | PRICE |
---|---|
Bloodroot | $2.44 |
Mayapple | $2.98 |
Jack-In-The-Pulpit | $3.23 |