Source file ⇒ lec31.Rmd

Today

  1. processing XML using R
  2. data scraping XML data

Last time I showed you how to create an XML file in R. Now I will show you how to read an XML file and make a data table in R.

1. Processing XML using R

Please copy the following XLM document to Sublime and save in your home directory as plant_catalog.xml.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
     <PLANT>
          <COMMON>Bloodroot</COMMON>
          <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
          <ZONE>4</ZONE>
          <LIGHT>Mostly Shady</LIGHT>
          <PRICE>$2.44</PRICE>
          <AVAILABILITY>031599</AVAILABILITY>
     </PLANT>
     <PLANT>
          <COMMON>Columbine</COMMON>
          <BOTANICAL>Aquilegia canadensis</BOTANICAL>
          <ZONE>3</ZONE>
          <LIGHT>Mostly Shady</LIGHT>
          <PRICE>$9.37</PRICE>
          <AVAILABILITY>030699</AVAILABILITY>
     </PLANT>
     <PLANT>
          <COMMON>Marsh Marigold</COMMON>
          <BOTANICAL>Caltha palustris</BOTANICAL>
          <ZONE>4</ZONE>
          <LIGHT>Mostly Sunny</LIGHT>
          <PRICE>$6.81</PRICE>
          <AVAILABILITY>051799</AVAILABILITY>
     </PLANT>
</CATALOG>

Aside:

R allows for object-oriented (OO) programming. We’re not going to do any of this style of programming ourselves (i.e. defining S4 classes in R, and their slots and methods), but it’s helpful to know how to interpret it when we see it.

For example you can define a dog class consisting of all types of dogs. A class is the blueprint from which all objects are created.An object is an instance of a class, for example a particular dog from the dog class. The dog blueprint says what dogs are and what they can do. Dogs come in different types, have diffent names and ages. Dogs also can do things like bark, play fetch, roll over and even skateboard. My dog is an object of the dog class. He is a bulldog, his name is Max, and he is three years old. He can bark, and skateboard.

Data Frames are examples of a class in R. An object is a particular Data Frame you are working with. Data Frames come in different sizes, have different types of entries, and names (called the slots of the class). Data Frames also have methods (i.e. functions) which operate on them like ncol, or names or head.

To check the class of an object in R, just use class(objectname). This might be a vector in case methods of the class are inherited from other classes.

Back to XML:

To read an XML file into R, use xmlTreeParse.

library(XML)
doc <- xmlTreeParse("/Users/Adam/Desktop/plant_catalog.xml")
class(doc)
## [1] "XMLDocument"         "XMLAbstractDocument"

The XML package allows us to create a special class, called XMLDocument, to represent an XML tree.

xmlTreeParse implements what is called the DOM (Document Object Model) parser. It reads the entire file into memory.

We don’t have time to cover it, but you should be aware of another parsing model called SAX (Simple API for XML). It reads the document incrementally and is more memory efficient, but it is trickier to use. This might be necessary for large xml files.

The XMLDocument class has a method called xmlRoot() which gets the top-level XML node as a list of lists.

root <- xmlRoot(doc)
class(root)
## [1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode" 
## [4] "oldClass"
root
## <CATALOG>
##  <PLANT>
##   <COMMON>Bloodroot</COMMON>
##   <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$2.44</PRICE>
##   <AVAILABILITY>031599</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Columbine</COMMON>
##   <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##   <ZONE>3</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$9.37</PRICE>
##   <AVAILABILITY>030699</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Marsh Marigold</COMMON>
##   <BOTANICAL>Caltha palustris</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Sunny</LIGHT>
##   <PRICE>$6.81</PRICE>
##   <AVAILABILITY>051799</AVAILABILITY>
##  </PLANT>
## </CATALOG>

root is an object of the XMLNode class.

Each Plant node in root is a list. We can access an element within a node (i.e., a child), using the usual [[ ]] indexing for lists.It has methods which can give you information of the children nodes on the root.

# Look at the first plant node
oneplant <- root[[1]]
class(oneplant)
## [1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode" 
## [4] "oldClass"
oneplant
## <PLANT>
##  <COMMON>Bloodroot</COMMON>
##  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##  <ZONE>4</ZONE>
##  <LIGHT>Mostly Shady</LIGHT>
##  <PRICE>$2.44</PRICE>
##  <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>

We can drill down further into the list:

oneplant[['COMMON']]
## <COMMON>Bloodroot</COMMON>

Note that this doesn’t remove the markup, though. To do this, use the function xmlValue.

xmlValue(oneplant[['COMMON']])
## [1] "Bloodroot"
xmlValue(oneplant[['BOTANICAL']])
## [1] "Sanguinaria canadensis"

The names of oneplant is a named character vector of the names of its children.

names(oneplant)
##         COMMON      BOTANICAL           ZONE          LIGHT          PRICE 
##       "COMMON"    "BOTANICAL"         "ZONE"        "LIGHT"        "PRICE" 
##   AVAILABILITY 
## "AVAILABILITY"

Making a data frame from an XML file

To illustrate how we manipulate an XML object in R, we’ll take this data and reformat it into a data frame with one row for each plant.

There are special XML versions of lapply, and sapply, named xmlApply, xmlSApply. Each takes an XMLNode object as its primary argument. They iterate over the node’s children nodes, invoking the given function.

Like lapply, xmlApply returns a list. Like sapply, xmlSApply returns a simpler data structure if possible.

First, a quick review of sapply and lapply. Remember:

  1. lapply and sapply can operate on either a list or a vector.
1:3 %>% lapply( function(x){x^2}) 
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
1:3 %>% sapply( function(x){x^2}) 
## [1] 1 4 9
myList <- list(a=1, b=2, c=3) 
myList %>% lapply( function(x){x^2}) 
## $a
## [1] 1
## 
## $b
## [1] 4
## 
## $c
## [1] 9
myList %>% sapply( function(x){x^2})
## a b c 
## 1 4 9
  1. You can always include additional arguments. Examples:
myList %>% sapply(log)
##         a         b         c 
## 0.0000000 0.6931472 1.0986123
myList %>% sapply(log, base = 10)
##         a         b         c 
## 0.0000000 0.3010300 0.4771213
myList %>% sapply(function(x, pow){x^pow}, pow = 3)
##  a  b  c 
##  1  8 27
  1. If the results of sapply cannot be simplified, then sapply and lapply will return the same thing.
myList <- list(a=1:2, b=3:5, c=6)
myList %>% lapply( function(x){x^2})
## $a
## [1] 1 4
## 
## $b
## [1]  9 16 25
## 
## $c
## [1] 36
myList %>% sapply( function(x){x^2})
## $a
## [1] 1 4
## 
## $b
## [1]  9 16 25
## 
## $c
## [1] 36

(All of these are just examples to illustrate the differences between sapply and lapply. Of course if you have a vector, for simple arithmetic you should just use vectorized operations, e.g. vec^2.)

The functions xmlApply and xmlSApply work like lapply and sapply, except their arguments are XML nodes.

xmlApply returns a list (the elements may themselves be XML nodes).

If it can, xmlSApply returns a vector or matrix. If not, it also returns a list.

With all of these functions, always ask yourself

  1. What do I want to operate on (iterate over)?
  2. What do I want to produce?

In our plants example, we can use xmlSApply to extract the common names of all the plants.

root
## <CATALOG>
##  <PLANT>
##   <COMMON>Bloodroot</COMMON>
##   <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$2.44</PRICE>
##   <AVAILABILITY>031599</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Columbine</COMMON>
##   <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##   <ZONE>3</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$9.37</PRICE>
##   <AVAILABILITY>030699</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Marsh Marigold</COMMON>
##   <BOTANICAL>Caltha palustris</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Sunny</LIGHT>
##   <PRICE>$6.81</PRICE>
##   <AVAILABILITY>051799</AVAILABILITY>
##  </PLANT>
## </CATALOG>
commons <- root %>% xmlSApply (function(x){xmlValue(x[['COMMON']])})
commons
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold"

Task for you:

Discuss with your neighbor how the command above works

The elements of the root node are all plant nodes, (like `oneplant`).

Using the same strategy, we can create a full dataframe with all variables.

root[[1]]
## <PLANT>
##  <COMMON>Bloodroot</COMMON>
##  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##  <ZONE>4</ZONE>
##  <LIGHT>Mostly Shady</LIGHT>
##  <PRICE>$2.44</PRICE>
##  <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants
COMMON BOTANICAL ZONE LIGHT PRICE AVAILABILITY
Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599
Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699
Marsh Marigold Caltha palustris 4 Mostly Sunny $6.81 051799

Task for you:

Discuss with your neighbor what the commands above are doing.

Recognize that  root %>% xmlSApply(getvar,var) is the above code if var='Common'. However we are going to let var be all of the tags. The names of the tags is names(root[[1]]). This gives us a named list res which in the last command we convert to a data frame.
names(root[[1]])
##         COMMON      BOTANICAL           ZONE          LIGHT          PRICE 
##       "COMMON"    "BOTANICAL"         "ZONE"        "LIGHT"        "PRICE" 
##   AVAILABILITY 
## "AVAILABILITY"
res
## $COMMON
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold" 
## 
## $BOTANICAL
##                    PLANT                    PLANT                    PLANT 
## "Sanguinaria canadensis"   "Aquilegia canadensis"       "Caltha palustris" 
## 
## $ZONE
## PLANT PLANT PLANT 
##   "4"   "3"   "4" 
## 
## $LIGHT
##          PLANT          PLANT          PLANT 
## "Mostly Shady" "Mostly Shady" "Mostly Sunny" 
## 
## $PRICE
##   PLANT   PLANT   PLANT 
## "$2.44" "$9.37" "$6.81" 
## 
## $AVAILABILITY
##    PLANT    PLANT    PLANT 
## "031599" "030699" "051799"

Data Scraping XML documents

Steps: 1. Load the XML document from the web or your computer
2. Convert to a data table
3. Extract the information you need from the data table.

Here is the full plant_catalog data

http://www.xmlfiles.com/examples/plant_catalog.xml

We can load it into R with

xml <- 'http://www.xmlfiles.com/examples/plant_catalog.xml'
doc1 <- xmlTreeParse(xml)

Task for you:

Find the top 3 cheapest plants requiring Mostly Shady light.

# make XML into a data table
root <- 'http://www.xmlfiles.com/examples/plant_catalog.xml' %>%
  xmlTreeParse() %>%
  xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants
COMMON BOTANICAL ZONE LIGHT PRICE AVAILABILITY
Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599
Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699
Marsh Marigold Caltha palustris 4 Mostly Sunny $6.81 051799
Cowslip Caltha palustris 4 Mostly Shady $9.90 030699
Dutchman’s-Breeches Diecentra cucullaria 3 Mostly Shady $6.44 012099
Ginger, Wild Asarum canadense 3 Mostly Shady $9.03 041899
Hepatica Hepatica americana 4 Mostly Shady $4.45 012699
Liverleaf Hepatica americana 4 Mostly Shady $3.99 010299
Jack-In-The-Pulpit Arisaema triphyllum 4 Mostly Shady $3.23 020199
Mayapple Podophyllum peltatum 3 Mostly Shady $2.98 060599
Phlox, Woodland Phlox divaricata 3 Sun or Shade $2.80 012299
Phlox, Blue Phlox divaricata 3 Sun or Shade $5.59 021699
Spring-Beauty Claytonia Virginica 7 Mostly Shady $6.59 020199
Trillium Trillium grandiflorum 5 Sun or Shade $3.90 042999
Wake Robin Trillium grandiflorum 5 Sun or Shade $3.20 022199
Violet, Dog-Tooth Erythronium americanum 4 Shade $9.04 020199
Trout Lily Erythronium americanum 4 Shade $6.94 032499
Adder’s-Tongue Erythronium americanum 4 Shade $9.58 041399
Anemone Anemone blanda 6 Mostly Shady $8.86 122698
Grecian Windflower Anemone blanda 6 Mostly Shady $9.16 071099
Bee Balm Monarda didyma 4 Shade $4.59 050399
Bergamont Monarda didyma 4 Shade $7.16 042799
Black-Eyed Susan Rudbeckia hirta Annual Sunny $9.80 061899
Buttercup Ranunculus 4 Shade $2.57 061099
Crowfoot Ranunculus 4 Shade $9.34 040399
Butterfly Weed Asclepias tuberosa Annual Sunny $2.78 063099
Cinquefoil Potentilla Annual Shade $7.06 052599
Primrose Oenothera 3 - 5 Sunny $6.56 013099
Gentian Gentiana 4 Sun or Shade $7.81 051899
Blue Gentian Gentiana 4 Sun or Shade $8.56 050299
Jacob’s Ladder Polemonium caeruleum Annual Shade $9.26 022199
Greek Valerian Polemonium caeruleum Annual Shade $4.36 071499
California Poppy Eschscholzia californica Annual Sun $7.89 032799
Shooting Star Dodecatheon Annual Mostly Shady $8.60 051399
Snakeroot Cimicifuga Annual Shade $5.63 071199
Cardinal Flower Lobelia cardinalis 2 Shade $3.02 022299
new_plants <-  plants %>% filter(LIGHT=="Mostly Shady") 
new_plants_clean <- new_plants %>% mutate(cost=as.numeric(gsub("\\$", "", new_plants$PRICE))) %>% arrange(cost) %>% select(COMMON,PRICE) %>% head(3)
new_plants_clean 
COMMON PRICE
Bloodroot $2.44
Mayapple $2.98
Jack-In-The-Pulpit $3.23