lec31

Today

processing XML using R
data scraping XML data

Last time I showed you how to create an XML file in R. Now I will show you how to read an XML file and make a data table in R.

1. Processing XML using R

Please copy the following XLM document to Sublime and save in your home directory as plant_catalog.xml.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
     <PLANT>
          <COMMON>Bloodroot</COMMON>
          <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
          <ZONE>4</ZONE>
          <LIGHT>Mostly Shady</LIGHT>
          <PRICE>$2.44</PRICE>
          <AVAILABILITY>031599</AVAILABILITY>
     </PLANT>
     <PLANT>
          <COMMON>Columbine</COMMON>
          <BOTANICAL>Aquilegia canadensis</BOTANICAL>
          <ZONE>3</ZONE>
          <LIGHT>Mostly Shady</LIGHT>
          <PRICE>$9.37</PRICE>
          <AVAILABILITY>030699</AVAILABILITY>
     </PLANT>
     <PLANT>
          <COMMON>Marsh Marigold</COMMON>
          <BOTANICAL>Caltha palustris</BOTANICAL>
          <ZONE>4</ZONE>
          <LIGHT>Mostly Sunny</LIGHT>
          <PRICE>$6.81</PRICE>
          <AVAILABILITY>051799</AVAILABILITY>
     </PLANT>
</CATALOG>

Aside:

R allows for object-oriented (OO) programming. We’re not going to do any of this style of programming ourselves (i.e. defining S4 classes in R, and their slots and methods), but it’s helpful to know how to interpret it when we see it.

For example you can define a dog class consisting of all types of dogs. A class is the blueprint from which all objects are created.An object is an instance of a class, for example a particular dog from the dog class. The dog blueprint says what dogs are and what they can do. Dogs come in different types, have diffent names and ages. Dogs also can do things like bark, play fetch, roll over and even skateboard. My dog is an object of the dog class. He is a bulldog, his name is Max, and he is three years old. He can bark, and skateboard.

Data Frames are examples of a class in R. An object is a particular Data Frame you are working with. Data Frames come in different sizes, have different types of entries, and names (called the slots of the class). Data Frames also have methods (i.e. functions) which operate on them like ncol, or names or head.

To check the class of an object in R, just use class(objectname). This might be a vector in case methods of the class are inherited from other classes.

Back to XML:

To read an XML file into R, use xmlTreeParse.

library(XML)
doc <- xmlTreeParse("/Users/Adam/Desktop/plant_catalog.xml")
class(doc)

## [1] "XMLDocument"         "XMLAbstractDocument"

The XML package allows us to create a special class, called XMLDocument, to represent an XML tree.

xmlTreeParse implements what is called the DOM (Document Object Model) parser. It reads the entire file into memory.

We don’t have time to cover it, but you should be aware of another parsing model called SAX (Simple API for XML). It reads the document incrementally and is more memory efficient, but it is trickier to use. This might be necessary for large xml files.

The XMLDocument class has a method called xmlRoot() which gets the top-level XML node as a list of lists.

root <- xmlRoot(doc)
class(root)

## [1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode" 
## [4] "oldClass"

root

## <CATALOG>
##  <PLANT>
##   <COMMON>Bloodroot</COMMON>
##   <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$2.44</PRICE>
##   <AVAILABILITY>031599</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Columbine</COMMON>
##   <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##   <ZONE>3</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$9.37</PRICE>
##   <AVAILABILITY>030699</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Marsh Marigold</COMMON>
##   <BOTANICAL>Caltha palustris</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Sunny</LIGHT>
##   <PRICE>$6.81</PRICE>
##   <AVAILABILITY>051799</AVAILABILITY>
##  </PLANT>
## </CATALOG>

root is an object of the XMLNode class.

Each Plant node in root is a list. We can access an element within a node (i.e., a child), using the usual [[ ]] indexing for lists.It has methods which can give you information of the children nodes on the root.

# Look at the first plant node
oneplant <- root[[1]]
class(oneplant)

## [1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode" 
## [4] "oldClass"

oneplant

## <PLANT>
##  <COMMON>Bloodroot</COMMON>
##  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##  <ZONE>4</ZONE>
##  <LIGHT>Mostly Shady</LIGHT>
##  <PRICE>$2.44</PRICE>
##  <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>

We can drill down further into the list:

oneplant[['COMMON']]

## <COMMON>Bloodroot</COMMON>

Note that this doesn’t remove the markup, though. To do this, use the function xmlValue.

xmlValue(oneplant[['COMMON']])

## [1] "Bloodroot"

xmlValue(oneplant[['BOTANICAL']])

## [1] "Sanguinaria canadensis"

The names of oneplant is a named character vector of the names of its children.

names(oneplant)

##         COMMON      BOTANICAL           ZONE          LIGHT          PRICE 
##       "COMMON"    "BOTANICAL"         "ZONE"        "LIGHT"        "PRICE" 
##   AVAILABILITY 
## "AVAILABILITY"

Making a data frame from an XML file

To illustrate how we manipulate an XML object in R, we’ll take this data and reformat it into a data frame with one row for each plant.

There are special XML versions of lapply, and sapply, named xmlApply, xmlSApply. Each takes an XMLNode object as its primary argument. They iterate over the node’s children nodes, invoking the given function.

Like lapply, xmlApply returns a list. Like sapply, xmlSApply returns a simpler data structure if possible.

First, a quick review of sapply and lapply. Remember:

lapply and sapply can operate on either a list or a vector.

1:3 %>% lapply( function(x){x^2})

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9

1:3 %>% sapply( function(x){x^2})

## [1] 1 4 9

myList <- list(a=1, b=2, c=3) 
myList %>% lapply( function(x){x^2})

## $a
## [1] 1
## 
## $b
## [1] 4
## 
## $c
## [1] 9

myList %>% sapply( function(x){x^2})

## a b c 
## 1 4 9

You can always include additional arguments. Examples:

myList %>% sapply(log)

##         a         b         c 
## 0.0000000 0.6931472 1.0986123

myList %>% sapply(log, base = 10)

##         a         b         c 
## 0.0000000 0.3010300 0.4771213

myList %>% sapply(function(x, pow){x^pow}, pow = 3)

##  a  b  c 
##  1  8 27

If the results of sapply cannot be simplified, then sapply and lapply will return the same thing.

myList <- list(a=1:2, b=3:5, c=6)
myList %>% lapply( function(x){x^2})

## $a
## [1] 1 4
## 
## $b
## [1]  9 16 25
## 
## $c
## [1] 36

myList %>% sapply( function(x){x^2})

## $a
## [1] 1 4
## 
## $b
## [1]  9 16 25
## 
## $c
## [1] 36

(All of these are just examples to illustrate the differences between sapply and lapply. Of course if you have a vector, for simple arithmetic you should just use vectorized operations, e.g. vec^2.)

The functions xmlApply and xmlSApply work like lapply and sapply, except their arguments are XML nodes.

xmlApply returns a list (the elements may themselves be XML nodes).

If it can, xmlSApply returns a vector or matrix. If not, it also returns a list.

With all of these functions, always ask yourself

What do I want to operate on (iterate over)?
What do I want to produce?

In our plants example, we can use xmlSApply to extract the common names of all the plants.

root

## <CATALOG>
##  <PLANT>
##   <COMMON>Bloodroot</COMMON>
##   <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$2.44</PRICE>
##   <AVAILABILITY>031599</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Columbine</COMMON>
##   <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##   <ZONE>3</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$9.37</PRICE>
##   <AVAILABILITY>030699</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Marsh Marigold</COMMON>
##   <BOTANICAL>Caltha palustris</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Sunny</LIGHT>
##   <PRICE>$6.81</PRICE>
##   <AVAILABILITY>051799</AVAILABILITY>
##  </PLANT>
## </CATALOG>

commons <- root %>% xmlSApply (function(x){xmlValue(x[['COMMON']])})
commons

##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold"

Task for you:

Discuss with your neighbor how the command above works

The elements of the root node are all plant nodes, (like `oneplant`).

Using the same strategy, we can create a full dataframe with all variables.

root[[1]]

## <PLANT>
##  <COMMON>Bloodroot</COMMON>
##  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##  <ZONE>4</ZONE>
##  <LIGHT>Mostly Shady</LIGHT>
##  <PRICE>$2.44</PRICE>
##  <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>

getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants

COMMON	BOTANICAL	ZONE	LIGHT	PRICE	AVAILABILITY
Bloodroot	Sanguinaria canadensis	4	Mostly Shady	$2.44	031599
Columbine	Aquilegia canadensis	3	Mostly Shady	$9.37	030699
Marsh Marigold	Caltha palustris	4	Mostly Sunny	$6.81	051799

Task for you:

Discuss with your neighbor what the commands above are doing.

Recognize that  root %>% xmlSApply(getvar,var) is the above code if var='Common'. However we are going to let var be all of the tags. The names of the tags is names(root[[1]]). This gives us a named list res which in the last command we convert to a data frame.

names(root[[1]])

##         COMMON      BOTANICAL           ZONE          LIGHT          PRICE 
##       "COMMON"    "BOTANICAL"         "ZONE"        "LIGHT"        "PRICE" 
##   AVAILABILITY 
## "AVAILABILITY"

res

## $COMMON
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold" 
## 
## $BOTANICAL
##                    PLANT                    PLANT                    PLANT 
## "Sanguinaria canadensis"   "Aquilegia canadensis"       "Caltha palustris" 
## 
## $ZONE
## PLANT PLANT PLANT 
##   "4"   "3"   "4" 
## 
## $LIGHT
##          PLANT          PLANT          PLANT 
## "Mostly Shady" "Mostly Shady" "Mostly Sunny" 
## 
## $PRICE
##   PLANT   PLANT   PLANT 
## "$2.44" "$9.37" "$6.81" 
## 
## $AVAILABILITY
##    PLANT    PLANT    PLANT 
## "031599" "030699" "051799"

Data Scraping XML documents

Steps: 1. Load the XML document from the web or your computer
2. Convert to a data table
3. Extract the information you need from the data table.

Here is the full plant_catalog data

http://www.xmlfiles.com/examples/plant_catalog.xml

We can load it into R with

xml <- 'http://www.xmlfiles.com/examples/plant_catalog.xml'
doc1 <- xmlTreeParse(xml)

Task for you:

Find the top 3 cheapest plants requiring Mostly Shady light.

# make XML into a data table
root <- 'http://www.xmlfiles.com/examples/plant_catalog.xml' %>%
  xmlTreeParse() %>%
  xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants

COMMON	BOTANICAL	ZONE	LIGHT	PRICE	AVAILABILITY
Bloodroot	Sanguinaria canadensis	4	Mostly Shady	$2.44	031599
Columbine	Aquilegia canadensis	3	Mostly Shady	$9.37	030699
Marsh Marigold	Caltha palustris	4	Mostly Sunny	$6.81	051799
Cowslip	Caltha palustris	4	Mostly Shady	$9.90	030699
Dutchman’s-Breeches	Diecentra cucullaria	3	Mostly Shady	$6.44	012099
Ginger, Wild	Asarum canadense	3	Mostly Shady	$9.03	041899
Hepatica	Hepatica americana	4	Mostly Shady	$4.45	012699
Liverleaf	Hepatica americana	4	Mostly Shady	$3.99	010299
Jack-In-The-Pulpit	Arisaema triphyllum	4	Mostly Shady	$3.23	020199
Mayapple	Podophyllum peltatum	3	Mostly Shady	$2.98	060599
Phlox, Woodland	Phlox divaricata	3	Sun or Shade	$2.80	012299
Phlox, Blue	Phlox divaricata	3	Sun or Shade	$5.59	021699
Spring-Beauty	Claytonia Virginica	7	Mostly Shady	$6.59	020199
Trillium	Trillium grandiflorum	5	Sun or Shade	$3.90	042999
Wake Robin	Trillium grandiflorum	5	Sun or Shade	$3.20	022199
Violet, Dog-Tooth	Erythronium americanum	4	Shade	$9.04	020199
Trout Lily	Erythronium americanum	4	Shade	$6.94	032499
Adder’s-Tongue	Erythronium americanum	4	Shade	$9.58	041399
Anemone	Anemone blanda	6	Mostly Shady	$8.86	122698
Grecian Windflower	Anemone blanda	6	Mostly Shady	$9.16	071099
Bee Balm	Monarda didyma	4	Shade	$4.59	050399
Bergamont	Monarda didyma	4	Shade	$7.16	042799
Black-Eyed Susan	Rudbeckia hirta	Annual	Sunny	$9.80	061899
Buttercup	Ranunculus	4	Shade	$2.57	061099
Crowfoot	Ranunculus	4	Shade	$9.34	040399
Butterfly Weed	Asclepias tuberosa	Annual	Sunny	$2.78	063099
Cinquefoil	Potentilla	Annual	Shade	$7.06	052599
Primrose	Oenothera	3 - 5	Sunny	$6.56	013099
Gentian	Gentiana	4	Sun or Shade	$7.81	051899
Blue Gentian	Gentiana	4	Sun or Shade	$8.56	050299
Jacob’s Ladder	Polemonium caeruleum	Annual	Shade	$9.26	022199
Greek Valerian	Polemonium caeruleum	Annual	Shade	$4.36	071499
California Poppy	Eschscholzia californica	Annual	Sun	$7.89	032799
Shooting Star	Dodecatheon	Annual	Mostly Shady	$8.60	051399
Snakeroot	Cimicifuga	Annual	Shade	$5.63	071199
Cardinal Flower	Lobelia cardinalis	2	Shade	$3.02	022299

new_plants <-  plants %>% filter(LIGHT=="Mostly Shady") 
new_plants_clean <- new_plants %>% mutate(cost=as.numeric(gsub("\\$", "", new_plants$PRICE))) %>% arrange(cost) %>% select(COMMON,PRICE) %>% head(3)
new_plants_clean

COMMON	PRICE
Bloodroot	$2.44
Mayapple	$2.98
Jack-In-The-Pulpit	$3.23