Source file ⇒ 2017-lec23.Rmd
Last time I showed you how to create an XML file in R. Now I will show you how to read an XML file and make a data table in R.
The basic idea of object oriented programming (OOP) is to view a complex system as the interaction of simpler objects. You can think of an object as a sort of active data type that combines data and operations. To put it simply, objects know stuff (they contain data) and they can do stuff (they have operations called methods). Every object is an instance of some class.
For example you can define a dog class consisting of a blueprint of what a dog is. Dogs come in different types, have diffent names and ages (these are things dogs know – called slots). Dogs also can do things like bark, play fetch, roll over and even skateboard (these are things dogs do–called methods). My dog is an object of the dog class. He is a bulldog, his name is Max, and he is three years old. He can bark, drool, and skateboard.
Data Frames are examples of a class in R. Data Frames come in different sizes, have different types of entries, and names (these are things that data frames know). Data Frames also have methods (i.e. functions) which operate on them like ncol
, or names
or head
(these are things that data frames do). An object is a particular Data Frame you are working with.
mydf <- data.frame(c(1,2),c(2,3)) #the object a is an instance of a data frame class
names(mydf) <- c("x","y")
mydf
x | y |
---|---|
1 | 2 |
2 | 3 |
ncol(mydf)
## [1] 2
class(mydf) #the name of the class of object a
## [1] "data.frame"
typeof(mydf) #how R internally stores an object a
## [1] "list"
mydf[["x"]]
## [1] 1 2
The function class()
outputs a vector that allows an object to inherit from multiple classes.
class(mydf) <- c(class(mydf),"super.data.frame")
class(mydf)
## [1] "data.frame" "super.data.frame"
now if I have a class “super.data.frame”, it would know the things that class data.frame
knows and would be able to do the things that class data.frame
does. We say that the class “super.data.frame” inherits from the class “data.frame”.
R has a very sloppy system of object-oriented (OO) programming called S3
and a more formal version (like in Python) of object-oriented (OO) programming called the S4
. A graduate course in R such as Stats 243 teaches this material or you can read about it here.
Our knowledge of classes and OOP will help us speak about the next topic.
Please copy the following XLM document to Sublime and save in your home directory or Desktop as plant.xml.
<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
<PLANT>
<COMMON>Bloodroot</COMMON>
<BOTANICAL>Sanguinaria canadensis</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$2.44</PRICE>
<AVAILABILITY>031599</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<ZONE>3</ZONE>
<LIGHT>Mostly Shady</LIGHT>
<PRICE>$9.37</PRICE>
<AVAILABILITY>030699</AVAILABILITY>
</PLANT>
<PLANT>
<COMMON>Marsh Marigold</COMMON>
<BOTANICAL>Caltha palustris</BOTANICAL>
<ZONE>4</ZONE>
<LIGHT>Mostly Sunny</LIGHT>
<PRICE>$6.81</PRICE>
<AVAILABILITY>051799</AVAILABILITY>
</PLANT>
</CATALOG>
To read an XML file into R, use xmlTreeParse
.
library(XML)
doc <- xmlTreeParse("/Users/Adam/Desktop/Stat133_S17/lectures/plant.xml")
class(doc)
## [1] "XMLDocument" "XMLAbstractDocument"
The XML package allows us to create a special class, called XMLDocument
, to represent an XML tree.
xmlTreeParse
implements what is called the DOM
(Document Object Model) parser. It reads the entire file into memory.
We don’t have time to cover it, but you should be aware of another parsing model called SAX
(Simple API for XML). It reads the document incrementally and is more memory efficient, but it is trickier to use. This might be necessary for large xml files.
The XMLDocument
class has a method called xmlRoot()
which gets the top-level XML node as a list of lists.
root <- xmlRoot(doc)
class(root)
## [1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode"
## [4] "oldClass"
typeof(root)
## [1] "list"
root
## <CATALOG>
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Columbine</COMMON>
## <BOTANICAL>Aquilegia canadensis</BOTANICAL>
## <ZONE>3</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$9.37</PRICE>
## <AVAILABILITY>030699</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Marsh Marigold</COMMON>
## <BOTANICAL>Caltha palustris</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Sunny</LIGHT>
## <PRICE>$6.81</PRICE>
## <AVAILABILITY>051799</AVAILABILITY>
## </PLANT>
## </CATALOG>
root
is an object of the XMLNode class.
Each Plant node in root
is stored in R as a list. We can access an element within a node (i.e., a child), using the usual [[ ]] indexing for lists. It has methods which can give you information of the children nodes on the root.
# Look at the first plant node
oneplant <- root[[1]]
class(oneplant)
## [1] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode"
## [4] "oldClass"
oneplant
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
We can drill down further into the list:
oneplant[['COMMON']]
## <COMMON>Bloodroot</COMMON>
Note that this doesn’t remove the markup, though. To do this, use the function xmlValue
.
xmlValue(oneplant[['COMMON']])
## [1] "Bloodroot"
xmlValue(oneplant[['BOTANICAL']])
## [1] "Sanguinaria canadensis"
The names of oneplant is a named character vector of the names of its children.
names(oneplant)
## COMMON BOTANICAL ZONE LIGHT PRICE
## "COMMON" "BOTANICAL" "ZONE" "LIGHT" "PRICE"
## AVAILABILITY
## "AVAILABILITY"
To illustrate how we manipulate an XML object in R, we’ll take this data and reformat it into a data frame with one row for each plant.
There are special XML versions of lapply
, and sapply
, named xmlApply
, xmlSApply
. Each takes an XMLNode
object as its primary argument. They iterate over the node’s children nodes, invoking the given function.
Like lapply
, xmlApply
returns a list. Like sapply
, xmlSApply
returns a simpler data structure if possible.
First, a quick review of sapply
and lapply
. Remember:
lapply
and sapply
can operate a vector or a list.myList <- list(a=1, b=2, c=3)
myList
## $a
## [1] 1
##
## $b
## [1] 2
##
## $c
## [1] 3
length(myList)
## [1] 3
myList %>% lapply( function(x){x^2})
## $a
## [1] 1
##
## $b
## [1] 4
##
## $c
## [1] 9
myList %>% sapply( function(x){x^2})
## a b c
## 1 4 9
myList %>% sapply(log)
## a b c
## 0.0000000 0.6931472 1.0986123
myList %>% sapply(log, base = 10)
## a b c
## 0.0000000 0.3010300 0.4771213
myList %>% sapply(function(x, pow){x^pow}, pow = 3)
## a b c
## 1 8 27
myList <- list(a=1:2, b=3:5, c=6)
myList %>% lapply( function(x){x^2})
## $a
## [1] 1 4
##
## $b
## [1] 9 16 25
##
## $c
## [1] 36
myList %>% sapply( function(x){x^2})
## $a
## [1] 1 4
##
## $b
## [1] 9 16 25
##
## $c
## [1] 36
The functions xmlApply
and xmlSApply
work like lapply
and sapply
, except their input is an object of class XMLNode
.
xmlApply
returns a list (the elements may themselves be XML nodes).
If it can, xmlSApply
returns a vector or array. If not, it also returns a list.
With all of these functions, always ask yourself
In our plants example, we can use xmlSApply to extract the common names of all the plants.
root
## <CATALOG>
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Columbine</COMMON>
## <BOTANICAL>Aquilegia canadensis</BOTANICAL>
## <ZONE>3</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$9.37</PRICE>
## <AVAILABILITY>030699</AVAILABILITY>
## </PLANT>
## <PLANT>
## <COMMON>Marsh Marigold</COMMON>
## <BOTANICAL>Caltha palustris</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Sunny</LIGHT>
## <PRICE>$6.81</PRICE>
## <AVAILABILITY>051799</AVAILABILITY>
## </PLANT>
## </CATALOG>
commons <- root %>% xmlSApply (function(x){xmlValue(x[['COMMON']])})
commons
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
We can write this as:
getvar <- function(x) xmlValue(x[['COMMON']])
commons <- root %>% xmlSApply(getvar)
commons
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
or as this:
getvar <- function(x,var) xmlValue(x[[var]])
commons <- root %>% xmlSApply(getvar,'COMMON')
commons
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
or as this:
getvar <- function(x,var) xmlValue(x[[var]])
commons <- 'COMMON' %>% lapply( function(var) root %>% xmlSApply(getvar,var))
commons
## [[1]]
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
Using the same strategy, we can create a full dataframe with all variables.
root[[1]]
## <PLANT>
## <COMMON>Bloodroot</COMMON>
## <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
## <ZONE>4</ZONE>
## <LIGHT>Mostly Shady</LIGHT>
## <PRICE>$2.44</PRICE>
## <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
names(root[[1]])
## COMMON BOTANICAL ZONE LIGHT PRICE
## "COMMON" "BOTANICAL" "ZONE" "LIGHT" "PRICE"
## AVAILABILITY
## "AVAILABILITY"
getvar <- function(x, var) xmlValue(x[[var]])
named_list <- names(root[[1]]) %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
named_list
## $COMMON
## PLANT PLANT PLANT
## "Bloodroot" "Columbine" "Marsh Marigold"
##
## $BOTANICAL
## PLANT PLANT PLANT
## "Sanguinaria canadensis" "Aquilegia canadensis" "Caltha palustris"
##
## $ZONE
## PLANT PLANT PLANT
## "4" "3" "4"
##
## $LIGHT
## PLANT PLANT PLANT
## "Mostly Shady" "Mostly Shady" "Mostly Sunny"
##
## $PRICE
## PLANT PLANT PLANT
## "$2.44" "$9.37" "$6.81"
##
## $AVAILABILITY
## PLANT PLANT PLANT
## "031599" "030699" "051799"
plants <- data.frame(named_list)
plants
COMMON | BOTANICAL | ZONE | LIGHT | PRICE | AVAILABILITY |
---|---|---|---|---|---|
Bloodroot | Sanguinaria canadensis | 4 | Mostly Shady | $2.44 | 031599 |
Columbine | Aquilegia canadensis | 3 | Mostly Shady | $9.37 | 030699 |
Marsh Marigold | Caltha palustris | 4 | Mostly Sunny | $6.81 | 051799 |
Steps:
Here is the full plant_catalog data
http://www.xmlfiles.com/examples/plant_catalog.xml
We can load it into R with
xml <- 'http://www.xmlfiles.com/examples/plant_catalog.xml'
doc1 <- xmlTreeParse(xml)
Find the top 3 cheapest plants requiring Mostly Shady light.
# make XML into a data table
root <- 'http://www.xmlfiles.com/examples/plant_catalog.xml' %>%
xmlTreeParse() %>%
xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]]) %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants
COMMON | BOTANICAL | ZONE | LIGHT | PRICE | AVAILABILITY |
---|---|---|---|---|---|
Bloodroot | Sanguinaria canadensis | 4 | Mostly Shady | $2.44 | 031599 |
Columbine | Aquilegia canadensis | 3 | Mostly Shady | $9.37 | 030699 |
Marsh Marigold | Caltha palustris | 4 | Mostly Sunny | $6.81 | 051799 |
Cowslip | Caltha palustris | 4 | Mostly Shady | $9.90 | 030699 |
Dutchman’s-Breeches | Diecentra cucullaria | 3 | Mostly Shady | $6.44 | 012099 |
Ginger, Wild | Asarum canadense | 3 | Mostly Shady | $9.03 | 041899 |
Hepatica | Hepatica americana | 4 | Mostly Shady | $4.45 | 012699 |
Liverleaf | Hepatica americana | 4 | Mostly Shady | $3.99 | 010299 |
Jack-In-The-Pulpit | Arisaema triphyllum | 4 | Mostly Shady | $3.23 | 020199 |
Mayapple | Podophyllum peltatum | 3 | Mostly Shady | $2.98 | 060599 |
Phlox, Woodland | Phlox divaricata | 3 | Sun or Shade | $2.80 | 012299 |
Phlox, Blue | Phlox divaricata | 3 | Sun or Shade | $5.59 | 021699 |
Spring-Beauty | Claytonia Virginica | 7 | Mostly Shady | $6.59 | 020199 |
Trillium | Trillium grandiflorum | 5 | Sun or Shade | $3.90 | 042999 |
Wake Robin | Trillium grandiflorum | 5 | Sun or Shade | $3.20 | 022199 |
Violet, Dog-Tooth | Erythronium americanum | 4 | Shade | $9.04 | 020199 |
Trout Lily | Erythronium americanum | 4 | Shade | $6.94 | 032499 |
Adder’s-Tongue | Erythronium americanum | 4 | Shade | $9.58 | 041399 |
Anemone | Anemone blanda | 6 | Mostly Shady | $8.86 | 122698 |
Grecian Windflower | Anemone blanda | 6 | Mostly Shady | $9.16 | 071099 |
Bee Balm | Monarda didyma | 4 | Shade | $4.59 | 050399 |
Bergamont | Monarda didyma | 4 | Shade | $7.16 | 042799 |
Black-Eyed Susan | Rudbeckia hirta | Annual | Sunny | $9.80 | 061899 |
Buttercup | Ranunculus | 4 | Shade | $2.57 | 061099 |
Crowfoot | Ranunculus | 4 | Shade | $9.34 | 040399 |
Butterfly Weed | Asclepias tuberosa | Annual | Sunny | $2.78 | 063099 |
Cinquefoil | Potentilla | Annual | Shade | $7.06 | 052599 |
Primrose | Oenothera | 3 - 5 | Sunny | $6.56 | 013099 |
Gentian | Gentiana | 4 | Sun or Shade | $7.81 | 051899 |
Blue Gentian | Gentiana | 4 | Sun or Shade | $8.56 | 050299 |
Jacob’s Ladder | Polemonium caeruleum | Annual | Shade | $9.26 | 022199 |
Greek Valerian | Polemonium caeruleum | Annual | Shade | $4.36 | 071499 |
California Poppy | Eschscholzia californica | Annual | Sun | $7.89 | 032799 |
Shooting Star | Dodecatheon | Annual | Mostly Shady | $8.60 | 051399 |
Snakeroot | Cimicifuga | Annual | Shade | $5.63 | 071199 |
Cardinal Flower | Lobelia cardinalis | 2 | Shade | $3.02 | 022299 |
new_plants <- plants %>% filter(LIGHT=="Mostly Shady")
new_plants_clean <- new_plants %>% mutate(cost=as.numeric(gsub("\\$", "", new_plants$PRICE))) %>% arrange(cost) %>% select(COMMON,PRICE) %>% head(3)
new_plants_clean
COMMON | PRICE |
---|---|
Bloodroot | $2.44 |
Mayapple | $2.98 |
Jack-In-The-Pulpit | $3.23 |
Recall: The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.
Example (moview.xml):
<?xml version="1.0"?>
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
Below we identify examples of an element, an attribute and the content of movies.xml.
#movie is an element (or node):
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="126" lang="eng"> </movie> #tag for the `movie` element.
mins="126" # an attribute
#The content of movie is:
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
Last time we used xmlTreeParse
, xmlRoot
, and xmlValue
to parse an xml document to get its values.
These together with lapply and xmlapply allowed us to convert an xml document into a data table.
library(XML)
root <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml") %>%
xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]]) %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
movies <- data.frame(res)
head(movies)
title | director | year | genre |
---|---|---|---|
Good Will Hunting | GusVan Sant | 1998 | drama |
Y tu mama tambien | AlfonsoCuaron | 2001 | drama |
We provide another way to scrape XML files using the idea of an xpath.
Xpath is a language to navigate through elements and attributes in an XML/HTML document.
It uses path expressions to select nodes in an XML document and identifies patterns to match data or content.
The key concept is knowing how to write XPath expressions. XPath expressions have a syntax similar to the way files are located in a hierarchy of directories/folders in a computer file system. For instance:
Lets consider a new XML file called movies.xml
.
<?xml version="1.0"?>
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
/movies/movie
is the XPath expression to locate the first movie
element that is the child of the movies
elements.
The main path expressions (i.e. symbols) are:
Symbol | Description |
---|---|
/ | selectes from the root node |
// | selects nodes anywhere |
. | selects the current node |
.. | selects the parent of the current node |
@ | selects attributes |
[ ] | square bracket to indicate attributes |
Xpath wildcards can be used to select unknown XML elements
Symbol | Description |
---|---|
* | matches any element |
node( ) | matches any node of any kind |
@* | matches any attribute |
To work with XPath expressions we use the function getNodeSet()
that accepts XPath expressions in order to select node-sets. Its main usage is:
getNodeSet(doc, path)
where doc
is an object of class XMLInternalDocument
and path
is a string giving the XPath expression to be evaluated.
Example:
movies_xml <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml", useInternalNodes = TRUE)
class(movies_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
typeof(movies_xml)
## [1] "externalptr"
getNodeSet(movies_xml, "//movie")
## [[1]]
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## [[2]]
## <movie mins="106" lang="spa">
## <title>Y tu mama tambien</title>
## <director>
## <first_name>Alfonso</first_name>
## <last_name>Cuaron</last_name>
## </director>
## <year>2001</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//movie[@lang='eng']")
## [[1]]
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//year")
## [[1]]
## <year>1998</year>
##
## [[2]]
## <year>2001</year>
##
## attr(,"class")
## [1] "XMLNodeSet"
One can also use the functions xpathApply()
or xpathSApply()
to find matching nodes in an internal XML tree.
Syntax:
xpathSApply(doc, path, fun, ... )
The output is a vector (or in the case of XpathApply, a list).
xpathSApply(movies_xml, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(movies_xml, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"