Source file ⇒ 2017-lec23.Rmd

Announcement

  1. Last homework on b-courses. Due Thursday.
  2. For part of class today and Thursday you will have time to meet with your group members and discuss your project. I will use this time to check in.=
  3. A one page summary of your project is due on b-courses Tuesday April 18 at 8pm. It is not graded but will be checked off for extra credit. All group members must submit to b-courses. Here is a description of what is expected: final project

Today

  1. Classes and OOP (object oriented programming)
  2. processing XML using R
  3. data scraping XML data
  4. Xpath (data scraping xml and html documents)

Last time I showed you how to create an XML file in R. Now I will show you how to read an XML file and make a data table in R.

0. Classes and OOP (object oriented programming)

The basic idea of object oriented programming (OOP) is to view a complex system as the interaction of simpler objects. You can think of an object as a sort of active data type that combines data and operations. To put it simply, objects know stuff (they contain data) and they can do stuff (they have operations called methods). Every object is an instance of some class.

For example you can define a dog class consisting of a blueprint of what a dog is. Dogs come in different types, have diffent names and ages (these are things dogs know – called slots). Dogs also can do things like bark, play fetch, roll over and even skateboard (these are things dogs do–called methods). My dog is an object of the dog class. He is a bulldog, his name is Max, and he is three years old. He can bark, drool, and skateboard.

Data Frames are examples of a class in R. Data Frames come in different sizes, have different types of entries, and names (these are things that data frames know). Data Frames also have methods (i.e. functions) which operate on them like ncol, or names or head (these are things that data frames do). An object is a particular Data Frame you are working with.

mydf <- data.frame(c(1,2),c(2,3))  #the object a is an instance of a data frame class
names(mydf) <- c("x","y")  
mydf
x y
1 2
2 3
ncol(mydf)
## [1] 2
class(mydf)  #the name of the class of object a
## [1] "data.frame"
typeof(mydf)  #how R internally stores an object a
## [1] "list"
mydf[["x"]] 
## [1] 1 2

The function class() outputs a vector that allows an object to inherit from multiple classes.

class(mydf) <- c(class(mydf),"super.data.frame")
class(mydf)
## [1] "data.frame"       "super.data.frame"

now if I have a class “super.data.frame”, it would know the things that class data.frame knows and would be able to do the things that class data.frame does. We say that the class “super.data.frame” inherits from the class “data.frame”.

R has a very sloppy system of object-oriented (OO) programming called S3 and a more formal version (like in Python) of object-oriented (OO) programming called the S4. A graduate course in R such as Stats 243 teaches this material or you can read about it here.

Our knowledge of classes and OOP will help us speak about the next topic.

1. Processing XML using R

Please copy the following XLM document to Sublime and save in your home directory or Desktop as plant.xml.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Edited with XML Spy v2006 (http://www.altova.com) -->
<CATALOG>
     <PLANT>
          <COMMON>Bloodroot</COMMON>
          <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
          <ZONE>4</ZONE>
          <LIGHT>Mostly Shady</LIGHT>
          <PRICE>$2.44</PRICE>
          <AVAILABILITY>031599</AVAILABILITY>
     </PLANT>
     <PLANT>
          <COMMON>Columbine</COMMON>
          <BOTANICAL>Aquilegia canadensis</BOTANICAL>
          <ZONE>3</ZONE>
          <LIGHT>Mostly Shady</LIGHT>
          <PRICE>$9.37</PRICE>
          <AVAILABILITY>030699</AVAILABILITY>
     </PLANT>
     <PLANT>
          <COMMON>Marsh Marigold</COMMON>
          <BOTANICAL>Caltha palustris</BOTANICAL>
          <ZONE>4</ZONE>
          <LIGHT>Mostly Sunny</LIGHT>
          <PRICE>$6.81</PRICE>
          <AVAILABILITY>051799</AVAILABILITY>
     </PLANT>
</CATALOG>

To read an XML file into R, use xmlTreeParse.

library(XML)
doc <- xmlTreeParse("/Users/Adam/Desktop/Stat133_S17/lectures/plant.xml")
class(doc)
## [1] "XMLDocument"         "XMLAbstractDocument"

The XML package allows us to create a special class, called XMLDocument, to represent an XML tree.

xmlTreeParse implements what is called the DOM (Document Object Model) parser. It reads the entire file into memory.

We don’t have time to cover it, but you should be aware of another parsing model called SAX (Simple API for XML). It reads the document incrementally and is more memory efficient, but it is trickier to use. This might be necessary for large xml files.

The XMLDocument class has a method called xmlRoot() which gets the top-level XML node as a list of lists.

root <- xmlRoot(doc)
class(root)
## [1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode" 
## [4] "oldClass"
typeof(root)
## [1] "list"
root
## <CATALOG>
##  <PLANT>
##   <COMMON>Bloodroot</COMMON>
##   <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$2.44</PRICE>
##   <AVAILABILITY>031599</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Columbine</COMMON>
##   <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##   <ZONE>3</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$9.37</PRICE>
##   <AVAILABILITY>030699</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Marsh Marigold</COMMON>
##   <BOTANICAL>Caltha palustris</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Sunny</LIGHT>
##   <PRICE>$6.81</PRICE>
##   <AVAILABILITY>051799</AVAILABILITY>
##  </PLANT>
## </CATALOG>

root is an object of the XMLNode class.

Each Plant node in root is stored in R as a list. We can access an element within a node (i.e., a child), using the usual [[ ]] indexing for lists. It has methods which can give you information of the children nodes on the root.

# Look at the first plant node
oneplant <- root[[1]]
class(oneplant)
## [1] "XMLNode"          "RXMLAbstractNode" "XMLAbstractNode" 
## [4] "oldClass"
oneplant
## <PLANT>
##  <COMMON>Bloodroot</COMMON>
##  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##  <ZONE>4</ZONE>
##  <LIGHT>Mostly Shady</LIGHT>
##  <PRICE>$2.44</PRICE>
##  <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>

We can drill down further into the list:

oneplant[['COMMON']]
## <COMMON>Bloodroot</COMMON>

Note that this doesn’t remove the markup, though. To do this, use the function xmlValue.

xmlValue(oneplant[['COMMON']])
## [1] "Bloodroot"
xmlValue(oneplant[['BOTANICAL']])
## [1] "Sanguinaria canadensis"

The names of oneplant is a named character vector of the names of its children.

names(oneplant)
##         COMMON      BOTANICAL           ZONE          LIGHT          PRICE 
##       "COMMON"    "BOTANICAL"         "ZONE"        "LIGHT"        "PRICE" 
##   AVAILABILITY 
## "AVAILABILITY"

Making a data frame from an XML file

To illustrate how we manipulate an XML object in R, we’ll take this data and reformat it into a data frame with one row for each plant.

There are special XML versions of lapply, and sapply, named xmlApply, xmlSApply. Each takes an XMLNode object as its primary argument. They iterate over the node’s children nodes, invoking the given function.

Like lapply, xmlApply returns a list. Like sapply, xmlSApply returns a simpler data structure if possible.

First, a quick review of sapply and lapply. Remember:

  1. lapply and sapply can operate a vector or a list.
myList <- list(a=1, b=2, c=3) 
myList
## $a
## [1] 1
## 
## $b
## [1] 2
## 
## $c
## [1] 3
length(myList)
## [1] 3
myList %>% lapply( function(x){x^2}) 
## $a
## [1] 1
## 
## $b
## [1] 4
## 
## $c
## [1] 9
myList %>% sapply( function(x){x^2})
## a b c 
## 1 4 9
  1. You can always include additional arguments. Examples:
myList %>% sapply(log)
##         a         b         c 
## 0.0000000 0.6931472 1.0986123
myList %>% sapply(log, base = 10)
##         a         b         c 
## 0.0000000 0.3010300 0.4771213
myList %>% sapply(function(x, pow){x^pow}, pow = 3)
##  a  b  c 
##  1  8 27
  1. If the results of sapply cannot be simplified, then sapply and lapply will return the same thing.
myList <- list(a=1:2, b=3:5, c=6)
myList %>% lapply( function(x){x^2})
## $a
## [1] 1 4
## 
## $b
## [1]  9 16 25
## 
## $c
## [1] 36
myList %>% sapply( function(x){x^2})
## $a
## [1] 1 4
## 
## $b
## [1]  9 16 25
## 
## $c
## [1] 36

The functions xmlApply and xmlSApply work like lapply and sapply, except their input is an object of class XMLNode.

xmlApply returns a list (the elements may themselves be XML nodes).

If it can, xmlSApply returns a vector or array. If not, it also returns a list.

With all of these functions, always ask yourself

  1. What do I want to operate on (iterate over)?
  2. What do I want to produce?

In our plants example, we can use xmlSApply to extract the common names of all the plants.

root
## <CATALOG>
##  <PLANT>
##   <COMMON>Bloodroot</COMMON>
##   <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$2.44</PRICE>
##   <AVAILABILITY>031599</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Columbine</COMMON>
##   <BOTANICAL>Aquilegia canadensis</BOTANICAL>
##   <ZONE>3</ZONE>
##   <LIGHT>Mostly Shady</LIGHT>
##   <PRICE>$9.37</PRICE>
##   <AVAILABILITY>030699</AVAILABILITY>
##  </PLANT>
##  <PLANT>
##   <COMMON>Marsh Marigold</COMMON>
##   <BOTANICAL>Caltha palustris</BOTANICAL>
##   <ZONE>4</ZONE>
##   <LIGHT>Mostly Sunny</LIGHT>
##   <PRICE>$6.81</PRICE>
##   <AVAILABILITY>051799</AVAILABILITY>
##  </PLANT>
## </CATALOG>
commons <- root %>% xmlSApply (function(x){xmlValue(x[['COMMON']])})
commons
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold"

We can write this as:

getvar <- function(x) xmlValue(x[['COMMON']])
commons <- root %>% xmlSApply(getvar)
commons
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold"

or as this:

getvar <- function(x,var) xmlValue(x[[var]])
commons <- root %>% xmlSApply(getvar,'COMMON')
commons
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold"

or as this:

getvar <- function(x,var) xmlValue(x[[var]])
commons <-  'COMMON' %>% lapply( function(var)  root %>% xmlSApply(getvar,var))
commons
## [[1]]
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold"

Using the same strategy, we can create a full dataframe with all variables.

root[[1]]
## <PLANT>
##  <COMMON>Bloodroot</COMMON>
##  <BOTANICAL>Sanguinaria canadensis</BOTANICAL>
##  <ZONE>4</ZONE>
##  <LIGHT>Mostly Shady</LIGHT>
##  <PRICE>$2.44</PRICE>
##  <AVAILABILITY>031599</AVAILABILITY>
## </PLANT>
names(root[[1]])
##         COMMON      BOTANICAL           ZONE          LIGHT          PRICE 
##       "COMMON"    "BOTANICAL"         "ZONE"        "LIGHT"        "PRICE" 
##   AVAILABILITY 
## "AVAILABILITY"
getvar <- function(x, var) xmlValue(x[[var]])
named_list <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
named_list
## $COMMON
##            PLANT            PLANT            PLANT 
##      "Bloodroot"      "Columbine" "Marsh Marigold" 
## 
## $BOTANICAL
##                    PLANT                    PLANT                    PLANT 
## "Sanguinaria canadensis"   "Aquilegia canadensis"       "Caltha palustris" 
## 
## $ZONE
## PLANT PLANT PLANT 
##   "4"   "3"   "4" 
## 
## $LIGHT
##          PLANT          PLANT          PLANT 
## "Mostly Shady" "Mostly Shady" "Mostly Sunny" 
## 
## $PRICE
##   PLANT   PLANT   PLANT 
## "$2.44" "$9.37" "$6.81" 
## 
## $AVAILABILITY
##    PLANT    PLANT    PLANT 
## "031599" "030699" "051799"
plants <- data.frame(named_list)
plants
COMMON BOTANICAL ZONE LIGHT PRICE AVAILABILITY
Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599
Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699
Marsh Marigold Caltha palustris 4 Mostly Sunny $6.81 051799

In Class exercise

Do example 1a,b

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-23-collection/

Data Scraping XML documents

Steps:

  1. Load the XML document from the web or your computer
  2. Convert to a data table
  3. Extract the information you need from the data table.

Here is the full plant_catalog data

http://www.xmlfiles.com/examples/plant_catalog.xml

We can load it into R with

xml <- 'http://www.xmlfiles.com/examples/plant_catalog.xml'
doc1 <- xmlTreeParse(xml)

Find the top 3 cheapest plants requiring Mostly Shady light.

# make XML into a data table
root <- 'http://www.xmlfiles.com/examples/plant_catalog.xml' %>%
  xmlTreeParse() %>%
  xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
plants <- data.frame(res)
plants
COMMON BOTANICAL ZONE LIGHT PRICE AVAILABILITY
Bloodroot Sanguinaria canadensis 4 Mostly Shady $2.44 031599
Columbine Aquilegia canadensis 3 Mostly Shady $9.37 030699
Marsh Marigold Caltha palustris 4 Mostly Sunny $6.81 051799
Cowslip Caltha palustris 4 Mostly Shady $9.90 030699
Dutchman’s-Breeches Diecentra cucullaria 3 Mostly Shady $6.44 012099
Ginger, Wild Asarum canadense 3 Mostly Shady $9.03 041899
Hepatica Hepatica americana 4 Mostly Shady $4.45 012699
Liverleaf Hepatica americana 4 Mostly Shady $3.99 010299
Jack-In-The-Pulpit Arisaema triphyllum 4 Mostly Shady $3.23 020199
Mayapple Podophyllum peltatum 3 Mostly Shady $2.98 060599
Phlox, Woodland Phlox divaricata 3 Sun or Shade $2.80 012299
Phlox, Blue Phlox divaricata 3 Sun or Shade $5.59 021699
Spring-Beauty Claytonia Virginica 7 Mostly Shady $6.59 020199
Trillium Trillium grandiflorum 5 Sun or Shade $3.90 042999
Wake Robin Trillium grandiflorum 5 Sun or Shade $3.20 022199
Violet, Dog-Tooth Erythronium americanum 4 Shade $9.04 020199
Trout Lily Erythronium americanum 4 Shade $6.94 032499
Adder’s-Tongue Erythronium americanum 4 Shade $9.58 041399
Anemone Anemone blanda 6 Mostly Shady $8.86 122698
Grecian Windflower Anemone blanda 6 Mostly Shady $9.16 071099
Bee Balm Monarda didyma 4 Shade $4.59 050399
Bergamont Monarda didyma 4 Shade $7.16 042799
Black-Eyed Susan Rudbeckia hirta Annual Sunny $9.80 061899
Buttercup Ranunculus 4 Shade $2.57 061099
Crowfoot Ranunculus 4 Shade $9.34 040399
Butterfly Weed Asclepias tuberosa Annual Sunny $2.78 063099
Cinquefoil Potentilla Annual Shade $7.06 052599
Primrose Oenothera 3 - 5 Sunny $6.56 013099
Gentian Gentiana 4 Sun or Shade $7.81 051899
Blue Gentian Gentiana 4 Sun or Shade $8.56 050299
Jacob’s Ladder Polemonium caeruleum Annual Shade $9.26 022199
Greek Valerian Polemonium caeruleum Annual Shade $4.36 071499
California Poppy Eschscholzia californica Annual Sun $7.89 032799
Shooting Star Dodecatheon Annual Mostly Shady $8.60 051399
Snakeroot Cimicifuga Annual Shade $5.63 071199
Cardinal Flower Lobelia cardinalis 2 Shade $3.02 022299
new_plants <-  plants %>% filter(LIGHT=="Mostly Shady") 
new_plants_clean <- new_plants %>% mutate(cost=as.numeric(gsub("\\$", "", new_plants$PRICE))) %>% arrange(cost) %>% select(COMMON,PRICE) %>% head(3)
new_plants_clean 
COMMON PRICE
Bloodroot $2.44
Mayapple $2.98
Jack-In-The-Pulpit $3.23

3. Xpath (data scraping xml and html documents)

Recall: The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.

Example (moview.xml):

<?xml version="1.0"?>
<movies>
    <movie mins="126" lang="eng">
        <title>Good Will Hunting</title> 
        <director>
            <first_name>Gus</first_name>
            <last_name>Van Sant</last_name>
        </director>
        <year>1998</year> 
        <genre>drama</genre>
    </movie>
    <movie mins="106" lang="spa"> 
        <title>Y tu mama tambien</title>
         <director>
            <first_name>Alfonso</first_name>
            <last_name>Cuaron</last_name> 
        </director>
        <year>2001</year>
        <genre>drama</genre>
    </movie>
</movies>

Below we identify examples of an element, an attribute and the content of movies.xml.

#movie is an element (or node):

<movie mins="126" lang="eng">
    <title>Good Will Hunting</title> 
    <director>
      <first_name>Gus</first_name>
        <last_name>Van Sant</last_name>
    </director>
    <year>1998</year> 
    <genre>drama</genre>
</movie>
<movie mins="126" lang="eng"> </movie>  #tag for the `movie` element.
mins="126" # an attribute
#The content of movie is:

<title>Good Will Hunting</title> 
<director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
</director>
<year>1998</year> 
<genre>drama</genre>

Last time we used xmlTreeParse, xmlRoot, and xmlValue to parse an xml document to get its values.

These together with lapply and xmlapply allowed us to convert an xml document into a data table.

library(XML)
root <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml") %>%
  xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
movies <- data.frame(res)
head(movies)
title director year genre
Good Will Hunting GusVan Sant 1998 drama
Y tu mama tambien AlfonsoCuaron 2001 drama

We provide another way to scrape XML files using the idea of an xpath.

Xpath Language

Xpath is a language to navigate through elements and attributes in an XML/HTML document.

It uses path expressions to select nodes in an XML document and identifies patterns to match data or content.

XPath Syntax

The key concept is knowing how to write XPath expressions. XPath expressions have a syntax similar to the way files are located in a hierarchy of directories/folders in a computer file system. For instance:

Lets consider a new XML file called movies.xml.

<?xml version="1.0"?>
<movies>
    <movie mins="126" lang="eng">
        <title>Good Will Hunting</title> 
        <director>
            <first_name>Gus</first_name>
            <last_name>Van Sant</last_name>
        </director>
        <year>1998</year> 
        <genre>drama</genre>
    </movie>
    <movie mins="106" lang="spa"> 
        <title>Y tu mama tambien</title>
         <director>
            <first_name>Alfonso</first_name>
            <last_name>Cuaron</last_name> 
        </director>
        <year>2001</year>
        <genre>drama</genre>
    </movie>
</movies>

/movies/movie

is the XPath expression to locate the first movie element that is the child of the movies elements.

The main path expressions (i.e. symbols) are:

Symbol Description
/ selectes from the root node
// selects nodes anywhere
. selects the current node
.. selects the parent of the current node
@ selects attributes
[ ] square bracket to indicate attributes

Xpath wildcards can be used to select unknown XML elements

Symbol Description
* matches any element
node( ) matches any node of any kind
@* matches any attribute

To work with XPath expressions we use the function getNodeSet() that accepts XPath expressions in order to select node-sets. Its main usage is:

getNodeSet(doc, path) where doc is an object of class XMLInternalDocument and path is a string giving the XPath expression to be evaluated.

Example:

movies_xml <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml", useInternalNodes = TRUE)
class(movies_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
typeof(movies_xml)
## [1] "externalptr"
getNodeSet(movies_xml, "//movie")
## [[1]]
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## [[2]]
## <movie mins="106" lang="spa">
##   <title>Y tu mama tambien</title>
##   <director>
##     <first_name>Alfonso</first_name>
##     <last_name>Cuaron</last_name>
##   </director>
##   <year>2001</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//movie[@lang='eng']")
## [[1]]
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//year")
## [[1]]
## <year>1998</year> 
## 
## [[2]]
## <year>2001</year> 
## 
## attr(,"class")
## [1] "XMLNodeSet"

One can also use the functions xpathApply() or xpathSApply() to find matching nodes in an internal XML tree.

Syntax:

xpathSApply(doc, path, fun, ... )

The output is a vector (or in the case of XpathApply, a list).

xpathSApply(movies_xml, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(movies_xml, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"