Source file ⇒ lec32.Rmd

Today

  1. Xpath (data scraping xml and html documents)
  2. Group project info

Recall: The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.

Example (moview.xml):

<?xml version="1.0"?>
<movies>
    <movie mins="126" lang="eng">
        <title>Good Will Hunting</title> 
        <director>
            <first_name>Gus</first_name>
            <last_name>Van Sant</last_name>
        </director>
        <year>1998</year> 
        <genre>drama</genre>
    </movie>
    <movie mins="106" lang="spa"> 
        <title>Y tu mama tambien</title>
         <director>
            <first_name>Alfonso</first_name>
            <last_name>Cuaron</last_name> 
        </director>
        <year>2001</year>
        <genre>drama</genre>
    </movie>
</movies>

Task for you

Identify examples of an element, an attribute and the content of movies.xml.

#movie is an element (or node):

<movie mins="126" lang="eng">
    <title>Good Will Hunting</title> 
    <director>
      <first_name>Gus</first_name>
        <last_name>Van Sant</last_name>
    </director>
    <year>1998</year> 
    <genre>drama</genre>
</movie>
<movie mins="126" lang="eng"> </movie> is a tag for the `movie` element.
mins="126" is an attribute. An attribute value is always in quotes.
The content of movie is:

<title>Good Will Hunting</title> 
<director>
    <first_name>Gus</first_name>
    <last_name>Van Sant</last_name>
</director>
<year>1998</year> 
<genre>drama</genre>

Last time we used xmlTreeParse, xmlRoot, and xmlValue to parse an xml document to get its values.

These together with lapply and xmlapply allowed us to convert an xml document into a data table.

library(XML)
root <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml") %>%
  xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]])  %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
movies <- data.frame(res)
head(movies)
title director year genre
Good Will Hunting GusVan Sant 1998 drama
Y tu mama tambien AlfonsoCuaron 2001 drama

We provide another way to scrape XML files using the idea of an xpath.

Xpath Language

Xpath is a language to navigate through elements and attributes in an XML/HTML document.

It uses path expressions to select nodes in an XML document and identifies patterns to match data or content.

XPath Syntax

The key concept is knowing how to write XPath expressions. XPath expressions have a syntax similar to the way files are located in a hierarchy of directories/folders in a computer file system. For instance:

/movies/movie

is the XPath expression to locate the first movie element that is the child of the movies elements.

The main path expressions (i.e. symbols) are:

Symbol Description
/ selectes from the root node
// selects nodes anywhere
. selects the current node
.. selects the parent of the current node
@ selects attributes
[ ] square bracket to indicate attributes

Xpath wildcards can be used to select unknown XML elements

Symbol Description
* matches any element
node( ) matches any node of any kind
@* matches any attribute

To work with XPath expressions we use the function getNodeSet() that accepts XPath expressions in order to select node-sets. Its main usage is:

getNodeSet(doc, path) where doc is an object of class XMLInternalDocument and path is a string giving the XPath expression to be evaluated.

Example:

movies_xml <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml", useInternalNodes = TRUE)
class(movies_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
getNodeSet(movies_xml, "//movie")
## [[1]]
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## [[2]]
## <movie mins="106" lang="spa">
##   <title>Y tu mama tambien</title>
##   <director>
##     <first_name>Alfonso</first_name>
##     <last_name>Cuaron</last_name>
##   </director>
##   <year>2001</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//movie[@lang='eng']")
## [[1]]
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//year")
## [[1]]
## <year>1998</year> 
## 
## [[2]]
## <year>2001</year> 
## 
## attr(,"class")
## [1] "XMLNodeSet"