Source file ⇒ lec32.Rmd
Recall: The basic unit of XML code is called an element or node. It is made up of both markup and content. Markup consists of tags, attributes, and comments.
Example (moview.xml):
<?xml version="1.0"?>
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
Identify examples of an element, an attribute and the content of movies.xml.
#movie is an element (or node):
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="126" lang="eng"> </movie> is a tag for the `movie` element.
mins="126" is an attribute. An attribute value is always in quotes.
The content of movie is:
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
Last time we used xmlTreeParse
, xmlRoot
, and xmlValue
to parse an xml document to get its values.
These together with lapply and xmlapply allowed us to convert an xml document into a data table.
library(XML)
root <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml") %>%
xmlRoot()
getvar <- function(x, var) xmlValue(x[[var]])
res <- names(root[[1]]) %>% lapply(function(var){ root %>% xmlSApply(getvar,var)})
movies <- data.frame(res)
head(movies)
title | director | year | genre |
---|---|---|---|
Good Will Hunting | GusVan Sant | 1998 | drama |
Y tu mama tambien | AlfonsoCuaron | 2001 | drama |
We provide another way to scrape XML files using the idea of an xpath.
Xpath is a language to navigate through elements and attributes in an XML/HTML document.
It uses path expressions to select nodes in an XML document and identifies patterns to match data or content.
The key concept is knowing how to write XPath expressions. XPath expressions have a syntax similar to the way files are located in a hierarchy of directories/folders in a computer file system. For instance:
/movies/movie
is the XPath expression to locate the first movie
element that is the child of the movies
elements.
The main path expressions (i.e. symbols) are:
Symbol | Description |
---|---|
/ | selectes from the root node |
// | selects nodes anywhere |
. | selects the current node |
.. | selects the parent of the current node |
@ | selects attributes |
[ ] | square bracket to indicate attributes |
Xpath wildcards can be used to select unknown XML elements
Symbol | Description |
---|---|
* | matches any element |
node( ) | matches any node of any kind |
@* | matches any attribute |
To work with XPath expressions we use the function getNodeSet()
that accepts XPath expressions in order to select node-sets. Its main usage is:
getNodeSet(doc, path)
where doc
is an object of class XMLInternalDocument
and path
is a string giving the XPath expression to be evaluated.
Example:
movies_xml <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml", useInternalNodes = TRUE)
class(movies_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
getNodeSet(movies_xml, "//movie")
## [[1]]
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## [[2]]
## <movie mins="106" lang="spa">
## <title>Y tu mama tambien</title>
## <director>
## <first_name>Alfonso</first_name>
## <last_name>Cuaron</last_name>
## </director>
## <year>2001</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//movie[@lang='eng']")
## [[1]]
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//year")
## [[1]]
## <year>1998</year>
##
## [[2]]
## <year>2001</year>
##
## attr(,"class")
## [1] "XMLNodeSet"