Loading and representing the contents of HTML/XML files in an R session
Inspecting content on the Web: browser to display HTML content nicely
Importing HTML files into R and extracting info. from them: parser in R to construct useful representations of HTML documents
Reading vs. Parsing
Reading does not care to understand the formal grammar that underlies HTML but merely recognize the sequence of symbols included in the HTML file: Merely loading the content of an HTML file into an R session.
library(httr)
url <- "http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html"
fortune <- httr::GET(url)
fortune
## Response [http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html]
## Date: 2020-10-05 15:50
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 776 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html> <head>
## <title>Collected R wisdoms</title>
## </head>
##
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## ...
class(fortune)
## [1] "response"
GET()
is agnostic about the different tag elements (name, attribute, values, etc.) and produces results that do not reflect the document’s internal hierarchy as implied by the nested tags in any sensible way.
To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy of an HTML file within some R-specific data structure.
Transformation from any HTML file to a queryable Document Object Model: Parsing using XML package in two steps
1. ```html_parse()``` first parses the entire target document and creates the DOM in a tree-like data structure of the C language.
2. The C-level node structure is converted into an object of the R language through handler functions.
library(XML)
parsed_fortune <- htmlParse(fortune)
parsed_fortune <- htmlParse(fortune, encoding = "UTF-8")
class(parsed_fortune)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
parsed_fortune
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
##
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
##
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
##
## </body>
## </html>
##
Asking what information we are interested in and identifying where the information is located in a specific document
Tailoring a query to the document and obtaining the desired information
(Re)casting the extracted values into a format that facilitates further analysis
XPath is a query language that is useful for addressing and extracting parts from HTML/XML documents.
library(XML)
parsed_fortune <- htmlParse(file=fortune)
print(parsed_fortune)
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
##
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
##
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
##
## </body>
## </html>
##
A tree perspective on parsed_fortune
getwd()
## [1] "/Users/shinlee/Dropbox/2020_Class/fall/big data journalism/R"
xpathSApply(doc = parsed_fortune, path = "/html/body/div/p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>
##
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//body//p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>
##
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>
##
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
Relative path statements result in complete traversals of the document tree, which is rather expensive computationally and decreases the efficiency of the query. So, if speed is an issue to your code execution, it is advisable to express node locations by absolute paths
xpathSApply(parsed_fortune, "/html/body/div/*/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>
##
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//title/..")
## [[1]]
## <head>
## <title>Collected R wisdoms</title>
## </head>
xpathSApply(parsed_fortune, "//address | //title")
## [[1]]
## <title>Collected R wisdoms</title>
##
## [[2]]
## <address>
## <a href="http://www.r-datacollectionbook.com">
## <i>The book homepage</i>
## </a>
## <a/>
## </address>
node1/relation::node2
xpathSApply(parsed_fortune, "//a/ancestor::div")
## [[1]]
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
xpathSApply(parsed_fortune, "//a/ancestor::div//i")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//p/preceding-sibling::h1")
## [[1]]
## <h1>Robert Gentleman</h1>
##
## [[2]]
## <h1>Rolf Turner</h1>
xpathSApply(parsed_fortune, "//title/parent::*")
## [[1]]
## <head>
## <title>Collected R wisdoms</title>
## </head>
Visualizing node relations
Predicates are simple functions that are applied to a node’s name, value, or attribute, and which evaluate whether a condition is true or false.
After a node(or node set) we specify the predicate in square brackets, node1[predicate]
. We select all
xpathSApply(parsed_fortune, "//div/p[position()=1]")
## [[1]]
## <p>
## <i>'What we have is nice, but we need something very different'</i>
## </p>
##
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div/p[1]")
## [[1]]
## <p>
## <i>'What we have is nice, but we need something very different'</i>
## </p>
##
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div/p[last()]")
## [[1]]
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
##
## [[2]]
## <p>
## <b>Source: </b>
## <a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a>
## </p>
xpathSApply(parsed_fortune, "//div/p[last()-1]")
## [[1]]
## <p>
## <i>'What we have is nice, but we need something very different'</i>
## </p>
##
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div[count(.//a)>0]")
## [[1]]
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
@
element retrieves the attributes from a selected node. The ./@*
expression returns all the attributes, regardless of their name, from the currently selected nodes.
xpathSApply(parsed_fortune, "//div[count(./@*)>2]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//div[not(count(./@*)>2)]")
## [[1]]
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
xpathSApply(parsed_fortune, "//div[@date='October/2011']")
## [[1]]
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
xpathSApply(parsed_fortune, "//*[contains(text(), 'magic')]")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//div[starts-with(./@id, 'R')]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//div[substring-after(./@date, '/')='2003']//i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>
xpathSApply(parsed_fortune, "//title", fun = xmlValue)
## [1] "Collected R wisdoms"
xpathSApply(parsed_fortune, "//div", xmlAttrs)
## [[1]]
## id lang date
## "R Inventor" "english" "June/2003"
##
## [[2]]
## lang date
## "english" "October/2011"
xpathSApply(parsed_fortune, "//div", xmlGetAttr, "lang")
## [1] "english" "english"
fun
argumentlowerCaseFun <- function(x) {
x <- tolower(xmlValue(x))
x
}
xpathSApply(parsed_fortune, "//div//i", fun = lowerCaseFun)
## [1] "'what we have is nice, but we need something very different'"
## [2] "'r is wonderful, but it cannot work magic'"
dateFun <- function(x) {
require(stringr)
date <- xmlGetAttr(node = x, name = "date")
year <- str_extract(date, "[0-9]{4}")
year
}
xpathSApply(parsed_fortune, "//div", dateFun)
## Loading required package: stringr
## [1] "2003" "2011"
idFun <- function(x) {
id <- xmlGetAttr(x, "id")
id <- ifelse(is.null(id), "not specified", id)
return(id)
}
xpathSApply(parsed_fortune, "//div", idFun)
## [1] "R Inventor" "not specified"