library(httr)
url <- "http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html"
fortune <- httr::GET(url)
fortune
## Response [http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html]
## Date: 2021-10-13 06:56
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 776 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html> <head>
## <title>Collected R wisdoms</title>
## </head>
##
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## ...
class(fortune)
## [1] "response"
library(XML)
parsed_fortune <- htmlParse(fortune)
class(parsed_fortune)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
parsed_fortune
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
##
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
##
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
##
## </body>
## </html>
##
XPath is a query language that is useful for addressing and extracting parts from HTML/XML documents.
Predicates are simple functions that are applied to a node’s name, value, or attribute, and which evaluate whether a condition is true or false.
After a node(or node set) we specify the predicate in square brackets, node1[predicate]
. We select all
Textual properties of the document are useful predicates for node selection.
parsed_fortune
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
##
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
##
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
##
## </body>
## </html>
##
xpathSApply(parsed_fortune, "//div[@date='October/2011']")
## [[1]]
## <div lang="english" date="October/2011">
## <h1>Rolf Turner</h1>
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
## <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
string_method(text1, 'text2')
xpathSApply(parsed_fortune, "//*[contains(text(), 'magic')]")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//div[starts-with(./@id, 'R')]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
## <h1>Robert Gentleman</h1>
## <p><i>'What we have is nice, but we need something very different'</i></p>
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//title")
## [[1]]
## <title>Collected R wisdoms</title>
xpathSApply(parsed_fortune, "//title", fun = xmlValue)
## [1] "Collected R wisdoms"
xpathSApply(parsed_fortune, "//h1", fun = xmlValue)
## [1] "Robert Gentleman" "Rolf Turner"
xpathSApply(parsed_fortune, "//div", xmlAttrs)
## [[1]]
## id lang date
## "R Inventor" "english" "June/2003"
##
## [[2]]
## lang date
## "english" "October/2011"
xpathSApply(parsed_fortune, "//div", xmlGetAttr, "lang")
## [1] "english" "english"
fun
argumentxpathSApply(parsed_fortune, "//div//i", xmlValue)
## [1] "'What we have is nice, but we need something very different'"
## [2] "'R is wonderful, but it cannot work magic'"
lowerCaseFun <- function(x) {
x <- tolower(xmlValue(x))
return(x)
}
xpathSApply(parsed_fortune, "//div//i", fun = lowerCaseFun)
## [1] "'what we have is nice, but we need something very different'"
## [2] "'r is wonderful, but it cannot work magic'"
xpathSApply(parsed_fortune, "//div", xmlGetAttr, "date")
## [1] "June/2003" "October/2011"
dateFun <- function(x) {
require(stringr)
date <- xmlGetAttr(node = x, name = "date")
year <- str_extract(date, "[0-9]{4}")
year
}
xpathSApply(parsed_fortune, "//div", dateFun)
## 필요한 패키지를 로딩중입니다: stringr
## [1] "2003" "2011"
xpathSApply(parsed_fortune, "//div", xmlGetAttr, "id")
## [[1]]
## [1] "R Inventor"
##
## [[2]]
## NULL
idFun <- function(x) {
id <- xmlGetAttr(x, "id")
id <- ifelse(is.null(id), "not specified", id)
return(id)
}
xpathSApply(parsed_fortune, "//div", idFun)
## [1] "R Inventor" "not specified"
Chrome is a browser that provides a suite of developer tools to help inspect elements in the webpage and create valid XPath statements that can be passed to XML’s node retrieval functions.
http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html
'/html/body/div[2]/p[1]/i'
## [1] "/html/body/div[2]/p[1]/i"
xpathSApply(parsed_fortune, '/html/body/div[2]/p[1]/i', fun=xmlValue)
## [1] "'R is wonderful, but it cannot work magic'"
'/html/body/div[2]/p[2]/a'
## [1] "/html/body/div[2]/p[2]/a"
xpathSApply(parsed_fortune, '/html/body/div[2]/p[2]/a', fun=xmlGetAttr, "href")
## [1] "https://stat.ethz.ch/mailman/listinfo/r-help"
Knowing how to build expressions from scratch remains a necessary skill
XPath language for querying HTML documents
How to build XPath expressions from scratch
Iterative learning process for constructing an applicable XPath statement