Source file ⇒ lec33.Rmd
Here is a good resourse http://www.w3schools.com/html/default.asp
HTML is a markup language for describing web documents (web pages).
HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags
HTML documents are described by predefined HTML tags
Each HTML tag describes different document content
HTML is a special case of XML
The DOCTYPE
declaration defines the document type to be HTML
The text between <html>
and </html>
describes an HTML document
The text between <head>
and </head>
provides information about the document
The text between <title>
and </title>
provides a title for the document
The text between <body>
and </body>
describes the visible page content
The text between <h1>
and </h1>
describes a heading
The text between <p>
and </p>
describes a paragraph
Well formed HTML means:
<!DOCTYPE html>
<div id="contentSub"> some content here </div>
HTML links are defined with the <a>
tag:
<a href="http://www.w3schools.com">This is a link</a>
The link’s destination is specified in the href
attribute.
Attributes are used to provide additional information about HTML elements.
<pre>
displays any text content exactly as it appears in the source code. It is useful for displaying computer code or computer output or formating a table a certain way.
<!DOCTYPE html>
<html>
<body>
<p>The pre tag preserves both spaces and line breaks:</p>
<pre>
My Bonnie lies over the ocean.
My Bonnie lies over the sea.
My Bonnie lies over the ocean.
Oh, bring back my Bonnie to me.
</pre>
</body>
</html>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
Tables are defined with the <table>
tag.
Tables are divided into table rows with the <tr>
tag.
Table rows are divided into table data with the <td>
tag.
A table row can also be divided into table headings with the <th>
tag.
Since well-formed HTML is a special case of XML, we can extract latitude and longitude fromthis HTML file using the above tools.
Summary of XML functions:
Function | Description |
---|---|
xmlTreeParse() or xmlParse() | reads xml file, returns class XMLDocument or XMLInternalDocument |
htmlTreeParse() or htmlParse() | reads html file, returns class XMLDocument or XMLInternalDocument |
xmlRoot() | gets access to the root node and its elements |
xmlChildren() | gets access to the child elements of a given node |
xmlName() | name of the node |
xmlSize() | number of subnodes |
xmlAttrs() | named character vector of all attributes |
xmlGetAttr() | value of a single attribute |
xmlValue() | contents of a leaf node |
xmlParent() | name of parent node |
xmlAncestors() | name of ancestor nodes |
getSibling() | siblings to the right or to the left |
xmlNamespace() | the namespace (if there’s one) |
xmlApply() | lapply function applied to nodes of a tree |
xmlSApply | sapply function applied to nodes of a tree |
Examples:
<?xml version="1.0"?>
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
movies_xml <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml", useInternalNodes = TRUE)
class(movies_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
root = xmlRoot(movies_xml)
movie_child = xmlChildren(root)
movie_child
## $movie
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## $movie
## <movie mins="106" lang="spa">
## <title>Y tu mama tambien</title>
## <director>
## <first_name>Alfonso</first_name>
## <last_name>Cuaron</last_name>
## </director>
## <year>2001</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
goodwill = movie_child[[1]]
goodwill
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
# node name
xmlName(goodwill)
## [1] "movie"
# number of children
xmlSize(goodwill)
## [1] 4
# node attributes
xmlAttrs(goodwill) ## mins lang
## mins lang
## "126" "eng"
# get specific attribute value
xmlGetAttr(goodwill, name = 'lang')
## [1] "eng"
# node content (as character string)
xmlValue(goodwill)
## [1] "Good Will HuntingGusVan Sant1998drama"
# child nodes of goodwill node
xmlChildren(goodwill)
## $title
## <title>Good Will Hunting</title>
##
## $director
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
##
## $year
## <year>1998</year>
##
## $genre
## <genre>drama</genre>
##
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
# parent
xmlParent(goodwill)
## <movies>
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
## <movie mins="106" lang="spa">
## <title>Y tu mama tambien</title>
## <director>
## <first_name>Alfonso</first_name>
## <last_name>Cuaron</last_name>
## </director>
## <year>2001</year>
## <genre>drama</genre>
## </movie>
## </movies>
# sibling of goodwill node
getSibling(goodwill)
## <movie mins="106" lang="spa">
## <title>Y tu mama tambien</title>
## <director>
## <first_name>Alfonso</first_name>
## <last_name>Cuaron</last_name>
## </director>
## <year>2001</year>
## <genre>drama</genre>
## </movie>
To work with XPath expressions we use the function getNodeSet()
that accepts XPath expressions in order to select node-sets. Its main usage is:
getNodeSet(doc, path)
where doc
is an object of class XMLInternalDocument
and path
is a string giving the XPath expression to be evaluated. Its output is an XMLNodeSet.
Example:
getNodeSet(movies_xml, "//movie")
## [[1]]
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## [[2]]
## <movie mins="106" lang="spa">
## <title>Y tu mama tambien</title>
## <director>
## <first_name>Alfonso</first_name>
## <last_name>Cuaron</last_name>
## </director>
## <year>2001</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//movie[@lang='eng']")
## [[1]]
## <movie mins="126" lang="eng">
## <title>Good Will Hunting</title>
## <director>
## <first_name>Gus</first_name>
## <last_name>Van Sant</last_name>
## </director>
## <year>1998</year>
## <genre>drama</genre>
## </movie>
##
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//year")
## [[1]]
## <year>1998</year>
##
## [[2]]
## <year>2001</year>
##
## attr(,"class")
## [1] "XMLNodeSet"
One can also use the functions xpathApply()
or xpathSApply()
to find matching nodes in an internal XML tree.
Syntax:
xpathSApply(doc, path, fun, ... )
The output is a vector (or in the case of XpathApply, a list).
xpathSApply(movies_xml, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(movies_xml, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"
Locations for countries are available frommany places on the Web. MaxMind provides these data within an HTML page available at:
http://dev.maxmind.com/geoip/legacy/codes/country_latlon/
Since well-formed HTML is a special case of XML, we can extract latitude and longitude from this HTML file using the above tools.
Examine the HTML source, and notice that thedata are simply placed as plain text within a <pre>
node in the document. If we can extract the contents of this <pre>
node, then we can place this information in a data frame.
htmlParse()
and access the root of the document using `xmlRoot()htmlParse()
this is a wrapper for htmlTreeParse()
so you don’t have to write the argument useInternalNodes = TRUE
doc <- htmlParse("http://dev.maxmind.com/geoip/legacy/codes/country_latlon/")
root <- xmlRoot(doc)
<pre>
node in the document.root %>% xpathSApply( "//pre", xmlValue)
## [1] "\r\n\"iso 3166 country\",\"latitude\",\"longitude\"\r\nAD,42.5000,1.5000\r\nAE,24.0000,54.0000\r\nAF,33.0000,65.0000\r\nAG,17.0500,-61.8000\r\nAI,18.2500,-63.1667\r\nAL,41.0000,20.0000\r\nAM,40.0000,45.0000\r\nAN,12.2500,-68.7500\r\nAO,-12.5000,18.5000\r\nAP,35.0000,105.0000\r\nAQ,-90.0000,0.0000\r\nAR,-34.0000,-64.0000\r\nAS,-14.3333,-170.0000\r\nAT,47.3333,13.3333\r\nAU,-27.0000,133.0000\r\nAW,12.5000,-69.9667\r\nAZ,40.5000,47.5000\r\nBA,44.0000,18.0000\r\nBB,13.1667,-59.5333\r\nBD,24.0000,90.0000\r\nBE,50.8333,4.0000\r\nBF,13.0000,-2.0000\r\nBG,43.0000,25.0000\r\nBH,26.0000,50.5500\r\nBI,-3.5000,30.0000\r\nBJ,9.5000,2.2500\r\nBM,32.3333,-64.7500\r\nBN,4.5000,114.6667\r\nBO,-17.0000,-65.0000\r\nBR,-10.0000,-55.0000\r\nBS,24.2500,-76.0000\r\nBT,27.5000,90.5000\r\nBV,-54.4333,3.4000\r\nBW,-22.0000,24.0000\r\nBY,53.0000,28.0000\r\nBZ,17.2500,-88.7500\r\nCA,60.0000,-95.0000\r\nCC,-12.5000,96.8333\r\nCD,0.0000,25.0000\r\nCF,7.0000,21.0000\r\nCG,-1.0000,15.0000\r\nCH,47.0000,8.0000\r\nCI,8.0000,-5.0000\r\nCK,-21.2333,-159.7667\r\nCL,-30.0000,-71.0000\r\nCM,6.0000,12.0000\r\nCN,35.0000,105.0000\r\nCO,4.0000,-72.0000\r\nCR,10.0000,-84.0000\r\nCU,21.5000,-80.0000\r\nCV,16.0000,-24.0000\r\nCX,-10.5000,105.6667\r\nCY,35.0000,33.0000\r\nCZ,49.7500,15.5000\r\nDE,51.0000,9.0000\r\nDJ,11.5000,43.0000\r\nDK,56.0000,10.0000\r\nDM,15.4167,-61.3333\r\nDO,19.0000,-70.6667\r\nDZ,28.0000,3.0000\r\nEC,-2.0000,-77.5000\r\nEE,59.0000,26.0000\r\nEG,27.0000,30.0000\r\nEH,24.5000,-13.0000\r\nER,15.0000,39.0000\r\nES,40.0000,-4.0000\r\nET,8.0000,38.0000\r\nEU,47.0000,8.0000\r\nFI,64.0000,26.0000\r\nFJ,-18.0000,175.0000\r\nFK,-51.7500,-59.0000\r\nFM,6.9167,158.2500\r\nFO,62.0000,-7.0000\r\nFR,46.0000,2.0000\r\nGA,-1.0000,11.7500\r\nGB,54.0000,-2.0000\r\nGD,12.1167,-61.6667\r\nGE,42.0000,43.5000\r\nGF,4.0000,-53.0000\r\nGH,8.0000,-2.0000\r\nGI,36.1833,-5.3667\r\nGL,72.0000,-40.0000\r\nGM,13.4667,-16.5667\r\nGN,11.0000,-10.0000\r\nGP,16.2500,-61.5833\r\nGQ,2.0000,10.0000\r\nGR,39.0000,22.0000\r\nGS,-54.5000,-37.0000\r\nGT,15.5000,-90.2500\r\nGU,13.4667,144.7833\r\nGW,12.0000,-15.0000\r\nGY,5.0000,-59.0000\r\nHK,22.2500,114.1667\r\nHM,-53.1000,72.5167\r\nHN,15.0000,-86.5000\r\nHR,45.1667,15.5000\r\nHT,19.0000,-72.4167\r\nHU,47.0000,20.0000\r\nID,-5.0000,120.0000\r\nIE,53.0000,-8.0000\r\nIL,31.5000,34.7500\r\nIN,20.0000,77.0000\r\nIO,-6.0000,71.5000\r\nIQ,33.0000,44.0000\r\nIR,32.0000,53.0000\r\nIS,65.0000,-18.0000\r\nIT,42.8333,12.8333\r\nJM,18.2500,-77.5000\r\nJO,31.0000,36.0000\r\nJP,36.0000,138.0000\r\nKE,1.0000,38.0000\r\nKG,41.0000,75.0000\r\nKH,13.0000,105.0000\r\nKI,1.4167,173.0000\r\nKM,-12.1667,44.2500\r\nKN,17.3333,-62.7500\r\nKP,40.0000,127.0000\r\nKR,37.0000,127.5000\r\nKW,29.3375,47.6581\r\nKY,19.5000,-80.5000\r\nKZ,48.0000,68.0000\r\nLA,18.0000,105.0000\r\nLB,33.8333,35.8333\r\nLC,13.8833,-61.1333\r\nLI,47.1667,9.5333\r\nLK,7.0000,81.0000\r\nLR,6.5000,-9.5000\r\nLS,-29.5000,28.5000\r\nLT,56.0000,24.0000\r\nLU,49.7500,6.1667\r\nLV,57.0000,25.0000\r\nLY,25.0000,17.0000\r\nMA,32.0000,-5.0000\r\nMC,43.7333,7.4000\r\nMD,47.0000,29.0000\r\nME,42.0000,19.0000\r\nMG,-20.0000,47.0000\r\nMH,9.0000,168.0000\r\nMK,41.8333,22.0000\r\nML,17.0000,-4.0000\r\nMM,22.0000,98.0000\r\nMN,46.0000,105.0000\r\nMO,22.1667,113.5500\r\nMP,15.2000,145.7500\r\nMQ,14.6667,-61.0000\r\nMR,20.0000,-12.0000\r\nMS,16.7500,-62.2000\r\nMT,35.8333,14.5833\r\nMU,-20.2833,57.5500\r\nMV,3.2500,73.0000\r\nMW,-13.5000,34.0000\r\nMX,23.0000,-102.0000\r\nMY,2.5000,112.5000\r\nMZ,-18.2500,35.0000\r\nNA,-22.0000,17.0000\r\nNC,-21.5000,165.5000\r\nNE,16.0000,8.0000\r\nNF,-29.0333,167.9500\r\nNG,10.0000,8.0000\r\nNI,13.0000,-85.0000\r\nNL,52.5000,5.7500\r\nNO,62.0000,10.0000\r\nNP,28.0000,84.0000\r\nNR,-0.5333,166.9167\r\nNU,-19.0333,-169.8667\r\nNZ,-41.0000,174.0000\r\nOM,21.0000,57.0000\r\nPA,9.0000,-80.0000\r\nPE,-10.0000,-76.0000\r\nPF,-15.0000,-140.0000\r\nPG,-6.0000,147.0000\r\nPH,13.0000,122.0000\r\nPK,30.0000,70.0000\r\nPL,52.0000,20.0000\r\nPM,46.8333,-56.3333\r\nPR,18.2500,-66.5000\r\nPS,32.0000,35.2500\r\nPT,39.5000,-8.0000\r\nPW,7.5000,134.5000\r\nPY,-23.0000,-58.0000\r\nQA,25.5000,51.2500\r\nRE,-21.1000,55.6000\r\nRO,46.0000,25.0000\r\nRS,44.0000,21.0000\r\nRU,60.0000,100.0000\r\nRW,-2.0000,30.0000\r\nSA,25.0000,45.0000\r\nSB,-8.0000,159.0000\r\nSC,-4.5833,55.6667\r\nSD,15.0000,30.0000\r\nSE,62.0000,15.0000\r\nSG,1.3667,103.8000\r\nSH,-15.9333,-5.7000\r\nSI,46.0000,15.0000\r\nSJ,78.0000,20.0000\r\nSK,48.6667,19.5000\r\nSL,8.5000,-11.5000\r\nSM,43.7667,12.4167\r\nSN,14.0000,-14.0000\r\nSO,10.0000,49.0000\r\nSR,4.0000,-56.0000\r\nST,1.0000,7.0000\r\nSV,13.8333,-88.9167\r\nSY,35.0000,38.0000\r\nSZ,-26.5000,31.5000\r\nTC,21.7500,-71.5833\r\nTD,15.0000,19.0000\r\nTF,-43.0000,67.0000\r\nTG,8.0000,1.1667\r\nTH,15.0000,100.0000\r\nTJ,39.0000,71.0000\r\nTK,-9.0000,-172.0000\r\nTM,40.0000,60.0000\r\nTN,34.0000,9.0000\r\nTO,-20.0000,-175.0000\r\nTR,39.0000,35.0000\r\nTT,11.0000,-61.0000\r\nTV,-8.0000,178.0000\r\nTW,23.5000,121.0000\r\nTZ,-6.0000,35.0000\r\nUA,49.0000,32.0000\r\nUG,1.0000,32.0000\r\nUM,19.2833,166.6000\r\nUS,38.0000,-97.0000\r\nUY,-33.0000,-56.0000\r\nUZ,41.0000,64.0000\r\nVA,41.9000,12.4500\r\nVC,13.2500,-61.2000\r\nVE,8.0000,-66.0000\r\nVG,18.5000,-64.5000\r\nVI,18.3333,-64.8333\r\nVN,16.0000,106.0000\r\nVU,-16.0000,167.0000\r\nWF,-13.3000,-176.2000\r\nWS,-13.5833,-172.3333\r\nYE,15.0000,48.0000\r\nYT,-12.8333,45.1667\r\nZA,-29.0000,24.0000\r\nZM,-15.0000,30.0000\r\nZW,-20.0000,30.0000\r\n\r\n"
#or equivalently:
root %>% getNodeSet("//pre") %>% sapply(xmlValue) #here need sapply since getNodeSet returns a set of nodes (even if the set is just one node)
## [1] "\r\n\"iso 3166 country\",\"latitude\",\"longitude\"\r\nAD,42.5000,1.5000\r\nAE,24.0000,54.0000\r\nAF,33.0000,65.0000\r\nAG,17.0500,-61.8000\r\nAI,18.2500,-63.1667\r\nAL,41.0000,20.0000\r\nAM,40.0000,45.0000\r\nAN,12.2500,-68.7500\r\nAO,-12.5000,18.5000\r\nAP,35.0000,105.0000\r\nAQ,-90.0000,0.0000\r\nAR,-34.0000,-64.0000\r\nAS,-14.3333,-170.0000\r\nAT,47.3333,13.3333\r\nAU,-27.0000,133.0000\r\nAW,12.5000,-69.9667\r\nAZ,40.5000,47.5000\r\nBA,44.0000,18.0000\r\nBB,13.1667,-59.5333\r\nBD,24.0000,90.0000\r\nBE,50.8333,4.0000\r\nBF,13.0000,-2.0000\r\nBG,43.0000,25.0000\r\nBH,26.0000,50.5500\r\nBI,-3.5000,30.0000\r\nBJ,9.5000,2.2500\r\nBM,32.3333,-64.7500\r\nBN,4.5000,114.6667\r\nBO,-17.0000,-65.0000\r\nBR,-10.0000,-55.0000\r\nBS,24.2500,-76.0000\r\nBT,27.5000,90.5000\r\nBV,-54.4333,3.4000\r\nBW,-22.0000,24.0000\r\nBY,53.0000,28.0000\r\nBZ,17.2500,-88.7500\r\nCA,60.0000,-95.0000\r\nCC,-12.5000,96.8333\r\nCD,0.0000,25.0000\r\nCF,7.0000,21.0000\r\nCG,-1.0000,15.0000\r\nCH,47.0000,8.0000\r\nCI,8.0000,-5.0000\r\nCK,-21.2333,-159.7667\r\nCL,-30.0000,-71.0000\r\nCM,6.0000,12.0000\r\nCN,35.0000,105.0000\r\nCO,4.0000,-72.0000\r\nCR,10.0000,-84.0000\r\nCU,21.5000,-80.0000\r\nCV,16.0000,-24.0000\r\nCX,-10.5000,105.6667\r\nCY,35.0000,33.0000\r\nCZ,49.7500,15.5000\r\nDE,51.0000,9.0000\r\nDJ,11.5000,43.0000\r\nDK,56.0000,10.0000\r\nDM,15.4167,-61.3333\r\nDO,19.0000,-70.6667\r\nDZ,28.0000,3.0000\r\nEC,-2.0000,-77.5000\r\nEE,59.0000,26.0000\r\nEG,27.0000,30.0000\r\nEH,24.5000,-13.0000\r\nER,15.0000,39.0000\r\nES,40.0000,-4.0000\r\nET,8.0000,38.0000\r\nEU,47.0000,8.0000\r\nFI,64.0000,26.0000\r\nFJ,-18.0000,175.0000\r\nFK,-51.7500,-59.0000\r\nFM,6.9167,158.2500\r\nFO,62.0000,-7.0000\r\nFR,46.0000,2.0000\r\nGA,-1.0000,11.7500\r\nGB,54.0000,-2.0000\r\nGD,12.1167,-61.6667\r\nGE,42.0000,43.5000\r\nGF,4.0000,-53.0000\r\nGH,8.0000,-2.0000\r\nGI,36.1833,-5.3667\r\nGL,72.0000,-40.0000\r\nGM,13.4667,-16.5667\r\nGN,11.0000,-10.0000\r\nGP,16.2500,-61.5833\r\nGQ,2.0000,10.0000\r\nGR,39.0000,22.0000\r\nGS,-54.5000,-37.0000\r\nGT,15.5000,-90.2500\r\nGU,13.4667,144.7833\r\nGW,12.0000,-15.0000\r\nGY,5.0000,-59.0000\r\nHK,22.2500,114.1667\r\nHM,-53.1000,72.5167\r\nHN,15.0000,-86.5000\r\nHR,45.1667,15.5000\r\nHT,19.0000,-72.4167\r\nHU,47.0000,20.0000\r\nID,-5.0000,120.0000\r\nIE,53.0000,-8.0000\r\nIL,31.5000,34.7500\r\nIN,20.0000,77.0000\r\nIO,-6.0000,71.5000\r\nIQ,33.0000,44.0000\r\nIR,32.0000,53.0000\r\nIS,65.0000,-18.0000\r\nIT,42.8333,12.8333\r\nJM,18.2500,-77.5000\r\nJO,31.0000,36.0000\r\nJP,36.0000,138.0000\r\nKE,1.0000,38.0000\r\nKG,41.0000,75.0000\r\nKH,13.0000,105.0000\r\nKI,1.4167,173.0000\r\nKM,-12.1667,44.2500\r\nKN,17.3333,-62.7500\r\nKP,40.0000,127.0000\r\nKR,37.0000,127.5000\r\nKW,29.3375,47.6581\r\nKY,19.5000,-80.5000\r\nKZ,48.0000,68.0000\r\nLA,18.0000,105.0000\r\nLB,33.8333,35.8333\r\nLC,13.8833,-61.1333\r\nLI,47.1667,9.5333\r\nLK,7.0000,81.0000\r\nLR,6.5000,-9.5000\r\nLS,-29.5000,28.5000\r\nLT,56.0000,24.0000\r\nLU,49.7500,6.1667\r\nLV,57.0000,25.0000\r\nLY,25.0000,17.0000\r\nMA,32.0000,-5.0000\r\nMC,43.7333,7.4000\r\nMD,47.0000,29.0000\r\nME,42.0000,19.0000\r\nMG,-20.0000,47.0000\r\nMH,9.0000,168.0000\r\nMK,41.8333,22.0000\r\nML,17.0000,-4.0000\r\nMM,22.0000,98.0000\r\nMN,46.0000,105.0000\r\nMO,22.1667,113.5500\r\nMP,15.2000,145.7500\r\nMQ,14.6667,-61.0000\r\nMR,20.0000,-12.0000\r\nMS,16.7500,-62.2000\r\nMT,35.8333,14.5833\r\nMU,-20.2833,57.5500\r\nMV,3.2500,73.0000\r\nMW,-13.5000,34.0000\r\nMX,23.0000,-102.0000\r\nMY,2.5000,112.5000\r\nMZ,-18.2500,35.0000\r\nNA,-22.0000,17.0000\r\nNC,-21.5000,165.5000\r\nNE,16.0000,8.0000\r\nNF,-29.0333,167.9500\r\nNG,10.0000,8.0000\r\nNI,13.0000,-85.0000\r\nNL,52.5000,5.7500\r\nNO,62.0000,10.0000\r\nNP,28.0000,84.0000\r\nNR,-0.5333,166.9167\r\nNU,-19.0333,-169.8667\r\nNZ,-41.0000,174.0000\r\nOM,21.0000,57.0000\r\nPA,9.0000,-80.0000\r\nPE,-10.0000,-76.0000\r\nPF,-15.0000,-140.0000\r\nPG,-6.0000,147.0000\r\nPH,13.0000,122.0000\r\nPK,30.0000,70.0000\r\nPL,52.0000,20.0000\r\nPM,46.8333,-56.3333\r\nPR,18.2500,-66.5000\r\nPS,32.0000,35.2500\r\nPT,39.5000,-8.0000\r\nPW,7.5000,134.5000\r\nPY,-23.0000,-58.0000\r\nQA,25.5000,51.2500\r\nRE,-21.1000,55.6000\r\nRO,46.0000,25.0000\r\nRS,44.0000,21.0000\r\nRU,60.0000,100.0000\r\nRW,-2.0000,30.0000\r\nSA,25.0000,45.0000\r\nSB,-8.0000,159.0000\r\nSC,-4.5833,55.6667\r\nSD,15.0000,30.0000\r\nSE,62.0000,15.0000\r\nSG,1.3667,103.8000\r\nSH,-15.9333,-5.7000\r\nSI,46.0000,15.0000\r\nSJ,78.0000,20.0000\r\nSK,48.6667,19.5000\r\nSL,8.5000,-11.5000\r\nSM,43.7667,12.4167\r\nSN,14.0000,-14.0000\r\nSO,10.0000,49.0000\r\nSR,4.0000,-56.0000\r\nST,1.0000,7.0000\r\nSV,13.8333,-88.9167\r\nSY,35.0000,38.0000\r\nSZ,-26.5000,31.5000\r\nTC,21.7500,-71.5833\r\nTD,15.0000,19.0000\r\nTF,-43.0000,67.0000\r\nTG,8.0000,1.1667\r\nTH,15.0000,100.0000\r\nTJ,39.0000,71.0000\r\nTK,-9.0000,-172.0000\r\nTM,40.0000,60.0000\r\nTN,34.0000,9.0000\r\nTO,-20.0000,-175.0000\r\nTR,39.0000,35.0000\r\nTT,11.0000,-61.0000\r\nTV,-8.0000,178.0000\r\nTW,23.5000,121.0000\r\nTZ,-6.0000,35.0000\r\nUA,49.0000,32.0000\r\nUG,1.0000,32.0000\r\nUM,19.2833,166.6000\r\nUS,38.0000,-97.0000\r\nUY,-33.0000,-56.0000\r\nUZ,41.0000,64.0000\r\nVA,41.9000,12.4500\r\nVC,13.2500,-61.2000\r\nVE,8.0000,-66.0000\r\nVG,18.5000,-64.5000\r\nVI,18.3333,-64.8333\r\nVN,16.0000,106.0000\r\nVU,-16.0000,167.0000\r\nWF,-13.3000,-176.2000\r\nWS,-13.5833,-172.3333\r\nYE,15.0000,48.0000\r\nYT,-12.8333,45.1667\r\nZA,-29.0000,24.0000\r\nZM,-15.0000,30.0000\r\nZW,-20.0000,30.0000\r\n\r\n"
The function read.table()
reads values of a file
or text
and returns a data frame (use parameters text, skip, header, and sep).
pre <- root %>% getNodeSet("//pre") %>% sapply(xmlValue)
read.table(text=pre, skip=1,header=TRUE,sep=",") %>% head()
iso.3166.country | latitude | longitude |
---|---|---|
AD | 42.50 | 1.5000 |
AE | 24.00 | 54.0000 |
AF | 33.00 | 65.0000 |
AG | 17.05 | -61.8000 |
AI | 18.25 | -63.1667 |
AL | 41.00 | 20.0000 |
Get names of all the faculty members in the Statistics Deparment
http://statistics.berkeley.edu/people/faculty
doc <- htmlParse('http://statistics.berkeley.edu/people/faculty')
root <- xmlRoot(doc)
name_nodes <- getNodeSet(doc, '//div[@class="views-field views-field-title"]')
sapply(name_nodes, xmlValue)
## [1] " Ani Adhikari " " David Aldous "
## [3] " Peter Bartlett " " Peter Bickel "
## [5] " David R Brillinger " " Ben Brown "
## [7] " Joan Bruna " " Ching-Shui Cheng "
## [9] " Peng Ding " " Kjell Doksum "
## [11] " Sandrine Dudoit " " Noureddine El Karoui "
## [13] " Steven Evans " " Will Fithian "
## [15] " Lisa Goldberg " " Leo Goodman "
## [17] " Adityanand Guntuboyina " " Alan Hammond "
## [19] " Haiyan Huang " " Fletcher Ibser "
## [21] " Nicholas P. Jewell " " Michael Jordan "
## [23] " Michael Klass " " Adam Lucas "
## [25] " Michael Mahoney " " Jon McAuliffe "
## [27] " Warry Millar " " Elchanan Mossel "
## [29] " Rasmus Nielsen " " Deborah Nolan "
## [31] " Christopher Paciorek " " Jim Pitman "
## [33] " Helmut Pitters " " Elizabeth Purdom "
## [35] " Roger Purves " " Nusrat Rabbee "
## [37] " Benjamin Recht " " John Rice "
## [39] " Gaston Sanchez " " Jasjeet Sekhon "
## [41] " Juliet Shaffer " " Alistair Sinclair "
## [43] " Allan Sly " " Yun S. Song "
## [45] " Terry Speed " " Philip B. Stark "
## [47] " Chuck Stone " " Shobhana Stoyanov "
## [49] " Bernd Sturmfels " " Nike Sun "
## [51] " Yuekai Sun " " Aram Thomasian "
## [53] " Mark van der Laan " " Ken Wachter "
## [55] " Martin Wainwright " " Bin Yu "
#or
names <- xpathSApply(doc, '//div[@class="views-field views-field-title"]', xmlValue)
names
## [1] " Ani Adhikari " " David Aldous "
## [3] " Peter Bartlett " " Peter Bickel "
## [5] " David R Brillinger " " Ben Brown "
## [7] " Joan Bruna " " Ching-Shui Cheng "
## [9] " Peng Ding " " Kjell Doksum "
## [11] " Sandrine Dudoit " " Noureddine El Karoui "
## [13] " Steven Evans " " Will Fithian "
## [15] " Lisa Goldberg " " Leo Goodman "
## [17] " Adityanand Guntuboyina " " Alan Hammond "
## [19] " Haiyan Huang " " Fletcher Ibser "
## [21] " Nicholas P. Jewell " " Michael Jordan "
## [23] " Michael Klass " " Adam Lucas "
## [25] " Michael Mahoney " " Jon McAuliffe "
## [27] " Warry Millar " " Elchanan Mossel "
## [29] " Rasmus Nielsen " " Deborah Nolan "
## [31] " Christopher Paciorek " " Jim Pitman "
## [33] " Helmut Pitters " " Elizabeth Purdom "
## [35] " Roger Purves " " Nusrat Rabbee "
## [37] " Benjamin Recht " " John Rice "
## [39] " Gaston Sanchez " " Jasjeet Sekhon "
## [41] " Juliet Shaffer " " Alistair Sinclair "
## [43] " Allan Sly " " Yun S. Song "
## [45] " Terry Speed " " Philip B. Stark "
## [47] " Chuck Stone " " Shobhana Stoyanov "
## [49] " Bernd Sturmfels " " Nike Sun "
## [51] " Yuekai Sun " " Aram Thomasian "
## [53] " Mark van der Laan " " Ken Wachter "
## [55] " Martin Wainwright " " Bin Yu "
Get the links of all the faculty member in the Statistics Department. Hint: use xmlAttrs
doc <- htmlParse("http://statistics.berkeley.edu/people/faculty")
root <- xmlRoot(doc)
links <- root %>% xpathSApply( '//div[@class="views-field views-field-title"]/h2/a',xmlAttrs)
links
## href href
## "/people/ani-adhikari" "/people/david-aldous"
## href href
## "/people/peter-bartlett" "/people/peter-bickel"
## href href
## "/people/david-brillinger" "/people/ben-brown"
## href href
## "/people/joan-bruna" "/people/ching-shui-cheng"
## href href
## "/people/peng-ding" "/people/kjell-doksum"
## href href
## "/people/sandrine-dudoit-0" "/people/noureddine-el-karoui"
## href href
## "/people/steven-evans" "/people/william-fithian"
## href href
## "/people/lisa-goldberg" "/people/leo-goodman"
## href href
## "/people/adityanand-guntuboyina" "/people/alan-hammond-0"
## href href
## "/people/haiyan-huang" "/people/fletcher-ibser"
## href href
## "/people/nick-jewell" "/people/michael-jordan"
## href href
## "/people/michael-klass" "/people/adam-lucas"
## href href
## "/people/michael-mahoney" "/people/jon-mcauliffe-0"
## href href
## "/people/warry-millar" "/people/elchanan-mossel"
## href href
## "/people/rasmus-nielsen" "/people/deborah-nolan"
## href href
## "/people/christopher-paciorek" "/people/jim-pitman"
## href href
## "/people/helmut-pitters" "/people/elizabeth-purdom"
## href href
## "/people/roger-purves" "/people/nusrat-rabbee"
## href href
## "/people/benjamin-recht" "/people/john-rice"
## href href
## "/people/gaston-sanchez" "/people/jasjeet-sekhon"
## href href
## "/people/juliet-shaffer" "/people/alistair-sinclair"
## href href
## "/people/allan-sly-0" "/people/yun-song"
## href href
## "/people/terry-speed" "/people/philip-stark"
## href href
## "/people/chuck-stone" "/people/shobhana-stoyanov"
## href href
## "/people/bernd-sturmfels" "/people/nike-sun"
## href href
## "/people/yuekai-sun" "/people/aram-thomasian"
## href href
## "/people/mark-van-der-laan" "/people/ken-wachter"
## href href
## "/people/martin-wainwright" "/people/bin-yu-0"
Make a data frame of each faculty member’s name and his/her link
df=data.frame(names, links)
df %>% head()
names | links |
---|---|
Ani Adhikari | /people/ani-adhikari |
David Aldous | /people/david-aldous |
Peter Bartlett | /people/peter-bartlett |
Peter Bickel | /people/peter-bickel |
David R Brillinger | /people/david-brillinger |
Ben Brown | /people/ben-brown |
Sometimes your HTML isn’t well formed. In this case you will get an error:
doc <- htmlParse("http://en.wikipedia.org/wiki/Mile_run_world_record_progression")
root <- xmlRoot(doc)
Error: failed to load external entity "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
.
You can either try and tidy your HTML or if it is an HTML table you can just use Xpath and rvest (for html_nodes
and html_table
) as we did in
lecture 25 http://rpubs.com/alucas/162044 .
library(rvest)
SetOfTables <-
"http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
html_table(fill=TRUE)
head( SetOfTables[[1]] )
Time | Athlete | Nationality | Date | Venue |
---|---|---|---|---|
4:28 | Charles Westhall | United Kingdom | 26 July 1855 | London |
4:28 | Thomas Horspool | United Kingdom | 28 September 1857 | Manchester |
4:23 | Thomas Horspool | United Kingdom | 12 July 1858 | Manchester |
4:22¼ | Siah Albison | United Kingdom | 27 October 1860 | Manchester |
4:21¾ | William Lang | United Kingdom | 11 July 1863 | Manchester |
4:20½ | Edward Mills | United Kingdom | 23 April 1864 | Manchester |