Source file ⇒ lec33.Rmd

Today

  1. Crash course in HTML
  2. Xpath (data scraping xml and html documents)

1. Crash course in HTML

Here is a good resourse http://www.w3schools.com/html/default.asp

HTML is a markup language for describing web documents (web pages).

HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags

HTML documents are described by predefined HTML tags

Each HTML tag describes different document content

HTML is a special case of XML

The DOCTYPE declaration defines the document type to be HTML
The text between <html> and </html> describes an HTML document
The text between <head> and </head> provides information about the document
The text between <title> and </title> provides a title for the document
The text between <body> and </body> describes the visible page content
The text between <h1> and </h1> describes a heading
The text between <p> and </p> describes a paragraph

Well formed HTML means:

  1. has DOCTYPE header For example: <!DOCTYPE html>
  2. has all closure tags For example: <div id="contentSub"> some content here </div>

HTML links are defined with the <a> tag:

<a href="http://www.w3schools.com">This is a link</a>

The link’s destination is specified in the href attribute.

Attributes are used to provide additional information about HTML elements.

<pre> displays any text content exactly as it appears in the source code. It is useful for displaying computer code or computer output or formating a table a certain way.

<!DOCTYPE html>
<html>
<body>

<p>The pre tag preserves both spaces and line breaks:</p>

<pre>
   My Bonnie lies over the ocean.

   My Bonnie lies over the sea.

   My Bonnie lies over the ocean.
   
   Oh, bring back my Bonnie to me.
</pre>

</body>
</html>

<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Tables are defined with the <table> tag.

Tables are divided into table rows with the <tr> tag.

Table rows are divided into table data with the <td> tag.

A table row can also be divided into table headings with the <th> tag.

Since well-formed HTML is a special case of XML, we can extract latitude and longitude fromthis HTML file using the above tools.

2. Data Scraping XML and HTML with Xpath

Summary of XML functions:

Function Description
xmlTreeParse() or xmlParse() reads xml file, returns class XMLDocument or XMLInternalDocument
htmlTreeParse() or htmlParse() reads html file, returns class XMLDocument or XMLInternalDocument
xmlRoot() gets access to the root node and its elements
xmlChildren() gets access to the child elements of a given node
xmlName() name of the node
xmlSize() number of subnodes
xmlAttrs() named character vector of all attributes
xmlGetAttr() value of a single attribute
xmlValue() contents of a leaf node
xmlParent() name of parent node
xmlAncestors() name of ancestor nodes
getSibling() siblings to the right or to the left
xmlNamespace() the namespace (if there’s one)
xmlApply() lapply function applied to nodes of a tree
xmlSApply sapply function applied to nodes of a tree

Examples:

<?xml version="1.0"?>
<movies>
    <movie mins="126" lang="eng">
        <title>Good Will Hunting</title> 
        <director>
            <first_name>Gus</first_name>
            <last_name>Van Sant</last_name>
        </director>
        <year>1998</year> 
        <genre>drama</genre>
    </movie>
    <movie mins="106" lang="spa"> 
        <title>Y tu mama tambien</title>
         <director>
            <first_name>Alfonso</first_name>
            <last_name>Cuaron</last_name> 
        </director>
        <year>2001</year>
        <genre>drama</genre>
    </movie>
</movies>

movies_xml <- xmlTreeParse("/Users/Adam/Desktop/stat133lectures_hw_lab/movies.xml", useInternalNodes = TRUE)
class(movies_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
root = xmlRoot(movies_xml)
movie_child = xmlChildren(root)
movie_child
## $movie
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## $movie
## <movie mins="106" lang="spa">
##   <title>Y tu mama tambien</title>
##   <director>
##     <first_name>Alfonso</first_name>
##     <last_name>Cuaron</last_name>
##   </director>
##   <year>2001</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
goodwill = movie_child[[1]] 
goodwill
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie>
# node name
xmlName(goodwill)
## [1] "movie"
# number of children
xmlSize(goodwill)
## [1] 4
# node attributes
xmlAttrs(goodwill) ## mins lang
##  mins  lang 
## "126" "eng"
# get specific attribute value
xmlGetAttr(goodwill, name = 'lang') 
## [1] "eng"
# node content (as character string)
xmlValue(goodwill)
## [1] "Good Will HuntingGusVan Sant1998drama"
# child nodes of goodwill node
xmlChildren(goodwill)
## $title
## <title>Good Will Hunting</title> 
## 
## $director
## <director>
##   <first_name>Gus</first_name>
##   <last_name>Van Sant</last_name>
## </director> 
## 
## $year
## <year>1998</year> 
## 
## $genre
## <genre>drama</genre> 
## 
## attr(,"class")
## [1] "XMLInternalNodeList" "XMLNodeList"
# parent
xmlParent(goodwill)
## <movies>
##   <movie mins="126" lang="eng">
##     <title>Good Will Hunting</title>
##     <director>
##       <first_name>Gus</first_name>
##       <last_name>Van Sant</last_name>
##     </director>
##     <year>1998</year>
##     <genre>drama</genre>
##   </movie>
##   <movie mins="106" lang="spa">
##     <title>Y tu mama tambien</title>
##     <director>
##       <first_name>Alfonso</first_name>
##       <last_name>Cuaron</last_name>
##     </director>
##     <year>2001</year>
##     <genre>drama</genre>
##   </movie>
## </movies>
# sibling of goodwill node
getSibling(goodwill)
## <movie mins="106" lang="spa">
##   <title>Y tu mama tambien</title>
##   <director>
##     <first_name>Alfonso</first_name>
##     <last_name>Cuaron</last_name>
##   </director>
##   <year>2001</year>
##   <genre>drama</genre>
## </movie>

To work with XPath expressions we use the function getNodeSet() that accepts XPath expressions in order to select node-sets. Its main usage is:

getNodeSet(doc, path) where doc is an object of class XMLInternalDocument and path is a string giving the XPath expression to be evaluated. Its output is an XMLNodeSet.

Example:

getNodeSet(movies_xml, "//movie")
## [[1]]
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## [[2]]
## <movie mins="106" lang="spa">
##   <title>Y tu mama tambien</title>
##   <director>
##     <first_name>Alfonso</first_name>
##     <last_name>Cuaron</last_name>
##   </director>
##   <year>2001</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//movie[@lang='eng']")
## [[1]]
## <movie mins="126" lang="eng">
##   <title>Good Will Hunting</title>
##   <director>
##     <first_name>Gus</first_name>
##     <last_name>Van Sant</last_name>
##   </director>
##   <year>1998</year>
##   <genre>drama</genre>
## </movie> 
## 
## attr(,"class")
## [1] "XMLNodeSet"
getNodeSet(movies_xml, "//year")
## [[1]]
## <year>1998</year> 
## 
## [[2]]
## <year>2001</year> 
## 
## attr(,"class")
## [1] "XMLNodeSet"

One can also use the functions xpathApply() or xpathSApply() to find matching nodes in an internal XML tree.

Syntax:

xpathSApply(doc, path, fun, ... )

The output is a vector (or in the case of XpathApply, a list).

xpathSApply(movies_xml, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(movies_xml, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"

Data Scraping HTML

Locations for countries are available frommany places on the Web. MaxMind provides these data within an HTML page available at:

http://dev.maxmind.com/geoip/legacy/codes/country_latlon/

Since well-formed HTML is a special case of XML, we can extract latitude and longitude from this HTML file using the above tools.

Examine the HTML source, and notice that thedata are simply placed as plain text within a <pre> node in the document. If we can extract the contents of this <pre> node, then we can place this information in a data frame.

Step 1: Begin by parsing the function with htmlParse() and access the root of the document using `xmlRoot()

htmlParse() this is a wrapper for htmlTreeParse() so you don’t have to write the argument useInternalNodes = TRUE

doc <- htmlParse("http://dev.maxmind.com/geoip/legacy/codes/country_latlon/")
root <- xmlRoot(doc)

Step 2: Use an XPath expression to locate the <pre> node in the document.

root %>% xpathSApply( "//pre", xmlValue)
## [1] "\r\n\"iso 3166 country\",\"latitude\",\"longitude\"\r\nAD,42.5000,1.5000\r\nAE,24.0000,54.0000\r\nAF,33.0000,65.0000\r\nAG,17.0500,-61.8000\r\nAI,18.2500,-63.1667\r\nAL,41.0000,20.0000\r\nAM,40.0000,45.0000\r\nAN,12.2500,-68.7500\r\nAO,-12.5000,18.5000\r\nAP,35.0000,105.0000\r\nAQ,-90.0000,0.0000\r\nAR,-34.0000,-64.0000\r\nAS,-14.3333,-170.0000\r\nAT,47.3333,13.3333\r\nAU,-27.0000,133.0000\r\nAW,12.5000,-69.9667\r\nAZ,40.5000,47.5000\r\nBA,44.0000,18.0000\r\nBB,13.1667,-59.5333\r\nBD,24.0000,90.0000\r\nBE,50.8333,4.0000\r\nBF,13.0000,-2.0000\r\nBG,43.0000,25.0000\r\nBH,26.0000,50.5500\r\nBI,-3.5000,30.0000\r\nBJ,9.5000,2.2500\r\nBM,32.3333,-64.7500\r\nBN,4.5000,114.6667\r\nBO,-17.0000,-65.0000\r\nBR,-10.0000,-55.0000\r\nBS,24.2500,-76.0000\r\nBT,27.5000,90.5000\r\nBV,-54.4333,3.4000\r\nBW,-22.0000,24.0000\r\nBY,53.0000,28.0000\r\nBZ,17.2500,-88.7500\r\nCA,60.0000,-95.0000\r\nCC,-12.5000,96.8333\r\nCD,0.0000,25.0000\r\nCF,7.0000,21.0000\r\nCG,-1.0000,15.0000\r\nCH,47.0000,8.0000\r\nCI,8.0000,-5.0000\r\nCK,-21.2333,-159.7667\r\nCL,-30.0000,-71.0000\r\nCM,6.0000,12.0000\r\nCN,35.0000,105.0000\r\nCO,4.0000,-72.0000\r\nCR,10.0000,-84.0000\r\nCU,21.5000,-80.0000\r\nCV,16.0000,-24.0000\r\nCX,-10.5000,105.6667\r\nCY,35.0000,33.0000\r\nCZ,49.7500,15.5000\r\nDE,51.0000,9.0000\r\nDJ,11.5000,43.0000\r\nDK,56.0000,10.0000\r\nDM,15.4167,-61.3333\r\nDO,19.0000,-70.6667\r\nDZ,28.0000,3.0000\r\nEC,-2.0000,-77.5000\r\nEE,59.0000,26.0000\r\nEG,27.0000,30.0000\r\nEH,24.5000,-13.0000\r\nER,15.0000,39.0000\r\nES,40.0000,-4.0000\r\nET,8.0000,38.0000\r\nEU,47.0000,8.0000\r\nFI,64.0000,26.0000\r\nFJ,-18.0000,175.0000\r\nFK,-51.7500,-59.0000\r\nFM,6.9167,158.2500\r\nFO,62.0000,-7.0000\r\nFR,46.0000,2.0000\r\nGA,-1.0000,11.7500\r\nGB,54.0000,-2.0000\r\nGD,12.1167,-61.6667\r\nGE,42.0000,43.5000\r\nGF,4.0000,-53.0000\r\nGH,8.0000,-2.0000\r\nGI,36.1833,-5.3667\r\nGL,72.0000,-40.0000\r\nGM,13.4667,-16.5667\r\nGN,11.0000,-10.0000\r\nGP,16.2500,-61.5833\r\nGQ,2.0000,10.0000\r\nGR,39.0000,22.0000\r\nGS,-54.5000,-37.0000\r\nGT,15.5000,-90.2500\r\nGU,13.4667,144.7833\r\nGW,12.0000,-15.0000\r\nGY,5.0000,-59.0000\r\nHK,22.2500,114.1667\r\nHM,-53.1000,72.5167\r\nHN,15.0000,-86.5000\r\nHR,45.1667,15.5000\r\nHT,19.0000,-72.4167\r\nHU,47.0000,20.0000\r\nID,-5.0000,120.0000\r\nIE,53.0000,-8.0000\r\nIL,31.5000,34.7500\r\nIN,20.0000,77.0000\r\nIO,-6.0000,71.5000\r\nIQ,33.0000,44.0000\r\nIR,32.0000,53.0000\r\nIS,65.0000,-18.0000\r\nIT,42.8333,12.8333\r\nJM,18.2500,-77.5000\r\nJO,31.0000,36.0000\r\nJP,36.0000,138.0000\r\nKE,1.0000,38.0000\r\nKG,41.0000,75.0000\r\nKH,13.0000,105.0000\r\nKI,1.4167,173.0000\r\nKM,-12.1667,44.2500\r\nKN,17.3333,-62.7500\r\nKP,40.0000,127.0000\r\nKR,37.0000,127.5000\r\nKW,29.3375,47.6581\r\nKY,19.5000,-80.5000\r\nKZ,48.0000,68.0000\r\nLA,18.0000,105.0000\r\nLB,33.8333,35.8333\r\nLC,13.8833,-61.1333\r\nLI,47.1667,9.5333\r\nLK,7.0000,81.0000\r\nLR,6.5000,-9.5000\r\nLS,-29.5000,28.5000\r\nLT,56.0000,24.0000\r\nLU,49.7500,6.1667\r\nLV,57.0000,25.0000\r\nLY,25.0000,17.0000\r\nMA,32.0000,-5.0000\r\nMC,43.7333,7.4000\r\nMD,47.0000,29.0000\r\nME,42.0000,19.0000\r\nMG,-20.0000,47.0000\r\nMH,9.0000,168.0000\r\nMK,41.8333,22.0000\r\nML,17.0000,-4.0000\r\nMM,22.0000,98.0000\r\nMN,46.0000,105.0000\r\nMO,22.1667,113.5500\r\nMP,15.2000,145.7500\r\nMQ,14.6667,-61.0000\r\nMR,20.0000,-12.0000\r\nMS,16.7500,-62.2000\r\nMT,35.8333,14.5833\r\nMU,-20.2833,57.5500\r\nMV,3.2500,73.0000\r\nMW,-13.5000,34.0000\r\nMX,23.0000,-102.0000\r\nMY,2.5000,112.5000\r\nMZ,-18.2500,35.0000\r\nNA,-22.0000,17.0000\r\nNC,-21.5000,165.5000\r\nNE,16.0000,8.0000\r\nNF,-29.0333,167.9500\r\nNG,10.0000,8.0000\r\nNI,13.0000,-85.0000\r\nNL,52.5000,5.7500\r\nNO,62.0000,10.0000\r\nNP,28.0000,84.0000\r\nNR,-0.5333,166.9167\r\nNU,-19.0333,-169.8667\r\nNZ,-41.0000,174.0000\r\nOM,21.0000,57.0000\r\nPA,9.0000,-80.0000\r\nPE,-10.0000,-76.0000\r\nPF,-15.0000,-140.0000\r\nPG,-6.0000,147.0000\r\nPH,13.0000,122.0000\r\nPK,30.0000,70.0000\r\nPL,52.0000,20.0000\r\nPM,46.8333,-56.3333\r\nPR,18.2500,-66.5000\r\nPS,32.0000,35.2500\r\nPT,39.5000,-8.0000\r\nPW,7.5000,134.5000\r\nPY,-23.0000,-58.0000\r\nQA,25.5000,51.2500\r\nRE,-21.1000,55.6000\r\nRO,46.0000,25.0000\r\nRS,44.0000,21.0000\r\nRU,60.0000,100.0000\r\nRW,-2.0000,30.0000\r\nSA,25.0000,45.0000\r\nSB,-8.0000,159.0000\r\nSC,-4.5833,55.6667\r\nSD,15.0000,30.0000\r\nSE,62.0000,15.0000\r\nSG,1.3667,103.8000\r\nSH,-15.9333,-5.7000\r\nSI,46.0000,15.0000\r\nSJ,78.0000,20.0000\r\nSK,48.6667,19.5000\r\nSL,8.5000,-11.5000\r\nSM,43.7667,12.4167\r\nSN,14.0000,-14.0000\r\nSO,10.0000,49.0000\r\nSR,4.0000,-56.0000\r\nST,1.0000,7.0000\r\nSV,13.8333,-88.9167\r\nSY,35.0000,38.0000\r\nSZ,-26.5000,31.5000\r\nTC,21.7500,-71.5833\r\nTD,15.0000,19.0000\r\nTF,-43.0000,67.0000\r\nTG,8.0000,1.1667\r\nTH,15.0000,100.0000\r\nTJ,39.0000,71.0000\r\nTK,-9.0000,-172.0000\r\nTM,40.0000,60.0000\r\nTN,34.0000,9.0000\r\nTO,-20.0000,-175.0000\r\nTR,39.0000,35.0000\r\nTT,11.0000,-61.0000\r\nTV,-8.0000,178.0000\r\nTW,23.5000,121.0000\r\nTZ,-6.0000,35.0000\r\nUA,49.0000,32.0000\r\nUG,1.0000,32.0000\r\nUM,19.2833,166.6000\r\nUS,38.0000,-97.0000\r\nUY,-33.0000,-56.0000\r\nUZ,41.0000,64.0000\r\nVA,41.9000,12.4500\r\nVC,13.2500,-61.2000\r\nVE,8.0000,-66.0000\r\nVG,18.5000,-64.5000\r\nVI,18.3333,-64.8333\r\nVN,16.0000,106.0000\r\nVU,-16.0000,167.0000\r\nWF,-13.3000,-176.2000\r\nWS,-13.5833,-172.3333\r\nYE,15.0000,48.0000\r\nYT,-12.8333,45.1667\r\nZA,-29.0000,24.0000\r\nZM,-15.0000,30.0000\r\nZW,-20.0000,30.0000\r\n\r\n"
#or equivalently:
root %>% getNodeSet("//pre") %>% sapply(xmlValue)  #here need sapply since getNodeSet returns a set of nodes (even if the set is just one node)
## [1] "\r\n\"iso 3166 country\",\"latitude\",\"longitude\"\r\nAD,42.5000,1.5000\r\nAE,24.0000,54.0000\r\nAF,33.0000,65.0000\r\nAG,17.0500,-61.8000\r\nAI,18.2500,-63.1667\r\nAL,41.0000,20.0000\r\nAM,40.0000,45.0000\r\nAN,12.2500,-68.7500\r\nAO,-12.5000,18.5000\r\nAP,35.0000,105.0000\r\nAQ,-90.0000,0.0000\r\nAR,-34.0000,-64.0000\r\nAS,-14.3333,-170.0000\r\nAT,47.3333,13.3333\r\nAU,-27.0000,133.0000\r\nAW,12.5000,-69.9667\r\nAZ,40.5000,47.5000\r\nBA,44.0000,18.0000\r\nBB,13.1667,-59.5333\r\nBD,24.0000,90.0000\r\nBE,50.8333,4.0000\r\nBF,13.0000,-2.0000\r\nBG,43.0000,25.0000\r\nBH,26.0000,50.5500\r\nBI,-3.5000,30.0000\r\nBJ,9.5000,2.2500\r\nBM,32.3333,-64.7500\r\nBN,4.5000,114.6667\r\nBO,-17.0000,-65.0000\r\nBR,-10.0000,-55.0000\r\nBS,24.2500,-76.0000\r\nBT,27.5000,90.5000\r\nBV,-54.4333,3.4000\r\nBW,-22.0000,24.0000\r\nBY,53.0000,28.0000\r\nBZ,17.2500,-88.7500\r\nCA,60.0000,-95.0000\r\nCC,-12.5000,96.8333\r\nCD,0.0000,25.0000\r\nCF,7.0000,21.0000\r\nCG,-1.0000,15.0000\r\nCH,47.0000,8.0000\r\nCI,8.0000,-5.0000\r\nCK,-21.2333,-159.7667\r\nCL,-30.0000,-71.0000\r\nCM,6.0000,12.0000\r\nCN,35.0000,105.0000\r\nCO,4.0000,-72.0000\r\nCR,10.0000,-84.0000\r\nCU,21.5000,-80.0000\r\nCV,16.0000,-24.0000\r\nCX,-10.5000,105.6667\r\nCY,35.0000,33.0000\r\nCZ,49.7500,15.5000\r\nDE,51.0000,9.0000\r\nDJ,11.5000,43.0000\r\nDK,56.0000,10.0000\r\nDM,15.4167,-61.3333\r\nDO,19.0000,-70.6667\r\nDZ,28.0000,3.0000\r\nEC,-2.0000,-77.5000\r\nEE,59.0000,26.0000\r\nEG,27.0000,30.0000\r\nEH,24.5000,-13.0000\r\nER,15.0000,39.0000\r\nES,40.0000,-4.0000\r\nET,8.0000,38.0000\r\nEU,47.0000,8.0000\r\nFI,64.0000,26.0000\r\nFJ,-18.0000,175.0000\r\nFK,-51.7500,-59.0000\r\nFM,6.9167,158.2500\r\nFO,62.0000,-7.0000\r\nFR,46.0000,2.0000\r\nGA,-1.0000,11.7500\r\nGB,54.0000,-2.0000\r\nGD,12.1167,-61.6667\r\nGE,42.0000,43.5000\r\nGF,4.0000,-53.0000\r\nGH,8.0000,-2.0000\r\nGI,36.1833,-5.3667\r\nGL,72.0000,-40.0000\r\nGM,13.4667,-16.5667\r\nGN,11.0000,-10.0000\r\nGP,16.2500,-61.5833\r\nGQ,2.0000,10.0000\r\nGR,39.0000,22.0000\r\nGS,-54.5000,-37.0000\r\nGT,15.5000,-90.2500\r\nGU,13.4667,144.7833\r\nGW,12.0000,-15.0000\r\nGY,5.0000,-59.0000\r\nHK,22.2500,114.1667\r\nHM,-53.1000,72.5167\r\nHN,15.0000,-86.5000\r\nHR,45.1667,15.5000\r\nHT,19.0000,-72.4167\r\nHU,47.0000,20.0000\r\nID,-5.0000,120.0000\r\nIE,53.0000,-8.0000\r\nIL,31.5000,34.7500\r\nIN,20.0000,77.0000\r\nIO,-6.0000,71.5000\r\nIQ,33.0000,44.0000\r\nIR,32.0000,53.0000\r\nIS,65.0000,-18.0000\r\nIT,42.8333,12.8333\r\nJM,18.2500,-77.5000\r\nJO,31.0000,36.0000\r\nJP,36.0000,138.0000\r\nKE,1.0000,38.0000\r\nKG,41.0000,75.0000\r\nKH,13.0000,105.0000\r\nKI,1.4167,173.0000\r\nKM,-12.1667,44.2500\r\nKN,17.3333,-62.7500\r\nKP,40.0000,127.0000\r\nKR,37.0000,127.5000\r\nKW,29.3375,47.6581\r\nKY,19.5000,-80.5000\r\nKZ,48.0000,68.0000\r\nLA,18.0000,105.0000\r\nLB,33.8333,35.8333\r\nLC,13.8833,-61.1333\r\nLI,47.1667,9.5333\r\nLK,7.0000,81.0000\r\nLR,6.5000,-9.5000\r\nLS,-29.5000,28.5000\r\nLT,56.0000,24.0000\r\nLU,49.7500,6.1667\r\nLV,57.0000,25.0000\r\nLY,25.0000,17.0000\r\nMA,32.0000,-5.0000\r\nMC,43.7333,7.4000\r\nMD,47.0000,29.0000\r\nME,42.0000,19.0000\r\nMG,-20.0000,47.0000\r\nMH,9.0000,168.0000\r\nMK,41.8333,22.0000\r\nML,17.0000,-4.0000\r\nMM,22.0000,98.0000\r\nMN,46.0000,105.0000\r\nMO,22.1667,113.5500\r\nMP,15.2000,145.7500\r\nMQ,14.6667,-61.0000\r\nMR,20.0000,-12.0000\r\nMS,16.7500,-62.2000\r\nMT,35.8333,14.5833\r\nMU,-20.2833,57.5500\r\nMV,3.2500,73.0000\r\nMW,-13.5000,34.0000\r\nMX,23.0000,-102.0000\r\nMY,2.5000,112.5000\r\nMZ,-18.2500,35.0000\r\nNA,-22.0000,17.0000\r\nNC,-21.5000,165.5000\r\nNE,16.0000,8.0000\r\nNF,-29.0333,167.9500\r\nNG,10.0000,8.0000\r\nNI,13.0000,-85.0000\r\nNL,52.5000,5.7500\r\nNO,62.0000,10.0000\r\nNP,28.0000,84.0000\r\nNR,-0.5333,166.9167\r\nNU,-19.0333,-169.8667\r\nNZ,-41.0000,174.0000\r\nOM,21.0000,57.0000\r\nPA,9.0000,-80.0000\r\nPE,-10.0000,-76.0000\r\nPF,-15.0000,-140.0000\r\nPG,-6.0000,147.0000\r\nPH,13.0000,122.0000\r\nPK,30.0000,70.0000\r\nPL,52.0000,20.0000\r\nPM,46.8333,-56.3333\r\nPR,18.2500,-66.5000\r\nPS,32.0000,35.2500\r\nPT,39.5000,-8.0000\r\nPW,7.5000,134.5000\r\nPY,-23.0000,-58.0000\r\nQA,25.5000,51.2500\r\nRE,-21.1000,55.6000\r\nRO,46.0000,25.0000\r\nRS,44.0000,21.0000\r\nRU,60.0000,100.0000\r\nRW,-2.0000,30.0000\r\nSA,25.0000,45.0000\r\nSB,-8.0000,159.0000\r\nSC,-4.5833,55.6667\r\nSD,15.0000,30.0000\r\nSE,62.0000,15.0000\r\nSG,1.3667,103.8000\r\nSH,-15.9333,-5.7000\r\nSI,46.0000,15.0000\r\nSJ,78.0000,20.0000\r\nSK,48.6667,19.5000\r\nSL,8.5000,-11.5000\r\nSM,43.7667,12.4167\r\nSN,14.0000,-14.0000\r\nSO,10.0000,49.0000\r\nSR,4.0000,-56.0000\r\nST,1.0000,7.0000\r\nSV,13.8333,-88.9167\r\nSY,35.0000,38.0000\r\nSZ,-26.5000,31.5000\r\nTC,21.7500,-71.5833\r\nTD,15.0000,19.0000\r\nTF,-43.0000,67.0000\r\nTG,8.0000,1.1667\r\nTH,15.0000,100.0000\r\nTJ,39.0000,71.0000\r\nTK,-9.0000,-172.0000\r\nTM,40.0000,60.0000\r\nTN,34.0000,9.0000\r\nTO,-20.0000,-175.0000\r\nTR,39.0000,35.0000\r\nTT,11.0000,-61.0000\r\nTV,-8.0000,178.0000\r\nTW,23.5000,121.0000\r\nTZ,-6.0000,35.0000\r\nUA,49.0000,32.0000\r\nUG,1.0000,32.0000\r\nUM,19.2833,166.6000\r\nUS,38.0000,-97.0000\r\nUY,-33.0000,-56.0000\r\nUZ,41.0000,64.0000\r\nVA,41.9000,12.4500\r\nVC,13.2500,-61.2000\r\nVE,8.0000,-66.0000\r\nVG,18.5000,-64.5000\r\nVI,18.3333,-64.8333\r\nVN,16.0000,106.0000\r\nVU,-16.0000,167.0000\r\nWF,-13.3000,-176.2000\r\nWS,-13.5833,-172.3333\r\nYE,15.0000,48.0000\r\nYT,-12.8333,45.1667\r\nZA,-29.0000,24.0000\r\nZM,-15.0000,30.0000\r\nZW,-20.0000,30.0000\r\n\r\n"

Step 3 Read the plain text in your character vector into a data frame.

The function read.table() reads values of a file or text and returns a data frame (use parameters text, skip, header, and sep).

pre <- root %>% getNodeSet("//pre") %>% sapply(xmlValue)
read.table(text=pre, skip=1,header=TRUE,sep=",") %>% head()
iso.3166.country latitude longitude
AD 42.50 1.5000
AE 24.00 54.0000
AF 33.00 65.0000
AG 17.05 -61.8000
AI 18.25 -63.1667
AL 41.00 20.0000

Task for you:

Get names of all the faculty members in the Statistics Deparment

http://statistics.berkeley.edu/people/faculty

doc <- htmlParse('http://statistics.berkeley.edu/people/faculty')
root <- xmlRoot(doc)

name_nodes <- getNodeSet(doc, '//div[@class="views-field views-field-title"]')
sapply(name_nodes, xmlValue)   
##  [1] "        Ani Adhikari  "           "        David Aldous  "          
##  [3] "        Peter Bartlett  "         "        Peter Bickel  "          
##  [5] "        David R Brillinger  "     "        Ben Brown  "             
##  [7] "        Joan Bruna  "             "        Ching-Shui Cheng  "      
##  [9] "        Peng Ding  "              "        Kjell Doksum  "          
## [11] "        Sandrine Dudoit  "        "        Noureddine El Karoui  "  
## [13] "        Steven Evans  "           "        Will Fithian  "          
## [15] "        Lisa Goldberg  "          "        Leo Goodman  "           
## [17] "        Adityanand Guntuboyina  " "        Alan Hammond  "          
## [19] "        Haiyan Huang  "           "        Fletcher Ibser  "        
## [21] "        Nicholas P. Jewell  "     "        Michael Jordan  "        
## [23] "        Michael Klass  "          "        Adam Lucas  "            
## [25] "        Michael Mahoney  "        "        Jon McAuliffe  "         
## [27] "        Warry Millar  "           "        Elchanan Mossel  "       
## [29] "        Rasmus Nielsen  "         "        Deborah Nolan  "         
## [31] "        Christopher Paciorek  "   "        Jim Pitman  "            
## [33] "        Helmut Pitters  "         "        Elizabeth Purdom  "      
## [35] "        Roger Purves  "           "        Nusrat Rabbee  "         
## [37] "        Benjamin Recht  "         "        John Rice  "             
## [39] "        Gaston Sanchez  "         "        Jasjeet Sekhon  "        
## [41] "        Juliet Shaffer  "         "        Alistair Sinclair  "     
## [43] "        Allan Sly  "              "        Yun S. Song  "           
## [45] "        Terry Speed  "            "        Philip B. Stark  "       
## [47] "        Chuck Stone  "            "        Shobhana Stoyanov  "     
## [49] "        Bernd Sturmfels  "        "        Nike Sun  "              
## [51] "        Yuekai Sun  "             "        Aram Thomasian  "        
## [53] "        Mark van der Laan  "      "        Ken Wachter  "           
## [55] "        Martin Wainwright  "      "        Bin Yu  "
#or
names <- xpathSApply(doc, '//div[@class="views-field views-field-title"]', xmlValue)
names
##  [1] "        Ani Adhikari  "           "        David Aldous  "          
##  [3] "        Peter Bartlett  "         "        Peter Bickel  "          
##  [5] "        David R Brillinger  "     "        Ben Brown  "             
##  [7] "        Joan Bruna  "             "        Ching-Shui Cheng  "      
##  [9] "        Peng Ding  "              "        Kjell Doksum  "          
## [11] "        Sandrine Dudoit  "        "        Noureddine El Karoui  "  
## [13] "        Steven Evans  "           "        Will Fithian  "          
## [15] "        Lisa Goldberg  "          "        Leo Goodman  "           
## [17] "        Adityanand Guntuboyina  " "        Alan Hammond  "          
## [19] "        Haiyan Huang  "           "        Fletcher Ibser  "        
## [21] "        Nicholas P. Jewell  "     "        Michael Jordan  "        
## [23] "        Michael Klass  "          "        Adam Lucas  "            
## [25] "        Michael Mahoney  "        "        Jon McAuliffe  "         
## [27] "        Warry Millar  "           "        Elchanan Mossel  "       
## [29] "        Rasmus Nielsen  "         "        Deborah Nolan  "         
## [31] "        Christopher Paciorek  "   "        Jim Pitman  "            
## [33] "        Helmut Pitters  "         "        Elizabeth Purdom  "      
## [35] "        Roger Purves  "           "        Nusrat Rabbee  "         
## [37] "        Benjamin Recht  "         "        John Rice  "             
## [39] "        Gaston Sanchez  "         "        Jasjeet Sekhon  "        
## [41] "        Juliet Shaffer  "         "        Alistair Sinclair  "     
## [43] "        Allan Sly  "              "        Yun S. Song  "           
## [45] "        Terry Speed  "            "        Philip B. Stark  "       
## [47] "        Chuck Stone  "            "        Shobhana Stoyanov  "     
## [49] "        Bernd Sturmfels  "        "        Nike Sun  "              
## [51] "        Yuekai Sun  "             "        Aram Thomasian  "        
## [53] "        Mark van der Laan  "      "        Ken Wachter  "           
## [55] "        Martin Wainwright  "      "        Bin Yu  "

Task for you:

Get the links of all the faculty member in the Statistics Department. Hint: use xmlAttrs

doc <-  htmlParse("http://statistics.berkeley.edu/people/faculty")
root <- xmlRoot(doc)
links <- root %>% xpathSApply( '//div[@class="views-field views-field-title"]/h2/a',xmlAttrs)
links
##                             href                             href 
##           "/people/ani-adhikari"           "/people/david-aldous" 
##                             href                             href 
##         "/people/peter-bartlett"           "/people/peter-bickel" 
##                             href                             href 
##       "/people/david-brillinger"              "/people/ben-brown" 
##                             href                             href 
##             "/people/joan-bruna"       "/people/ching-shui-cheng" 
##                             href                             href 
##              "/people/peng-ding"           "/people/kjell-doksum" 
##                             href                             href 
##      "/people/sandrine-dudoit-0"   "/people/noureddine-el-karoui" 
##                             href                             href 
##           "/people/steven-evans"        "/people/william-fithian" 
##                             href                             href 
##          "/people/lisa-goldberg"            "/people/leo-goodman" 
##                             href                             href 
## "/people/adityanand-guntuboyina"         "/people/alan-hammond-0" 
##                             href                             href 
##           "/people/haiyan-huang"         "/people/fletcher-ibser" 
##                             href                             href 
##            "/people/nick-jewell"         "/people/michael-jordan" 
##                             href                             href 
##          "/people/michael-klass"             "/people/adam-lucas" 
##                             href                             href 
##        "/people/michael-mahoney"        "/people/jon-mcauliffe-0" 
##                             href                             href 
##           "/people/warry-millar"        "/people/elchanan-mossel" 
##                             href                             href 
##         "/people/rasmus-nielsen"          "/people/deborah-nolan" 
##                             href                             href 
##   "/people/christopher-paciorek"             "/people/jim-pitman" 
##                             href                             href 
##         "/people/helmut-pitters"       "/people/elizabeth-purdom" 
##                             href                             href 
##           "/people/roger-purves"          "/people/nusrat-rabbee" 
##                             href                             href 
##         "/people/benjamin-recht"              "/people/john-rice" 
##                             href                             href 
##         "/people/gaston-sanchez"         "/people/jasjeet-sekhon" 
##                             href                             href 
##         "/people/juliet-shaffer"      "/people/alistair-sinclair" 
##                             href                             href 
##            "/people/allan-sly-0"               "/people/yun-song" 
##                             href                             href 
##            "/people/terry-speed"           "/people/philip-stark" 
##                             href                             href 
##            "/people/chuck-stone"      "/people/shobhana-stoyanov" 
##                             href                             href 
##        "/people/bernd-sturmfels"               "/people/nike-sun" 
##                             href                             href 
##             "/people/yuekai-sun"         "/people/aram-thomasian" 
##                             href                             href 
##      "/people/mark-van-der-laan"            "/people/ken-wachter" 
##                             href                             href 
##      "/people/martin-wainwright"               "/people/bin-yu-0"

Task for you:

Make a data frame of each faculty member’s name and his/her link

df=data.frame(names, links)
df %>% head()
names links
Ani Adhikari /people/ani-adhikari
David Aldous /people/david-aldous
Peter Bartlett /people/peter-bartlett
Peter Bickel /people/peter-bickel
David R Brillinger /people/david-brillinger
Ben Brown /people/ben-brown

Dealing with not well formed HTML (common)

Sometimes your HTML isn’t well formed. In this case you will get an error:

doc <- htmlParse("http://en.wikipedia.org/wiki/Mile_run_world_record_progression")
root <- xmlRoot(doc)

Error: failed to load external entity "http://en.wikipedia.org/wiki/Mile_run_world_record_progression".

You can either try and tidy your HTML or if it is an HTML table you can just use Xpath and rvest (for html_nodes and html_table) as we did in

lecture 25 http://rpubs.com/alucas/162044 .

library(rvest)
SetOfTables <- 
  "http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
  read_html() %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
  html_table(fill=TRUE)
head( SetOfTables[[1]] )
Time Athlete Nationality Date Venue
4:28 Charles Westhall United Kingdom 26 July 1855 London
4:28 Thomas Horspool United Kingdom 28 September 1857 Manchester
4:23 Thomas Horspool United Kingdom 12 July 1858 Manchester
4:22¼ Siah Albison United Kingdom 27 October 1860 Manchester
4:21¾ William Lang United Kingdom 11 July 1863 Manchester
4:20½ Edward Mills United Kingdom 23 April 1864 Manchester