Source file ⇒ 2017-lec24.Rmd
Here is a good resourse http://www.w3schools.com/html/default.asp
HTML is a markup language for describing web documents (web pages).
HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags
HTML documents are described by predefined HTML tags
Each HTML tag describes different document content
HTML is a special case of XML
The DOCTYPE declaration defines the document type to be HTML
The text between <html> and </html> describes an HTML document
The text between <head> and </head> provides information about the document
The text between <title> and </title> provides a title for the document
The text between <body> and </body> describes the visible page content
The text between <h1> and </h1> describes a heading
The text between <p> and </p> describes a paragraph
Well formed HTML means:
<!DOCTYPE html><div id="contentSub"> some content here </div>HTML links are defined with the <a> tag:
<a href="http://www.w3schools.com">This is a link</a>
The link’s destination is specified in the href attribute.
Attributes are used to provide additional information about HTML elements.
<pre> displays any text content exactly as it appears in the source code. It is useful for displaying computer code or computer output or formating a table a certain way.
<!DOCTYPE html>
<html>
<body>
<p>The pre tag preserves both spaces and line breaks:</p>
<pre>
My Bonnie lies over the ocean.
My Bonnie lies over the sea.
My Bonnie lies over the ocean.
Oh, bring back my Bonnie to me.
</pre>
</body>
</html>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
Tables are defined with the <table> tag.
Tables are divided into table rows with the <tr> tag.
Table rows are divided into table data with the <td> tag.
A table row can also be divided into table headings with the <th> tag.
Since well-formed HTML is a special case of XML, we can extract latitude and longitude fromthis HTML file using the above tools.
Summary of XML and HTML functions:
| Function | Description |
|---|---|
| xmlTreeParse() or xmlParse() | reads xml file, returns class XMLDocument or XMLInternalDocument |
| htmlTreeParse() or htmlParse() | reads html file, returns class XMLDocument or XMLInternalDocument |
| xmlAttrs() | named character vector of all attributes |
| xmlValue() | contents of a leaf node |
| xmlSApply | sapply function applied to nodes of a tree |
XML example:
<?xml version="1.0"?>
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
doc <- xmlParse("/Users/Adam/Desktop/Stat133_S17/lectures/movies.xml")
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
One can use the function xpathSApply() to find matching nodes in an internal XML tree.
Syntax:
xpathSApply(doc, path, fun, ... )
The output is a vector.
xpathSApply(doc, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(doc, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"
xpathSApply(doc, "//movie[@lang='eng']", xmlAttrs)
| mins | 126 |
| lang | eng |
Note: This doesn’t produce a character vector like the previous examples because you aren’t supposed to have more than one attribute per tag.
We will examine the Statistics Faculty web page at UCLA: http://directory.stat.ucla.edu/active_faculty
Since well-formed HTML is a special case of XML, we can extract information from this HTML file using the above tools.
Suppose we wish to make a data table summarizing the UCLA Statistics faculty names and their research interests
htmlParse()htmlParse() this is a wrapper for htmlTreeParse() so you don’t have to write the argument useInternalNodes = TRUE
doc <- htmlParse("http://directory.stat.ucla.edu/active_faculty")
class(doc)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
xpathSApplynames <- xpathSApply(doc, '//div[@class="entity"]/a', xmlValue)
names
## [1] "Arash Ali Amini" "Peter Bentler"
## [3] "Alyson (Allie) Fletcher" "Rob Gould"
## [5] "Mark S. Handcock" "Erin Hartman"
## [7] "Chad Hazlett" "Jingyi Jessica Li"
## [9] "Ker-Chau Li" "Rick Paik Schoenberg"
## [11] "Ying Nian Wu" "Hongquan Xu"
## [13] "Alan Yuille" "Qing Zhou"
## [15] "Song Chun Zhu"
research <- xpathSApply(doc, '//p', xmlValue)
research
## [1] "High-dimensional inference, machine learning, optimization, networks"
## [2] "Multivariate analysis, with special emphasis on latent variable models."
## [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"
## [4] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
## [5] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"
## [6] "Causal inference, high-dimensional regression and classification, applications in political science."
## [7] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."
## [8] "Dimension reduction, data visualization, time series, images, and gene expression."
## [9] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."
## [10] "Statistical modeling and learning"
## [11] "Experimental design, functional linear model, computer experiment."
## [12] "Computer Vision, Bayesian Statistics, Pattern Recognition."
## [13] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"
## [14] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"
#Notice Rob Gould doesn't list his research interests so we need to wrangle the data a bit
research <- append(research, "", after=3)
research
## [1] "High-dimensional inference, machine learning, optimization, networks"
## [2] "Multivariate analysis, with special emphasis on latent variable models."
## [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"
## [4] ""
## [5] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
## [6] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"
## [7] "Causal inference, high-dimensional regression and classification, applications in political science."
## [8] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."
## [9] "Dimension reduction, data visualization, time series, images, and gene expression."
## [10] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."
## [11] "Statistical modeling and learning"
## [12] "Experimental design, functional linear model, computer experiment."
## [13] "Computer Vision, Bayesian Statistics, Pattern Recognition."
## [14] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"
## [15] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"
data.frame() to make a data frame from your character vectorsdata.frame(names,research)
| names | research |
|---|---|
| Arash Ali Amini | High-dimensional inference, machine learning, optimization, networks |
| Peter Bentler | Multivariate analysis, with special emphasis on latent variable models. |
| Alyson (Allie) Fletcher | Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory |
| Rob Gould | |
| Mark S. Handcock | Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology. |
| Erin Hartman | Causal Inference, Sampling Design, Survey Weighting, applications to social sciences |
| Chad Hazlett | Causal inference, high-dimensional regression and classification, applications in political science. |
| Jingyi Jessica Li | Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology. |
| Ker-Chau Li | Dimension reduction, data visualization, time series, images, and gene expression. |
| Rick Paik Schoenberg | Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology. |
| Ying Nian Wu | Statistical modeling and learning |
| Hongquan Xu | Experimental design, functional linear model, computer experiment. |
| Alan Yuille | Computer Vision, Bayesian Statistics, Pattern Recognition. |
| Qing Zhou | Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes |
| Song Chun Zhu | Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts |
Now you are going to scrape info off of the Cal Statistics website: http://statistics.berkeley.edu/people/faculty
Do example 1a,b,c
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-24-collection/
tableSometimes your HTML isn’t well formed (for example it is missing closing tags). In this case you will get an error:
doc <- htmlParse("http://en.wikipedia.org/wiki/Mile_run_world_record_progression")
Error: failed to load external entity "http://en.wikipedia.org/wiki/Mile_run_world_record_progression".
There are tools in R to tidy HTML.
If your are scraping data from inside a table tag in HTML then it might be easiest to use the html_node and html_table functions in the rvest package in R as we did in
lecture 18 http://rpubs.com/lucasadam7/259162 .
#type install.packages("rvest") in console
library(rvest)
SetOfTables <-
"http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
html_table(fill=TRUE)
head( SetOfTables[[1]] )
| Time | Athlete | Nationality | Date | Venue |
|---|---|---|---|---|
| 4:28 | Charles Westhall | United Kingdom | 26 July 1855 | London |
| 4:28 | Thomas Horspool | United Kingdom | 28 September 1857 | Manchester |
| 4:23 | Thomas Horspool | United Kingdom | 12 July 1858 | Manchester |
| 4:22¼ | Siah Albison | United Kingdom | 27 October 1860 | Manchester |
| 4:21¾ | William Lang | United Kingdom | 11 July 1863 | Manchester |
| 4:20½ | Edward Mills | United Kingdom | 23 April 1864 | Manchester |