Source file ⇒ 2017-lec24.Rmd
Here is a good resourse http://www.w3schools.com/html/default.asp
HTML is a markup language for describing web documents (web pages).
HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags
HTML documents are described by predefined HTML tags
Each HTML tag describes different document content
HTML is a special case of XML
The DOCTYPE
declaration defines the document type to be HTML
The text between <html>
and </html>
describes an HTML document
The text between <head>
and </head>
provides information about the document
The text between <title>
and </title>
provides a title for the document
The text between <body>
and </body>
describes the visible page content
The text between <h1>
and </h1>
describes a heading
The text between <p>
and </p>
describes a paragraph
Well formed HTML means:
<!DOCTYPE html>
<div id="contentSub"> some content here </div>
HTML links are defined with the <a>
tag:
<a href="http://www.w3schools.com">This is a link</a>
The link’s destination is specified in the href
attribute.
Attributes are used to provide additional information about HTML elements.
<pre>
displays any text content exactly as it appears in the source code. It is useful for displaying computer code or computer output or formating a table a certain way.
<!DOCTYPE html>
<html>
<body>
<p>The pre tag preserves both spaces and line breaks:</p>
<pre>
My Bonnie lies over the ocean.
My Bonnie lies over the sea.
My Bonnie lies over the ocean.
Oh, bring back my Bonnie to me.
</pre>
</body>
</html>
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
Tables are defined with the <table>
tag.
Tables are divided into table rows with the <tr>
tag.
Table rows are divided into table data with the <td>
tag.
A table row can also be divided into table headings with the <th>
tag.
Since well-formed HTML is a special case of XML, we can extract latitude and longitude fromthis HTML file using the above tools.
Summary of XML and HTML functions:
Function | Description |
---|---|
xmlTreeParse() or xmlParse() | reads xml file, returns class XMLDocument or XMLInternalDocument |
htmlTreeParse() or htmlParse() | reads html file, returns class XMLDocument or XMLInternalDocument |
xmlAttrs() | named character vector of all attributes |
xmlValue() | contents of a leaf node |
xmlSApply | sapply function applied to nodes of a tree |
XML example:
<?xml version="1.0"?>
<movies>
<movie mins="126" lang="eng">
<title>Good Will Hunting</title>
<director>
<first_name>Gus</first_name>
<last_name>Van Sant</last_name>
</director>
<year>1998</year>
<genre>drama</genre>
</movie>
<movie mins="106" lang="spa">
<title>Y tu mama tambien</title>
<director>
<first_name>Alfonso</first_name>
<last_name>Cuaron</last_name>
</director>
<year>2001</year>
<genre>drama</genre>
</movie>
</movies>
doc <- xmlParse("/Users/Adam/Desktop/Stat133_S17/lectures/movies.xml")
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
One can use the function xpathSApply()
to find matching nodes in an internal XML tree.
Syntax:
xpathSApply(doc, path, fun, ... )
The output is a vector.
xpathSApply(doc, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(doc, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"
xpathSApply(doc, "//movie[@lang='eng']", xmlAttrs)
mins | 126 |
lang | eng |
Note: This doesn’t produce a character vector like the previous examples because you aren’t supposed to have more than one attribute per tag.
We will examine the Statistics Faculty web page at UCLA: http://directory.stat.ucla.edu/active_faculty
Since well-formed HTML is a special case of XML, we can extract information from this HTML file using the above tools.
Suppose we wish to make a data table summarizing the UCLA Statistics faculty names and their research interests
htmlParse()
htmlParse()
this is a wrapper for htmlTreeParse()
so you don’t have to write the argument useInternalNodes = TRUE
doc <- htmlParse("http://directory.stat.ucla.edu/active_faculty")
class(doc)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument"
## [4] "XMLAbstractDocument"
xpathSApply
names <- xpathSApply(doc, '//div[@class="entity"]/a', xmlValue)
names
## [1] "Arash Ali Amini" "Peter Bentler"
## [3] "Alyson (Allie) Fletcher" "Rob Gould"
## [5] "Mark S. Handcock" "Erin Hartman"
## [7] "Chad Hazlett" "Jingyi Jessica Li"
## [9] "Ker-Chau Li" "Rick Paik Schoenberg"
## [11] "Ying Nian Wu" "Hongquan Xu"
## [13] "Alan Yuille" "Qing Zhou"
## [15] "Song Chun Zhu"
research <- xpathSApply(doc, '//p', xmlValue)
research
## [1] "High-dimensional inference, machine learning, optimization, networks"
## [2] "Multivariate analysis, with special emphasis on latent variable models."
## [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"
## [4] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
## [5] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"
## [6] "Causal inference, high-dimensional regression and classification, applications in political science."
## [7] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."
## [8] "Dimension reduction, data visualization, time series, images, and gene expression."
## [9] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."
## [10] "Statistical modeling and learning"
## [11] "Experimental design, functional linear model, computer experiment."
## [12] "Computer Vision, Bayesian Statistics, Pattern Recognition."
## [13] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"
## [14] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"
#Notice Rob Gould doesn't list his research interests so we need to wrangle the data a bit
research <- append(research, "", after=3)
research
## [1] "High-dimensional inference, machine learning, optimization, networks"
## [2] "Multivariate analysis, with special emphasis on latent variable models."
## [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"
## [4] ""
## [5] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
## [6] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"
## [7] "Causal inference, high-dimensional regression and classification, applications in political science."
## [8] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."
## [9] "Dimension reduction, data visualization, time series, images, and gene expression."
## [10] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."
## [11] "Statistical modeling and learning"
## [12] "Experimental design, functional linear model, computer experiment."
## [13] "Computer Vision, Bayesian Statistics, Pattern Recognition."
## [14] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"
## [15] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"
data.frame()
to make a data frame from your character vectorsdata.frame(names,research)
names | research |
---|---|
Arash Ali Amini | High-dimensional inference, machine learning, optimization, networks |
Peter Bentler | Multivariate analysis, with special emphasis on latent variable models. |
Alyson (Allie) Fletcher | Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory |
Rob Gould | |
Mark S. Handcock | Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology. |
Erin Hartman | Causal Inference, Sampling Design, Survey Weighting, applications to social sciences |
Chad Hazlett | Causal inference, high-dimensional regression and classification, applications in political science. |
Jingyi Jessica Li | Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology. |
Ker-Chau Li | Dimension reduction, data visualization, time series, images, and gene expression. |
Rick Paik Schoenberg | Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology. |
Ying Nian Wu | Statistical modeling and learning |
Hongquan Xu | Experimental design, functional linear model, computer experiment. |
Alan Yuille | Computer Vision, Bayesian Statistics, Pattern Recognition. |
Qing Zhou | Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes |
Song Chun Zhu | Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts |
Now you are going to scrape info off of the Cal Statistics website: http://statistics.berkeley.edu/people/faculty
Do example 1a,b,c
https://scf.berkeley.edu:3838/shiny/alucas/Lecture-24-collection/
table
Sometimes your HTML isn’t well formed (for example it is missing closing tags). In this case you will get an error:
doc <- htmlParse("http://en.wikipedia.org/wiki/Mile_run_world_record_progression")
Error: failed to load external entity "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
.
There are tools in R to tidy HTML.
If your are scraping data from inside a table tag in HTML then it might be easiest to use the html_node
and html_table
functions in the rvest
package in R as we did in
lecture 18 http://rpubs.com/lucasadam7/259162 .
#type install.packages("rvest") in console
library(rvest)
SetOfTables <-
"http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
read_html() %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
html_table(fill=TRUE)
head( SetOfTables[[1]] )
Time | Athlete | Nationality | Date | Venue |
---|---|---|---|---|
4:28 | Charles Westhall | United Kingdom | 26 July 1855 | London |
4:28 | Thomas Horspool | United Kingdom | 28 September 1857 | Manchester |
4:23 | Thomas Horspool | United Kingdom | 12 July 1858 | Manchester |
4:22¼ | Siah Albison | United Kingdom | 27 October 1860 | Manchester |
4:21¾ | William Lang | United Kingdom | 11 July 1863 | Manchester |
4:20½ | Edward Mills | United Kingdom | 23 April 1864 | Manchester |