Source file ⇒ 2017-lec24.Rmd

Today

Crash course in HTML
Webscraping
Check in with project groups

1. Crash course in HTML

Here is a good resourse http://www.w3schools.com/html/default.asp

HTML is a markup language for describing web documents (web pages).

HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags

HTML documents are described by predefined HTML tags

Each HTML tag describes different document content

HTML is a special case of XML

The DOCTYPE declaration defines the document type to be HTML
The text between <html> and </html> describes an HTML document
The text between <head> and </head> provides information about the document
The text between <title> and </title> provides a title for the document
The text between <body> and </body> describes the visible page content
The text between <h1> and </h1> describes a heading
The text between <p> and </p> describes a paragraph

Well formed HTML means:

has DOCTYPE header For example: <!DOCTYPE html>
has all closure tags For example: <div id="contentSub"> some content here </div>

HTML links are defined with the <a> tag:

<a href="http://www.w3schools.com">This is a link</a>

The link’s destination is specified in the href attribute.

Attributes are used to provide additional information about HTML elements.

<pre> displays any text content exactly as it appears in the source code. It is useful for displaying computer code or computer output or formating a table a certain way.

<!DOCTYPE html>
<html>
<body>

<p>The pre tag preserves both spaces and line breaks:</p>

<pre>
   My Bonnie lies over the ocean.

   My Bonnie lies over the sea.

   My Bonnie lies over the ocean.
   
   Oh, bring back my Bonnie to me.
</pre>

</body>
</html>

<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Tables are defined with the <table> tag.

Tables are divided into table rows with the <tr> tag.

Table rows are divided into table data with the <td> tag.

A table row can also be divided into table headings with the <th> tag.

Since well-formed HTML is a special case of XML, we can extract latitude and longitude fromthis HTML file using the above tools.

2.Webscraping

Summary of XML and HTML functions:

Function	Description
xmlTreeParse() or xmlParse()	reads xml file, returns class XMLDocument or XMLInternalDocument
htmlTreeParse() or htmlParse()	reads html file, returns class XMLDocument or XMLInternalDocument
xmlAttrs()	named character vector of all attributes
xmlValue()	contents of a leaf node
xmlSApply	sapply function applied to nodes of a tree

XML example:

<?xml version="1.0"?>
<movies>
    <movie mins="126" lang="eng">
        <title>Good Will Hunting</title> 
        <director>
            <first_name>Gus</first_name>
            <last_name>Van Sant</last_name>
        </director>
        <year>1998</year> 
        <genre>drama</genre>
    </movie>
    <movie mins="106" lang="spa"> 
        <title>Y tu mama tambien</title>
         <director>
            <first_name>Alfonso</first_name>
            <last_name>Cuaron</last_name> 
        </director>
        <year>2001</year>
        <genre>drama</genre>
    </movie>
</movies>

doc <- xmlParse("/Users/Adam/Desktop/Stat133_S17/lectures/movies.xml")
class(doc)

## [1] "XMLInternalDocument" "XMLAbstractDocument"

One can use the function xpathSApply() to find matching nodes in an internal XML tree.

Syntax:

xpathSApply(doc, path, fun, ... )

The output is a vector.

xpathSApply(doc, "//year", xmlValue)

## [1] "1998" "2001"

xpathSApply(doc, "//movie[@lang='eng']/year", xmlValue)

## [1] "1998"

xpathSApply(doc, "//movie[@lang='eng']", xmlAttrs)

mins	126
lang	eng

Note: This doesn’t produce a character vector like the previous examples because you aren’t supposed to have more than one attribute per tag.

Example

We will examine the Statistics Faculty web page at UCLA: http://directory.stat.ucla.edu/active_faculty

Since well-formed HTML is a special case of XML, we can extract information from this HTML file using the above tools.

Suppose we wish to make a data table summarizing the UCLA Statistics faculty names and their research interests

Step 1: Begin by parsing the function with `htmlParse()`

htmlParse() this is a wrapper for htmlTreeParse() so you don’t have to write the argument useInternalNodes = TRUE

doc <- htmlParse("http://directory.stat.ucla.edu/active_faculty")
class(doc)

## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

Step 2: Use an XPath expression to locate the name and research interests and use `xpathSApply`

names <- xpathSApply(doc, '//div[@class="entity"]/a', xmlValue)
names

##  [1] "Arash Ali Amini"         "Peter Bentler"          
##  [3] "Alyson (Allie) Fletcher" "Rob Gould"              
##  [5] "Mark S. Handcock"        "Erin Hartman"           
##  [7] "Chad Hazlett"            "Jingyi Jessica Li"      
##  [9] "Ker-Chau Li"             "Rick Paik Schoenberg"   
## [11] "Ying Nian Wu"            "Hongquan Xu"            
## [13] "Alan Yuille"             "Qing Zhou"              
## [15] "Song Chun Zhu"

research <- xpathSApply(doc, '//p', xmlValue)
research

##  [1] "High-dimensional inference, machine learning, optimization, networks"                                                                                  
##  [2] "Multivariate analysis, with special emphasis on latent variable models."                                                                               
##  [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"            
##  [4] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
##  [5] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"                                                                  
##  [6] "Causal inference, high-dimensional regression and classification, applications in political science."                                                  
##  [7] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."         
##  [8] "Dimension reduction, data visualization, time series, images, and gene expression."                                                                    
##  [9] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."                                             
## [10] "Statistical modeling and learning"                                                                                                                     
## [11] "Experimental design, functional linear model, computer experiment."                                                                                    
## [12] "Computer Vision, Bayesian Statistics, Pattern Recognition."                                                                                            
## [13] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"                                                                   
## [14] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"

#Notice Rob Gould doesn't list his research interests so we need to wrangle the data a bit

research <- append(research, "", after=3)
research

##  [1] "High-dimensional inference, machine learning, optimization, networks"                                                                                  
##  [2] "Multivariate analysis, with special emphasis on latent variable models."                                                                               
##  [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"            
##  [4] ""                                                                                                                                                      
##  [5] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
##  [6] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"                                                                  
##  [7] "Causal inference, high-dimensional regression and classification, applications in political science."                                                  
##  [8] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."         
##  [9] "Dimension reduction, data visualization, time series, images, and gene expression."                                                                    
## [10] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."                                             
## [11] "Statistical modeling and learning"                                                                                                                     
## [12] "Experimental design, functional linear model, computer experiment."                                                                                    
## [13] "Computer Vision, Bayesian Statistics, Pattern Recognition."                                                                                            
## [14] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"                                                                   
## [15] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"

Step 3 Use `data.frame()` to make a data frame from your character vectors

data.frame(names,research)

names	research
Arash Ali Amini	High-dimensional inference, machine learning, optimization, networks
Peter Bentler	Multivariate analysis, with special emphasis on latent variable models.
Alyson (Allie) Fletcher	Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory
Rob Gould
Mark S. Handcock	Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology.
Erin Hartman	Causal Inference, Sampling Design, Survey Weighting, applications to social sciences
Chad Hazlett	Causal inference, high-dimensional regression and classification, applications in political science.
Jingyi Jessica Li	Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology.
Ker-Chau Li	Dimension reduction, data visualization, time series, images, and gene expression.
Rick Paik Schoenberg	Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology.
Ying Nian Wu	Statistical modeling and learning
Hongquan Xu	Experimental design, functional linear model, computer experiment.
Alan Yuille	Computer Vision, Bayesian Statistics, Pattern Recognition.
Qing Zhou	Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes
Song Chun Zhu	Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts

Now you are going to scrape info off of the Cal Statistics website: http://statistics.berkeley.edu/people/faculty

In Class exercise

Do example 1a,b,c

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-24-collection/

Dealing with not well formed HTML or an HTML `table`

Sometimes your HTML isn’t well formed (for example it is missing closing tags). In this case you will get an error:

doc <- htmlParse("http://en.wikipedia.org/wiki/Mile_run_world_record_progression")

Error: failed to load external entity "http://en.wikipedia.org/wiki/Mile_run_world_record_progression".

There are tools in R to tidy HTML.

If your are scraping data from inside a table tag in HTML then it might be easiest to use the html_node and html_table functions in the rvest package in R as we did in

lecture 18 http://rpubs.com/lucasadam7/259162 .

#type install.packages("rvest") in console
library(rvest)
SetOfTables <- 
  "http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
  read_html() %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
  html_table(fill=TRUE)
head( SetOfTables[[1]] )

Time	Athlete	Nationality	Date	Venue
4:28	Charles Westhall	United Kingdom	26 July 1855	London
4:28	Thomas Horspool	United Kingdom	28 September 1857	Manchester
4:23	Thomas Horspool	United Kingdom	12 July 1858	Manchester
4:22¼	Siah Albison	United Kingdom	27 October 1860	Manchester
4:21¾	William Lang	United Kingdom	11 July 1863	Manchester
4:20½	Edward Mills	United Kingdom	23 April 1864	Manchester

lec24

Today

1. Crash course in HTML

2.Webscraping

Example

Step 1: Begin by parsing the function with htmlParse()

Step 2: Use an XPath expression to locate the name and research interests and use xpathSApply

Step 3 Use data.frame() to make a data frame from your character vectors

In Class exercise

Dealing with not well formed HTML or an HTML table

Step 1: Begin by parsing the function with `htmlParse()`

Step 2: Use an XPath expression to locate the name and research interests and use `xpathSApply`

Step 3 Use `data.frame()` to make a data frame from your character vectors

Dealing with not well formed HTML or an HTML `table`