Source file ⇒ 2017-lec24.Rmd

Today

  1. Crash course in HTML
  2. Webscraping
  3. Check in with project groups

1. Crash course in HTML

Here is a good resourse http://www.w3schools.com/html/default.asp

HTML is a markup language for describing web documents (web pages).

HTML stands for Hyper Text Markup Language
A markup language is a set of markup tags

HTML documents are described by predefined HTML tags

Each HTML tag describes different document content

HTML is a special case of XML

The DOCTYPE declaration defines the document type to be HTML
The text between <html> and </html> describes an HTML document
The text between <head> and </head> provides information about the document
The text between <title> and </title> provides a title for the document
The text between <body> and </body> describes the visible page content
The text between <h1> and </h1> describes a heading
The text between <p> and </p> describes a paragraph

Well formed HTML means:

  1. has DOCTYPE header For example: <!DOCTYPE html>
  2. has all closure tags For example: <div id="contentSub"> some content here </div>

HTML links are defined with the <a> tag:

<a href="http://www.w3schools.com">This is a link</a>

The link’s destination is specified in the href attribute.

Attributes are used to provide additional information about HTML elements.

<pre> displays any text content exactly as it appears in the source code. It is useful for displaying computer code or computer output or formating a table a certain way.

<!DOCTYPE html>
<html>
<body>

<p>The pre tag preserves both spaces and line breaks:</p>

<pre>
   My Bonnie lies over the ocean.

   My Bonnie lies over the sea.

   My Bonnie lies over the ocean.
   
   Oh, bring back my Bonnie to me.
</pre>

</body>
</html>
<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Tables are defined with the <table> tag.

Tables are divided into table rows with the <tr> tag.

Table rows are divided into table data with the <td> tag.

A table row can also be divided into table headings with the <th> tag.

Since well-formed HTML is a special case of XML, we can extract latitude and longitude fromthis HTML file using the above tools.

2.Webscraping

Summary of XML and HTML functions:

Function Description
xmlTreeParse() or xmlParse() reads xml file, returns class XMLDocument or XMLInternalDocument
htmlTreeParse() or htmlParse() reads html file, returns class XMLDocument or XMLInternalDocument
xmlAttrs() named character vector of all attributes
xmlValue() contents of a leaf node
xmlSApply sapply function applied to nodes of a tree

XML example:

<?xml version="1.0"?>
<movies>
    <movie mins="126" lang="eng">
        <title>Good Will Hunting</title> 
        <director>
            <first_name>Gus</first_name>
            <last_name>Van Sant</last_name>
        </director>
        <year>1998</year> 
        <genre>drama</genre>
    </movie>
    <movie mins="106" lang="spa"> 
        <title>Y tu mama tambien</title>
         <director>
            <first_name>Alfonso</first_name>
            <last_name>Cuaron</last_name> 
        </director>
        <year>2001</year>
        <genre>drama</genre>
    </movie>
</movies>
doc <- xmlParse("/Users/Adam/Desktop/Stat133_S17/lectures/movies.xml")
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

One can use the function xpathSApply() to find matching nodes in an internal XML tree.

Syntax:

xpathSApply(doc, path, fun, ... )

The output is a vector.

xpathSApply(doc, "//year", xmlValue)
## [1] "1998" "2001"
xpathSApply(doc, "//movie[@lang='eng']/year", xmlValue)
## [1] "1998"
xpathSApply(doc, "//movie[@lang='eng']", xmlAttrs)
mins 126
lang eng

Note: This doesn’t produce a character vector like the previous examples because you aren’t supposed to have more than one attribute per tag.

Example

We will examine the Statistics Faculty web page at UCLA: http://directory.stat.ucla.edu/active_faculty

Since well-formed HTML is a special case of XML, we can extract information from this HTML file using the above tools.

Suppose we wish to make a data table summarizing the UCLA Statistics faculty names and their research interests

Step 1: Begin by parsing the function with htmlParse()

htmlParse() this is a wrapper for htmlTreeParse() so you don’t have to write the argument useInternalNodes = TRUE

doc <- htmlParse("http://directory.stat.ucla.edu/active_faculty")
class(doc)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

Step 2: Use an XPath expression to locate the name and research interests and use xpathSApply

names <- xpathSApply(doc, '//div[@class="entity"]/a', xmlValue)
names
##  [1] "Arash Ali Amini"         "Peter Bentler"          
##  [3] "Alyson (Allie) Fletcher" "Rob Gould"              
##  [5] "Mark S. Handcock"        "Erin Hartman"           
##  [7] "Chad Hazlett"            "Jingyi Jessica Li"      
##  [9] "Ker-Chau Li"             "Rick Paik Schoenberg"   
## [11] "Ying Nian Wu"            "Hongquan Xu"            
## [13] "Alan Yuille"             "Qing Zhou"              
## [15] "Song Chun Zhu"
research <- xpathSApply(doc, '//p', xmlValue)
research
##  [1] "High-dimensional inference, machine learning, optimization, networks"                                                                                  
##  [2] "Multivariate analysis, with special emphasis on latent variable models."                                                                               
##  [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"            
##  [4] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
##  [5] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"                                                                  
##  [6] "Causal inference, high-dimensional regression and classification, applications in political science."                                                  
##  [7] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."         
##  [8] "Dimension reduction, data visualization, time series, images, and gene expression."                                                                    
##  [9] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."                                             
## [10] "Statistical modeling and learning"                                                                                                                     
## [11] "Experimental design, functional linear model, computer experiment."                                                                                    
## [12] "Computer Vision, Bayesian Statistics, Pattern Recognition."                                                                                            
## [13] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"                                                                   
## [14] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"
#Notice Rob Gould doesn't list his research interests so we need to wrangle the data a bit

research <- append(research, "", after=3)
research
##  [1] "High-dimensional inference, machine learning, optimization, networks"                                                                                  
##  [2] "Multivariate analysis, with special emphasis on latent variable models."                                                                               
##  [3] "Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory"            
##  [4] ""                                                                                                                                                      
##  [5] "Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology."
##  [6] "Causal Inference, Sampling Design, Survey Weighting, applications to social sciences"                                                                  
##  [7] "Causal inference, high-dimensional regression and classification, applications in political science."                                                  
##  [8] "Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology."         
##  [9] "Dimension reduction, data visualization, time series, images, and gene expression."                                                                    
## [10] "Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology."                                             
## [11] "Statistical modeling and learning"                                                                                                                     
## [12] "Experimental design, functional linear model, computer experiment."                                                                                    
## [13] "Computer Vision, Bayesian Statistics, Pattern Recognition."                                                                                            
## [14] "Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes"                                                                   
## [15] "Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts"

Step 3 Use data.frame() to make a data frame from your character vectors

data.frame(names,research)
names research
Arash Ali Amini High-dimensional inference, machine learning, optimization, networks
Peter Bentler Multivariate analysis, with special emphasis on latent variable models.
Alyson (Allie) Fletcher Machine learning, Statistical inference for high-dimensional data with applications in neuroscience, signal processing, information theory
Rob Gould
Mark S. Handcock Stochastic modeling of social networks, Environmental and spatial statistics, Demography, Computational statistics, Survey sampling, and Epidemiology.
Erin Hartman Causal Inference, Sampling Design, Survey Weighting, applications to social sciences
Chad Hazlett Causal inference, high-dimensional regression and classification, applications in political science.
Jingyi Jessica Li Applied Statistics and Statistical Modeling, as well as their interface with Statistical Genomics, Bioinformatics, and Computational Biology.
Ker-Chau Li Dimension reduction, data visualization, time series, images, and gene expression.
Rick Paik Schoenberg Point processes, Image analysis, Time series, and applications especially in seismology and fire ecology.
Ying Nian Wu Statistical modeling and learning
Hongquan Xu Experimental design, functional linear model, computer experiment.
Alan Yuille Computer Vision, Bayesian Statistics, Pattern Recognition.
Qing Zhou Computational biology, Statistical learning, Monte Carlo methods, Energy landscapes
Song Chun Zhu Computer Vision, Machine Learning, MCMC computing, Cognition, and Visual Arts

Now you are going to scrape info off of the Cal Statistics website: http://statistics.berkeley.edu/people/faculty

In Class exercise

Do example 1a,b,c

https://scf.berkeley.edu:3838/shiny/alucas/Lecture-24-collection/

Dealing with not well formed HTML or an HTML table

Sometimes your HTML isn’t well formed (for example it is missing closing tags). In this case you will get an error:

doc <- htmlParse("http://en.wikipedia.org/wiki/Mile_run_world_record_progression")

Error: failed to load external entity "http://en.wikipedia.org/wiki/Mile_run_world_record_progression".

There are tools in R to tidy HTML.

If your are scraping data from inside a table tag in HTML then it might be easiest to use the html_node and html_table functions in the rvest package in R as we did in

lecture 18 http://rpubs.com/lucasadam7/259162 .

#type install.packages("rvest") in console
library(rvest)
SetOfTables <- 
  "http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
  read_html() %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/table') %>%
  html_table(fill=TRUE)
head( SetOfTables[[1]] )
Time Athlete Nationality Date Venue
4:28 Charles Westhall United Kingdom 26 July 1855 London
4:28 Thomas Horspool United Kingdom 28 September 1857 Manchester
4:23 Thomas Horspool United Kingdom 12 July 1858 Manchester
4:22¼ Siah Albison United Kingdom 27 October 1860 Manchester
4:21¾ William Lang United Kingdom 11 July 1863 Manchester
4:20½ Edward Mills United Kingdom 23 April 1864 Manchester