Introduction to Web Scraping

What is web scraping?

Web scraping is the process of extracting a structural representation of data from a website.

Example

  1. To collect user comments on an online news article

  2. Targeting the marking of the web page

  3. Parsing the web page into a tree representation

  4. Running an R script automatically

I. Data on the internet

  1. The World Wide Web (WWW) contains all kinds of information from diverse sources, which are useful for any research.

  2. The data may be spanned not only across multiple websites but also across multiple pages under various sections even in a single website.

  3. Web scraping is used to extract data from the website(s) and to transform unstructured data from the internet into a local database.

  4. When using APIs, data collection from the website is fast and can be performed without any web scraping technique.

  5. However, many websites do not provide APIs. Web scraping is the solution in that case.

  6. One of the most useful features are hierarchically rooted trees that are labeled. On these trees, the tags represent the appropriate labels for the Hypertext Markup Language (HTML) syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page.

  7. The display of a web page using an ordered rooted tree labeled with a label is referred to as the Document Object Model (DOM). The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags. This can be interpreted by the browser to represent web-specific items.

  8. HTML tags can be placed in a hierarchical structure. In this hierarchy, elements in the DOM are captured by the document tree that represents the HTML tags.

Document Object Model(DOM)

Document Object Model(DOM)

2. HTML

HyperText Markup Language

HTML

HTML

2.1 Brower presentation and source code

  1. HTML’s marked up structured

    • Markup definitions: the tags
  2. Web content is an interpreted version of the source code

    • How the document is structured and the function of its various parts: headlines, links, tables, etc…

    • Element inspector

2.2 Syntax rules

1. Tags, elements, and attributes

Elements

<title>First HTML</title>

Attributes

<a href="https://en.wikipedia.org/wiki/Main_Page">Link to Wikipedia!</a>

https://en.wikipedia.org/wiki/Main_Page

2. Tree structure

<html>
  <head>
    <title>First HTML</title>         
  </head>
  <body>
    <p>I am your first HTML file!</p>
  </body>
</html>
A tree perspective on HTML

A tree perspective on HTML

3. Comments

HTML offers the possibility to insert comments into the code that are not evaluated and therefore not displayed in the browser.

<!-- I am a comment.
  I can span several lines and I am able to store additional 
    content that is not displayed by the browser. -->
     

4. Reversed and special characters

  <p> &lt;p&gt; </p>

HTML entities

|Character|Entity name|Explanation|
|   "     |  &quot;   |quotation mark|
|   '     |  &apos;   |apostrophe|
|   &     |  &amp;    |ampersand|
|   <     |  &lt;     |less than|
|   >     |  &gt;     |greater than|
|         |  &nbsp;   |non-breaking space|

<p>

5. Spaces and line breaks

  <p>Writing     code        is           poetry</p>

  <p>Writing&nbsp;code&nbsp;is&nbsp;poetry</p>
  
  <p>Writing<br>code<br>is<br>poetry</p>
  
  

Writing code is poetry

Writing code is poetry

Writing
code
is
poetry

Summary

  1. What we get presented when surfing the web is an interpreted version of the marked up source code that holds the content.

  2. Tags form the core of the markup used in HTML and can be used to define structure, appearance, and content.