Introduction to Web Scraping

What is web scraping?

Web scraping is the process of extracting a structural representation of data from a website.

Example

  1. To collect user comments on an online news article

  2. Targeting the marking of the web page

  3. Parsing the web page into a tree representation

  4. Running an R script automatically

I. Data on the internet

  1. The World Wide Web (WWW) contains all kinds of information from diverse sources, which are useful for any research.

  2. The data may be spanned not only across multiple websites but also across multiple pages under various sections even in a single website.

  3. Web scraping is used to extract data from the website(s) and to transform unstructured data from the internet into a local database.

  4. When using APIs, data collection from the website is fast and can be performed without any web scraping technique.

  5. However, many websites do not provide APIs. Web scraping is the solution in that case.

  6. One of the most useful features are hierarchically rooted trees that are labeled. On these trees, the tags represent the appropriate labels for the Hypertext Markup Language (HTML) syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page.

  7. The display of a web page using an ordered rooted tree labeled with a label is referred to as the Document Object Model (DOM). The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags. This can be interpreted by the browser to represent web-specific items.

  8. HTML tags can be placed in a hierarchical structure. In this hierarchy, elements in the DOM are captured by the document tree that represents the HTML tags.

Document Object Model(DOM)

Document Object Model(DOM)

2. HTML

2.1 Brower presentation and source code

HTML

HTML

  1. HTML’s marked up structured

    • Markup definitions: the tags
  2. Web content is an interpreted version of the source code

    • How the document is structured and the function of its various parts: headlines, links, tables, etc…

    • Element inspector

2.2 Syntax rules

  1. Tags, elements, and attributes

Elements

<title>First HTML</title>

Attributes

<a href="https://en.wikipedia.org/wiki/Main_Page">Link to Wikipedia!</a>

https://en.wikipedia.org/wiki/Main_Page

  1. Tree structure

    <p>First HTML</p>

    I am your first HTML file!

A tree perspective on HTML

A tree perspective on HTML

  1. Comments

HTML offers the possibility to insert comments into the code that are not evaluated and therefore not displayed in the browser.

<!-- I am a comment.
  I can span several lines and I am able to store additional 
    content that is not displayed by the browser. -->
     
  1. Reversed and special characters

    <p>5 &lt; 6 but 7 &gt; 3 </p>

5 < 6 but 7 > 3

HTML entities

Character Entity name Explanation
" " quotation mark
' apostrophe
& & ampersand
< < less than
> > greater than
  non-breaking space
  1. Spaces and line breaks

    Writing code is poetry

    Writing
     code
     is
     poetry

    Writing
        code
          is
     poetry

2.3 Tags and attributes

  1. The anchor tag <a>

The tag <a> is what turns HTML from just a markup language into a hypertext markup language by enabling HTML documents to link to other documents.

- Linking to another document
  1. The metadata tag <meta>

The <meta> tag provides meta information on the HTML document.

- Specifying keywords

<meta name="keywords" content="Automation, Data, R">

- Declaring character encoding

<meta charset="ISO-8859-1"/>

- Defining character encodings

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
  1. The external reference tag <link>

The <link> tag is used to link to and include information and external files.

- Specifying style sheets to use

<link rel="stylesheet" href="htmlresources/awesomestyle.css" type="text/css"/>

- Specifying the icon associated with the website

<link rel="shortcut icon" href="htmlresources/favicon.ico" type="image/x-icon"/>
  1. Emphasizing tags <b>, <i>, <strong>

    • Text with bold type setting
    • Text set in italics
    • Text defined as important
  2. The paragraphs tag <p>

  3. Heading tags <h1>, <h2>, <h3>, …

  4. Listing content with <ul>, <ol>, and <dl>

  5. The organizational tags <div> and <span>

While <div> and <span> themselves do not change the appearance of the content they enclose, these tags are used to group parts of the document.

  • <div> defines groups across lines, tags, and paragraphs

  • <span> used for in-line grouping

  • CSS div.happy {color:pink;font-family:“Comic Sans MS”;font-size:120%} span.happy {color:pink;font-family:“Comic Sans MS”;font-size:120%}

      <link href="htmlresources/awesomestyle.css" rel="stylesheet" type="text/css"/>

The purpose of CSS is to separate content from layout to improve the document’s accessibility. Defining styles outside of an HTML and assigning them via the class attribute enables the web designer to reuse styles across elements and documents. This enables developers to change a style in one single place–within the CSS file–with effects on all elements and documents using this style.

  1. Table tags <table>, <tr>, <td>, and <th>
  • new lines with <tr>
  • <td> for defining cells
  • <th> for header cells

Summary

  1. What we get presented when surfing the web is an interpreted version of the marked up source code that holds the content.

  2. Tags form the core of the markup used in HTML and can be used to define structure, appearance, and content.

Assignment

Write down an basic HTML document. You may consult with the examples of HTML documents available on our E-Class page. The detailed instruction will be announced on E-Class.