W5-1: RWS Ch.1 Intro to Web Scraping

Introduction to Web Scraping

What is web scraping?

Web scraping is the process of extracting a structural representation of data from a website.

Example

To collect user comments on an online news article
Targeting the marking of the web page
Parsing the web page into a tree representation
Running an R script automatically

I. Data on the internet

The World Wide Web (WWW) contains all kinds of information from diverse sources, which are useful for any research.
The data may be spanned not only across multiple websites but also across multiple pages under various sections even in a single website.
Web scraping is used to extract data from the website(s) and to transform unstructured data from the internet into a local database.
When using APIs, data collection from the website is fast and can be performed without any web scraping technique.
However, many websites do not provide APIs. Web scraping is the solution in that case.
One of the most useful features are hierarchically rooted trees that are labeled. On these trees, the tags represent the appropriate labels for the Hypertext Markup Language (HTML) syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page.
The display of a web page using an ordered rooted tree labeled with a label is referred to as the Document Object Model (DOM). The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags. This can be interpreted by the browser to represent web-specific items.
HTML tags can be placed in a hierarchical structure. In this hierarchy, elements in the DOM are captured by the document tree that represents the HTML tags.

Document Object Model(DOM)

2. HTML

2.1 Brower presentation and source code

HTML

HTML’s marked up structured
- Markup definitions: the tags
Web content is an interpreted version of the source code
- How the document is structured and the function of its various parts: headlines, links, tables, etc…
- Element inspector

2.2 Syntax rules

Tags, elements, and attributes

Elements

<title>First HTML</title>

Attributes

<a href="https://en.wikipedia.org/wiki/Main_Page">Link to Wikipedia!</a>

https://en.wikipedia.org/wiki/Main_Page

Tree structure
<p>First HTML</p>

I am your first HTML file!

A tree perspective on HTML

Comments

HTML offers the possibility to insert comments into the code that are not evaluated and therefore not displayed in the browser.

<!-- I am a comment.
  I can span several lines and I am able to store additional 
    content that is not displayed by the browser. -->

Reversed and special characters
```
<p>5 &lt; 6 but 7 &gt; 3 </p>
```

5 < 6 but 7 > 3

HTML entities

Character	Entity name	Explanation
"	"	quotation mark
’	'	apostrophe
&	&	ampersand
<	<	less than
>	>	greater than
		non-breaking space

Spaces and line breaks

Writing code is poetry

Writing
code
is
poetry

Writing
code
is
poetry

2.3 Tags and attributes

The anchor tag <a>

The tag <a> is what turns HTML from just a markup language into a hypertext markup language by enabling HTML documents to link to other documents.

- Linking to another document

The metadata tag <meta>

The <meta> tag provides meta information on the HTML document.

- Specifying keywords

<meta name="keywords" content="Automation, Data, R">

- Declaring character encoding

<meta charset="ISO-8859-1"/>

- Defining character encodings

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

The external reference tag <link>

The <link> tag is used to link to and include information and external files.

- Specifying style sheets to use

<link rel="stylesheet" href="htmlresources/awesomestyle.css" type="text/css"/>

- Specifying the icon associated with the website

<link rel="shortcut icon" href="htmlresources/favicon.ico" type="image/x-icon"/>

Emphasizing tags <b>, <i>, <strong>
- Text with bold type setting
- Text set in italics
- Text defined as important
The paragraphs tag <p>
Heading tags <h1>, <h2>, <h3>, …
Listing content with <ul>, <ol>, and <dl>
The organizational tags <div> and <span>

While <div> and <span> themselves do not change the appearance of the content they enclose, these tags are used to group parts of the document.

<div> defines groups across lines, tags, and paragraphs
<span> used for in-line grouping
CSS div.happy {color:pink;font-family:“Comic Sans MS”;font-size:120%} span.happy {color:pink;font-family:“Comic Sans MS”;font-size:120%}
```
  <link href="htmlresources/awesomestyle.css" rel="stylesheet" type="text/css"/>
```

The purpose of CSS is to separate content from layout to improve the document’s accessibility. Defining styles outside of an HTML and assigning them via the class attribute enables the web designer to reuse styles across elements and documents. This enables developers to change a style in one single place–within the CSS file–with effects on all elements and documents using this style.

Table tags <table>, <tr>, <td>, and <th>

new lines with <tr>
<td> for defining cells
<th> for header cells

Summary

What we get presented when surfing the web is an interpreted version of the marked up source code that holds the content.
Tags form the core of the markup used in HTML and can be used to define structure, appearance, and content.

Assignment

Write down an basic HTML document. You may consult with the examples of HTML documents available on our E-Class page. The detailed instruction will be announced on E-Class.