What is web scraping?
Web scraping is the process of extracting a structural representation of data from a website.
To collect user comments on an online news article
Targeting the marking of the web page
Parsing the web page into a tree representation
Running an R script automatically
The World Wide Web (WWW) contains all kinds of information from diverse sources, which are useful for any research.
The data may be spanned not only across multiple websites but also across multiple pages under various sections even in a single website.
Web scraping is used to extract data from the website(s) and to transform unstructured data from the internet into a local database.
When using APIs, data collection from the website is fast and can be performed without any web scraping technique.
However, many websites do not provide APIs. Web scraping is the solution in that case.
One of the most useful features are hierarchically rooted trees that are labeled. On these trees, the tags represent the appropriate labels for the Hypertext Markup Language (HTML) syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page.
The display of a web page using an ordered rooted tree labeled with a label is referred to as the Document Object Model (DOM). The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags. This can be interpreted by the browser to represent web-specific items.
HTML tags can be placed in a hierarchical structure. In this hierarchy, elements in the DOM are captured by the document tree that represents the HTML tags.
Document Object Model(DOM)
HTML
HTML’s marked up structured
Web content is an interpreted version of the source code
How the document is structured and the function of its various parts: headlines, links, tables, etc…
Element inspector
Elements
<title>First HTML</title>
Attributes
<a href="https://en.wikipedia.org/wiki/Main_Page">Link to Wikipedia!</a>
https://en.wikipedia.org/wiki/Main_Page
Tree structure
First HTML
I am your first HTML file!
A tree perspective on HTML
HTML offers the possibility to insert comments into the code that are not evaluated and therefore not displayed in the browser.
<!-- I am a comment.
I can span several lines and I am able to store additional
content that is not displayed by the browser. -->
Reversed and special characters
<p>5 < 6 but 7 > 3 </p>
5 < 6 but 7 > 3
HTML entities
Character | Entity name | Explanation |
---|---|---|
" | " | quotation mark |
’ | ' | apostrophe |
& | & | ampersand |
< | < | less than |
> | > | greater than |
non-breaking space |
Spaces and line breaks
Writing code is poetry
Writing
code
is
poetry
Writing
code
is
poetry
What we get presented when surfing the web is an interpreted version of the marked up source code that holds the content.
Tags form the core of the markup used in HTML and can be used to define structure, appearance, and content.
Write down an basic HTML document. You may consult with the examples of HTML documents available on our E-Class page. The detailed instruction will be announced on E-Class.