Week4

Ch2. HTML

2.1 Brower presentation and source code

HTML

HTML’s marked up structured
- Markup definitions: the tags
Web content is an interpreted version of the source code
- How the document is structured and the function of its various parts: headlines, links, tables, etc…
- Element inspector

2.2 Syntax rules

Tags, elements, and attributes

Elements

<title>First HTML</title>

Attributes

<a href="http://www.r-datacollection.com/">Link to Homepage</a>

http://www.r-datacollection.com/bookmaterials.html

Tree structure
<p>First HTML</p>
I am your first HTML-file!

A tree perspective on HTML

Comments
Reversed and special characters
```
<p>5 &lt; 6 but 7 &gt; 3 </p>
```

HTML entities

Character	Entity name	Explanation
"	"	quotation mark
’	'	apostrophe
&	&	ampersand
<	<	less than
>	>	greater than
		non-breaking space

Document type definition
```
<!DOCTYPE html>
```
Spaces and line breaks

Writing code is poetry

Writing
code
is
poetry

Writing
code
is
poetry

2.3 Tags and attributes

The anchor tag <a>

The tag <a> is what turns HTML from just a markup language into a hypertext markup language by enabling HTML documents to link to other documents.

- Linking to another document

The metadata tag <meta>

The <meta> tag provides meta information on the HTML document.

- Specifying keywords
- Asking robots not to index the page or to follow its links
- Declaring character encoding
- Defining character encodings

The external reference tag <link>

The <link> tag is used to link to and include information and external files.

- Specifying style sheets to use
- Specifying the icon associated with the website

Emphasizing tags <b>, <i>, <strong>
- Text with bold type setting
- Text set in italics
- Text defined as important
The paragraphs tag <p>
Heading tags <h1>, <h2>, <h3>, …
Listing content with <ul>, <ol>, and <dl>
The organizational tags <div> and <span>

While <div> and <span> themselves do not change the appearance of the content they enclose, these tags are used to group parts of the document.

- &lt;div&gt; defines groups across lines, tags, and paragraphs
- &lt;span&gt; used for in-line grouping
- CSS
     div.happy {color:pink;font-family:"Comic Sans MS";font-size:120%}
     span.happy {color:pink;font-family:"Comic Sans MS";font-size:120%} 
     
     <link href="htmlresources/awesomestyle.css" rel="stylesheet" type="text/css"/>

The purpose of CSS is to separate content from layout to improve the document’s accessibility. Defining styles outside of an HTML and assigning them via the class attribute enables the web designer to reuse styles across elements and documents. This enables developers to change a style in one single place–within the CSS file–with effects on all elements and documents using this style.

The <form> tag and its companions
The foreign script tag <script>

HTTP

Table tags <table>, <tr>, <td>, and <th>
- new lines with <tr>
- <td> for defining cells
- <th> for header cells

2.4 Parsing

Loading and representing the contents of HTML/XML files in an R session

Inspecting content on the Web: browser to display HTML content nicely
Importing HTML files into R and extracting info. from them: parser in R to construct useful representations of HTML documents

What is parsing?

Reading vs. Parsing

Reading does not care to understand the formal grammar that underlies HTML but merely recognize the sequence of symbols included in the HTML file: Merely loading the content of an HTML file into an R session.

url <- "https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989"
example <- readLines(url)

## Warning in readLines(url): 'https://news.naver.com/main/read.nhn?
## mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989'에서 불완전한 마지막 행이 발견
## 되었습니다

example <- paste0(example, collapse = " ")

class(example)

## [1] "character"

library(httr)
url <- "https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989"
example <- httr::GET(url)
example

## Response [https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989]
##   Date: 2020-09-23 08:26
##   Status: 200
##   Content-Type: text/html;charset=EUC-KR
##   Size: 123 kB
## <!DOCTYPE HTML> 
## <html lang="ko"> 
## <head>
## <meta charset="euc-kr">
## <meta http-equiv="X-UA-Compatible" content="IE=edge">
## <meta name="referrer" contents="always">
## <meta name="viewport" content="width=1106" />
## <title>질병청 "'상온 노출' 백신 문제없으면 즉시 접종 재개"(종합) : 네이버 뉴스</title>
## 
## 
## ...

class(example)

## [1] "response"

GET() is agnostic about the different tag elements (name, attribute, values, etc.) and produces results that do not reflect the document’s internal hierarchy as implied by the nested tags in any sensible way.

To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy of an HTML file within some R-specific data structure.

Transformation from any HTML file to a queryable Document Object Model: Parsing using XML package in two steps

  1. ```html_parse()``` first parses the entire target document and creates the DOM in a tree-like data structure of the C language.
  2. The C-level node structure is converted into an object of the R language through handler functions.

#library(XML)
#parsed_example <- htmlParse(example)
#parsed_example <- htmlParse(example, encoding = "UTF-8")
#class(parsed_example)
#parsed_example

Discarding nodes

Discarding unnecessary parts of web documents in the parsing stage can help mitigate memory issues and enhance extraction speed. We can specify handlers as a list of named functions, where the name corresponds to a node name and the function specifies what should happen with the node.

# h1 <- list("body" = function(x){NULL})

# parsed_example_body <- htmlTreeParse(example, handlers = h1 , asTree = T, encoding = "UTF-8")
# parsed_example_body$children

Week4

Shin Lee

2020 9 23