Ch2. HTML

2.1 Brower presentation and source code

HTML

HTML

  1. HTML’s marked up structured

    • Markup definitions: the tags
  2. Web content is an interpreted version of the source code

    • How the document is structured and the function of its various parts: headlines, links, tables, etc…

    • Element inspector

2.2 Syntax rules

  1. Tags, elements, and attributes

Elements

<title>First HTML</title>

Attributes

<a href="http://www.r-datacollection.com/">Link to Homepage</a>

http://www.r-datacollection.com/bookmaterials.html

  1. Tree structure

    <p>First HTML</p>

    I am your first HTML-file!

A tree perspective on HTML

A tree perspective on HTML

  1. Comments

  2. Reversed and special characters

    <p>5 &lt; 6 but 7 &gt; 3 </p>

HTML entities

Character Entity name Explanation
" " quotation mark
' apostrophe
& & ampersand
< < less than
> > greater than
  non-breaking space
  1. Document type definition

    <!DOCTYPE html>
  2. Spaces and line breaks

    Writing code is poetry

    Writing
     code
     is
     poetry

    Writing
        code
          is
     poetry

2.3 Tags and attributes

  1. The anchor tag <a>

The tag <a> is what turns HTML from just a markup language into a hypertext markup language by enabling HTML documents to link to other documents.

- Linking to another document
  1. The metadata tag <meta>

The <meta> tag provides meta information on the HTML document.

- Specifying keywords
- Asking robots not to index the page or to follow its links
- Declaring character encoding
- Defining character encodings
  1. The external reference tag <link>

The <link> tag is used to link to and include information and external files.

- Specifying style sheets to use
- Specifying the icon associated with the website
  1. Emphasizing tags <b>, <i>, <strong>

    • Text with bold type setting
    • Text set in italics
    • Text defined as important
  2. The paragraphs tag <p>

  3. Heading tags <h1>, <h2>, <h3>, …

  4. Listing content with <ul>, <ol>, and <dl>

  5. The organizational tags <div> and <span>

While <div> and <span> themselves do not change the appearance of the content they enclose, these tags are used to group parts of the document.

- &lt;div&gt; defines groups across lines, tags, and paragraphs
- &lt;span&gt; used for in-line grouping
- CSS
     div.happy {color:pink;font-family:"Comic Sans MS";font-size:120%}
     span.happy {color:pink;font-family:"Comic Sans MS";font-size:120%} 
     
     <link href="htmlresources/awesomestyle.css" rel="stylesheet" type="text/css"/>
     

The purpose of CSS is to separate content from layout to improve the document’s accessibility. Defining styles outside of an HTML and assigning them via the class attribute enables the web designer to reuse styles across elements and documents. This enables developers to change a style in one single place–within the CSS file–with effects on all elements and documents using this style.

  1. The <form> tag and its companions

  2. The foreign script tag <script>

HTTP

HTTP

  1. Table tags <table>, <tr>, <td>, and <th>

    • new lines with <tr>
    • <td> for defining cells
    • <th> for header cells

2.4 Parsing

Loading and representing the contents of HTML/XML files in an R session

  1. Inspecting content on the Web: browser to display HTML content nicely

  2. Importing HTML files into R and extracting info. from them: parser in R to construct useful representations of HTML documents

What is parsing?

Reading vs. Parsing

Reading does not care to understand the formal grammar that underlies HTML but merely recognize the sequence of symbols included in the HTML file: Merely loading the content of an HTML file into an R session.

url <- "https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989"
example <- readLines(url)
## Warning in readLines(url): 'https://news.naver.com/main/read.nhn?
## mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989'에서 불완전한 마지막 행이 발견
## 되었습니다
example <- paste0(example, collapse = " ")

class(example)
## [1] "character"
library(httr)
url <- "https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989"
example <- httr::GET(url)
example
## Response [https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989]
##   Date: 2020-09-23 08:26
##   Status: 200
##   Content-Type: text/html;charset=EUC-KR
##   Size: 123 kB
## <!DOCTYPE HTML> 
## <html lang="ko"> 
## <head>
## <meta charset="euc-kr">
## <meta http-equiv="X-UA-Compatible" content="IE=edge">
## <meta name="referrer" contents="always">
## <meta name="viewport" content="width=1106" />
## <title>질병청 "'상온 노출' 백신 문제없으면 즉시 접종 재개"(종합) : 네이버 뉴스</title>
## 
## 
## ...
class(example)
## [1] "response"

GET() is agnostic about the different tag elements (name, attribute, values, etc.) and produces results that do not reflect the document’s internal hierarchy as implied by the nested tags in any sensible way.

To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy of an HTML file within some R-specific data structure.

Transformation from any HTML file to a queryable Document Object Model: Parsing using XML package in two steps

  1. ```html_parse()``` first parses the entire target document and creates the DOM in a tree-like data structure of the C language.
  2. The C-level node structure is converted into an object of the R language through handler functions.
#library(XML)
#parsed_example <- htmlParse(example)
#parsed_example <- htmlParse(example, encoding = "UTF-8")
#class(parsed_example)
#parsed_example

Discarding nodes

Discarding unnecessary parts of web documents in the parsing stage can help mitigate memory issues and enhance extraction speed. We can specify handlers as a list of named functions, where the name corresponds to a node name and the function specifies what should happen with the node.

# h1 <- list("body" = function(x){NULL})

# parsed_example_body <- htmlTreeParse(example, handlers = h1 , asTree = T, encoding = "UTF-8")
# parsed_example_body$children

Extracting information in the building process

Assignment

  1. Select 1) a NAVER news webpage of your interest in your browser

  2. Have a look at the source code

  3. Inspect various elements in the Inspect Elements tool of your browser

  4. Copy and paste the elements for the outlet, the upload data, the headline, any highlight, the body content, and the comments (댓글)

  5. Check and report the structure of the elements.