HTML
HTML’s marked up structured
Web content is an interpreted version of the source code
How the document is structured and the function of its various parts: headlines, links, tables, etc…
Element inspector
Elements
<title>First HTML</title>
Attributes
<a href="http://www.r-datacollection.com/">Link to Homepage</a>
http://www.r-datacollection.com/bookmaterials.html
Tree structure
First HTML
I am your first HTML-file!
A tree perspective on HTML
Comments
Reversed and special characters
<p>5 < 6 but 7 > 3 </p>
HTML entities
Character | Entity name | Explanation |
---|---|---|
" | " | quotation mark |
’ | ' | apostrophe |
& | & | ampersand |
< | < | less than |
> | > | greater than |
non-breaking space |
Document type definition
<!DOCTYPE html>
Spaces and line breaks
Writing code is poetry
Writing
code
is
poetry
Writing
code
is
poetry
Loading and representing the contents of HTML/XML files in an R session
Inspecting content on the Web: browser to display HTML content nicely
Importing HTML files into R and extracting info. from them: parser in R to construct useful representations of HTML documents
Reading vs. Parsing
Reading does not care to understand the formal grammar that underlies HTML but merely recognize the sequence of symbols included in the HTML file: Merely loading the content of an HTML file into an R session.
url <- "https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989"
example <- readLines(url)
## Warning in readLines(url): 'https://news.naver.com/main/read.nhn?
## mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989'에서 불완전한 마지막 행이 발견
## 되었습니다
example <- paste0(example, collapse = " ")
class(example)
## [1] "character"
library(httr)
url <- "https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989"
example <- httr::GET(url)
example
## Response [https://news.naver.com/main/read.nhn?mode=LSD&mid=shm&sid1=102&oid=001&aid=0011898989]
## Date: 2020-09-23 08:26
## Status: 200
## Content-Type: text/html;charset=EUC-KR
## Size: 123 kB
## <!DOCTYPE HTML>
## <html lang="ko">
## <head>
## <meta charset="euc-kr">
## <meta http-equiv="X-UA-Compatible" content="IE=edge">
## <meta name="referrer" contents="always">
## <meta name="viewport" content="width=1106" />
## <title>질병청 "'상온 노출' 백신 문제없으면 즉시 접종 재개"(종합) : 네이버 뉴스</title>
##
##
## ...
class(example)
## [1] "response"
GET()
is agnostic about the different tag elements (name, attribute, values, etc.) and produces results that do not reflect the document’s internal hierarchy as implied by the nested tags in any sensible way.
To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy of an HTML file within some R-specific data structure.
Transformation from any HTML file to a queryable Document Object Model: Parsing using XML package in two steps
1. ```html_parse()``` first parses the entire target document and creates the DOM in a tree-like data structure of the C language.
2. The C-level node structure is converted into an object of the R language through handler functions.
#library(XML)
#parsed_example <- htmlParse(example)
#parsed_example <- htmlParse(example, encoding = "UTF-8")
#class(parsed_example)
#parsed_example
Discarding unnecessary parts of web documents in the parsing stage can help mitigate memory issues and enhance extraction speed. We can specify handlers as a list of named functions, where the name corresponds to a node name and the function specifies what should happen with the node.
# h1 <- list("body" = function(x){NULL})
# parsed_example_body <- htmlTreeParse(example, handlers = h1 , asTree = T, encoding = "UTF-8")
# parsed_example_body$children
Select 1) a NAVER news webpage of your interest in your browser
Have a look at the source code
Inspect various elements in the Inspect Elements tool of your browser
Copy and paste the elements for the outlet, the upload data, the headline, any highlight, the body content, and the comments (댓글)
Check and report the structure of the elements.