2. Navegar la estructura HTML con rvest
Existen varias funciones de rvest para navegar sobre HTML. Veamos algunas de ellas sobre el ejemplo de la clase pasada.
library(rvest)
library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ------------------------------------------ tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.3 v dplyr 1.0.7
v tidyr 1.1.3 v stringr 1.4.0
v readr 2.0.0 v forcats 0.5.1
-- Conflicts --------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x readr::guess_encoding() masks rvest::guess_encoding()
x dplyr::lag() masks stats::lag()
url <- "https://www.york.ac.uk/teaching/cws/wws/webpage1.html"
html <- url %>% read_html()
- Un primer método es el uso de la función html_children(), la cual toma un nodo y nos devuelve sus hijos, que son de la clase xml_nodeset.
html %>% html_children()
{xml_nodeset (1)}
[1] <body><hmtl><title>webpage1</title>\n<table width="75%" align="center"><tr>\ ...
Si a este conjunto de nodos intentamos convertirlos automáticamente a texto, obtendremos el siguiente resultado.
html %>% html_children() %>% html_text()
[1] "webpage1\nSTARTING . . . \n\n\nThere are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML. \nHTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!\n\nLearning HTML will enable you to:\ncreate your own simple pages\nread and appreciate pages created by others\ndevelop an understanding of the creative and literary implications of web-texts\nhave the confidence to branch out into more complex web design \nA HTML web page is made up of tags. Tags are placed in brackets like this < tag > . A tag tells the browser how to display information. Most tags need to be opened < tag > and closed < /tag >.\n\n To make a simple web page you need to know only four tags:\n< HTML > tells the browser your page is written in HTML format\n< HEAD > this is a kind of preface of vital information that doesn't appear on the screen. \n< TITLE >Write the title of the web page here - this is the information that viewers see on the upper bar of their screen. (I've given this page the title 'webpage1').\n< BODY >This is where you put the content of your page, the words and pictures that people read on the screen. \nAll these tags need to be closed.\n\nEXERCISE\n\nWrite a simple web page.\n Copy out exactly the HTML below, using a WP program such as Notepad.\nInformation in italics indicates where you can insert your own text, other information is HTML and needs to be exact. However, make sure there are no spaces between the tag brackets and the text inside.\n(Find Notepad by going to the START menu\\ PROGRAMS\\ ACCESSORIES\\ NOTEPAD). \n\n< HTML >< HEAD >< TITLE > title of page< /TITLE >< /HEAD >< BODY> write what you like here: 'my first web page', or a piece about what you are reading, or a few thoughts on the course, or copy out a few words from a book or cornflake packet. Just type in your words using no extras such as bold, or italics, as these have special HTML tags, although you may use upper and lower case letters and single spaces. < /BODY >< /HTML >Save the file as 'first.html' (ie. call the file anything at all) It's useful if you start a folder - just as you would for word-processing - and call it something like WEBPAGES, and put your first.html file in the folder.\n\nNOW - open your browser.\nOn Netscape the process is: \nTop menu; FILE\\ OPEN PAGE\\ CHOOSE FILE \nClick on your WEBPAGES folder\\ FIRST file\nClick 'open' and your page should appear.\nOn Internet Explorer: \nTop menu; FILE\\ OPEN\\ BROWSE \nClick on your WEBPAGES folder\\ FIRST file\nClick 'open' and your page should appear.If the page doesn't open, go back over your notepad typing and make sure that all the HTML tags are correct. Check there are no spaces between tags and internal text; check that all tags are closed; check that you haven't written < HTLM > or < BDDY >. Your page will work eventually. \n\nMake another page. Call it somethingdifferent.html and place it in the same WEBPAGES folder as detailed above.\nstart formatting in lesson two\nback to wws index \n\n \n \n\n\n\n\n"
Como se puede notar, al usar html_text se obtiene el texto de todas las etiquetas en el conjunto de nodos utilizado.
- Otra forma de navegar es el uso de selectores a través de las funciones html_nodes y html_node. Mientras html_nodes devuelve todos los nodos del selector escogido, html_node devuelve solo el primero que encuentre. Así, el argumento que necesitan estas funciones para realizar su trabajo es un selector. Un selector no es más que una cadena de texto que especifica un camino que se debe recorrer en el árbol HTML para llegar al objeto que estamos buscando. Cabe notar que estos selectores siguen una sintaxis específica. Intentemos obtener solo el texto de los nodos p (párrafos).
html %>% html_nodes("p") %>% html_text()
[1] "There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language - HTML. \n"
[2] "HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!"
[3] "Learning HTML will enable you to:\n"
[4] "A HTML web page is made up of tags. Tags are placed in brackets like this < tag > . A tag tells the browser how to display information. Most tags need to be opened < tag > and closed < /tag >.\n\n"
[5] " To make a simple web page you need to know only four tags:\n"
[6] "All these tags need to be closed.\n\n"
[7] "Write a simple web page."
[8] " Copy out exactly the HTML below, using a WP program such as Notepad.\nInformation in italics indicates where you can insert your own text, other information is HTML and needs to be exact. However, make sure there are no spaces between the tag brackets and the text inside.\n(Find Notepad by going to the START menu\\ PROGRAMS\\ ACCESSORIES\\ NOTEPAD). \n"
[9] "\n< HTML >< HEAD >< TITLE > title of page< /TITLE >< /HEAD >< BODY> write what you like here: 'my first web page', or a piece about what you are reading, or a few thoughts on the course, or copy out a few words from a book or cornflake packet. Just type in your words using no extras such as bold, or italics, as these have special HTML tags, although you may use upper and lower case letters and single spaces. < /BODY >< /HTML >"
[10] "Save the file as 'first.html' (ie. call the file anything at all) It's useful if you start a folder - just as you would for word-processing - and call it something like WEBPAGES, and put your first.html file in the folder.\n\n"
[11] "NOW - open your browser.\nOn Netscape the process is: \nTop menu; FILE\\ OPEN PAGE\\ CHOOSE FILE \nClick on your WEBPAGES folder\\ FIRST file\nClick 'open' and your page should appear.\n"
[12] "On Internet Explorer: \nTop menu; FILE\\ OPEN\\ BROWSE \nClick on your WEBPAGES folder\\ FIRST file\nClick 'open' and your page should appear."
[13] "If the page doesn't open, go back over your notepad typing and make sure that all the HTML tags are correct. Check there are no spaces between tags and internal text; check that all tags are closed; check that you haven't written < HTLM > or < BDDY >. Your page will work eventually. \n"
[14] "\nMake another page. Call it somethingdifferent.html and place it in the same WEBPAGES folder as detailed above.\n"
[15] "start formatting in lesson two\nback to wws index "
Notemos que al usar solo p, se obtuvieron todos los elementos de este tipo en todo el documento, para navegar a un nodo específico, tendremos que detallar más el selector, lo cual lo podemos lograr a través de la opción inspeccionar que tienen casi todos los navegadores existentes, o través de extensiones como SelectorGadget en Google Chrome.
- Finalmente, una vez que naveguemos a los selectores deseados, podemos obtener atributos específicos de ellos, como hipervínculos a través de la función html_attr. Esta necesita como argumento el atributo que deseamos extraer. Si usamos la función html_attrs, esta nos devolverá todos los atributos del nodo seleccionado como un vector.
html %>% html_nodes("table p") %>% pluck(15) %>% html_node("a") %>% html_attrs()
href
"webpage2.html"
html %>% html_nodes("table p") %>% pluck(15) %>% html_node("a") %>% html_attr("href")
[1] "webpage2.html"
