library(tidyverse)
library(rvest)Module 1 Lesson 4 Application
Star Wars
rvest includes a very simple example in vignette(“starwars”). This is a simple page with minimal HTML so it’s a good place to start. I’d encourage you to navigate to that page now and use “Inspect Element” to inspect one of the headings that’s the title of a Star Wars movie. Use the keyboard or mouse to explore the hierarchy of the HTML and see if you can get a sense of the shared structure used by each movie.
You should be able to see that each movie has a shared structure that looks like this:
elements:
url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)
section <- html |> html_elements("section")
section{xml_nodeset (7)}
[1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
[2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
[3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
[4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
[5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
[6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
[7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...
This retrieves seven elements matching the seven movies found on that page, suggesting that using section as a selector is good. Extracting the individual elements is straightforward since the data is always found in the text. It’s just a matter of finding the right selector:
section |> html_element("h2") |> html_text2()[1] "The Phantom Menace" "Attack of the Clones"
[3] "Revenge of the Sith" "A New Hope"
[5] "The Empire Strikes Back" "Return of the Jedi"
[7] "The Force Awakens"
section |> html_element(".director") |> html_text2()[1] "George Lucas" "George Lucas" "George Lucas" "George Lucas"
[5] "Irvin Kershner" "Richard Marquand" "J. J. Abrams"
Once we’ve done that for each component, we can wrap all the results up into a tibble:
starwars <- tibble(
title = section |>
html_element("h2") |>
html_text2(),
released = section |>
html_element("p") |>
html_text2() |>
str_remove("Released: ") |>
parse_date(),
director = section |>
html_element(".director") |>
html_text2(),
intro = section |>
html_element(".crawl") |>
html_text2()
)
starwars# A tibble: 7 × 4
title released director intro
<chr> <date> <chr> <chr>
1 The Phantom Menace 1999-05-19 George Lucas "Turmoil has engulfed the…
2 Attack of the Clones 2002-05-16 George Lucas "There is unrest in the G…
3 Revenge of the Sith 2005-05-19 George Lucas "War! The Republic is cru…
4 A New Hope 1977-05-25 George Lucas "It is a period of civil …
5 The Empire Strikes Back 1980-05-17 Irvin Kershner "It is a dark time for th…
6 Return of the Jedi 1983-05-25 Richard Marquand "Luke Skywalker has retur…
7 The Force Awakens 2015-12-11 J. J. Abrams "Luke Skywalker has vanis…
We did a little more processing of released to get a variable that will be easy to use later in our analysis.
We can then save our scraped data into local directory.
write_csv(starwars, "starwars.csv")