Module 1 Lesson 4 Application

Author

Jamal Rogers

Published

May 16, 2023

library(tidyverse)
library(rvest)

Star Wars

rvest includes a very simple example in vignette(“starwars”). This is a simple page with minimal HTML so it’s a good place to start. I’d encourage you to navigate to that page now and use “Inspect Element” to inspect one of the headings that’s the title of a Star Wars movie. Use the keyboard or mouse to explore the hierarchy of the HTML and see if you can get a sense of the shared structure used by each movie.

You should be able to see that each movie has a shared structure that looks like this:

Our goal is to turn this data into a 7 row data frame with variables title, year, director, and intro. We’ll start by reading the HTML and extracting all the

elements:

url <- "https://rvest.tidyverse.org/articles/starwars.html"
html <- read_html(url)

section <- html |> html_elements("section")
section
{xml_nodeset (7)}
[1] <section><h2 data-id="1">\nThe Phantom Menace\n</h2>\n<p>\nReleased: 1999 ...
[2] <section><h2 data-id="2">\nAttack of the Clones\n</h2>\n<p>\nReleased: 20 ...
[3] <section><h2 data-id="3">\nRevenge of the Sith\n</h2>\n<p>\nReleased: 200 ...
[4] <section><h2 data-id="4">\nA New Hope\n</h2>\n<p>\nReleased: 1977-05-25\n ...
[5] <section><h2 data-id="5">\nThe Empire Strikes Back\n</h2>\n<p>\nReleased: ...
[6] <section><h2 data-id="6">\nReturn of the Jedi\n</h2>\n<p>\nReleased: 1983 ...
[7] <section><h2 data-id="7">\nThe Force Awakens\n</h2>\n<p>\nReleased: 2015- ...

This retrieves seven elements matching the seven movies found on that page, suggesting that using section as a selector is good. Extracting the individual elements is straightforward since the data is always found in the text. It’s just a matter of finding the right selector:

section |> html_element("h2") |> html_text2()
[1] "The Phantom Menace"      "Attack of the Clones"   
[3] "Revenge of the Sith"     "A New Hope"             
[5] "The Empire Strikes Back" "Return of the Jedi"     
[7] "The Force Awakens"      
section |> html_element(".director") |> html_text2()
[1] "George Lucas"     "George Lucas"     "George Lucas"     "George Lucas"    
[5] "Irvin Kershner"   "Richard Marquand" "J. J. Abrams"    

Once we’ve done that for each component, we can wrap all the results up into a tibble:

starwars <- tibble(
  title = section |> 
    html_element("h2") |> 
    html_text2(),
  released = section |> 
    html_element("p") |> 
    html_text2() |> 
    str_remove("Released: ") |> 
    parse_date(),
  director = section |> 
    html_element(".director") |> 
    html_text2(),
  intro = section |> 
    html_element(".crawl") |> 
    html_text2()
)

starwars
# A tibble: 7 × 4
  title                   released   director         intro                     
  <chr>                   <date>     <chr>            <chr>                     
1 The Phantom Menace      1999-05-19 George Lucas     "Turmoil has engulfed the…
2 Attack of the Clones    2002-05-16 George Lucas     "There is unrest in the G…
3 Revenge of the Sith     2005-05-19 George Lucas     "War! The Republic is cru…
4 A New Hope              1977-05-25 George Lucas     "It is a period of civil …
5 The Empire Strikes Back 1980-05-17 Irvin Kershner   "It is a dark time for th…
6 Return of the Jedi      1983-05-25 Richard Marquand "Luke Skywalker has retur…
7 The Force Awakens       2015-12-11 J. J. Abrams     "Luke Skywalker has vanis…

We did a little more processing of released to get a variable that will be easy to use later in our analysis.

We can then save our scraped data into local directory.

write_csv(starwars, "starwars.csv")