library(tidyverse)
library(rjson)
library(XML)
library(xml2)
library(rvest)
library(jsonlite)
library(knitr)
Reading Books
Introduction
In this notebook we will introduce reading less-than-tidy data commonly found on the web, and demonstrate the steps needed to get each format into proper tidy shape.
We’ll review a JSON, HTML, and XML file, each containing data on a different book.
JSON
Reading from JSON is relatively straightforward using the jsonlite
library, however it comes in the form of a list of lists. We can turn it into a tibble by unnesting each set of lists into its own set of columns
<- jsonlite::fromJSON("dune.json", flatten = TRUE)
dune_list
<- tibble(book = list(dune_list$book))
dune_tibble
<- dune_tibble |>
dune_tibble_flat unnest_wider(book) |>
unnest_longer(characters) |>
unnest_wider(characters)
kable(dune_tibble_flat)
title | author | genre | firstname | lastname |
---|---|---|---|---|
Dune | Herbert, Frank | Science Fiction | Paul | Atreides |
Dune | Herbert, Frank | Science Fiction | Vladmir | Harkonnen |
Dune | Herbert, Frank | Science Fiction | Chani | NA |
XML
Reading XML is straightforward using read_xml
, however we’ll need to individually parse each section as its own column in order to create a tidy dataframe.
# Read in the raw XML
<- read_xml("hobbit.xml")
hobbit_xml
# Parse the XML so we can work with it further
<- xmlParse(hobbit_xml)
hobbit
# Extract each section into new objects
<- xml_text(xml_find_all(hobbit_xml, "//title"))
title <- xml_text(xml_find_all(hobbit_xml, "//author"))
author <- xml_text(xml_find_all(hobbit_xml, "//genre"))
genre
# Assemble the column objects into a tibble
<- tibble(title, author, genre)
hobbit_df
kable(hobbit_df)
title | author | genre |
---|---|---|
The Hobbit | Tolkien, John Ronald Reuel | High fantasy |
HTML
HTML is the least straightforward (most unstructured?) data format. Similar to XML, we must extract each “column” from each HTML attribute as individual objects, then combine them into a full tibble.
<- read_html("khaneman.html")
html
<- html_element(html, "div") %>% html_attr("title")
title <- html_element(html, "div") %>% html_attr("author")
author <- html_element(html, "div") %>% html_attr("publisher")
publisher <- html_element(html, "div") %>% html_attr("edition")
edition <- html_element(html, "div") %>% html_attr("publishing_date")
publishing_date <- html_element(html, "div") %>% html_attr("isbn")
isbn
<- tibble(
khaneman Title = title,
Author = author,
Publisher = publisher,
Edition = edition,
Publishing_Date = publishing_date,
ISBN = isbn
)
kable(khaneman)
Title | Author | Publisher | Edition | Publishing_Date | ISBN |
---|---|---|---|---|---|
Thinking, Fast and Slow | Khaneman, Daniel | Farrar, Straus, and Giroux | First Edition | 2013 | 0374533555 |