Reading Books

library(tidyverse)
library(rjson)
library(XML)
library(xml2)
library(rvest)
library(jsonlite)
library(knitr)

Introduction

In this notebook we will introduce reading less-than-tidy data commonly found on the web, and demonstrate the steps needed to get each format into proper tidy shape.

We’ll review a JSON, HTML, and XML file, each containing data on a different book.

JSON

Reading from JSON is relatively straightforward using the jsonlite library, however it comes in the form of a list of lists. We can turn it into a tibble by unnesting each set of lists into its own set of columns

dune_list <- jsonlite::fromJSON("dune.json", flatten = TRUE)

dune_tibble <- tibble(book = list(dune_list$book))

dune_tibble_flat <- dune_tibble |> 
  unnest_wider(book) |> 
  unnest_longer(characters) |> 
  unnest_wider(characters)

kable(dune_tibble_flat)

title	author	genre	firstname	lastname
Dune	Herbert, Frank	Science Fiction	Paul	Atreides
Dune	Herbert, Frank	Science Fiction	Vladmir	Harkonnen
Dune	Herbert, Frank	Science Fiction	Chani	NA

XML

Reading XML is straightforward using read_xml, however we’ll need to individually parse each section as its own column in order to create a tidy dataframe.

# Read in the raw XML
hobbit_xml <- read_xml("hobbit.xml")

# Parse the XML so we can work with it further
hobbit <- xmlParse(hobbit_xml)

# Extract each section into new objects
title <- xml_text(xml_find_all(hobbit_xml, "//title"))
author <- xml_text(xml_find_all(hobbit_xml, "//author"))
genre <- xml_text(xml_find_all(hobbit_xml, "//genre"))

# Assemble the column objects into a tibble
hobbit_df <- tibble(title, author, genre)

kable(hobbit_df)

title	author	genre
The Hobbit	Tolkien, John Ronald Reuel	High fantasy

HTML

HTML is the least straightforward (most unstructured?) data format. Similar to XML, we must extract each “column” from each HTML attribute as individual objects, then combine them into a full tibble.

html <- read_html("khaneman.html")

title <- html_element(html, "div") %>% html_attr("title")
author <- html_element(html, "div") %>% html_attr("author")
publisher <- html_element(html, "div") %>% html_attr("publisher")
edition <- html_element(html, "div") %>% html_attr("edition")
publishing_date <- html_element(html, "div") %>% html_attr("publishing_date")
isbn <- html_element(html, "div") %>% html_attr("isbn")

khaneman <- tibble(
  Title = title,
  Author = author,
  Publisher = publisher,
  Edition = edition,
  Publishing_Date = publishing_date,
  ISBN = isbn
)

kable(khaneman)

Title	Author	Publisher	Edition	Publishing_Date	ISBN
Thinking, Fast and Slow	Khaneman, Daniel	Farrar, Straus, and Giroux	First Edition	2013	0374533555