# Read the HTML file, creating a list of length 1 table (html file contained a single table)
html_tables_list <- read_html("https://raw.githubusercontent.com/naomibuell/DATA607/main/favorite_books.html") |>
html_table(fill = TRUE) # Extract the table using html_table()
html_books_df <-
html_tables_list[[1]] |> # Access table from list of tables from HTML
as.data.frame() |> # Convert the table to a data frame
clean_names()Assignment – Working with XML and JSON in R
Introduction
In this assignment, I pick three of my favorite books on Economics and store information on these books in HTML, XML, and JSON formats. I will load the information from each of the three formats into separate R data frames, and determine whether the three data frames are identical.
Book and Attribute Selection
I chose the following three books and collected title, authors, number of pages, pace, and mood according to StoryGraph, a website for reviewing books, below:
“Freakonomics: A Rogue Economist Explores the Hidden Side of Everything” by Steven D. Levitt and Stephen J. Dubner. It is 320 pages long, fast-paced, and has an informative mood.
“The Big Short: Inside the Doomsday Machine” by Michael Lewis. This book is 315 pages long, slow-paced, and has an informative mood.
“Manias, Panics, and Crashes: A History of Financial Crises” by Charles P. Kindleberger. This book is 304 pages long, is slow-paced, and has a challenging mood.
Next, I will take the information above, and separately create three files which store the book’s information in HTML, XML, and JSON formats.
HTML
I wrote HTML code (HTML table rendered below) and saved it online as file called “favorite_books.html” on my GitHub.
| Title | Authors | Pages | Pace | Mood |
| Freakonomics | Steven D. Levitt, Stephen J. Dubner | 315 | Slow-paced | Informative |
| The Big Short | Michael Lewis | 320 | Fast-paced | Informative |
| Manias, Panics, and Crashes | Charles P. Kindleberger | 304 | Slow-paced | Challenging |
XML
Next, I do the same thing, but create an XML file. The XML content is displayed below, but is saved as a separate file on GitHub.
<books>
<book>
<title>Freakonomics</title>
<authors>Steven D. Levitt, Stephen J. Dubner</authors>
<pages>315</pages>
<pace>Slow-paced</pace>
<mood>Informative</mood>
</book>
<book>
<title>The Big Short</title>
<authors>Michael Lewis</authors>
<pages>320</pages>
<pace>Fast-paced</pace>
<mood>Informative</mood>
</book>
<book>
<title>Manias, Panics, and Crashes</title>
<authors>Charles P. Kindleberger</authors>
<pages>304</pages>
<pace>Slow-paced</pace>
<mood>Challenging</mood>
</book>
</books>JSON
Lastly, I created the same table as a JSON file and saved it online.
{
"books": [
{
"title": "Freakonomics",
"authors": "Steven D. Levitt, Stephen J. Dubner",
"pages": 315,
"pace": "Slow-paced",
"mood": "Informative"
},
{
"title": "The Big Short",
"authors": "Michael Lewis",
"pages": 320,
"pace": "Fast-paced",
"mood": "Informative"
},
{
"title": "Manias, Panics, and Crashes",
"authors": "Charles P. Kindleberger",
"pages": 304,
"pace": "Slow-paced",
"mood": "Challenging"
}
]
}Load Files in R
The following R code uses rvest, xml2, and jsonlite packages to load the information from each of the three file sources into separate R data frames.
First, I load the HTML file using xml2::read_html() and rvest::html_table() commands. I save it as a data frame in R called html_books_df.
Next, I use the xml2 package to lad the XML file to an R data frame called xml_books_df:
xml_books_data <- read_xml("https://raw.githubusercontent.com/naomibuell/DATA607/main/favorite_books.xml")
# Extract data from the XML file to create a data frame. Each column is extracted from the XML one at a time as a character vector, then combined as one df.
xml_books_df <- data.frame(
title = xml_books_data |> # For the tittle column,
xml_find_all("//book/title") |> # get all title elements (within book elements),
xml_text(), # and extract text (as string vector).
authors = xml_books_data |>
xml_find_all("//book/authors") |>
xml_text(),
pages = xml_books_data |>
xml_find_all("//book/pages") |>
xml_text() |>
as.integer(), # change to integer vector.
pace = xml_books_data |>
xml_find_all("//book/pace") |>
xml_text(),
mood = xml_books_data |>
xml_find_all("//book/mood") |>
xml_text()
)Lastly, I do the same for the JSON file using the jsonlite::fromJSON command. The new data frame is called json_books_df.
# Read the JSON file into a data frame
json_books_df <- fromJSON("https://raw.githubusercontent.com/naomibuell/DATA607/main/favorite_books.json") |>
as.data.frame() |>
clean_names()Conclusion
Here are the final three data frames:
head(html_books_df) title authors pages
1 Freakonomics Steven D. Levitt, Stephen J. Dubner 315
2 The Big Short Michael Lewis 320
3 Manias, Panics, and Crashes Charles P. Kindleberger 304
pace mood
1 Slow-paced Informative
2 Fast-paced Informative
3 Slow-paced Challenging
head(xml_books_df) title authors pages
1 Freakonomics Steven D. Levitt, Stephen J. Dubner 315
2 The Big Short Michael Lewis 320
3 Manias, Panics, and Crashes Charles P. Kindleberger 304
pace mood
1 Slow-paced Informative
2 Fast-paced Informative
3 Slow-paced Challenging
head(json_books_df) books_title books_authors books_pages
1 Freakonomics Steven D. Levitt, Stephen J. Dubner 315
2 The Big Short Michael Lewis 320
3 Manias, Panics, and Crashes Charles P. Kindleberger 304
books_pace books_mood
1 Slow-paced Informative
2 Fast-paced Informative
3 Slow-paced Challenging
The three data frames weren’t all identical when loaded from GitHub as their original file types into R. I had to use different techniques on each to get them to have the proper shape, variable names, and variable types, depending on original file type. Ultimately, they were all the same shape (3 obs. of 5 variables), but I manually created this shape for the XML data frame column by column (by compiling 5 vectors of length 3, one for each title column, authors column, etc.), while the other 2 files had that shape immediately when loading using the HTML or JSON packages. Variable names differed–the JSON command automatically added the “book” element as a prefix to each variable name, while XML and HTML variables did not include any prefixes like this. For the pages variable, this was originally a string variable when extracted from XML, which I updated to integer using the as.integer() command, but for HTML and JSON files, these were automatically loaded as integers.