DATA 607, HW 07 – XML and JSON in R

Description

In this assignment, I recorded information for three different books, stored the information in an HTML file, XML file, and JSON file, and read the files into R to see how they compare.

These are the three books I chose (from my bookshelf):

The Dictator’s Handbook, by Bruce Bueno de Mesquita and Alastair Smith (2011)
King Leopold’s Ghost, by Adam Hochschild (1998)
The Complete Maus (Maus #1-2), by Art Spiegelman (1986)

I recorded each book’s title, author(s), year of publication, genre, and Goodreads rating to create these three files:

I found this article from DataCamp very helpful in completing the assignment.

library(XML)         # To read HTML and XML files
library(RCurl)       # To get URL data
library(jsonlite)    # To read JSON files
library(knitr)       # To create responsive HTML tables
library(kableExtra)  # To create responsive HTML tables

Reading HTML into R

# Link to HTML file
u1 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.html"

# Get URL data
urldata <- getURL(u1)

# Read in the HTML file. Since the entire file is one table, we do not have to specify which table we want to use
raw_html <- readHTMLTable(urldata, stringsAsFactors = FALSE)

# Convert the HTML into an R dataframe
html <- data.frame(raw_html)

html %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

NULL.id	NULL.Title	NULL.Author	NULL.Year	NULL.Genre	NULL.Rating
1	The Dictator’s Handbook	Bruce Bueno de Mesquita, Alastair Smith	2011	Politics	4.27
2	King Leopold’s Ghost	Adam Hochschild	1998	History	4.15
3	The Complete Maus (Maus #1-2)	Art Spiegelman	1986	Graphic Novel	4.53

Reading XML into R

# Link to XML file
u2 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.xml"

# Get URL data
raw_xml <- getURL(u2)

# Parse the XML data from the URL
xml_file <- xmlParse(raw_xml)

# Convert the XML data into an R dataframe
xml_df <- xmlToDataFrame(xml_file, stringsAsFactors = FALSE)

xml_df %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

title	author1	author2	year	genre	rating
The Dictator’s Handbook	Bruce Bueno de Mesquita	Alastair Smith	2011	Politics	4.27
King Leopold’s Ghost	Adam Hochschild		1998	History	4.15
The Complete Maus (Maus #1-2)	Art Spiegelman		1986	Graphic Novel	4.53

Reading JSON into R

# Link to JSON file
u3 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.json"

# Get the JSON data from the URL
json_file <- fromJSON(u3)

# Convert the JSON data into an R dataframe
json_df <- data.frame(json_file, stringsAsFactors = FALSE)

json_df %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

books.Title	books.Author1	books.Author2	books.Year	books.Genre	books.Rating
The Dictator’s Handbook	Bruce Bueno de Mesquita	Alastair Smith	2011	Politics	4.27
King Leopold’s Ghost	Adam Hochschild		1998	History	4.15
The Complete Maus (Maus #1-2)	Art Spiegelman		1986	Graphic Novel	4.53

Comparison

The three dataframes ended up being very similar, but not identical. They all required slightly different steps to be accessible by R.

The JSON file required the fewest steps to read, while the XML required the most. XML also had more strict rules regarding special characters and naming that meant my file needed a few rounds of corrections before it could be read by R.

The column names of each dataframe ended up different depending on the file, reflecting the structure of the language used to create it.