Working with HTML, XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books3.xml”, and “books3.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

References to the books.html, books.xml, and books.json.

Books.html: https://raw.githubusercontent.com/jcp9010/MSDA/master/books.html

Books.xml: https://raw.githubusercontent.com/jcp9010/MSDA/master/books3.xml

Books2.jason: https://raw.githubusercontent.com/jcp9010/MSDA/master/books3.json

Utilizing http://www.rawgit.com, the raw files from GitHub were converted with proper Content-Type headers.

Load libraries:

library(RCurl)
library(XML)
library(jsonlite)
library(data.table)

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames.

# Assigning html, xml, json addresses to local variables
html.url <- "http://cdn.rawgit.com/jcp9010/MSDA/master/books.html"
xml.url <- "http://cdn.rawgit.com/jcp9010/MSDA/master/books3.xml"
json.url <- "http://cdn.rawgit.com/jcp9010/MSDA/master/books3.json"

# Parsing the data
comic.html <- htmlParse(file = html.url)
comic.xml <- xmlParse(xml.url)
comic.json <- fromJSON(json.url)

# Converting a parsed HTML file into a data.frame
comic.html.df <- as.data.frame(readHTMLTable(comic.html))
comic.html.df
##                                                    NULL.Book.Name
## 1                                  The Complete Calvin and Hobbes
## 2                              The Walking Dead Compendium Vol. 1
## 3 Peanuts 2000: The 50th Year of the World's Favorite Comic Strip
##                                          NULL.Author
## 1                                     Bill Watterson
## 2 Robert Kirkman, Charlie Adlard, and Cliff Rathburn
## 3                                  Charles M. Schulz
##              NULL.Publisher NULL.Year.Published NULL.Pages NULL.Language
## 1 Andrews McMeel Publishing                2012       1456       English
## 2                  Skybound                2016       1088       English
## 3          Ballantine Books                2000        176       English
# Converting a parsed XML file into a data.frame
root <- xmlRoot(comic.xml)
comic.xml.df <- xmlToDataFrame(root)
comic.xml.df
##                                                         Book_Name
## 1                                  The Complete Calvin and Hobbes
## 2                              The Walking Dead Compendium Vol. 1
## 3 Peanuts 2000: The 50th Year of the World's Favorite Comic Strip
##                                           Author                 Publisher
## 1                                 Bill Watterson Andrews McMeel Publishing
## 2 Robert Kirkman, Charlie Adlard, Cliff Rathburn                  Skybound
## 3                              Charles M. Schulz          Ballantine Books
##   Year_Published Pages Language
## 1           2012  1456  English
## 2           2016  1088  English
## 3           2000   176  English
# Converting comic.json (list) into a data.frame
comic.json.df <- as.data.frame(comic.json)
comic.json.df
##                                                   books.Book_Name
## 1                                               Calvin and Hobbes
## 2                              The Walking Dead Compendium Vol. 1
## 3 Peanuts 2000: The 50th Year of the World's Favorite Comic Strip
##                                     books.Author           books.Publisher
## 1                                 Bill Watterson Andrews McMeel Publishing
## 2 Robert Kirkman, Charlie Adlard, Cliff Rathburn                  Skybound
## 3                              Charles M. Schulz          Ballantine Books
##   books.Year books.Pages books.Language
## 1       2012        1456        English
## 2       2016        1088        English
## 3       2000         176        English

Are the three data frames identical?

We can look at the structures of each data.frame to see if they are identical.

str(comic.html.df)
## 'data.frame':    3 obs. of  6 variables:
##  $ NULL.Book.Name     : Factor w/ 3 levels "Peanuts 2000: The 50th Year of the World's Favorite Comic Strip",..: 2 3 1
##  $ NULL.Author        : Factor w/ 3 levels "Bill Watterson",..: 1 3 2
##  $ NULL.Publisher     : Factor w/ 3 levels "Andrews McMeel Publishing",..: 1 3 2
##  $ NULL.Year.Published: Factor w/ 3 levels "2000","2012",..: 2 3 1
##  $ NULL.Pages         : Factor w/ 3 levels "1088","1456",..: 2 1 3
##  $ NULL.Language      : Factor w/ 1 level "English": 1 1 1
str(comic.xml.df)
## 'data.frame':    3 obs. of  6 variables:
##  $ Book_Name     : Factor w/ 3 levels "Peanuts 2000: The 50th Year of the World's Favorite Comic Strip",..: 2 3 1
##  $ Author        : Factor w/ 3 levels "Bill Watterson",..: 1 3 2
##  $ Publisher     : Factor w/ 3 levels "Andrews McMeel Publishing",..: 1 3 2
##  $ Year_Published: Factor w/ 3 levels "2000","2012",..: 2 3 1
##  $ Pages         : Factor w/ 3 levels "1088","1456",..: 2 1 3
##  $ Language      : Factor w/ 1 level "English": 1 1 1
str(comic.json.df)
## 'data.frame':    3 obs. of  6 variables:
##  $ books.Book_Name: chr  "Calvin and Hobbes" "The Walking Dead Compendium Vol. 1" "Peanuts 2000: The 50th Year of the World's Favorite Comic Strip"
##  $ books.Author   : chr  "Bill Watterson" "Robert Kirkman, Charlie Adlard, Cliff Rathburn" "Charles M. Schulz"
##  $ books.Publisher: chr  "Andrews McMeel Publishing" "Skybound" "Ballantine Books"
##  $ books.Year     : int  2012 2016 2000
##  $ books.Pages    : int  1456 1088 176
##  $ books.Language : chr  "English" "English" "English"

After looking at the structures, the HTML and XML data.frame looks identical to each other (other than the names of the headers i.e. Null.Book.Name vs. Book_Name). They are all listed as factors with multiple levels, whereas, the JSON data frame uses chr, int in their data.frame.