The packages I will be using for this assignment are XML, rvest, jsonlite, and DT.
XML, rvest, and jsonlite are going to be used to parse XML, HTML, and JSON data respectively. The DT package is going to be used to display the dataframes at the end.
library(XML)
library(rvest)
library(jsonlite)
library(DT)
The files will be loaded directly from individual files I have created for the assignment. For the most part, I have tried to replicate the data to the best of my ability across all three data formats. Let’s see how that goes.
Loading the HTML file was pretty straightforward. The rvest package has a lot of functionality for reading HTML tables into dataframes.
The first line of code parses the HTML document into a recognizable format. The second line of code transforms the parsed HTML into a dataframe. The third line displays the dataframe.
parsedHTML = read_html(x = "books.html")
booksHTML = html_table(html_nodes(parsedHTML,"table")[[1]])
booksHTML
## Title Authors
## 1 The Moon Is Down John Steinbeck
## 2 Veterinary Guide for Farmers G. W. Stamm, Dallas S. Burch
## 3 Me Talk Pretty One Day David Sedaris
## Publishing Year
## 1 1942
## 2 1950
## 3 2000
## Dedication Text
## 1 TO PAT COVICI A GREAT EDITOR AND A GREAT FRIEND
## 2 Dedicated to the GI Joes who, having fought for our nation, will now help to feed and clothe it better as a result of their "on the farm" training under the auspices of the Veterans Farm Training program of the United States Veterans Administration and Vocational Education leaders of the various states
## 3 For my father, Lou
The HTML table has a few important syntax characteristics that differentiate it from the other formats:
Loading the XML file was similar to the html file. The XML package has similar functionality to the rvest package. The code below parses the XML file using the xmlParse function and then transforms the parsed data into a dataframe using the xmlToDataFrame function. The last line displays the dataframe.
parsedXML = xmlParse("books.xml")
booksXML = xmlToDataFrame(parsedXML)
booksXML
## Title Authors PublishingYear
## 1 The Moon is Down John Steinbeck 1942
## 2 Veterinary Guide for Farmers G. W. Stamm, Dallas S. Burch 1950
## 3 Me Talk Pretty One Day David Sedaris 2000
## DedicationText
## 1 TO PAT COVICI A GREAT EDITOR AND A GREAT FRIEND
## 2 Dedicated to the GI Joes who, having fought for our nation, will now help to feed and clothe it better as a result of their "on the farm" training under the auspices of the Veterans Farm Training program of the United States Veterans Administration and Vocational Education leaders of the various states
## 3 For my father, Lou
While the XML table may look the same as the HTML table, there are slight differences between them. For the most part, HTML and XML formats have the same syntax restrictions, however, the way that the data is stored differs between the two formats. In HTML, column names are stored as strings between tags, while in XML, column names are stored as tags. As a result, the column names of the XML dataframe cannot have spaces, while the column names of the HTML dataframe can have spaces. The syntax characteristics for XML are stated below:
Loading the JSON file only took one line of code, thanks to the jsonlite package. The code below reads in the JSON file and automatically interprets it as a dataframe. The last line displays the dataframe.
booksJSON = fromJSON(txt = "books.json")$Books
booksJSON
## Title Authors
## 1 The Moon is Down John Steinbeck
## 2 Veterinary Guide for Farmers G. W. Stamm, Dallas S. Burch
## 3 Me Talk Pretty One Day David Sedaris
## Publishing Year
## 1 1942
## 2 1950
## 3 2000
## Dedication Text
## 1 TO PAT COVICI A GREAT EDITOR AND A GREAT FRIEND
## 2 Dedicated to the GI Joes who, having fought for our nation, will now help to feed and clothe it better as a result of their "on the farm" training under the auspices of the Veterans Farm Training program of the United States Veterans Administration and Vocational Education leaders of the various states
## 3 For my father, Lou
The JSON table is probably the most different of the three formats. The main similarity between JSON, HTML, and XML is that the data is stored in a tree format. However, JSON uses the javascript dictionary format instead of the tag format like HTML and XML. The dictionary format allows nested data to be stored very easily without having to create nested tables, like in the tag format. When translated into an R dataframe, nested elements are separated by commas when displayed.
Another small difference between JSON and HTML/XML is that data elements require escape characters to display double quotes. Since JSON uses regular strings to store data, it interprets quotes inside quotes as the end of the string. To avoid this problem, escape characters are used to print quotes in strings.
The syntax characteristics of JSON are stated below:
The following question was posed in the assignment instructions: Are the three dataframes identical?
The short answer is no.
The dataframes are very similar, but each format is slightly different from the other formats. HTML stores data using the tag format, where column names are stored as strings between tags, which allows column names to have spaces. XML also uses the tag format, but stores column names as tags, preventing column names from having spaces. JSON uses a javascript dictionary format, where column names and values are stored as simple strings. Simple strings allow spaces and commas, but need escape characters for double quotes. JSON also stores nested data very easily compared to HTML and XML, which require nested tables.
The code below creates a function that displays the dataframes as fancy DT tables. Here you can see the slight differences between the formats.
ShowDataFrame = function(dataframe){
return(DT::datatable(dataframe,options = list(pageLength = 10)))
}
ShowDataFrame(booksHTML)
ShowDataFrame(booksXML)
ShowDataFrame(booksJSON)