This assignment presents the conversion of three common file types of data - XML, JSON, and HTML - into an R dataframe. The results are largely based on Google searches to find code examples and R documentation on the libraries capable of performing the different conversions. Each source file containing bibliographic descriptions of the same three books was created by hand by the assignment’s author.
The XML file to R dataframe conversion below uses the ‘XML’ library. Another R library, ‘xml2’, would also have been able to perform the conversion based on documentation, but after several attempts of trial and error, the ‘XML’ library code proved simpler.
The resulting dataframe converted the three book entries into three rows (observations) with seven columns (variables). The book with two authors, Freakonomics, and the book with two publishers, Capital, did produce a single concatenated string entry combining the separate values. For example, the author column for Freakonomics resulted in “Steven D. LevittStephen J. Dubner” which is the concatenation of the two authors name. In order to tidy this data, I would recommend introducing a delimiter in the author field to allow for tidyr functions to separate out the author entries appropriately. Otherwise, the conversion of XML to R dataframe proved relatively simple and accurate.
library(XML)
library(RCurl)
## Loading required package: bitops
url <- "https://raw.githubusercontent.com/ptanofsky/data607/master/Week07_Assignment/books.xml"
xml_url <- getURL(url)
xml_doc <- xmlParse(xml_url)
xml_df <- xmlToDataFrame(xml_doc)
dim(xml_df)
## [1] 3 7
print(xml_df)
## title
## 1 Moneyball
## 2 Freakonomics
## 3 Capital in the Twenty-First Century
## subtitle
## 1 The Art of Winning an Unfair Game
## 2 A Rogue Economist Explores the Hidden Side of Everything
## 3
## authors
## 1 Michael Lewis
## 2 Steven D. LevittStephen J. Dubner
## 3 Thomas Piketty
## publishers year_published pages
## 1 W. W. Norton & Company 2003 288
## 2 William Morrow 2005 336
## 3 Éditions du SeuilHarvard University Press 2013 696
## isbn
## 1 978-0-393-05765-2
## 2 0-06-123400-1
## 3 978-0674430006
The JSON file to R dataframe conversion below uses the ‘rjson’ library. Based on the output below, the JSON conversion would require the most transformation and tidy-ing to construct a usable R dataframe. The resulting dataframe consists of just two rows and 21 columns. Instead of translating the JSON file containing an array of books into separate rows, the conversion appears to have created a second row to accommodate for the second author of Freakonomics and the second publisher of Capital. The documentation of JSON conversion reads in each list as a column, whereas the input JSON file contains each book entry as a separate array element. The use of tidy techniques, including melting, could transform the resulting dataframe into a more analysis-friendly structure.
library(rjson)
json_url <- "https://raw.githubusercontent.com/ptanofsky/data607/master/Week07_Assignment/books.json"
books_json_inp <- fromJSON(file = json_url)
json_df <- as.data.frame(books_json_inp)
dim(json_df)
## [1] 2 21
print(json_df)
## books.title books.subtitle books.authors
## 1 Moneyball The Art of Winning an Unfair Game Michael Lewis
## 2 Moneyball The Art of Winning an Unfair Game Michael Lewis
## books.publishers books.year_published books.pages
## 1 W. W. Norton & Company 2003 288
## 2 W. W. Norton & Company 2003 288
## books.isbn books.title.1
## 1 978-0-393-05765-2 Freakonomics
## 2 978-0-393-05765-2 Freakonomics
## books.subtitle.1
## 1 A Rogue Economist Explores the Hidden Side of Everything
## 2 A Rogue Economist Explores the Hidden Side of Everything
## books.authors.1 books.publishers.1 books.year_published.1
## 1 Steven D. Levitt William Morrow 2005
## 2 Stephen J. Dubner William Morrow 2005
## books.pages.1 books.isbn.1 books.title.2
## 1 336 0-06-123400-1 Capital in the Twenty-First Century
## 2 336 0-06-123400-1 Capital in the Twenty-First Century
## books.subtitle.2 books.authors.2 books.publishers.2
## 1 Thomas Piketty Éditions du Seuil
## 2 Thomas Piketty Harvard University Press
## books.year_published.2 books.pages.2 books.isbn.2
## 1 2013 696 978-0674430006
## 2 2013 696 978-0674430006
The HTML file to R dataframe conversion below uses the ‘rvest’ library. Based on the output below, HTML conversion appears to be on par with the XML conversion. As in the XML example, the HTML to R dataframe results in three rows (observations) with seven columns (variables) just as the file denotes. Again, the two authors and two publishers proves a bit problematic in the resulting R dataframe. The two strings appear to be concatenated together with the white-space HTML characters included in the output, too. The overall structure of the R dataframe meets expectations but does require transformation to parse the fields with two concatenated values. A delimiter could be used in these scenarios to allow for easier transformations with the ‘dplyr’ library. The resulting R dataframe is near analysis-ready based on the initial conversion of HTML to R dataframe.
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:XML':
##
## xml
html_url <- "https://raw.githubusercontent.com/ptanofsky/data607/master/Week07_Assignment/books.html"
html_df <- as.data.frame(read_html(html_url) %>% html_table(fill=TRUE))
dim(html_df)
## [1] 3 7
print(html_df)
## Title
## 1 Moneyball
## 2 Freakonomics
## 3 Capital in the Twenty-First Century
## Subtitle
## 1 The Art of Winning an Unfair Game
## 2 A Rogue Economist Explores the Hidden Side of Everything
## 3
## Author
## 1 Michael Lewis
## 2 Steven D. Levitt\n\t\t\t\t\tStephen J. Dubner
## 3 Thomas Piketty
## Publisher Publication.Date
## 1 W. W. Norton & Company 2003
## 2 William Morrow 2005
## 3 Editions du Seuil\n\t\t\t\t\tHarvard University Press 2013
## Pages ISBN
## 1 288 978-0-393-05765-2
## 2 336 0-06-123400-1
## 3 696 978-0674430006
The assignment proved an interesting exercise in data conversion when using R. Despite most raw input files coming in the form of CSV, raw data can come in many other common forms, including XML, JSON, and HTML. Libraries for the R language exist to provide relatively straightforward conversions of the raw input to R dataframes, but the conversion itself doesn’t guarantee a tidy dataframe. The exercise outlines the different capabilities of the R libraries and the incongruent results of simply converting to a dataframe. Additional transformations and tidy-ing would be required to prepare these sample files for data analysis techniques and meaningful plots.