Introduction

This assignment presents the conversion of three common file types of data - XML, JSON, and HTML - into an R dataframe. The results are largely based on Google searches to find code examples and R documentation on the libraries capable of performing the different conversions. Each source file containing bibliographic descriptions of the same three books was created by hand by the assignment’s author.

Read XML

The XML file to R dataframe conversion below uses the ‘XML’ library. Another R library, ‘xml2’, would also have been able to perform the conversion based on documentation, but after several attempts of trial and error, the ‘XML’ library code proved simpler.

The resulting dataframe converted the three book entries into three rows (observations) with seven columns (variables). The book with two authors, Freakonomics, and the book with two publishers, Capital, did produce a single concatenated string entry combining the separate values. For example, the author column for Freakonomics resulted in “Steven D. LevittStephen J. Dubner” which is the concatenation of the two authors name. In order to tidy this data, I would recommend introducing a delimiter in the author field to allow for tidyr functions to separate out the author entries appropriately. Otherwise, the conversion of XML to R dataframe proved relatively simple and accurate.

library(XML)
library(RCurl)

## Loading required package: bitops

url <- "https://raw.githubusercontent.com/ptanofsky/data607/master/Week07_Assignment/books.xml"
xml_url <- getURL(url)
xml_doc <- xmlParse(xml_url)
xml_df <- xmlToDataFrame(xml_doc)

dim(xml_df)

## [1] 3 7

print(xml_df)

##                                 title
## 1                           Moneyball
## 2                        Freakonomics
## 3 Capital in the Twenty-First Century
##                                                   subtitle
## 1                        The Art of Winning an Unfair Game
## 2 A Rogue Economist Explores the Hidden Side of Everything
## 3                                                         
##                             authors
## 1                     Michael Lewis
## 2 Steven D. LevittStephen J. Dubner
## 3                    Thomas Piketty
##                                  publishers year_published pages
## 1                    W. W. Norton & Company           2003   288
## 2                            William Morrow           2005   336
## 3 Éditions du SeuilHarvard University Press           2013   696
##                isbn
## 1 978-0-393-05765-2
## 2     0-06-123400-1
## 3    978-0674430006

Read JSON

The JSON file to R dataframe conversion below uses the ‘rjson’ library. Based on the output below, the JSON conversion would require the most transformation and tidy-ing to construct a usable R dataframe. The resulting dataframe consists of just two rows and 21 columns. Instead of translating the JSON file containing an array of books into separate rows, the conversion appears to have created a second row to accommodate for the second author of Freakonomics and the second publisher of Capital. The documentation of JSON conversion reads in each list as a column, whereas the input JSON file contains each book entry as a separate array element. The use of tidy techniques, including melting, could transform the resulting dataframe into a more analysis-friendly structure.

library(rjson)

json_url <- "https://raw.githubusercontent.com/ptanofsky/data607/master/Week07_Assignment/books.json"
books_json_inp <- fromJSON(file = json_url)
json_df <- as.data.frame(books_json_inp)

dim(json_df)

## [1]  2 21

print(json_df)

##   books.title                    books.subtitle books.authors
## 1   Moneyball The Art of Winning an Unfair Game Michael Lewis
## 2   Moneyball The Art of Winning an Unfair Game Michael Lewis
##         books.publishers books.year_published books.pages
## 1 W. W. Norton & Company                 2003         288
## 2 W. W. Norton & Company                 2003         288
##          books.isbn books.title.1
## 1 978-0-393-05765-2  Freakonomics
## 2 978-0-393-05765-2  Freakonomics
##                                           books.subtitle.1
## 1 A Rogue Economist Explores the Hidden Side of Everything
## 2 A Rogue Economist Explores the Hidden Side of Everything
##     books.authors.1 books.publishers.1 books.year_published.1
## 1  Steven D. Levitt     William Morrow                   2005
## 2 Stephen J. Dubner     William Morrow                   2005
##   books.pages.1  books.isbn.1                       books.title.2
## 1           336 0-06-123400-1 Capital in the Twenty-First Century
## 2           336 0-06-123400-1 Capital in the Twenty-First Century
##   books.subtitle.2 books.authors.2       books.publishers.2
## 1                   Thomas Piketty        Éditions du Seuil
## 2                   Thomas Piketty Harvard University Press
##   books.year_published.2 books.pages.2   books.isbn.2
## 1                   2013           696 978-0674430006
## 2                   2013           696 978-0674430006

Read HTML

The HTML file to R dataframe conversion below uses the ‘rvest’ library. Based on the output below, HTML conversion appears to be on par with the XML conversion. As in the XML example, the HTML to R dataframe results in three rows (observations) with seven columns (variables) just as the file denotes. Again, the two authors and two publishers proves a bit problematic in the resulting R dataframe. The two strings appear to be concatenated together with the white-space HTML characters included in the output, too. The overall structure of the R dataframe meets expectations but does require transformation to parse the fields with two concatenated values. A delimiter could be used in these scenarios to allow for easier transformations with the ‘dplyr’ library. The resulting R dataframe is near analysis-ready based on the initial conversion of HTML to R dataframe.

library(rvest)

## Loading required package: xml2

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:XML':
## 
##     xml

html_url <- "https://raw.githubusercontent.com/ptanofsky/data607/master/Week07_Assignment/books.html"
html_df <- as.data.frame(read_html(html_url) %>% html_table(fill=TRUE))

dim(html_df)

## [1] 3 7

print(html_df)

##                                 Title
## 1                           Moneyball
## 2                        Freakonomics
## 3 Capital in the Twenty-First Century
##                                                   Subtitle
## 1                        The Art of Winning an Unfair Game
## 2 A Rogue Economist Explores the Hidden Side of Everything
## 3                                                         
##                                          Author
## 1                                 Michael Lewis
## 2 Steven D. Levitt\n\t\t\t\t\tStephen J. Dubner
## 3                                Thomas Piketty
##                                               Publisher Publication.Date
## 1                                W. W. Norton & Company             2003
## 2                                        William Morrow             2005
## 3 Editions du Seuil\n\t\t\t\t\tHarvard University Press             2013
##   Pages              ISBN
## 1   288 978-0-393-05765-2
## 2   336     0-06-123400-1
## 3   696    978-0674430006

Conclusion

The assignment proved an interesting exercise in data conversion when using R. Despite most raw input files coming in the form of CSV, raw data can come in many other common forms, including XML, JSON, and HTML. Libraries for the R language exist to provide relatively straightforward conversions of the raw input to R dataframes, but the conversion itself doesn’t guarantee a tidy dataframe. The exercise outlines the different capabilities of the R libraries and the incongruent results of simply converting to a dataframe. Additional transformations and tidy-ing would be required to prepare these sample files for data analysis techniques and meaningful plots.

DATA 607 Week 07 Assignment

Philip Tanofsky

3/14/2020

Introduction

Read XML

Read JSON

Read HTML

Conclusion