Introduction

Three separate XML, JSON, and HTML files where created and uploaded to githubs. Each file was read into R and converted into a dataframe.



XMl File to Data Frame Conversion

The source xml file is located at: https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.xml

This conversion used the RCurl and XML packages.

Note: The RCurl was required to properly read the XML format.
# Install the packages
install.packages("XML", repos = "http://cran.us.r-project.org", quiet = T)
install.packages("RCurl", repos = "http://cran.us.r-project.org", quiet = T)

# Load the libraries
library(XML)
library(RCurl)

# RCurl ensures that the XML file is formatted properly when read. 
myxml = 'https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.xml'
mybooksxml <- getURL(myxml)

# Parse the read XML file and view the XML structure
mybooksxml <- xmlTreeParse(mybooksxml)
xmlRoot(mybooksxml)
## <books>
##  <book form="paperback" USprice="14">
##   <isbn>0887308856</isbn>
##   <title>Hidden Order The Economics of Everyday Life</title>
##   <publisher>Harper Collins</publisher>
##   <authors>
##    <authorname1>David Friedman</authorname1>
##   </authors>
##  </book>
##  <book form="paperback" USprice="15">
##   <isbn>0142000280</isbn>
##   <title>Getting Things Done</title>
##   <publisher>Penguin Books</publisher>
##   <authors>
##    <authorname1>David Allen</authorname1>
##   </authors>
##  </book>
##  <book form="paperback" USprice="11.95">
##   <isbn>9781583671559</isbn>
##   <title>the Politics of Immigration</title>
##   <publisher>Monthly Review Press</publisher>
##   <authors>
##    <authorname1>Jane Guskin</authorname1>
##    <authorname2>David Wilson</authorname2>
##   </authors>
##  </book>
## </books>
# Convert the XML to a dataframe.  The DF is transposed to transform it into the preferable long format.
mybooksxml_df <- data.frame(t(xmlToList(mybooksxml)))

# Confirm the class and view the resulting dataframe
class(mybooksxml_df)
## [1] "data.frame"
knitr::kable(mybooksxml_df)
isbn title publisher authors.authorname1 .attrs
book 0887308856 Hidden Order The Economics of Everyday Life Harper Collins David Friedman paperback, 14
book.1 0142000280 Getting Things Done Penguin Books David Allen paperback, 15
book.2 9781583671559 the Politics of Immigration Monthly Review Press Jane Guskin , David Wilson paperback, 11.95



JSON File to Dataframe Conversion

The source file is located at: https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.json

This conversion used the jsonlite package. Please note the json conversion included separate columns for the attribute. The Kable table print truncates the table columns.

# Install package
install.packages("jsonlite", repos = "http://cran.us.r-project.org", quiet = T)

# Load the library
library("jsonlite")

# Read the the json file and verify the format
mybooksjs <- fromJSON("https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.json")
toJSON(mybooksjs, pretty = TRUE)
## {
##   "books": {
##     "book": [
##       {
##         "isbn": "0887308856",
##         "title": "Hidden Order The Economics of Everyday Life",
##         "publisher": "Harper Collins",
##         "authors": {
##           "authorname1": "David Friedman"
##         },
##         "_form": "paperback",
##         "_USprice": "14"
##       },
##       {
##         "isbn": "0142000280",
##         "title": "Getting Things Done",
##         "publisher": "Penguin Books",
##         "authors": {
##           "authorname1": "David Allen"
##         },
##         "_form": "paperback",
##         "_USprice": "15"
##       },
##       {
##         "isbn": "9781583671559",
##         "title": "the Politics of Immigration",
##         "publisher": "Monthly Review Press",
##         "authors": {
##           "authorname1": "Jane Guskin",
##           "authorname2": "David Wilson"
##         },
##         "_form": "paperback",
##         "_USprice": "11.95"
##       }
##     ]
##   }
## }
# Convert the json to data frame
mybooksjs_df<- as.data.frame(mybooksjs)

# Confirm the class and view the resulting dataframe.  
class(mybooksjs_df)
## [1] "data.frame"
knitr::kable(mybooksjs_df)
books.book.isbn books.book.title books.book.publisher books.book.authors books.book._form books.book._USprice
0887308856 Hidden Order The Economics of Everyday Life Harper Collins David Friedman paperback 14
0142000280 Getting Things Done Penguin Books David Allen paperback 15
9781583671559 the Politics of Immigration Monthly Review Press Jane Guskin paperback 11.95



HTML File to Dataframe Conversion

The source file is located at: https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.html

This conversion uses the same packages as the XML conversion - RCurl and XML

# RCurl ensures that the HTML file is formatted properly when read. 
myhtml = 'https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.html'
mybookshtml <- getURL(myhtml)

# Parse the read HTML file and view the HTML structure
mybookshtmlstr<- htmlTreeParse(mybookshtml)
xmlRoot(mybookshtmlstr)
## <html>
##  <head>
##   <meta charset="UTF-8"/>
##   <title>Books I Have on My Shelf</title>
##  </head>
##  <body>
##   <table border="1">
##    <thead>
##     <tr>
##      <th>isbn</th>
##      <th>title</th>
##      <th>publisher</th>
##      <th>authors</th>
##      <th>_form</th>
##      <th>_USprice</th>
##     </tr>
##    </thead>
##    <tbody>
##     <tr>
##      <td>0887308856</td>
##      <td>Hidden Order The Economics of Everyday Life</td>
##      <td>Harper Collins</td>
##      <td>David Friedman</td>
##      <td>paperback</td>
##      <td>14</td>
##     </tr>
##     <tr>
##      <td>0142000280</td>
##      <td>Getting Things Done</td>
##      <td>Penguin Books</td>
##      <td>David Allen</td>
##      <td>paperback</td>
##      <td>15</td>
##     </tr>
##     <tr>
##      <td>9781583671559</td>
##      <td>the Politics of Immigration</td>
##      <td>Monthly Review Press</td>
##      <td>Jane Guskin|David Wilson</td>
##      <td>paperback</td>
##      <td>11.95</td>
##     </tr>
##    </tbody>
##   </table>
##  </body>
## </html>
# Convert the HTML to a dataframe.    
mybookshtml_df <- data.frame(readHTMLTable(mybookshtml))

# Confirm the class and view the resulting dataframe
class(mybookshtml_df)
## [1] "data.frame"
knitr::kable(mybookshtml_df)
NULL.isbn NULL.title NULL.publisher NULL.authors NULL._form NULL._USprice
0887308856 Hidden Order The Economics of Everyday Life Harper Collins David Friedman paperback 14
0142000280 Getting Things Done Penguin Books David Allen paperback 15
9781583671559 the Politics of Immigration Monthly Review Press Jane Guskin|David Wilson paperback 11.95



Conclusions

Even with clean, manually created XML and HTML files, these file formats are very sensitive to reading and parsing using various packages. Initially, I assumed the reading, parsing and dataframe conversion would be interchangeable between packages. The XML2 package was able to read the XML directly from the url but the differing parsing technique of XMLParser was not ideal for df transformation. The XML package required RCurl before the format could be properly. The difference between XML, XMl2 and XMLParser packages made it particularly challenging to get the attributes from node and adding them to dataframe columns.

To get the XML in the long format one read, it was transposed. This df would require additional transformation via tidyverse to further clean up the dataframe to split and rename columns.


The JSON conversion was the most straight forward for the (admittedly) very json clean file. It automatically read the node attributes and transformed them to columns joined to their corresponding nodes in the dataframe. The final df was relatively clean.


The HTML behaved very similarly to the XML. The extracted HTML table did not need to be transposed but it would need additional tidyverse transformation to split and rename columns.