Three separate XML, JSON, and HTML files where created and uploaded to githubs. Each file was read into R and converted into a dataframe.
The source xml file is located at: https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.xml
This conversion used the RCurl and XML packages.
# Install the packages
install.packages("XML", repos = "http://cran.us.r-project.org", quiet = T)
install.packages("RCurl", repos = "http://cran.us.r-project.org", quiet = T)
# Load the libraries
library(XML)
library(RCurl)
# RCurl ensures that the XML file is formatted properly when read.
myxml = 'https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.xml'
mybooksxml <- getURL(myxml)
# Parse the read XML file and view the XML structure
mybooksxml <- xmlTreeParse(mybooksxml)
xmlRoot(mybooksxml)
## <books>
## <book form="paperback" USprice="14">
## <isbn>0887308856</isbn>
## <title>Hidden Order The Economics of Everyday Life</title>
## <publisher>Harper Collins</publisher>
## <authors>
## <authorname1>David Friedman</authorname1>
## </authors>
## </book>
## <book form="paperback" USprice="15">
## <isbn>0142000280</isbn>
## <title>Getting Things Done</title>
## <publisher>Penguin Books</publisher>
## <authors>
## <authorname1>David Allen</authorname1>
## </authors>
## </book>
## <book form="paperback" USprice="11.95">
## <isbn>9781583671559</isbn>
## <title>the Politics of Immigration</title>
## <publisher>Monthly Review Press</publisher>
## <authors>
## <authorname1>Jane Guskin</authorname1>
## <authorname2>David Wilson</authorname2>
## </authors>
## </book>
## </books>
# Convert the XML to a dataframe. The DF is transposed to transform it into the preferable long format.
mybooksxml_df <- data.frame(t(xmlToList(mybooksxml)))
# Confirm the class and view the resulting dataframe
class(mybooksxml_df)
## [1] "data.frame"
knitr::kable(mybooksxml_df)
isbn | title | publisher | authors.authorname1 | .attrs | |
---|---|---|---|---|---|
book | 0887308856 | Hidden Order The Economics of Everyday Life | Harper Collins | David Friedman | paperback, 14 |
book.1 | 0142000280 | Getting Things Done | Penguin Books | David Allen | paperback, 15 |
book.2 | 9781583671559 | the Politics of Immigration | Monthly Review Press | Jane Guskin , David Wilson | paperback, 11.95 |
The source file is located at: https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.json
This conversion used the jsonlite package. Please note the json conversion included separate columns for the attribute. The Kable table print truncates the table columns.
# Install package
install.packages("jsonlite", repos = "http://cran.us.r-project.org", quiet = T)
# Load the library
library("jsonlite")
# Read the the json file and verify the format
mybooksjs <- fromJSON("https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.json")
toJSON(mybooksjs, pretty = TRUE)
## {
## "books": {
## "book": [
## {
## "isbn": "0887308856",
## "title": "Hidden Order The Economics of Everyday Life",
## "publisher": "Harper Collins",
## "authors": {
## "authorname1": "David Friedman"
## },
## "_form": "paperback",
## "_USprice": "14"
## },
## {
## "isbn": "0142000280",
## "title": "Getting Things Done",
## "publisher": "Penguin Books",
## "authors": {
## "authorname1": "David Allen"
## },
## "_form": "paperback",
## "_USprice": "15"
## },
## {
## "isbn": "9781583671559",
## "title": "the Politics of Immigration",
## "publisher": "Monthly Review Press",
## "authors": {
## "authorname1": "Jane Guskin",
## "authorname2": "David Wilson"
## },
## "_form": "paperback",
## "_USprice": "11.95"
## }
## ]
## }
## }
# Convert the json to data frame
mybooksjs_df<- as.data.frame(mybooksjs)
# Confirm the class and view the resulting dataframe.
class(mybooksjs_df)
## [1] "data.frame"
knitr::kable(mybooksjs_df)
books.book.isbn | books.book.title | books.book.publisher | books.book.authors | books.book._form | books.book._USprice |
---|---|---|---|---|---|
0887308856 | Hidden Order The Economics of Everyday Life | Harper Collins | David Friedman | paperback | 14 |
0142000280 | Getting Things Done | Penguin Books | David Allen | paperback | 15 |
9781583671559 | the Politics of Immigration | Monthly Review Press | Jane Guskin | paperback | 11.95 |
The source file is located at: https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.html
This conversion uses the same packages as the XML conversion - RCurl and XML
# RCurl ensures that the HTML file is formatted properly when read.
myhtml = 'https://raw.githubusercontent.com/johnnydrodriguez/data607_assignment8/main/books.html'
mybookshtml <- getURL(myhtml)
# Parse the read HTML file and view the HTML structure
mybookshtmlstr<- htmlTreeParse(mybookshtml)
xmlRoot(mybookshtmlstr)
## <html>
## <head>
## <meta charset="UTF-8"/>
## <title>Books I Have on My Shelf</title>
## </head>
## <body>
## <table border="1">
## <thead>
## <tr>
## <th>isbn</th>
## <th>title</th>
## <th>publisher</th>
## <th>authors</th>
## <th>_form</th>
## <th>_USprice</th>
## </tr>
## </thead>
## <tbody>
## <tr>
## <td>0887308856</td>
## <td>Hidden Order The Economics of Everyday Life</td>
## <td>Harper Collins</td>
## <td>David Friedman</td>
## <td>paperback</td>
## <td>14</td>
## </tr>
## <tr>
## <td>0142000280</td>
## <td>Getting Things Done</td>
## <td>Penguin Books</td>
## <td>David Allen</td>
## <td>paperback</td>
## <td>15</td>
## </tr>
## <tr>
## <td>9781583671559</td>
## <td>the Politics of Immigration</td>
## <td>Monthly Review Press</td>
## <td>Jane Guskin|David Wilson</td>
## <td>paperback</td>
## <td>11.95</td>
## </tr>
## </tbody>
## </table>
## </body>
## </html>
# Convert the HTML to a dataframe.
mybookshtml_df <- data.frame(readHTMLTable(mybookshtml))
# Confirm the class and view the resulting dataframe
class(mybookshtml_df)
## [1] "data.frame"
knitr::kable(mybookshtml_df)
NULL.isbn | NULL.title | NULL.publisher | NULL.authors | NULL._form | NULL._USprice |
---|---|---|---|---|---|
0887308856 | Hidden Order The Economics of Everyday Life | Harper Collins | David Friedman | paperback | 14 |
0142000280 | Getting Things Done | Penguin Books | David Allen | paperback | 15 |
9781583671559 | the Politics of Immigration | Monthly Review Press | Jane Guskin|David Wilson | paperback | 11.95 |
Even with clean, manually created XML and HTML files, these file formats are very sensitive to reading and parsing using various packages. Initially, I assumed the reading, parsing and dataframe conversion would be interchangeable between packages. The XML2 package was able to read the XML directly from the url but the differing parsing technique of XMLParser was not ideal for df transformation. The XML package required RCurl before the format could be properly. The difference between XML, XMl2 and XMLParser packages made it particularly challenging to get the attributes from node and adding them to dataframe columns.
To get the XML in the long format one read, it was transposed. This df would require additional transformation via tidyverse to further clean up the dataframe to split and rename columns.
The JSON conversion was the most straight forward for the (admittedly) very json clean file. It automatically read the node attributes and transformed them to columns joined to their corresponding nodes in the dataframe. The final df was relatively clean.
The HTML behaved very similarly to the XML. The extracted HTML table did not need to be transposed but it would need additional tidyverse transformation to split and rename columns.