Parse xml to dataframe
As xmlToDataFrame does not handle duplicate nodes properly let’s create a function
xmlNodeValue=function(root,keyname="Id"){
x=xmlToDataFrame(root,collectNames = F)
x=as.data.frame(t(x))
x$node=rownames(x)
colnames(x)[1]="value"
rownames(x)=NULL
x$Id=x[1,1]
colnames(x)[3]=keyname
x=x[2:nrow(x),]
x[,c(3,2,1)]
}
Now convert the xml to data frame
library(XML)
library(knitr)
bookxml=xmlParse(readLines("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/books.xml"))
root=xmlRoot(bookxml)
xmlbooks=NULL
for (i in 1:xmlSize(root)) xmlbooks=rbind(xmlbooks,xmlNodeValue(root[i],"title"))
rownames(xmlbooks)=NULL
kable(xmlbooks)
| title | node | value |
|---|---|---|
| Calvin and Hobbes | author | Bill Watterson |
| Calvin and Hobbes | publisher | Andrew McMeel Publishing |
| Calvin and Hobbes | genre | Humor |
| Calvin and Hobbes | type | comic |
| Harry Potter | author | J.K. Rowling |
| Harry Potter | publisher | Arthur A. Levine Book |
| Harry Potter | genre | Fantasy |
| Harry Potter | type | Fiction |
| Head First Java | author | Berts Bates |
| Head First Java | author | Kathy Sierra |
| Head First Java | publisher | O’Relly |
| Head First Java | genre | Programming |
| Head First Java | type | Text Book |
Covert HTML table to data frame
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.2
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
htmltext=readLines("https://raw.githubusercontent.com/mkds/MSDA/d47fd834bbb37c1cfa7b48cd5613aa46e5cbc6aa/IS607/data/books.html")
hbooks=as.data.frame(readHTMLTable(htmltext)[[1]])
hbooks[hbooks==""]=NA
htmlbooks= hbooks %>% fill(1) %>% gather(node,value,-title) %>% filter(value!="") %>% arrange(title)
## Warning: attributes are not identical across measure variables; they will
## be dropped
kable(htmlbooks)
| title | node | value |
|---|---|---|
| Calvin and Hobbes | author | Bill Watterson |
| Calvin and Hobbes | publisher | Andrew McMeel Publishing |
| Calvin and Hobbes | genre | Humor |
| Calvin and Hobbes | type | comic |
| Harry Potter | author | J.K. Rowling |
| Harry Potter | publisher | Arthur A. Levine Tr |
| Harry Potter | genre | Fantasy |
| Harry Potter | type | Fiction |
| Head First Java | author | Berts Bates |
| Head First Java | author | Kathy Sierra |
| Head First Java | publisher | O’Relly |
| Head First Java | genre | Software |
| Head First Java | type | Text Book |
Convert JSON to data frame.. I found a package to convert JSON to data frame. But, the structure is different from the one I came up for XML and HTML
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.2.2
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
jsontext=readLines("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/books.json")
booksjson= data.frame(fromJSON(jsontext)[[1]][[1]])
kable(booksjson)
| title | author | publisher | genre | type |
|---|---|---|---|---|
| Calvin and Hobbes | Bill Watterson | Andrew McMeel Publishing | Humor | comic |
| Harry Potter | J.K. Rowling | Arthur A. Levine Book | Fantasy | Fiction |
| Head First Java | Berts Bates, Kathy Sierra | O’Relly | Programming | Text Book |