Parse XML, HTML and JSON

Parse xml to dataframe
As xmlToDataFrame does not handle duplicate nodes properly let’s create a function

xmlNodeValue=function(root,keyname="Id"){
  x=xmlToDataFrame(root,collectNames = F)
  x=as.data.frame(t(x))
  x$node=rownames(x)
  colnames(x)[1]="value"
  rownames(x)=NULL
  x$Id=x[1,1]
  colnames(x)[3]=keyname
  x=x[2:nrow(x),]
  x[,c(3,2,1)]
}

Now convert the xml to data frame

library(XML)
library(knitr)
bookxml=xmlParse(readLines("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/books.xml"))
root=xmlRoot(bookxml)
xmlbooks=NULL
for (i in 1:xmlSize(root)) xmlbooks=rbind(xmlbooks,xmlNodeValue(root[i],"title"))
rownames(xmlbooks)=NULL
kable(xmlbooks)

title	node	value
Calvin and Hobbes	author	Bill Watterson
Calvin and Hobbes	publisher	Andrew McMeel Publishing
Calvin and Hobbes	genre	Humor
Calvin and Hobbes	type	comic
Harry Potter	author	J.K. Rowling
Harry Potter	publisher	Arthur A. Levine Book
Harry Potter	genre	Fantasy
Harry Potter	type	Fiction
Head First Java	author	Berts Bates
Head First Java	author	Kathy Sierra
Head First Java	publisher	O’Relly
Head First Java	genre	Programming
Head First Java	type	Text Book

Covert HTML table to data frame

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.2

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

htmltext=readLines("https://raw.githubusercontent.com/mkds/MSDA/d47fd834bbb37c1cfa7b48cd5613aa46e5cbc6aa/IS607/data/books.html")
hbooks=as.data.frame(readHTMLTable(htmltext)[[1]])
hbooks[hbooks==""]=NA
htmlbooks= hbooks %>% fill(1) %>% gather(node,value,-title) %>% filter(value!="") %>% arrange(title)

## Warning: attributes are not identical across measure variables; they will
## be dropped

kable(htmlbooks)

title	node	value
Calvin and Hobbes	author	Bill Watterson
Calvin and Hobbes	publisher	Andrew McMeel Publishing
Calvin and Hobbes	genre	Humor
Calvin and Hobbes	type	comic
Harry Potter	author	J.K. Rowling
Harry Potter	publisher	Arthur A. Levine Tr
Harry Potter	genre	Fantasy
Harry Potter	type	Fiction
Head First Java	author	Berts Bates
Head First Java	author	Kathy Sierra
Head First Java	publisher	O’Relly
Head First Java	genre	Software
Head First Java	type	Text Book

Convert JSON to data frame.. I found a package to convert JSON to data frame. But, the structure is different from the one I came up for XML and HTML

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 3.2.2

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View

jsontext=readLines("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/books.json")
booksjson= data.frame(fromJSON(jsontext)[[1]][[1]])
kable(booksjson)

title	author	publisher	genre	type
Calvin and Hobbes	Bill Watterson	Andrew McMeel Publishing	Humor	comic
Harry Potter	J.K. Rowling	Arthur A. Levine Book	Fantasy	Fiction
Head First Java	Berts Bates, Kathy Sierra	O’Relly	Programming	Text Book

Parse XML, HTML and JSON

Mohan Kandaraj

October 18, 2015