Table capture from html, xml and json files
______________________________________
Table from html document:
| Newton and the Counterfeiter |
Thomas Levenson |
|
Isaac Newton |
Robert Hooke |
William Chaloner |
|
|
17th Century |
| Seven Ideas that Shook the Universe |
Bryon D.Anderson |
Nathan Spielberg |
Max Planck |
Galileo |
Albert Einstein |
Isaac Newton |
Lord Kelvin |
20th Century |
| The Swerve |
Stephen Greenblatt |
|
Poggio Bracciolini |
Lucretius |
Galileo |
|
|
15th Century |
______________________________________
Table from XML document:
| Newton and the Counterfeiter |
Thomas Levenson |
|
Isaac Newton |
Robert Hooke |
William Chaloner |
|
|
17th Century |
| The Swerve |
Stephen Greenblatt |
|
Poggio Bracciolini |
Lucretius |
Galileo |
|
|
17th Century |
| Seven Ideas that Shook the Universe |
Bryon D. Anderson |
Nathan Spielberg |
Max Planck |
Galileo |
Albert Einstein |
Isaac Newton |
Lord Kelvin |
20th Century |
______________________________________
Table from json document:
| Newton and the Counterfeiter |
| The Swerve |
| Seven Ideas that Shook the Universe |
| Thomas Levenson |
NA |
| Stephen Greenblatt |
NA |
| Bryon D. Anderson |
Nathan Spielberg |
| Isaac Newton |
Robert Hooke |
William Chaloner |
NA |
NA |
| Poggio Bracciolini |
Lucretius |
Galileo |
NA |
NA |
| Max Planck |
Galileo |
Albert Einstein |
Isaac Newton |
Lord Kelvin |
______________________________________
Structure for json-sourced data frame:
## 'data.frame': 3 obs. of 4 variables:
## $ title : chr "Newton and the Counterfeiter" "The Swerve" "Seven Ideas that Shook the Universe"
## $ authors :'data.frame': 3 obs. of 2 variables:
## ..$ 1: chr "Thomas Levenson" "Stephen Greenblatt" "Bryon D. Anderson"
## ..$ 2: chr NA NA "Nathan Spielberg"
## $ persons_of_note:'data.frame': 3 obs. of 5 variables:
## ..$ 1: chr "Isaac Newton" "Poggio Bracciolini" "Max Planck"
## ..$ 2: chr "Robert Hooke" "Lucretius" "Galileo"
## ..$ 3: chr "William Chaloner" "Galileo" "Albert Einstein"
## ..$ 4: chr NA NA "Isaac Newton"
## ..$ 5: chr NA NA "Lord Kelvin"
## $ Century : int 17 15 20
## [1] "Column names:"
## [1] "title" "authors" "persons_of_note" "Century"
______________________________________
Structure for html-sourced data frame:
## 'data.frame': 3 obs. of 9 variables:
## $ Title : Factor w/ 3 levels "Newton and the Counterfeiter",..: 1 2 3
## $ Author_1 : Factor w/ 3 levels "Bryon D.Anderson",..: 3 1 2
## $ Author_2 : Factor w/ 2 levels "","Nathan Spielberg": 1 2 1
## $ Person of note_1: Factor w/ 3 levels "Isaac Newton",..: 1 2 3
## $ Person of note_2: Factor w/ 3 levels "Galileo","Lucretius",..: 3 1 2
## $ Person of note_3: Factor w/ 3 levels "Albert Einstein",..: 3 1 2
## $ Person of note_4: Factor w/ 2 levels "","Isaac Newton": 1 2 1
## $ Person of note_5: Factor w/ 2 levels "","Lord Kelvin": 1 2 1
## $ Time period : Factor w/ 3 levels "15th Century",..: 2 3 1
______________________________________
Structure for XML-sourced data frame:
## 'data.frame': 3 obs. of 9 variables:
## $ title : Factor w/ 3 levels "Newton and the Counterfeiter",..: 1 3 2
## $ author_1 : Factor w/ 3 levels "Bryon D. Anderson",..: 3 2 1
## $ author_2 : Factor w/ 2 levels "","Nathan Spielberg": 1 1 2
## $ Person_of_note_1: Factor w/ 3 levels "Isaac Newton",..: 1 3 2
## $ Person_of_note_2: Factor w/ 3 levels "Galileo","Lucretius",..: 3 2 1
## $ Person_of_note_3: Factor w/ 3 levels "Albert Einstein",..: 3 2 1
## $ Person_of_note_4: Factor w/ 2 levels "","Isaac Newton": 1 1 2
## $ Person_of_note_5: Factor w/ 2 levels "","Lord Kelvin": 1 1 2
## $ Time_period : Factor w/ 2 levels "17th Century",..: 1 1 2
Our data frames started with different forms and ended with different forms. The most notable difference is the form of json data. It allows one to create a variable that has a non-fixed number of entries for a single column name. Json allows them to be grouped together so one doesn’t need to create space for five authors on the off chance that a book has 5 books. In reporting on our data, we might have to go through an extra step of separating the data. The json-based data frame has an extra dimension. On the other hand, one doesn’t have to create multiple categories for something like author, which naturally has one name.
Another important difference between taking data from the sources was the difficulty encountered in parsing html data. Tools for XML and json data were fairly straightforward and created output with a small number of lines of code. The parsing tree from the XML package offers an opportunity to have a DOM that efficiently describes relationships among the parts of the data. It is not very user-friendly and requires extra code to access the elements of the data.
The code used to create our tables follows:
options(width = 190)
library(RCurl)
library(XML)
library(xml2)
library(jsonlite)
library(rvest)
library(ggplot2)
library(knitr)
Books.from.XML<-xmlParse(getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books.xml"))
root.1<-xmlRoot(Books.from.XML)
my.xml.data.frame<-xmlToDataFrame(root.1)
Books.from.json<-fromJSON(getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books.json"))
Books.from.json<-do.call("rbind",lapply(Books.from.json,data.frame, stringsAsFactors=FALSE))
Books.from.HTML<-htmlTreeParse(file=getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books3.html"), useInternal = TRUE)
book<-read_html(getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books3.html"))
book.a<- Title <- book %>%
html_nodes("table tr") %>%
html_text()
book.b<- Title <- book %>%
html_nodes("table th") %>%
html_text()
book.c<- Title <- book %>%
html_nodes("table td") %>%
html_text()
book.matrix<-matrix(nrow=3,ncol=9)
for (i in 1:27){
j<-i%%9
j[j == 0] <-9
k<-(floor((i-1)/9)+1)
book.matrix[k,j]<-book.c[i]
}
Books.from.HTML <- data.frame('book.b[1]'=character(),
'book.b[2]'=character(),
'book.b[3]'=character(),
'book.b[4]'=character(),
'book.b[5]'=character(),
'book.b[6]'=character(),
'book.b[7]'=character(),
'book.b[8]'=character(),
'book.b[9]'=character(),
stringsAsFactors=FALSE)
Books.from.HTML<-data.frame(book.matrix)
colnames(Books.from.HTML)<-book.b
kable(Books.from.HTML[1:3,])
kable(my.xml.data.frame)
kable(Books.from.json[,1])
kable(Books.from.json[,2])
kable(Books.from.json[,3])
kable(Books.from.json[,4])
str(Books.from.json)
str(Books.from.HTML)
str(my.xml.data.frame)