Table capture from html, xml and json files

______________________________________

Table from html document:

Title	Author_1	Author_2	Person of note_1	Person of note_2	Person of note_3	Person of note_4	Person of note_5	Time period
Newton and the Counterfeiter	Thomas Levenson		Isaac Newton	Robert Hooke	William Chaloner			17th Century
Seven Ideas that Shook the Universe	Bryon D.Anderson	Nathan Spielberg	Max Planck	Galileo	Albert Einstein	Isaac Newton	Lord Kelvin	20th Century
The Swerve	Stephen Greenblatt		Poggio Bracciolini	Lucretius	Galileo			15th Century

______________________________________

Table from XML document:

title	author_1	author_2	Person_of_note_1	Person_of_note_2	Person_of_note_3	Person_of_note_4	Person_of_note_5	Time_period
Newton and the Counterfeiter	Thomas Levenson		Isaac Newton	Robert Hooke	William Chaloner			17th Century
The Swerve	Stephen Greenblatt		Poggio Bracciolini	Lucretius	Galileo			17th Century
Seven Ideas that Shook the Universe	Bryon D. Anderson	Nathan Spielberg	Max Planck	Galileo	Albert Einstein	Isaac Newton	Lord Kelvin	20th Century

______________________________________

Table from json document:

x
Newton and the Counterfeiter
The Swerve
Seven Ideas that Shook the Universe

1	2
Thomas Levenson	NA
Stephen Greenblatt	NA
Bryon D. Anderson	Nathan Spielberg

1	2	3	4	5
Isaac Newton	Robert Hooke	William Chaloner	NA	NA
Poggio Bracciolini	Lucretius	Galileo	NA	NA
Max Planck	Galileo	Albert Einstein	Isaac Newton	Lord Kelvin

x
17
15
20

______________________________________

Structure for json-sourced data frame:

## 'data.frame':    3 obs. of  4 variables:
##  $ title          : chr  "Newton and the Counterfeiter" "The Swerve" "Seven Ideas that Shook the Universe"
##  $ authors        :'data.frame': 3 obs. of  2 variables:
##   ..$ 1: chr  "Thomas Levenson" "Stephen Greenblatt" "Bryon D. Anderson"
##   ..$ 2: chr  NA NA "Nathan Spielberg"
##  $ persons_of_note:'data.frame': 3 obs. of  5 variables:
##   ..$ 1: chr  "Isaac Newton" "Poggio Bracciolini" "Max Planck"
##   ..$ 2: chr  "Robert Hooke" "Lucretius" "Galileo"
##   ..$ 3: chr  "William Chaloner" "Galileo" "Albert Einstein"
##   ..$ 4: chr  NA NA "Isaac Newton"
##   ..$ 5: chr  NA NA "Lord Kelvin"
##  $ Century        : int  17 15 20

## [1] "Column names:"

## [1] "title"           "authors"         "persons_of_note" "Century"

______________________________________

Structure for html-sourced data frame:

## 'data.frame':    3 obs. of  9 variables:
##  $ Title           : Factor w/ 3 levels "Newton and the Counterfeiter",..: 1 2 3
##  $ Author_1        : Factor w/ 3 levels "Bryon D.Anderson",..: 3 1 2
##  $ Author_2        : Factor w/ 2 levels "","Nathan Spielberg": 1 2 1
##  $ Person of note_1: Factor w/ 3 levels "Isaac Newton",..: 1 2 3
##  $ Person of note_2: Factor w/ 3 levels "Galileo","Lucretius",..: 3 1 2
##  $ Person of note_3: Factor w/ 3 levels "Albert Einstein",..: 3 1 2
##  $ Person of note_4: Factor w/ 2 levels "","Isaac Newton": 1 2 1
##  $ Person of note_5: Factor w/ 2 levels "","Lord Kelvin": 1 2 1
##  $ Time period     : Factor w/ 3 levels "15th Century",..: 2 3 1

______________________________________

Structure for XML-sourced data frame:

## 'data.frame':    3 obs. of  9 variables:
##  $ title           : Factor w/ 3 levels "Newton and the Counterfeiter",..: 1 3 2
##  $ author_1        : Factor w/ 3 levels "Bryon D. Anderson",..: 3 2 1
##  $ author_2        : Factor w/ 2 levels "","Nathan Spielberg": 1 1 2
##  $ Person_of_note_1: Factor w/ 3 levels "Isaac Newton",..: 1 3 2
##  $ Person_of_note_2: Factor w/ 3 levels "Galileo","Lucretius",..: 3 2 1
##  $ Person_of_note_3: Factor w/ 3 levels "Albert Einstein",..: 3 2 1
##  $ Person_of_note_4: Factor w/ 2 levels "","Isaac Newton": 1 1 2
##  $ Person_of_note_5: Factor w/ 2 levels "","Lord Kelvin": 1 1 2
##  $ Time_period     : Factor w/ 2 levels "17th Century",..: 1 1 2

Our data frames started with different forms and ended with different forms. The most notable difference is the form of json data. It allows one to create a variable that has a non-fixed number of entries for a single column name. Json allows them to be grouped together so one doesn’t need to create space for five authors on the off chance that a book has 5 books. In reporting on our data, we might have to go through an extra step of separating the data. The json-based data frame has an extra dimension. On the other hand, one doesn’t have to create multiple categories for something like author, which naturally has one name.

Another important difference between taking data from the sources was the difficulty encountered in parsing html data. Tools for XML and json data were fairly straightforward and created output with a small number of lines of code. The parsing tree from the XML package offers an opportunity to have a DOM that efficiently describes relationships among the parts of the data. It is not very user-friendly and requires extra code to access the elements of the data.

The code used to create our tables follows:

options(width = 190)
library(RCurl)
library(XML)
library(xml2)
library(jsonlite)
library(rvest)
library(ggplot2)
library(knitr)
Books.from.XML<-xmlParse(getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books.xml"))
root.1<-xmlRoot(Books.from.XML)
my.xml.data.frame<-xmlToDataFrame(root.1)
Books.from.json<-fromJSON(getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books.json"))
Books.from.json<-do.call("rbind",lapply(Books.from.json,data.frame, stringsAsFactors=FALSE))
Books.from.HTML<-htmlTreeParse(file=getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books3.html"), useInternal = TRUE)
book<-read_html(getURL("https://raw.githubusercontent.com/WigodskyD/data-sets/master/books3.html"))
book.a<- Title <- book %>% 
  html_nodes("table tr") %>%
  html_text() 
book.b<- Title <- book %>% 
  html_nodes("table th") %>%
  html_text()  
book.c<- Title <- book %>% 
  html_nodes("table td") %>%
  html_text() 
book.matrix<-matrix(nrow=3,ncol=9)
for (i in 1:27){
  j<-i%%9 
  j[j == 0] <-9
  k<-(floor((i-1)/9)+1)
     book.matrix[k,j]<-book.c[i]
               }
Books.from.HTML <- data.frame('book.b[1]'=character(),
                              'book.b[2]'=character(),
                              'book.b[3]'=character(),
                              'book.b[4]'=character(),
                              'book.b[5]'=character(),
                              'book.b[6]'=character(),
                              'book.b[7]'=character(),
                              'book.b[8]'=character(),
                              'book.b[9]'=character(),
                 stringsAsFactors=FALSE)
Books.from.HTML<-data.frame(book.matrix)
colnames(Books.from.HTML)<-book.b

kable(Books.from.HTML[1:3,])

kable(my.xml.data.frame)

kable(Books.from.json[,1])

kable(Books.from.json[,2])

kable(Books.from.json[,3])

kable(Books.from.json[,4])

str(Books.from.json)

str(Books.from.HTML)

str(my.xml.data.frame)

Data_607_Week7.Rmd

Dan Wigodsky

March 16, 2018

Table capture from html, xml and json files

______________________________________

Table from html document:

______________________________________

Table from XML document:

______________________________________

Table from json document:

______________________________________

Structure for json-sourced data frame:

______________________________________

Structure for html-sourced data frame:

______________________________________

Structure for XML-sourced data frame:

The code used to create our tables follows: