I created 3 separate files to store the infromation from my favorite children’s books.
* HTML
* JSON
* XML
The information will now be loaded into 3 separate data frames.
library(RCurl)
## Loading required package: bitops
library(XML)
## Warning: package 'XML' was built under R version 3.2.2
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.2.2
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
#1st the html file
url_html <- getURL("https://raw.githubusercontent.com/ncapofari/IS_607/IS_607/favorite_books.html")
html_frame <- as.data.frame(readHTMLTable(url_html), stringsAsFactors = FALSE)
#2nd the xml file
url_xml <- getURL("https://raw.githubusercontent.com/ncapofari/IS_607/IS_607/favorite_books.xml")
xml_frame <- xmlToDataFrame(url_xml, stringsAsFactors = FALSE)
#1st the json file
url_json <- getURL("https://raw.githubusercontent.com/ncapofari/IS_607/IS_607/favorite_books.json")
json_frame <- as.data.frame(fromJSON(url_json), stringsAsFactors = FALSE)
Here are the html, xml, json files as data frames:
class(html_frame)
## [1] "data.frame"
html_frame
## NULL.Title NULL.Author
## 1 The Day the Babies Crawled Away Peggy Rathman
## 2 Locomotive Brian Floca
## 3 Cha-cha Chimps Julia Durango
## 4 The Juniper Tree and Other Tales from Grimm Jacob Grimm, Wilhelm Grimm
## NULL.Illustrator NULL.Date NULL.Pages
## 1 Peggy Rathman 2003 40
## 2 Brian Floca 2013 64
## 3 Elanor Taylor 2010 32
## 4 Maurice Sendak 1973 352
class(xml_frame)
## [1] "data.frame"
xml_frame
## title author illustrator
## 1 The Day the Babies Crawled Away Peggy Rathman Peggy Rathman
## 2 Locomotive Brian Floca Brian Floca
## 3 Cha-cha Chimps Julia Durango Elanor Taylor
## 4 The Juniper Tree and Other Tales from Grimm Maurice Sendak
## date pages
## 1 2003 40
## 2 2013 64
## 3 2010 32
## 4 1973 352
class(json_frame)
## [1] "data.frame"
json_frame
## favorite_children_books.title
## 1 The Day the Babies Crawled Away
## 2 Locomotive
## 3 Cha-cha Chimps
## 4 The Juniper Tree and Other Tales from Grimm
## favorite_children_books.author favorite_children_books.illustrator
## 1 Peggy Rathman Peggy Rathman
## 2 Brian Floca Brian Floca
## 3 Peggy Rathman Julia Durango
## 4 Jacob Grimm, Wilhelm Grimm Maurice Sendak
## favorite_children_books.date favorite_children_books.pages
## 1 2003 40
## 2 2013 64
## 3 2010 32
## 4 1973 352
identical(html_frame, xml_frame)
## [1] FALSE
identical(html_frame, json_frame)
## [1] FALSE
identical(xml_frame, json_frame)
## [1] FALSE
None of the 3 data frames are the same. But they contain the same information.
As for the xml table, the multiple authors did not even appear in the data frame.
This is the line of code in the xml doc that contains the author information that did not show up:
<author author1=“Jacob Grimm” author2=“Wilhelm Grimm”/>
I will try to add the authors back into the data frame.
xml_doc <- htmlParse(url_xml)
#this returns a list of any author attribute
#when I created the xml doc, the book with 2 authors was stored them this way
multiple_authors <- xpathSApply(xml_doc, "///author", fun = xmlAttrs)
temp <- unlist(multiple_authors[4])
temp <- paste(temp[1], temp[2], sep = ", ")
xml_frame$author[4] <- temp
xml_frame
## title author
## 1 The Day the Babies Crawled Away Peggy Rathman
## 2 Locomotive Brian Floca
## 3 Cha-cha Chimps Julia Durango
## 4 The Juniper Tree and Other Tales from Grimm Jacob Grimm, Wilhelm Grimm
## illustrator date pages
## 1 Peggy Rathman 2003 40
## 2 Brian Floca 2013 64
## 3 Elanor Taylor 2010 32
## 4 Maurice Sendak 1973 352