Firstly, let’s load necessary packages. XML package is used to parse XML and html file, and jsonlite is used to parse json file.
## Loading required package: bitops
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Let’s read html file from github with using getURL() function and then read html table. Since the clasee of html_book is a list, I used data.frame() function in order to convert to data frame class.
html <- getURL("https://raw.githubusercontent.com/ekhahm/datascience/master/week7/books.html")
html_book<- readHTMLTable(html)
html_book
## $`NULL`
## Title
## 1 Harry potter and the philosopher's stone
## 2 Hitchhiker's guide to the galaxy book
## 3 Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch
## Author ISBN Genre
## 1 J. K. Rowling 0747532699 Fantasy
## 2 Douglas Adams 0330258648 Comic science fiction
## 3 Terry Pratchetth,Neil Gaiman 057504800X Horror
## [1] "list"
Now, we are taking a look of the html data table.
Let’s get xml file from github with using getURL() function and then parse xml table with xmlParse(). I used getNodeSet() function to find matching each node in an xml treen and then change the class to dataframe. Lastly set the names in an object for each node.
xml <-getURL("https://raw.githubusercontent.com/ekhahm/datascience/master/week7/books.xml")
xml_book <- xmlParse(xml)
a <- setNames(xmlToDataFrame(node = getNodeSet(xml_book, "//root/book/Title")), "Title")
b <- setNames(xmlToDataFrame(node = getNodeSet(xml_book, "//root/book/Author")), "Author")
c <- setNames(xmlToDataFrame(node = getNodeSet(xml_book, "//root/book/ISBN")), "ISBN")
d <- setNames(xmlToDataFrame(node = getNodeSet(xml_book, "//root/book/Genre")), "Genre")
xml_book <- cbind(a,b,c,d)
xml_book
Now, we are taking a look of xml data table.
Let’s read JSON file from github with using getURL() function and then convert r object from Json using fromJSON . Since the clasee of json_book is a list, I used data.frame() function in order to convert to data frame class.
json <- getURL("https://raw.githubusercontent.com/ekhahm/datascience/master/week7/books.json")
json_book <- fromJSON(json)
json_book
Now we are looking at the data table of json_df
The three data frames from html, xml, and json files look identical.