Week 7: Working with XML and JSON in R

Task Description

Separately create three files which store the book’s information in HTML , XML, and JSON formats containing book information and include the title, authors, and two or three other attributes that you find interesting. With R and using any packages, load information from the three sources into separate data frames. Check if all three are identical.

Overview of Approach

I created each file by hand and uploaded it to Github so that it can be read and reproduced. I then converted each one into a data frame. Lastly, I compared to see if they were identical.

Results

HTML

books_html <- read_html("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Week7/books.html") %>%
  html_nodes("table") %>% 
  html_table() %>% 
  as.data.frame()

kable(books_html)

Title	Authors	Publisher	Year	Pages	ISBN
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	Foster Provost, Tom Fawcett	O’Reilly Media, Inc.	2013	414	978-1449361327
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis	Wiley	2015	474	978-1118834817
Customers who viewed R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	Hadley Wickham, Garrett Grolemund	O’Reilly Media, Inc.	2017	520	978-1491910399

XML

books_xml <- xmlParse(getURL("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Week7/books.xml")) %>%
  xmlToDataFrame()
  
kable(books_xml)

Title	Authors	Publisher	Year	Pages	ISBN
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	Foster Provost, Tom Fawcett	O’Reilly Media, Inc.	2013	414	978-1449361327
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis	Wiley	2015	474	978-1118834817
Customers who viewed R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	Hadley Wickham, Garrett Grolemund	O’Reilly Media, Inc.	2017	520	978-1491910399

JSON

Note: I had to remove row names and also replace a string of characters as one author has a German letter in their name. However, if there is unicode data in the file, you can just include encoding = "UTF-8" when loading data. I also trasposed the data frame.

books_json <- fromJSON("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Week7/books.json", encoding = "UTF-8") %>%
  as.data.frame() %>%
  t() %>%
  as.data.frame() %>%
  remove_rownames()

kable(books_json)

Title	Authors	Publisher	Year	Pages	ISBN
Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking	Foster Provost, Tom Fawcett	O’Reilly Media, Inc.	2013	414	978-1449361327
Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining	Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis	Wiley	2015	474	978-1118834817
Customers who viewed R for Data Science: Import, Tidy, Transform, Visualize, and Model Data	Hadley Wickham, Garrett Grolemund	O’Reilly Media, Inc.	2017	520	978-1491910399

Comparison & Findings

Using the transitive property, I compared books_html to books_html and to books_xml. The end result was that all three data frames were identical.

books_html == books_json && books_html == books_xml

## [1] TRUE