Separately create three files which store the book’s information in HTML , XML, and JSON formats containing book information and include the title, authors, and two or three other attributes that you find interesting. With R and using any packages, load information from the three sources into separate data frames. Check if all three are identical.
I created each file by hand and uploaded it to Github so that it can be read and reproduced. I then converted each one into a data frame. Lastly, I compared to see if they were identical.
books_html <- read_html("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Week7/books.html") %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame()
kable(books_html)
| Title | Authors | Publisher | Year | Pages | ISBN |
|---|---|---|---|---|---|
| Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | Foster Provost, Tom Fawcett | O’Reilly Media, Inc. | 2013 | 414 | 978-1449361327 |
| Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | Wiley | 2015 | 474 | 978-1118834817 |
| Customers who viewed R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | Hadley Wickham, Garrett Grolemund | O’Reilly Media, Inc. | 2017 | 520 | 978-1491910399 |
books_xml <- xmlParse(getURL("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Week7/books.xml")) %>%
xmlToDataFrame()
kable(books_xml)
| Title | Authors | Publisher | Year | Pages | ISBN |
|---|---|---|---|---|---|
| Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | Foster Provost, Tom Fawcett | O’Reilly Media, Inc. | 2013 | 414 | 978-1449361327 |
| Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | Wiley | 2015 | 474 | 978-1118834817 |
| Customers who viewed R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | Hadley Wickham, Garrett Grolemund | O’Reilly Media, Inc. | 2017 | 520 | 978-1491910399 |
Note: I had to remove row names and also replace a string of characters as one author has a German letter in their name. However, if there is unicode data in the file, you can just include encoding = "UTF-8" when loading data. I also trasposed the data frame.
books_json <- fromJSON("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Week7/books.json", encoding = "UTF-8") %>%
as.data.frame() %>%
t() %>%
as.data.frame() %>%
remove_rownames()
kable(books_json)
| Title | Authors | Publisher | Year | Pages | ISBN |
|---|---|---|---|---|---|
| Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking | Foster Provost, Tom Fawcett | O’Reilly Media, Inc. | 2013 | 414 | 978-1449361327 |
| Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | Wiley | 2015 | 474 | 978-1118834817 |
| Customers who viewed R for Data Science: Import, Tidy, Transform, Visualize, and Model Data | Hadley Wickham, Garrett Grolemund | O’Reilly Media, Inc. | 2017 | 520 | 978-1491910399 |
Using the transitive property, I compared books_html to books_html and to books_xml. The end result was that all three data frames were identical.
books_html == books_json && books_html == books_xml
## [1] TRUE