Assignment
The assignment is to use a text editor to create a book catalog of 3 books, each containing title, author(s) and 2-3 other book attributes then format them into 3 popular data exchange formats over the web using HTML, XML, and JSON files.
Write code in R to load the information from the 3 sources into separate R data frames.
Compared the three frames.
Collaborators: Magnus Skonberg
Acquire Data
Read in the three Book Catalog formats from GitHub - HTML, XML, and JSON formats. Each file format is read then coerced into a data frame using various methods.
Show a glimpse of each data frame sourced from HTML, XML, and JSON file formats.
# Read in XML file then store as data frame
books_xml <- getURL("https://raw.githubusercontent.com/CUNYSPS-RickRN/DATA607/master/D607_A07_xml_books.xml")
books_xml_df <- xmlToDataFrame(books_xml)
# Read in JSON file then store as data frame
#books_json_file <- getURL("https://raw.githubusercontent.com/CUNYSPS-RickRN/DATA607/master/D607_A07_JSON_books.json")
books_json_file <- getURL("https://raw.githubusercontent.com/CUNYSPS-RickRN/DATA607/master/D607_A07_jsonV2_books.json")
books_json_interim <- fromJSON(books_json_file)
books_json_df <- as.data.frame(books_json_interim)
# Read in HTML file then store as data frame
books_html <- getURL("https://raw.githubusercontent.com/CUNYSPS-RickRN/DATA607/master/D607_A07_html_books.html")
books_html_df <- as.data.frame(readHTMLTable(books_html))
glimpse(books_html_df)
## Rows: 3
## Columns: 5
## $ NULL.Title <chr> "R for Data Science", "Data Science for Business", ...
## $ NULL.Author <chr> "Hadley Wickham,Garrett Grolemund", "Foster Provost...
## $ NULL.Price <chr> "24.99", "43.49", "31.92"
## $ NULL.Pub.Format <chr> "Kindle", "Paperback", "Paperback"
## $ NULL.Pub.Year <chr> "2017", "2013", "2017"
## Rows: 3
## Columns: 5
## $ title <chr> "R for Data Science", "Data Science for Business", "R for...
## $ author <chr> "Hadley Wickham,Garrett Grolemund", "Foster Provost,Tom F...
## $ price <chr> "24.99", "43.49", "31.92"
## $ pubformat <chr> "Kindle", "Paperback", "Paperback"
## $ pubyear <chr> "2017", "2013", "2017"
## Rows: 3
## Columns: 5
## $ title <chr> "R for Data Science", "Data Science for Business", "R for...
## $ author <chr> "Hadley Wickham,Garrett Grolemund", "Foster Provost,Tom F...
## $ price <chr> "24.99", "43.49", "31.92"
## $ pubformat <chr> "Kindle", "Paperback", "Paperback"
## $ pubyear <chr> "2017", "2013", "2017"
Compare
In this step, the compare.list function from the useful package compares each of the HTML, JSON, and XML dataframes. All 3 dataframes loaded from HTML,XML, and JSON matched each other.
# Use the compare.list methods from useful package to compare elements of two equal length lists.
compare.list(as.data.frame(books_html_df), as.data.frame(books_xml_df))
## [1] TRUE TRUE TRUE TRUE TRUE
compare.list(as.data.frame(books_html_df), as.data.frame(books_json_df))
## [1] TRUE TRUE TRUE TRUE TRUE
compare.list(as.data.frame(books_xml_df), as.data.frame(books_json_df))
## [1] TRUE TRUE TRUE TRUE TRUE
Summary
In this assignment, understanding 3 of the popular formats for exchanging data over the web (HTML, XML, and JSON) by using a text editor to format each accordingly using the same sample book catalog of 3 book titles demonstrates how these file formats can be loaded into data frames for subsequent processing. Furthermore, each of the data frames sourced from these formats ought to match each other.
LS0tDQp0aXRsZTogIkQ2MDdfQTA3X1JpY2tSTiINCmF1dGhvcjogIlJpY2tSTiINCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCINCm91dHB1dDogDQogIG9wZW5pbnRybzo6bGFiX3JlcG9ydDogZGVmYXVsdA0KICBodG1sX2RvY3VtZW50Og0KICAgIG51bWJlcl9zZWN0aW9uczogeWVzDQotLS0NCg0KYGBge3Igc3RlcF9zZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShydmVzdCkNCmxpYnJhcnkoWE1MKQ0KbGlicmFyeShSQ3VybCkNCmxpYnJhcnkocmpzb24pDQojaW5zdGFsbC5wYWNrYWdlcygidXNlZnVsIikNCmxpYnJhcnkodXNlZnVsKQ0KYGBgDQoNCiMgQXNzaWdubWVudA0KDQo8c3R5bGU+DQpkaXYuYmx1ZSB7IGJhY2tncm91bmQtY29sb3I6I2U2ZjBmZjsgYm9yZGVyLXJhZGl1czogNXB4OyBwYWRkaW5nOiAyMHB4O30NCjwvc3R5bGU+DQo8ZGl2IGNsYXNzID0gImJsdWUiPg0KDQpUaGUgYXNzaWdubWVudCBpcyB0byB1c2UgYSB0ZXh0IGVkaXRvciB0byBjcmVhdGUgYSBib29rIGNhdGFsb2cgb2YgMyBib29rcywgZWFjaCBjb250YWluaW5nIHRpdGxlLCBhdXRob3IocykgYW5kIDItMyBvdGhlciBib29rIGF0dHJpYnV0ZXMgdGhlbiBmb3JtYXQgdGhlbSBpbnRvIDMgcG9wdWxhciBkYXRhIGV4Y2hhbmdlIGZvcm1hdHMgb3ZlciB0aGUgd2ViIHVzaW5nIEhUTUwsIFhNTCwgYW5kIEpTT04gZmlsZXMuDQoNCldyaXRlIGNvZGUgaW4gUiB0byBsb2FkIHRoZSBpbmZvcm1hdGlvbiBmcm9tIHRoZSAzIHNvdXJjZXMgaW50byBzZXBhcmF0ZSBSIGRhdGEgZnJhbWVzLg0KDQpDb21wYXJlZCB0aGUgdGhyZWUgZnJhbWVzLg0KDQpDb2xsYWJvcmF0b3JzOiBNYWdudXMgU2tvbmJlcmcgDQoNCjwvZGl2PiBcaGZpbGxcYnJlYWsNCg0KDQojIEFjcXVpcmUgRGF0YQ0KDQo8c3R5bGU+DQpkaXYuYmx1ZSB7IGJhY2tncm91bmQtY29sb3I6I2U2ZjBmZjsgYm9yZGVyLXJhZGl1czogNXB4OyBwYWRkaW5nOiAyMHB4O30NCjwvc3R5bGU+DQo8ZGl2IGNsYXNzID0gImJsdWUiPg0KDQpSZWFkIGluIHRoZSB0aHJlZSBCb29rIENhdGFsb2cgZm9ybWF0cyBmcm9tIEdpdEh1YiAtIEhUTUwsIFhNTCwgYW5kIEpTT04gZm9ybWF0cy4gIEVhY2ggZmlsZSBmb3JtYXQgaXMgcmVhZCB0aGVuIGNvZXJjZWQgaW50byBhIGRhdGEgZnJhbWUgdXNpbmcgdmFyaW91cyBtZXRob2RzLg0KDQpTaG93IGEgZ2xpbXBzZSBvZiBlYWNoIGRhdGEgZnJhbWUgc291cmNlZCBmcm9tIEhUTUwsIFhNTCwgYW5kIEpTT04gZmlsZSBmb3JtYXRzLg0KDQo8L2Rpdj4gXGhmaWxsXGJyZWFrDQpgYGB7ciBzdGVwX3gsIGVjaG89VFJVRSB9DQojIFJlYWQgaW4gWE1MIGZpbGUgdGhlbiBzdG9yZSBhcyBkYXRhIGZyYW1lDQpib29rc194bWwgPC0gZ2V0VVJMKCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vQ1VOWVNQUy1SaWNrUk4vREFUQTYwNy9tYXN0ZXIvRDYwN19BMDdfeG1sX2Jvb2tzLnhtbCIpDQpib29rc194bWxfZGYgPC0geG1sVG9EYXRhRnJhbWUoYm9va3NfeG1sKQ0KDQoNCiMgUmVhZCBpbiBKU09OIGZpbGUgdGhlbiBzdG9yZSBhcyBkYXRhIGZyYW1lDQojYm9va3NfanNvbl9maWxlIDwtIGdldFVSTCgiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL0NVTllTUFMtUmlja1JOL0RBVEE2MDcvbWFzdGVyL0Q2MDdfQTA3X0pTT05fYm9va3MuanNvbiIpDQpib29rc19qc29uX2ZpbGUgPC0gZ2V0VVJMKCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vQ1VOWVNQUy1SaWNrUk4vREFUQTYwNy9tYXN0ZXIvRDYwN19BMDdfanNvblYyX2Jvb2tzLmpzb24iKQ0KDQpib29rc19qc29uX2ludGVyaW0gPC0gZnJvbUpTT04oYm9va3NfanNvbl9maWxlKQ0KYm9va3NfanNvbl9kZiA8LSBhcy5kYXRhLmZyYW1lKGJvb2tzX2pzb25faW50ZXJpbSkNCg0KDQojIFJlYWQgaW4gSFRNTCBmaWxlIHRoZW4gc3RvcmUgYXMgZGF0YSBmcmFtZQ0KYm9va3NfaHRtbCA8LSBnZXRVUkwoImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9DVU5ZU1BTLVJpY2tSTi9EQVRBNjA3L21hc3Rlci9ENjA3X0EwN19odG1sX2Jvb2tzLmh0bWwiKQ0KYm9va3NfaHRtbF9kZiA8LSBhcy5kYXRhLmZyYW1lKHJlYWRIVE1MVGFibGUoYm9va3NfaHRtbCkpDQoNCg0KZ2xpbXBzZShib29rc19odG1sX2RmKQ0KZ2xpbXBzZShib29rc19qc29uX2RmKQ0KZ2xpbXBzZShib29rc194bWxfZGYpDQpgYGANCg0KIyBDb21wYXJlDQoNCjxzdHlsZT4NCmRpdi5ibHVlIHsgYmFja2dyb3VuZC1jb2xvcjojZTZmMGZmOyBib3JkZXItcmFkaXVzOiA1cHg7IHBhZGRpbmc6IDIwcHg7fQ0KPC9zdHlsZT4NCjxkaXYgY2xhc3MgPSAiYmx1ZSI+DQoNCkluIHRoaXMgc3RlcCwgdGhlIGNvbXBhcmUubGlzdCBmdW5jdGlvbiBmcm9tIHRoZSB1c2VmdWwgcGFja2FnZSBjb21wYXJlcyBlYWNoIG9mIHRoZSBIVE1MLCBKU09OLCBhbmQgWE1MIGRhdGFmcmFtZXMuICBBbGwgMyBkYXRhZnJhbWVzIGxvYWRlZCBmcm9tIEhUTUwsWE1MLCBhbmQgSlNPTiBtYXRjaGVkIGVhY2ggb3RoZXIuDQoNCjwvZGl2PiBcaGZpbGxcYnJlYWsNCmBgYHtyIHN0ZXBfeSwgZWNobz1UUlVFfQ0KIyBVc2UgdGhlIGNvbXBhcmUubGlzdCBtZXRob2RzIGZyb20gdXNlZnVsIHBhY2thZ2UgdG8gY29tcGFyZSBlbGVtZW50cyBvZiB0d28gZXF1YWwgbGVuZ3RoIGxpc3RzLg0KY29tcGFyZS5saXN0KGFzLmRhdGEuZnJhbWUoYm9va3NfaHRtbF9kZiksIGFzLmRhdGEuZnJhbWUoYm9va3NfeG1sX2RmKSkNCg0KY29tcGFyZS5saXN0KGFzLmRhdGEuZnJhbWUoYm9va3NfaHRtbF9kZiksIGFzLmRhdGEuZnJhbWUoYm9va3NfanNvbl9kZikpDQoNCmNvbXBhcmUubGlzdChhcy5kYXRhLmZyYW1lKGJvb2tzX3htbF9kZiksIGFzLmRhdGEuZnJhbWUoYm9va3NfanNvbl9kZikpDQpgYGANCg0KDQojIFN1bW1hcnkNCg0KPHN0eWxlPg0KZGl2LmJsdWUgeyBiYWNrZ3JvdW5kLWNvbG9yOiNlNmYwZmY7IGJvcmRlci1yYWRpdXM6IDVweDsgcGFkZGluZzogMjBweDt9DQo8L3N0eWxlPg0KPGRpdiBjbGFzcyA9ICJibHVlIj4NCg0KSW4gdGhpcyBhc3NpZ25tZW50LCB1bmRlcnN0YW5kaW5nIDMgb2YgdGhlIHBvcHVsYXIgZm9ybWF0cyBmb3IgZXhjaGFuZ2luZyBkYXRhIG92ZXIgdGhlIHdlYiAoSFRNTCwgWE1MLCBhbmQgSlNPTikgYnkgdXNpbmcgYSB0ZXh0IGVkaXRvciB0byBmb3JtYXQgZWFjaCBhY2NvcmRpbmdseSB1c2luZyB0aGUgc2FtZSBzYW1wbGUgYm9vayBjYXRhbG9nIG9mIDMgYm9vayB0aXRsZXMgZGVtb25zdHJhdGVzIGhvdyB0aGVzZSBmaWxlIGZvcm1hdHMgY2FuIGJlIGxvYWRlZCBpbnRvIGRhdGEgZnJhbWVzIGZvciBzdWJzZXF1ZW50IHByb2Nlc3NpbmcuICBGdXJ0aGVybW9yZSwgZWFjaCBvZiB0aGUgZGF0YSBmcmFtZXMgc291cmNlZCBmcm9tIHRoZXNlIGZvcm1hdHMgb3VnaHQgdG8gbWF0Y2ggZWFjaCBvdGhlci4NCg0KPC9kaXY+IFxoZmlsbFxicmVhaw0K