Overview
I used Dreamwaever to create three different files in HTML, XML, and JSON formats. These files are tables containing exact same information, which are the title, authors, edition, publisher, and publication year of three of my favorite books in learning data science. Note that two of my three books have two authors.
I will load these files into 3 three different data frames and then compare them two by two to check their similarity and difference in R if there is any, and will share the findings.
Libraries
library(XML)
library(jsonlite)
library(RCurl)
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks jsonlite::flatten()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
Data & Data frames
Get URL
html_url <- "https://raw.githubusercontent.com/jnataky/DATA-607/master/Web_technologies/books_html.html"
xml_url <- "https://raw.githubusercontent.com/jnataky/DATA-607/master/Web_technologies/books_xml.xml"
json_url <- "https://raw.githubusercontent.com/jnataky/DATA-607/master/Web_technologies/books_json.json"
Load files
HTML
file_html <- getURL(html_url)
df_html <- readHTMLTable(file_html, which =1)
# Kable for tidy table
df_html %>%
kbl(caption = "My Favorite books") %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")
My Favorite books
|
Title
|
Authors
|
Edition
|
Publisher
|
Publication_Year
|
|
Understandable Statistics
|
Charles H. Brase and Corrinne P. Brase
|
Twelfth
|
Cengage Learning
|
2018
|
|
Hands-On Machine Learning with Scikit-Learn and TensorFlow
|
Aurelien Geron
|
First
|
O’Reilly
|
2017
|
|
Data Science for Business
|
Foster Provost and Tom Fawcett
|
First
|
O’Reilly
|
2013
|
XML
file_xml <- getURL(xml_url)
df_xml <- xmlToDataFrame(file_xml)
# Kable for tidy table
df_xml %>%
kbl(caption = "My Favorite books") %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")
My Favorite books
|
Title
|
Authors
|
Edition
|
Publisher
|
Publication_Year
|
|
Understandable Statistics
|
Charles H. Brase and Corrinne P. Brase
|
Twelfth
|
Cengage Learning
|
2018
|
|
Hands-On Machine Learning with Scikit-Learn and TensorFlow
|
Aurelien Geron
|
First
|
O’Reilly
|
2017
|
|
Data Science for Business
|
Foster Provost and Tom Fawcett
|
First
|
O’Reilly
|
2013
|
JSON
df_json <- as.data.frame(fromJSON(json_url))
# Change columns names to match others
names(df_json) <- c("Title", "Authors", "Edition", "Publisher", "Publication_Year")
# Kable for tidy table
df_json %>%
kbl(caption = "My Favorite books") %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")
My Favorite books
|
Title
|
Authors
|
Edition
|
Publisher
|
Publication_Year
|
|
Understandable Statistics
|
Charles H. Brase and Corrinne P. Brase
|
Twelfth
|
Cengage Learning
|
2018
|
|
Hands-On Machine Learning with Scikit-Learn and TensorFlow
|
Aurelien Geron
|
First
|
O’Reilly
|
2017
|
|
Data Science for Business
|
Foster Provost and Tom Fawcett
|
First
|
O’Reilly
|
2013
|
Comparison
# HTML & XML
all.equal(df_html, df_xml)
## [1] TRUE
# HTML & JSON
all.equal(df_html, df_json)
## [1] TRUE
Since the two first comparisons are true, we can say that all three frame have the same contents (after renaming to json data frame columns as the structure looked different from the others). They look the same but the internal structure might be different.
Find more info on those formats in the Findings section below.
Findings
This practice on these different data exchange formats bring to this conclusion:
The native structure of HTML does not naturally map into R objects. We can import HTML files as raw text, but this deprives us of the most useful features of these documents. And XML is a more generic counterpart to HTML and a frequently used format to exchange data on the Web. In the other hand, JSON is more lightweight due to its less verbose syntax and only allows a limited set of data types that are compatible with many programming languages.
Source: Munzert, S. (2015). Automated Data Collection with R: A practical guide to web scraping and text mining. Chichester: Wiley.
LS0tDQp0aXRsZTogIkRBVEEgNjA3IEFTU0lHTk1FTlQgNyINCmF1dGhvcjogIkplcmVkIEF0YWt5Ig0KZGF0ZTogImByIFN5cy5EYXRlKClgIg0Kb3V0cHV0OiANCiAgb3BlbmludHJvOjpsYWJfcmVwb3J0OiBkZWZhdWx0DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQojIyBPdmVydmlldw0KDQoNCjxzdHlsZT4NCmRpdi5hcXVhbWFyaW5lIHsgYmFja2dyb3VuZC1jb2xvcjojN2ZmZmQ0OyBib3JkZXItcmFkaXVzOiAxMHB4OyBwYWRkaW5nOiA1cHg7fQ0KPC9zdHlsZT4NCjxkaXYgY2xhc3MgPSAiYXF1YW1hcmluZSI+DQoNCkkgdXNlZCBEcmVhbXdhZXZlciB0byBjcmVhdGUgdGhyZWUgZGlmZmVyZW50IGZpbGVzIGluIEhUTUwsIFhNTCwgYW5kIEpTT04NCmZvcm1hdHMuIFRoZXNlIGZpbGVzIGFyZSB0YWJsZXMgY29udGFpbmluZyBleGFjdCBzYW1lIGluZm9ybWF0aW9uLCB3aGljaCBhcmUgDQp0aGUgdGl0bGUsIGF1dGhvcnMsIGVkaXRpb24sIHB1Ymxpc2hlciwgYW5kIHB1YmxpY2F0aW9uIHllYXIgb2YgdGhyZWUgb2YgDQpteSBmYXZvcml0ZSBib29rcyBpbiBsZWFybmluZyBkYXRhIHNjaWVuY2UuIE5vdGUgdGhhdCB0d28gb2YgbXkgdGhyZWUgYm9va3MgaGF2ZSANCnR3byBhdXRob3JzLg0KDQpJIHdpbGwgbG9hZCB0aGVzZSBmaWxlcyBpbnRvIDMgdGhyZWUgZGlmZmVyZW50IGRhdGEgZnJhbWVzIGFuZCB0aGVuIGNvbXBhcmUgdGhlbQ0KdHdvIGJ5IHR3byB0byBjaGVjayB0aGVpciBzaW1pbGFyaXR5IGFuZCBkaWZmZXJlbmNlIGluIFIgaWYgdGhlcmUgaXMgYW55LCBhbmQgDQp3aWxsIHNoYXJlIHRoZSBmaW5kaW5ncy4NCg0KPC9kaXY+IFxoZmlsbFxicmVhaw0KDQoNCiMjIExpYnJhcmllcw0KDQpgYGB7cn0NCg0KbGlicmFyeShYTUwpDQpsaWJyYXJ5KGpzb25saXRlKQ0KbGlicmFyeShSQ3VybCkNCg0KbGlicmFyeSh0aWR5dmVyc2UpDQpsaWJyYXJ5KGthYmxlRXh0cmEpDQoNCmBgYA0KDQojIyBEYXRhICYgRGF0YSBmcmFtZXMNCg0KIyMjIEdldCBVUkwNCg0KYGBge3J9DQoNCmh0bWxfdXJsIDwtICJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vam5hdGFreS9EQVRBLTYwNy9tYXN0ZXIvV2ViX3RlY2hub2xvZ2llcy9ib29rc19odG1sLmh0bWwiDQoNCnhtbF91cmwgPC0gImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9qbmF0YWt5L0RBVEEtNjA3L21hc3Rlci9XZWJfdGVjaG5vbG9naWVzL2Jvb2tzX3htbC54bWwiDQoNCmpzb25fdXJsIDwtICJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vam5hdGFreS9EQVRBLTYwNy9tYXN0ZXIvV2ViX3RlY2hub2xvZ2llcy9ib29rc19qc29uLmpzb24iDQoNCmBgYA0KDQojIyMgTG9hZCBmaWxlcw0KDQojIyMjIEhUTUwNCg0KYGBge3J9DQoNCmZpbGVfaHRtbCA8LSBnZXRVUkwoaHRtbF91cmwpDQoNCmRmX2h0bWwgPC0gcmVhZEhUTUxUYWJsZShmaWxlX2h0bWwsIHdoaWNoID0xKQ0KDQojIEthYmxlIGZvciB0aWR5IHRhYmxlDQoNCmRmX2h0bWwgJT4lDQogIGtibChjYXB0aW9uID0gIk15IEZhdm9yaXRlIGJvb2tzIikgJT4lDQogIGthYmxlX21hdGVyaWFsKGMoInN0cmlwZWQiLCAiaG92ZXIiKSkgJT4lDQogIHJvd19zcGVjKDAsIGNvbG9yID0gImluZGlnbyIpDQoNCmBgYA0KDQoNCiMjIyMgWE1MDQoNCmBgYHtyfQ0KDQpmaWxlX3htbCA8LSBnZXRVUkwoeG1sX3VybCkNCg0KZGZfeG1sIDwtIHhtbFRvRGF0YUZyYW1lKGZpbGVfeG1sKQ0KDQoNCiMgS2FibGUgZm9yIHRpZHkgdGFibGUNCg0KZGZfeG1sICU+JQ0KICBrYmwoY2FwdGlvbiA9ICJNeSBGYXZvcml0ZSBib29rcyIpICU+JQ0KICBrYWJsZV9tYXRlcmlhbChjKCJzdHJpcGVkIiwgImhvdmVyIikpICU+JQ0KICByb3dfc3BlYygwLCBjb2xvciA9ICJpbmRpZ28iKQ0KDQpgYGANCg0KDQoNCiMjIyMgSlNPTg0KDQpgYGB7cn0NCg0KZGZfanNvbiA8LSBhcy5kYXRhLmZyYW1lKGZyb21KU09OKGpzb25fdXJsKSkNCg0KYGBgDQoNCmBgYHtyfQ0KIyBDaGFuZ2UgY29sdW1ucyBuYW1lcyB0byBtYXRjaCBvdGhlcnMNCg0KbmFtZXMoZGZfanNvbikgPC0gYygiVGl0bGUiLCAiQXV0aG9ycyIsICJFZGl0aW9uIiwgIlB1Ymxpc2hlciIsICJQdWJsaWNhdGlvbl9ZZWFyIikNCg0KIyBLYWJsZSBmb3IgdGlkeSB0YWJsZQ0KDQpkZl9qc29uICU+JQ0KICBrYmwoY2FwdGlvbiA9ICJNeSBGYXZvcml0ZSBib29rcyIpICU+JQ0KICBrYWJsZV9tYXRlcmlhbChjKCJzdHJpcGVkIiwgImhvdmVyIikpICU+JQ0KICByb3dfc3BlYygwLCBjb2xvciA9ICJpbmRpZ28iKQ0KDQpgYGANCg0KIyMgQ29tcGFyaXNvbg0KDQoNCmBgYHtyfQ0KDQojIEhUTUwgJiBYTUwNCg0KYWxsLmVxdWFsKGRmX2h0bWwsIGRmX3htbCkNCg0KYGBgDQoNCmBgYHtyfQ0KDQojIEhUTUwgJiBKU09ODQoNCmFsbC5lcXVhbChkZl9odG1sLCBkZl9qc29uKQ0KDQpgYGANCg0KDQo8c3R5bGU+DQpkaXYuYXF1YW1hcmluZSB7IGJhY2tncm91bmQtY29sb3I6IzdmZmZkNDsgYm9yZGVyLXJhZGl1czogMTBweDsgcGFkZGluZzogNXB4O30NCjwvc3R5bGU+DQo8ZGl2IGNsYXNzID0gImFxdWFtYXJpbmUiPg0KDQpTaW5jZSB0aGUgdHdvIGZpcnN0IGNvbXBhcmlzb25zIGFyZSB0cnVlLCANCndlIGNhbiBzYXkgdGhhdCBhbGwgdGhyZWUgZnJhbWUgaGF2ZSB0aGUgc2FtZSBjb250ZW50cyANCihhZnRlciByZW5hbWluZyB0byBqc29uIGRhdGEgZnJhbWUgY29sdW1ucyBhcyB0aGUgc3RydWN0dXJlIGxvb2tlZCBkaWZmZXJlbnQNCmZyb20gdGhlIG90aGVycykuIFRoZXkgbG9vayB0aGUgc2FtZSBidXQgdGhlIGludGVybmFsIHN0cnVjdHVyZSBtaWdodCBiZSBkaWZmZXJlbnQuDQoNCkZpbmQgbW9yZSBpbmZvIG9uIHRob3NlIGZvcm1hdHMgaW4gdGhlIEZpbmRpbmdzIHNlY3Rpb24gYmVsb3cuDQoNCjwvZGl2PiBcaGZpbGxcYnJlYWsNCg0KDQojIyBGaW5kaW5ncw0KDQoNCjxzdHlsZT4NCmRpdi5hcXVhbWFyaW5lIHsgYmFja2dyb3VuZC1jb2xvcjojN2ZmZmQ0OyBib3JkZXItcmFkaXVzOiAxMHB4OyBwYWRkaW5nOiA1cHg7fQ0KPC9zdHlsZT4NCjxkaXYgY2xhc3MgPSAiYXF1YW1hcmluZSI+DQoNClRoaXMgcHJhY3RpY2Ugb24gdGhlc2UgZGlmZmVyZW50IGRhdGEgZXhjaGFuZ2UgZm9ybWF0cyBicmluZyB0byB0aGlzIGNvbmNsdXNpb246IA0KDQpUaGUgbmF0aXZlIHN0cnVjdHVyZSBvZiBIVE1MIGRvZXMgbm90IG5hdHVyYWxseSBtYXAgaW50byBSIG9iamVjdHMuIA0KV2UgY2FuIGltcG9ydCBIVE1MIGZpbGVzIGFzIHJhdyB0ZXh0LCBidXQgdGhpcyBkZXByaXZlcyB1cyBvZiB0aGUgbW9zdCB1c2VmdWwgZmVhdHVyZXMgb2YgdGhlc2UgZG9jdW1lbnRzLg0KQW5kIFhNTCBpcyBhIG1vcmUgZ2VuZXJpYyBjb3VudGVycGFydCB0byBIVE1MIGFuZCBhIGZyZXF1ZW50bHkgdXNlZCBmb3JtYXQgdG8gZXhjaGFuZ2UgZGF0YSBvbiB0aGUgV2ViLg0KSW4gdGhlIG90aGVyIGhhbmQsIEpTT04gaXMgbW9yZSBsaWdodHdlaWdodCBkdWUgdG8gaXRzIGxlc3MgdmVyYm9zZSBzeW50YXggYW5kIG9ubHkgYWxsb3dzIGEgbGltaXRlZA0Kc2V0IG9mIGRhdGEgdHlwZXMgdGhhdCBhcmUgY29tcGF0aWJsZSB3aXRoIG1hbnkgcHJvZ3JhbW1pbmcgbGFuZ3VhZ2VzLg0KDQpTb3VyY2U6IE11bnplcnQsIFMuICgyMDE1KS4gQXV0b21hdGVkIERhdGEgQ29sbGVjdGlvbiB3aXRoIFI6IEEgcHJhY3RpY2FsIGd1aWRlIHRvIHdlYiBzY3JhcGluZyBhbmQgdGV4dCBtaW5pbmcuIENoaWNoZXN0ZXI6IFdpbGV5Lg0KDQo8L2Rpdj4gXGhmaWxsXGJyZWFrDQoNCi0g