Overview

I used Dreamwaever to create three different files in HTML, XML, and JSON formats. These files are tables containing exact same information, which are the title, authors, edition, publisher, and publication year of three of my favorite books in learning data science. Note that two of my three books have two authors.

I will load these files into 3 three different data frames and then compare them two by two to check their similarity and difference in R if there is any, and will share the findings.

Libraries

library(XML)
library(jsonlite)
library(RCurl)

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x purrr::flatten()  masks jsonlite::flatten()
## x dplyr::lag()      masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Data & Data frames

Get URL

html_url <- "https://raw.githubusercontent.com/jnataky/DATA-607/master/Web_technologies/books_html.html"

xml_url <- "https://raw.githubusercontent.com/jnataky/DATA-607/master/Web_technologies/books_xml.xml"

json_url <- "https://raw.githubusercontent.com/jnataky/DATA-607/master/Web_technologies/books_json.json"

Load files

HTML

file_html <- getURL(html_url)

df_html <- readHTMLTable(file_html, which =1)

# Kable for tidy table

df_html %>%
  kbl(caption = "My Favorite books") %>%
  kable_material(c("striped", "hover")) %>%
  row_spec(0, color = "indigo")
My Favorite books
Title Authors Edition Publisher Publication_Year
Understandable Statistics Charles H. Brase and Corrinne P. Brase Twelfth Cengage Learning 2018
Hands-On Machine Learning with Scikit-Learn and TensorFlow Aurelien Geron First O’Reilly 2017
Data Science for Business Foster Provost and Tom Fawcett First O’Reilly 2013

XML

file_xml <- getURL(xml_url)

df_xml <- xmlToDataFrame(file_xml)


# Kable for tidy table

df_xml %>%
  kbl(caption = "My Favorite books") %>%
  kable_material(c("striped", "hover")) %>%
  row_spec(0, color = "indigo")
My Favorite books
Title Authors Edition Publisher Publication_Year
Understandable Statistics Charles H. Brase and Corrinne P. Brase Twelfth Cengage Learning 2018
Hands-On Machine Learning with Scikit-Learn and TensorFlow Aurelien Geron First O’Reilly 2017
Data Science for Business Foster Provost and Tom Fawcett First O’Reilly 2013

JSON

df_json <- as.data.frame(fromJSON(json_url))
# Change columns names to match others

names(df_json) <- c("Title", "Authors", "Edition", "Publisher", "Publication_Year")

# Kable for tidy table

df_json %>%
  kbl(caption = "My Favorite books") %>%
  kable_material(c("striped", "hover")) %>%
  row_spec(0, color = "indigo")
My Favorite books
Title Authors Edition Publisher Publication_Year
Understandable Statistics Charles H. Brase and Corrinne P. Brase Twelfth Cengage Learning 2018
Hands-On Machine Learning with Scikit-Learn and TensorFlow Aurelien Geron First O’Reilly 2017
Data Science for Business Foster Provost and Tom Fawcett First O’Reilly 2013

Comparison

# HTML & XML

all.equal(df_html, df_xml)
## [1] TRUE
# HTML & JSON

all.equal(df_html, df_json)
## [1] TRUE

Since the two first comparisons are true, we can say that all three frame have the same contents (after renaming to json data frame columns as the structure looked different from the others). They look the same but the internal structure might be different.

Find more info on those formats in the Findings section below.

Findings

This practice on these different data exchange formats bring to this conclusion:

The native structure of HTML does not naturally map into R objects. We can import HTML files as raw text, but this deprives us of the most useful features of these documents. And XML is a more generic counterpart to HTML and a frequently used format to exchange data on the Web. In the other hand, JSON is more lightweight due to its less verbose syntax and only allows a limited set of data types that are compatible with many programming languages.

Source: Munzert, S. (2015). Automated Data Collection with R: A practical guide to web scraping and text mining. Chichester: Wiley.

LS0tDQp0aXRsZTogIkRBVEEgNjA3IEFTU0lHTk1FTlQgNyINCmF1dGhvcjogIkplcmVkIEF0YWt5Ig0KZGF0ZTogImByIFN5cy5EYXRlKClgIg0Kb3V0cHV0OiANCiAgb3BlbmludHJvOjpsYWJfcmVwb3J0OiBkZWZhdWx0DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMNCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQojIyBPdmVydmlldw0KDQoNCjxzdHlsZT4NCmRpdi5hcXVhbWFyaW5lIHsgYmFja2dyb3VuZC1jb2xvcjojN2ZmZmQ0OyBib3JkZXItcmFkaXVzOiAxMHB4OyBwYWRkaW5nOiA1cHg7fQ0KPC9zdHlsZT4NCjxkaXYgY2xhc3MgPSAiYXF1YW1hcmluZSI+DQoNCkkgdXNlZCBEcmVhbXdhZXZlciB0byBjcmVhdGUgdGhyZWUgZGlmZmVyZW50IGZpbGVzIGluIEhUTUwsIFhNTCwgYW5kIEpTT04NCmZvcm1hdHMuIFRoZXNlIGZpbGVzIGFyZSB0YWJsZXMgY29udGFpbmluZyBleGFjdCBzYW1lIGluZm9ybWF0aW9uLCB3aGljaCBhcmUgDQp0aGUgdGl0bGUsIGF1dGhvcnMsIGVkaXRpb24sIHB1Ymxpc2hlciwgYW5kIHB1YmxpY2F0aW9uIHllYXIgb2YgdGhyZWUgb2YgDQpteSBmYXZvcml0ZSBib29rcyBpbiBsZWFybmluZyBkYXRhIHNjaWVuY2UuIE5vdGUgdGhhdCB0d28gb2YgbXkgdGhyZWUgYm9va3MgaGF2ZSANCnR3byBhdXRob3JzLg0KDQpJIHdpbGwgbG9hZCB0aGVzZSBmaWxlcyBpbnRvIDMgdGhyZWUgZGlmZmVyZW50IGRhdGEgZnJhbWVzIGFuZCB0aGVuIGNvbXBhcmUgdGhlbQ0KdHdvIGJ5IHR3byB0byBjaGVjayB0aGVpciBzaW1pbGFyaXR5IGFuZCBkaWZmZXJlbmNlIGluIFIgaWYgdGhlcmUgaXMgYW55LCBhbmQgDQp3aWxsIHNoYXJlIHRoZSBmaW5kaW5ncy4NCg0KPC9kaXY+IFxoZmlsbFxicmVhaw0KDQoNCiMjIExpYnJhcmllcw0KDQpgYGB7cn0NCg0KbGlicmFyeShYTUwpDQpsaWJyYXJ5KGpzb25saXRlKQ0KbGlicmFyeShSQ3VybCkNCg0KbGlicmFyeSh0aWR5dmVyc2UpDQpsaWJyYXJ5KGthYmxlRXh0cmEpDQoNCmBgYA0KDQojIyBEYXRhICYgRGF0YSBmcmFtZXMNCg0KIyMjIEdldCBVUkwNCg0KYGBge3J9DQoNCmh0bWxfdXJsIDwtICJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vam5hdGFreS9EQVRBLTYwNy9tYXN0ZXIvV2ViX3RlY2hub2xvZ2llcy9ib29rc19odG1sLmh0bWwiDQoNCnhtbF91cmwgPC0gImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9qbmF0YWt5L0RBVEEtNjA3L21hc3Rlci9XZWJfdGVjaG5vbG9naWVzL2Jvb2tzX3htbC54bWwiDQoNCmpzb25fdXJsIDwtICJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vam5hdGFreS9EQVRBLTYwNy9tYXN0ZXIvV2ViX3RlY2hub2xvZ2llcy9ib29rc19qc29uLmpzb24iDQoNCmBgYA0KDQojIyMgTG9hZCBmaWxlcw0KDQojIyMjIEhUTUwNCg0KYGBge3J9DQoNCmZpbGVfaHRtbCA8LSBnZXRVUkwoaHRtbF91cmwpDQoNCmRmX2h0bWwgPC0gcmVhZEhUTUxUYWJsZShmaWxlX2h0bWwsIHdoaWNoID0xKQ0KDQojIEthYmxlIGZvciB0aWR5IHRhYmxlDQoNCmRmX2h0bWwgJT4lDQogIGtibChjYXB0aW9uID0gIk15IEZhdm9yaXRlIGJvb2tzIikgJT4lDQogIGthYmxlX21hdGVyaWFsKGMoInN0cmlwZWQiLCAiaG92ZXIiKSkgJT4lDQogIHJvd19zcGVjKDAsIGNvbG9yID0gImluZGlnbyIpDQoNCmBgYA0KDQoNCiMjIyMgWE1MDQoNCmBgYHtyfQ0KDQpmaWxlX3htbCA8LSBnZXRVUkwoeG1sX3VybCkNCg0KZGZfeG1sIDwtIHhtbFRvRGF0YUZyYW1lKGZpbGVfeG1sKQ0KDQoNCiMgS2FibGUgZm9yIHRpZHkgdGFibGUNCg0KZGZfeG1sICU+JQ0KICBrYmwoY2FwdGlvbiA9ICJNeSBGYXZvcml0ZSBib29rcyIpICU+JQ0KICBrYWJsZV9tYXRlcmlhbChjKCJzdHJpcGVkIiwgImhvdmVyIikpICU+JQ0KICByb3dfc3BlYygwLCBjb2xvciA9ICJpbmRpZ28iKQ0KDQpgYGANCg0KDQoNCiMjIyMgSlNPTg0KDQpgYGB7cn0NCg0KZGZfanNvbiA8LSBhcy5kYXRhLmZyYW1lKGZyb21KU09OKGpzb25fdXJsKSkNCg0KYGBgDQoNCmBgYHtyfQ0KIyBDaGFuZ2UgY29sdW1ucyBuYW1lcyB0byBtYXRjaCBvdGhlcnMNCg0KbmFtZXMoZGZfanNvbikgPC0gYygiVGl0bGUiLCAiQXV0aG9ycyIsICJFZGl0aW9uIiwgIlB1Ymxpc2hlciIsICJQdWJsaWNhdGlvbl9ZZWFyIikNCg0KIyBLYWJsZSBmb3IgdGlkeSB0YWJsZQ0KDQpkZl9qc29uICU+JQ0KICBrYmwoY2FwdGlvbiA9ICJNeSBGYXZvcml0ZSBib29rcyIpICU+JQ0KICBrYWJsZV9tYXRlcmlhbChjKCJzdHJpcGVkIiwgImhvdmVyIikpICU+JQ0KICByb3dfc3BlYygwLCBjb2xvciA9ICJpbmRpZ28iKQ0KDQpgYGANCg0KIyMgQ29tcGFyaXNvbg0KDQoNCmBgYHtyfQ0KDQojIEhUTUwgJiBYTUwNCg0KYWxsLmVxdWFsKGRmX2h0bWwsIGRmX3htbCkNCg0KYGBgDQoNCmBgYHtyfQ0KDQojIEhUTUwgJiBKU09ODQoNCmFsbC5lcXVhbChkZl9odG1sLCBkZl9qc29uKQ0KDQpgYGANCg0KDQo8c3R5bGU+DQpkaXYuYXF1YW1hcmluZSB7IGJhY2tncm91bmQtY29sb3I6IzdmZmZkNDsgYm9yZGVyLXJhZGl1czogMTBweDsgcGFkZGluZzogNXB4O30NCjwvc3R5bGU+DQo8ZGl2IGNsYXNzID0gImFxdWFtYXJpbmUiPg0KDQpTaW5jZSB0aGUgdHdvIGZpcnN0IGNvbXBhcmlzb25zIGFyZSB0cnVlLCANCndlIGNhbiBzYXkgdGhhdCBhbGwgdGhyZWUgZnJhbWUgaGF2ZSB0aGUgc2FtZSBjb250ZW50cyANCihhZnRlciByZW5hbWluZyB0byBqc29uIGRhdGEgZnJhbWUgY29sdW1ucyBhcyB0aGUgc3RydWN0dXJlIGxvb2tlZCBkaWZmZXJlbnQNCmZyb20gdGhlIG90aGVycykuIFRoZXkgbG9vayB0aGUgc2FtZSBidXQgdGhlIGludGVybmFsIHN0cnVjdHVyZSBtaWdodCBiZSBkaWZmZXJlbnQuDQoNCkZpbmQgbW9yZSBpbmZvIG9uIHRob3NlIGZvcm1hdHMgaW4gdGhlIEZpbmRpbmdzIHNlY3Rpb24gYmVsb3cuDQoNCjwvZGl2PiBcaGZpbGxcYnJlYWsNCg0KDQojIyBGaW5kaW5ncw0KDQoNCjxzdHlsZT4NCmRpdi5hcXVhbWFyaW5lIHsgYmFja2dyb3VuZC1jb2xvcjojN2ZmZmQ0OyBib3JkZXItcmFkaXVzOiAxMHB4OyBwYWRkaW5nOiA1cHg7fQ0KPC9zdHlsZT4NCjxkaXYgY2xhc3MgPSAiYXF1YW1hcmluZSI+DQoNClRoaXMgcHJhY3RpY2Ugb24gdGhlc2UgZGlmZmVyZW50IGRhdGEgZXhjaGFuZ2UgZm9ybWF0cyBicmluZyB0byB0aGlzIGNvbmNsdXNpb246IA0KDQpUaGUgbmF0aXZlIHN0cnVjdHVyZSBvZiBIVE1MIGRvZXMgbm90IG5hdHVyYWxseSBtYXAgaW50byBSIG9iamVjdHMuIA0KV2UgY2FuIGltcG9ydCBIVE1MIGZpbGVzIGFzIHJhdyB0ZXh0LCBidXQgdGhpcyBkZXByaXZlcyB1cyBvZiB0aGUgbW9zdCB1c2VmdWwgZmVhdHVyZXMgb2YgdGhlc2UgZG9jdW1lbnRzLg0KQW5kIFhNTCBpcyBhIG1vcmUgZ2VuZXJpYyBjb3VudGVycGFydCB0byBIVE1MIGFuZCBhIGZyZXF1ZW50bHkgdXNlZCBmb3JtYXQgdG8gZXhjaGFuZ2UgZGF0YSBvbiB0aGUgV2ViLg0KSW4gdGhlIG90aGVyIGhhbmQsIEpTT04gaXMgbW9yZSBsaWdodHdlaWdodCBkdWUgdG8gaXRzIGxlc3MgdmVyYm9zZSBzeW50YXggYW5kIG9ubHkgYWxsb3dzIGEgbGltaXRlZA0Kc2V0IG9mIGRhdGEgdHlwZXMgdGhhdCBhcmUgY29tcGF0aWJsZSB3aXRoIG1hbnkgcHJvZ3JhbW1pbmcgbGFuZ3VhZ2VzLg0KDQpTb3VyY2U6IE11bnplcnQsIFMuICgyMDE1KS4gQXV0b21hdGVkIERhdGEgQ29sbGVjdGlvbiB3aXRoIFI6IEEgcHJhY3RpY2FsIGd1aWRlIHRvIHdlYiBzY3JhcGluZyBhbmQgdGV4dCBtaW5pbmcuIENoaWNoZXN0ZXI6IFdpbGV5Lg0KDQo8L2Rpdj4gXGhmaWxsXGJyZWFrDQoNCi0g