Some classic computer formats include JSON, HTML, and XML documents. They are designed to conform to some specific standard to befit specific purposes. Generally speaking, any web technologies will use all of them. We will examine the same set of data in each of them.
file = "data/topology"
url = "https://raw.githubusercontent.com/Anthogonyst/607-Acquisition/master/"
### All three files are the same basename, hence the shortcut
AppendExtension <- function(fp, ext, web) {
paste0(fp, ext) %>%
ifelse(file.exists(.), ., paste0(web, fp, ext))
}
### JSON
jsonData = AppendExtension(file, ".json", url) %>%
jsonlite::read_json()
### HTML
htmlData = AppendExtension(file, ".html", url) %>%
XML::readHTMLTable() %>%
.[[1]]
### XML
xmlData = AppendExtension(file, ".xml", url) %>%
XML::xmlToList()
The most common form is probably HTML. It is a markdown language so the software interprets text and runs its own code. Since the web runs on it, it has some concepts like tables and paragraphs. There’s probably a better way to do the following but I didn’t want to nest a table inside a table.
htmlData
## title author/0 author/1
## 1 Elementary Topology O Viro O Ivanov
## 2 Vector Bundles and K-Theory
## 3 Automorphisms of Surfaces after Nielsen and Thurston A J Casson S A Bleiler
## author/2 author/3 website
## 1 V Kharlamov N Netsvetaev http://www.math.uu.se/~oleg/educ-texts.html
## 2 http://www.math.cornell.edu/~hatcher
## 3
## author publisher year price
## 1
## 2 A Hatcher
## 3 Cambridge University Press 1988 $15
It is probably the easiest form of data to understand JSON. There are really only two types available, lists and dictionaries. This makes tree structures that can store data and is very flexible.
jsonData
## [[1]]
## [[1]]$title
## [1] "Elementary Topology"
##
## [[1]]$author
## [[1]]$author[[1]]
## [1] "O Viro"
##
## [[1]]$author[[2]]
## [1] "O Ivanov"
##
## [[1]]$author[[3]]
## [1] "V Kharlamov"
##
## [[1]]$author[[4]]
## [1] "N Netsvetaev"
##
##
## [[1]]$website
## [1] "http://www.math.uu.se/~oleg/educ-texts.html"
##
##
## [[2]]
## [[2]]$title
## [1] "Vector Bundles and K-Theory"
##
## [[2]]$author
## [1] "A Hatcher"
##
## [[2]]$website
## [1] "http://www.math.cornell.edu/~hatcher"
##
##
## [[3]]
## [[3]]$title
## [1] "Automorphisms of Surfaces after Nielsen and Thurston"
##
## [[3]]$author
## [[3]]$author[[1]]
## [1] "A J Casson"
##
## [[3]]$author[[2]]
## [1] "S A Bleiler"
##
##
## [[3]]$publisher
## [1] "Cambridge University Press"
##
## [[3]]$year
## [1] "1988"
##
## [[3]]$price
## [1] "$15"
The third form is XML data. It is basically the HTML version of JSON, so it will generally describe schemas. Basically, if you’re doing a job in JSON but need to distinguish attributes from fields, this is available.
xmlData
## $book
## $book$title
## [1] "Elementary Topology"
##
## $book$author
## [1] "O Viro"
##
## $book$author
## [1] "O Ivanov"
##
## $book$author
## [1] "V Kharlamov"
##
## $book$author
## [1] "N Netsvetaev"
##
## $book$website
## [1] "http://www.math.uu.se/~oleg/educ-texts.html"
##
##
## $book
## $book$title
## [1] "Vector Bundles and K-Theory"
##
## $book$author
## [1] "A Hatcher"
##
## $book$website
## [1] "http://www.math.cornell.edu/~hatcher"
##
##
## $book
## $book$title
## [1] "Automorphisms of Surfaces after Nielsen and Thurston"
##
## $book$author
## [1] "A J Casson"
##
## $book$author
## [1] "S A Bleiler"
##
## $book$publisher
## [1] "Cambridge University Press"
##
## $book$year
## [1] "1988"
##
## $book$price
## [1] "$15"
As much as I have critique for the HTML data in the disorderly form I wrote, the other two can look alot like it. If you deconstruct json data, you’ll find that its names look strikingly similar to the HTML. Additionally, the xml data will fit in exactly the same places. I don’t recommend doing this for any practical usage but it makes you think.
rbind(
unlist(jsonData),
unlist(xmlData)
)
## title author1 author2 author3 author4
## [1,] "Elementary Topology" "O Viro" "O Ivanov" "V Kharlamov" "N Netsvetaev"
## [2,] "Elementary Topology" "O Viro" "O Ivanov" "V Kharlamov" "N Netsvetaev"
## website
## [1,] "http://www.math.uu.se/~oleg/educ-texts.html"
## [2,] "http://www.math.uu.se/~oleg/educ-texts.html"
## title author
## [1,] "Vector Bundles and K-Theory" "A Hatcher"
## [2,] "Vector Bundles and K-Theory" "A Hatcher"
## website
## [1,] "http://www.math.cornell.edu/~hatcher"
## [2,] "http://www.math.cornell.edu/~hatcher"
## title author1
## [1,] "Automorphisms of Surfaces after Nielsen and Thurston" "A J Casson"
## [2,] "Automorphisms of Surfaces after Nielsen and Thurston" "A J Casson"
## author2 publisher year price
## [1,] "S A Bleiler" "Cambridge University Press" "1988" "$15"
## [2,] "S A Bleiler" "Cambridge University Press" "1988" "$15"
They’re pretty similar, generally speaking. There’s probably a correct way to do the HTML table but a frontend dev is meant to write that. The XML tree is a bit more strict so every array needs to be named. Hypothetically, there’s ways to make cleaner tables in R but I never figured out where a clean enumerator is for that. Personally, I think navigating a tree structure is R’s major weakness but I digress.