Abstract

Some classic computer formats include JSON, HTML, and XML documents. They are designed to conform to some specific standard to befit specific purposes. Generally speaking, any web technologies will use all of them. We will examine the same set of data in each of them.

file = "data/topology"
url = "https://raw.githubusercontent.com/Anthogonyst/607-Acquisition/master/"

### All three files are the same basename, hence the shortcut
AppendExtension <- function(fp, ext, web) {
  paste0(fp, ext) %>%
    ifelse(file.exists(.), ., paste0(web, fp, ext))
}

### JSON
jsonData = AppendExtension(file, ".json", url) %>%
  jsonlite::read_json()

### HTML
htmlData = AppendExtension(file, ".html", url) %>%
  XML::readHTMLTable() %>%
    .[[1]]

### XML
xmlData = AppendExtension(file, ".xml", url) %>%
  XML::xmlToList()

HTML Data

The most common form is probably HTML. It is a markdown language so the software interprets text and runs its own code. Since the web runs on it, it has some concepts like tables and paragraphs. There’s probably a better way to do the following but I didn’t want to nest a table inside a table.

htmlData
##                                                  title   author/0    author/1
## 1                                  Elementary Topology     O Viro    O Ivanov
## 2                          Vector Bundles and K-Theory                       
## 3 Automorphisms of Surfaces after Nielsen and Thurston A J Casson S A Bleiler
##      author/2     author/3                                     website
## 1 V Kharlamov N Netsvetaev http://www.math.uu.se/~oleg/educ-texts.html
## 2                                 http://www.math.cornell.edu/~hatcher
## 3                                                                     
##      author                  publisher year price
## 1                                                
## 2 A Hatcher                                      
## 3           Cambridge University Press 1988   $15

JSON Data

It is probably the easiest form of data to understand JSON. There are really only two types available, lists and dictionaries. This makes tree structures that can store data and is very flexible.

jsonData
## [[1]]
## [[1]]$title
## [1] "Elementary Topology"
## 
## [[1]]$author
## [[1]]$author[[1]]
## [1] "O Viro"
## 
## [[1]]$author[[2]]
## [1] "O Ivanov"
## 
## [[1]]$author[[3]]
## [1] "V Kharlamov"
## 
## [[1]]$author[[4]]
## [1] "N Netsvetaev"
## 
## 
## [[1]]$website
## [1] "http://www.math.uu.se/~oleg/educ-texts.html"
## 
## 
## [[2]]
## [[2]]$title
## [1] "Vector Bundles and K-Theory"
## 
## [[2]]$author
## [1] "A Hatcher"
## 
## [[2]]$website
## [1] "http://www.math.cornell.edu/~hatcher"
## 
## 
## [[3]]
## [[3]]$title
## [1] "Automorphisms of Surfaces after Nielsen and Thurston"
## 
## [[3]]$author
## [[3]]$author[[1]]
## [1] "A J Casson"
## 
## [[3]]$author[[2]]
## [1] "S A Bleiler"
## 
## 
## [[3]]$publisher
## [1] "Cambridge University Press"
## 
## [[3]]$year
## [1] "1988"
## 
## [[3]]$price
## [1] "$15"

XML Data

The third form is XML data. It is basically the HTML version of JSON, so it will generally describe schemas. Basically, if you’re doing a job in JSON but need to distinguish attributes from fields, this is available.

xmlData
## $book
## $book$title
## [1] "Elementary Topology"
## 
## $book$author
## [1] "O Viro"
## 
## $book$author
## [1] "O Ivanov"
## 
## $book$author
## [1] "V Kharlamov"
## 
## $book$author
## [1] "N Netsvetaev"
## 
## $book$website
## [1] "http://www.math.uu.se/~oleg/educ-texts.html"
## 
## 
## $book
## $book$title
## [1] "Vector Bundles and K-Theory"
## 
## $book$author
## [1] "A Hatcher"
## 
## $book$website
## [1] "http://www.math.cornell.edu/~hatcher"
## 
## 
## $book
## $book$title
## [1] "Automorphisms of Surfaces after Nielsen and Thurston"
## 
## $book$author
## [1] "A J Casson"
## 
## $book$author
## [1] "S A Bleiler"
## 
## $book$publisher
## [1] "Cambridge University Press"
## 
## $book$year
## [1] "1988"
## 
## $book$price
## [1] "$15"

Unlist XML/JSON looks like HTML

As much as I have critique for the HTML data in the disorderly form I wrote, the other two can look alot like it. If you deconstruct json data, you’ll find that its names look strikingly similar to the HTML. Additionally, the xml data will fit in exactly the same places. I don’t recommend doing this for any practical usage but it makes you think.

rbind(
  unlist(jsonData),
  unlist(xmlData)
)
##      title                 author1  author2    author3       author4       
## [1,] "Elementary Topology" "O Viro" "O Ivanov" "V Kharlamov" "N Netsvetaev"
## [2,] "Elementary Topology" "O Viro" "O Ivanov" "V Kharlamov" "N Netsvetaev"
##      website                                      
## [1,] "http://www.math.uu.se/~oleg/educ-texts.html"
## [2,] "http://www.math.uu.se/~oleg/educ-texts.html"
##      title                         author     
## [1,] "Vector Bundles and K-Theory" "A Hatcher"
## [2,] "Vector Bundles and K-Theory" "A Hatcher"
##      website                               
## [1,] "http://www.math.cornell.edu/~hatcher"
## [2,] "http://www.math.cornell.edu/~hatcher"
##      title                                                  author1     
## [1,] "Automorphisms of Surfaces after Nielsen and Thurston" "A J Casson"
## [2,] "Automorphisms of Surfaces after Nielsen and Thurston" "A J Casson"
##      author2       publisher                    year   price
## [1,] "S A Bleiler" "Cambridge University Press" "1988" "$15"
## [2,] "S A Bleiler" "Cambridge University Press" "1988" "$15"

Conclusions

They’re pretty similar, generally speaking. There’s probably a correct way to do the HTML table but a frontend dev is meant to write that. The XML tree is a bit more strict so every array needs to be named. Hypothetically, there’s ways to make cleaner tables in R but I never figured out where a clean enumerator is for that. Personally, I think navigating a tree structure is R’s major weakness but I digress.