Setup environment

knitr::opts_chunk$set(echo = TRUE)
if("XML" %in% rownames(installed.packages()) == FALSE) {install.packages("XML")}
library(XML)
if("RCurl" %in% rownames(installed.packages()) == FALSE) {install.packages("RCurl")}
library(RCurl)

## Loading required package: bitops

if("rvest" %in% rownames(installed.packages()) == FALSE) {install.packages("rvest")}
library(rvest)

## Warning: package 'rvest' was built under R version 3.3.3

## Loading required package: xml2

## Warning: package 'xml2' was built under R version 3.3.3

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:XML':
## 
##     xml

if("DT" %in% rownames(installed.packages()) == FALSE) {install.packages("DT")}
library(DT)

## Warning: package 'DT' was built under R version 3.3.3

if("jsonlite" %in% rownames(installed.packages()) == FALSE) {install.packages("jsonlite")}
library(jsonlite)

initialise filenames

Since I enjoy reading and thus love literature, the books chosen were selected from some of my favourites read in High School.

Each file as created by hand.

html_url <- "https://raw.githubusercontent.com/NNedd/DATA607-Submissions/master/Week%207%20Assignment/books.html"

xml_url <- "https://raw.githubusercontent.com/NNedd/DATA607-Submissions/master/Week%207%20Assignment/books.xml"

json_url <- "https://raw.githubusercontent.com/NNedd/DATA607-Submissions/master/Week%207%20Assignment/books.json"

HTML document

htmlContent <- read_html(html_url)
htmlTable <- html_table(htmlContent, fill = TRUE)
final_html <-as.data.frame(htmlTable[1])
datatable(final_html)

XML document

xmlFile <- getURL(xml_url)
xmlContent <- xmlParse(xmlFile)
xmlroot <- xmlRoot(xmlContent)
final_xml <- xmlToDataFrame(xmlroot)
datatable(final_xml)

#Include author names
finalList <- xmlToList(xmlContent)
final_xml2  <- as.data.frame(finalList[1])
final_xml2 <- rbind(final_xml2, as.data.frame(finalList[2]))
final_xml2 <- rbind(final_xml2, as.data.frame(finalList[3]))
datatable(final_xml2)

JSON document

jsonFile <- getURL(json_url)
jsonContent <- fromJSON(jsonFile)
json_table <- jsonContent$`Literature Books`
json_table

##                   title  author.first          author.second year
## 1     A World of Poetry   Mark McWatt Hazel Simmons McDonald 1994
## 2        A Brighter Sun Samuel Selvon                   <NA> 1953
## 3 A man for all seasons   Robert Bolt                   <NA> 1960
##   goodreads rating No of Pages
## 1             4.13         196
## 2             4.04         240
## 3             3.89         192

#Include author names
final_json <- cbind(json_table, json_table$author)
final_json <- final_json[-2]
datatable(final_json)

Conclusion

Each document type has its own “personality”. HTML files are simple but it’s table feature allows only for simple data organisation. The XML and JSON format allow for more complex data organisation such as tagging the first and second authors.

The dataframes are also different. The HTML dataframe was clearly displayed with no complex manipulation necessary. The XML and JSON dataframes both required extra manipulation to display the names of the authors. However, after manipulation the JSON dataframe looked more like the HTML one. The XML dataframe was displayed differently with repeated rows for each author.

Working with XML and JSON in R

N Nedd

March 19, 2017