Intro

For this assignment, we were tasked with creating HTML, XML, and JSON files of 3 or our favourite books on one of our favorite topics. At least one of the books must have more than one author. Each of the different file structures should be loaded into R data frames. This is a primer for further work with these structures in the semseter.

Load Libraries

library("tidyverse")
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("rvest")
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library("XML")
## 
## Attaching package: 'XML'
## The following object is masked from 'package:rvest':
## 
##     xml
library("methods")
library("jsonlite")
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
library("stringr")
library("DT")
library ("RCurl")
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete

Preview HTML File

HTML FIle

HTML FIle

Load HTML Data & Create Data Frame

# load HTML data into data frame
url <- "https://raw.githubusercontent.com/albert-gilharry/DATA607-Assignment-5/master/data/books.html"
htmlBooks <- url %>%
  read_html() %>%
  html_nodes("table") %>%
  html_table()

htmlBooks <- htmlBooks[[1]]

View HTML Data Frame

datatable(htmlBooks, options = list(filter = FALSE), filter = "none")

Preview XML File

HTML FIle

HTML FIle

Load XML Data & Create Data Frame

The books with multiple authors posed a problem because the built in functionality to convert XML to a data frame concatenates the author nodes without a delimiter. For this reason, I looped through the data to format the authors’ names properly. This may not be most efficient way of doing so, but this is a very small data set, so it is fine.

url <- getURL("https://raw.githubusercontent.com/albert-gilharry/DATA607-Assignment-5/master/data/books.xml")
doc <- xmlParse(url)

data <- xpathSApply(doc, "//BOOKS/BOOK/AUTHORS",xmlChildren, simplify = TRUE)
authors = c()
for(i in 1:length(data)){
  c <- c()
  for(j in 1:length(data[[i]])){
   c <-  append( c, xmlValue(data[[i]][[j]]) )
  }
  authors <- append(authors, paste(unlist(c),collapse = ", "))
}

# use the built in function to create the data frame
xmlBooks <- xmlToDataFrame(url,stringsAsFactors = FALSE)

# fix the authors
xmlBooks$AUTHORS <- authors

View XML Data Table

datatable(xmlBooks, options = list(filter = FALSE),filter="none")

Preview JSON File

HTML FIle

HTML FIle

Load JSON Data & Create Data Frame

The books with multiple authors posed a problem again because the built in functionality to create a data frame from JSON data attaches an list for the authors. I looped through the data to format the authors. This may not be the most efficient way of doing so, but this is a very small data set, so it is fine.

# load JSON data into data frame
url <- getURL("https://raw.githubusercontent.com/albert-gilharry/DATA607-Assignment-5/master/data/books.json")
jsonBooks <- fromJSON(url)
authors = c()
jsonBooks <- jsonBooks$books

# create a comma separated list for authors of each book
for(i in 1:nrow(jsonBooks)){
 authors <- append(authors, paste(unlist( jsonBooks$author[i] ),collapse = ", "))
}

# update authors
jsonBooks$author <- authors

View JSON Data Table

datatable(jsonBooks, options = list(filter = FALSE),filter="none")

Conclusion

In conclusion, R packages make it relatively easy parse and load HTML, XML, and JSON data into data frames. These data frames were not identical due to how the packeges handle one to many relationships in JSON, and XML. The processing that I did eventlually led to all data frames being identical.