DS607-HW7

Data 607 Homework 7

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

HTML Data

html_source<-readHTMLTable(getURL("https://raw.githubusercontent.com/georg4re/DS607/master/data/books.html"))
html_source<-lapply(html_source[[1]], function(x) {unlist(x)})
html_data<-as.data.frame(html_source)
head(html_data)

##                                  Title                    Author        Country
## 1           The Fellowship of the Ring            J.R.R. Tolkien United Kingdom
## 2 The Hitchhiker's Guide to the Galaxy             Douglas Adams United Kingdom
## 3                             Watchmen Allan Moore, Dave Gibbons            USA
##                         Genre     Publisher Publication.Date
## 1          Fantasy, Adventure Allen & Unwin       07/29/1954
## 2       Comic Science Fiction     Pan Books       10/12/1979
## 3 Comic Book, Science Fiction     DC Comics       09/01/1986

XML Data

xml_source<-xmlInternalTreeParse(getURL("https://raw.githubusercontent.com/georg4re/DS607/master/data/books.xml"))
xml_source<-xmlSApply(xmlRoot(xml_source), function(x) xmlSApply(x, xmlValue))
#Need to flip the rows and columns
xml_source <- t(xml_source)
xml_data<-data.frame(xml_source, row.names = NULL)
head(xml_data)

##                                  title                    author        country
## 1           The Fellowship of the Ring            J.R.R. Tolkien United Kingdom
## 2 The Hitchhiker's Guide to the Galaxy             Douglas Adams United Kingdom
## 3                             Watchmen Allan Moore, Dave Gibbons            USA
##                         genre     publisher publication_date
## 1          Fantasy, Adventure Allen & Unwin       07/29/1954
## 2       Comic Science Fiction     Pan Books       10/12/1979
## 3 Comic Book, Science Fiction     DC Comics       09/01/1986

JSON Data

json_source <- fromJSON(file = "https://raw.githubusercontent.com/georg4re/DS607/master/data/books.json")
json_source <- lapply(json_source, function(x) {
  unlist(x)
})
json_data <- as.data.frame((do.call("rbind", json_source)))
head(json_data)

##                                  title                    author        country
## 1           The Fellowship of the Ring            J.R.R. Tolkien United Kingdom
## 2 The Hitchhiker's Guide to the Galaxy             Douglas Adams United Kingdom
## 3                             Watchmen Allan Moore, Dave Gibbons            USA
##                         genre     publisher publication_date
## 1          Fantasy, Adventure Allen & Unwin       07/29/1954
## 2       Comic Science Fiction     Pan Books       10/12/1979
## 3 Comic Book, Science Fiction     DC Comics       09/01/1986

Comparing the Data

comparedf(json_data, xml_data)

## Compare Object
## 
## Function Call: 
## comparedf(x = json_data, y = xml_data)
## 
## Shared: 6 non-by variables and 3 observations.
## Not shared: 0 variables and 0 observations.
## 
## Differences found in 0/6 variables compared.
## 0 variables compared have non-identical attributes.

When comparing the dataframes, we see the variable names and values are the same between JSON and XML, Due to the way we named them

HTML variable names are different because we used case titles for the columns

comparedf(html_data, json_data)

## Compare Object
## 
## Function Call: 
## comparedf(x = html_data, y = json_data)
## 
## Shared: 0 non-by variables and 3 observations.
## Not shared: 12 variables and 0 observations.
## 
## Differences found in 0/0 variables compared.
## 0 variables compared have non-identical attributes.