Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
html_source<-readHTMLTable(getURL("https://raw.githubusercontent.com/georg4re/DS607/master/data/books.html"))
html_source<-lapply(html_source[[1]], function(x) {unlist(x)})
html_data<-as.data.frame(html_source)
head(html_data)
## Title Author Country
## 1 The Fellowship of the Ring J.R.R. Tolkien United Kingdom
## 2 The Hitchhiker's Guide to the Galaxy Douglas Adams United Kingdom
## 3 Watchmen Allan Moore, Dave Gibbons USA
## Genre Publisher Publication.Date
## 1 Fantasy, Adventure Allen & Unwin 07/29/1954
## 2 Comic Science Fiction Pan Books 10/12/1979
## 3 Comic Book, Science Fiction DC Comics 09/01/1986
xml_source<-xmlInternalTreeParse(getURL("https://raw.githubusercontent.com/georg4re/DS607/master/data/books.xml"))
xml_source<-xmlSApply(xmlRoot(xml_source), function(x) xmlSApply(x, xmlValue))
#Need to flip the rows and columns
xml_source <- t(xml_source)
xml_data<-data.frame(xml_source, row.names = NULL)
head(xml_data)
## title author country
## 1 The Fellowship of the Ring J.R.R. Tolkien United Kingdom
## 2 The Hitchhiker's Guide to the Galaxy Douglas Adams United Kingdom
## 3 Watchmen Allan Moore, Dave Gibbons USA
## genre publisher publication_date
## 1 Fantasy, Adventure Allen & Unwin 07/29/1954
## 2 Comic Science Fiction Pan Books 10/12/1979
## 3 Comic Book, Science Fiction DC Comics 09/01/1986
json_source <- fromJSON(file = "https://raw.githubusercontent.com/georg4re/DS607/master/data/books.json")
json_source <- lapply(json_source, function(x) {
unlist(x)
})
json_data <- as.data.frame((do.call("rbind", json_source)))
head(json_data)
## title author country
## 1 The Fellowship of the Ring J.R.R. Tolkien United Kingdom
## 2 The Hitchhiker's Guide to the Galaxy Douglas Adams United Kingdom
## 3 Watchmen Allan Moore, Dave Gibbons USA
## genre publisher publication_date
## 1 Fantasy, Adventure Allen & Unwin 07/29/1954
## 2 Comic Science Fiction Pan Books 10/12/1979
## 3 Comic Book, Science Fiction DC Comics 09/01/1986
comparedf(json_data, xml_data)
## Compare Object
##
## Function Call:
## comparedf(x = json_data, y = xml_data)
##
## Shared: 6 non-by variables and 3 observations.
## Not shared: 0 variables and 0 observations.
##
## Differences found in 0/6 variables compared.
## 0 variables compared have non-identical attributes.
When comparing the dataframes, we see the variable names and values are the same between JSON and XML, Due to the way we named them
HTML variable names are different because we used case titles for the columns
comparedf(html_data, json_data)
## Compare Object
##
## Function Call:
## comparedf(x = html_data, y = json_data)
##
## Shared: 0 non-by variables and 3 observations.
## Not shared: 12 variables and 0 observations.
##
## Differences found in 0/0 variables compared.
## 0 variables compared have non-identical attributes.