For this assignment, we have to pick three books of our choice and create a small data frame with attrbutes about them, then try to load them into R as a JSON, XML, and HTML files. For this, I picked the books Behave by Robert Sapolsky, Growth by Vaclav Smil, and A Thousan Brains by Jeff Hawkins and Richard Dawkins. All three amazing books that I could talk for hours about. I created the files by hand on my laptop and then uploaded them to GitHub for online access.
library(rvest)
html = read_html("https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/books.html")
table_nodes = html_nodes(html, "table")
html_table = html_table(table_nodes[[1]], fill = TRUE)
In the code I load the HTML file from my GitHub using the rvest package. Then I extract the table nodes into a new variable, and then finally I generate the table from those nodes.
library(XML)
github = "https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/books.xml"
download.file(github, destfile = "books.xml", mode = "wb")
xml = xmlParse("books.xml", useInternalNodes = TRUE)
xml = xmlToDataFrame(xml, stringsAsFactors = FALSE)
Importing the XML file locally was no problem, but once I uploaded it to GitHub, the XML package had issues to load it into R, so I just download the XML file into the local directory and then import it from there. So make sure to delete this file from you wd later. :) I have tried to get the first row to be the column header, but it seems like the XML package does not inherently allow that.
library(jsonlite)
json = fromJSON("https://raw.githubusercontent.com/lucasweyrich958/DATA607/main/books.json")
The JSON file can be loaded into R via the package jsonlite, and it seems that this was the easiest of the three. Also in terms of creating it. Below all final three data frames are printed.
html_table
## # A tibble: 3 × 4
## Title Authors Pages Publisher
## <chr> <chr> <int> <chr>
## 1 Behave Robert Sapolsky 800 Penguin Press
## 2 Growth Vaclav Smil 655 MIT Press
## 3 A Thousand Brains Jeff Hawkins, Richard Dawkins 288 Basic Books
xml
## column1 column2 column3 column4
## 1 Title Authors Pages Publisher
## 2 Behave Robert Sapolsky 800 Penguin Press
## 3 Growth Vaclav Smil 655 MIT Press
## 4 A Thousand Brains Jeff Hawkins, Richard Dawkins 288 Basic Books
json
## Title Authors Pages Publisher
## 1 Behave Robert Sapolsky 800 Penguin Press
## 2 Growth Vaclav Smil 688 MIT Press
## 3 A Thousand Brains Jeff Hawkins, Richard Dawkins 255 Basic Books
As can be seen, all three filetypes were successfully imported, with a few extra steps required for HTML and XML files. For some reason the header in the XML file does not want to coopoerate. I also would like to mention that, in terms of the XML file, and likely for the other file types too, these are simple examples. At work I had an instance where I wanted to work with an XML file, a big one with several hundred columns and rows, and it did not cooperate as nicely as here.