For this assignment we have been asked to store information about three books, at least one of which has more than one author , into an XML file, an HTML table and a JSON file. For this assignment I have chosen the following three books.
| Title | Author1 | Author2 | Genre | ISBN-10 | Language | Publisher |
| Cosmic Queries: StarTalk’s Guide to Who We Are, How We Got Here, and Where We’re Going | Neil DeGrasse Tyson | James Trefil | Philosophy Metaphysics | 1426221770 | English | National Geographic |
| How to Avoid a Climate Disaster: The Solutions We Have and the Breakthroughs We Need | Bill Gates | Environmental Policy | 385546130 | English | Knopf | |
| The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race | Walter Isaacson | Scientist Biographies | 1982115858 | English | Simon and Schuster |
I have copied the XML form of the data into github. I will now import the file using the XML library, which has been previously loaded.
xml.url<- "https://raw.githubusercontent.com/bharbans/DATA607_HW/main/Week%207/books.xml"
xml.raw <- xmlParse(getURL(xml.url))
I will now use the xmlToDataFrame to convert my file to an R DataFrame.
books.df.fromXML <- xmlToDataFrame(xml.raw)%>%
mutate(ISBN10 = as.integer(ISBN10) )
datatable(books.df.fromXML)
I have copied the HTML table form of the data into github.
html.url<- "https://raw.githubusercontent.com/bharbans/DATA607_HW/main/Week%207/books.html"
html.raw <- read_html(html.url)
I will now use the html_table function from the rvest library to import the table into a DataFrame. SInce my file has only one table, I can safely choose the first element returned from the html_table function. I have also used the mutate_all function to replace all empty spaces with NA.
allTablesFromHTML<- html.raw %>% html_table(fill = T,header = T, trim = T)
books.df.fromHTML <- allTablesFromHTML[[1]] %>%
mutate_all(list(~na_if(.,"")))
I have copied the JSON form of the data into github. In my file I stored the information about the three books in a JSON array. I will use the fromJSON function from the jsonlite package to parse the file. I have also used the mutate_all function to replace all empty spaces with NA.
json.url<- "https://raw.githubusercontent.com/bharbans/DATA607_HW/main/Week%207/books.json"
json.raw <- jsonlite::fromJSON(getURL(json.url))
books.df.fromJSON <- json.raw$books %>%
mutate_all(list(~na_if(.,""))) %>%
mutate( ISBN10 = as.integer(ISBN10) )
There are many different ways to obtain data in R, in this assignment we are concerned with JSON, XML files and HTML Tables. I will use the all_equals function to show that the three data frames are in fact equal, bear in mind I did introduce NAs for the JSON and HTML formats where the information was empty. I also cast the ISBN10 column as integers for the XML and JSON formats, the HTML parser did this on its own.
all_equal(books.df.fromHTML,books.df.fromJSON)
## [1] TRUE
all_equal(books.df.fromHTML,books.df.fromXML)
## [1] TRUE
all_equal(books.df.fromJSON,books.df.fromXML)
## [1] TRUE