Introduction

For this assignment we have been asked to store information about three books, at least one of which has more than one author , into an XML file, an HTML table and a JSON file. For this assignment I have chosen the following three books.

Title Author1 Author2 Genre ISBN-10 Language Publisher
Cosmic Queries: StarTalk’s Guide to Who We Are, How We Got Here, and Where We’re Going Neil DeGrasse Tyson James Trefil Philosophy Metaphysics 1426221770 English National Geographic 
How to Avoid a Climate Disaster: The Solutions We Have and the Breakthroughs We Need Bill Gates Environmental Policy 385546130 English Knopf
The Code Breaker: Jennifer Doudna, Gene Editing, and the Future of the Human Race Walter Isaacson Scientist Biographies 1982115858 English Simon and Schuster

Read XML Format

I have copied the XML form of the data into github. I will now import the file using the XML library, which has been previously loaded.

xml.url<- "https://raw.githubusercontent.com/bharbans/DATA607_HW/main/Week%207/books.xml"
xml.raw <- xmlParse(getURL(xml.url))

I will now use the xmlToDataFrame to convert my file to an R DataFrame.

books.df.fromXML <- xmlToDataFrame(xml.raw)%>% 
  mutate(ISBN10  = as.integer(ISBN10) )
datatable(books.df.fromXML) 

Read HTML Table Format

I have copied the HTML table form of the data into github.

html.url<- "https://raw.githubusercontent.com/bharbans/DATA607_HW/main/Week%207/books.html"
html.raw <- read_html(html.url)

I will now use the html_table function from the rvest library to import the table into a DataFrame. SInce my file has only one table, I can safely choose the first element returned from the html_table function. I have also used the mutate_all function to replace all empty spaces with NA.

allTablesFromHTML<- html.raw %>%  html_table(fill = T,header = T, trim = T)
books.df.fromHTML <- allTablesFromHTML[[1]] %>% 
  mutate_all(list(~na_if(.,""))) 

Read JSON Format

I have copied the JSON form of the data into github. In my file I stored the information about the three books in a JSON array. I will use the fromJSON function from the jsonlite package to parse the file. I have also used the mutate_all function to replace all empty spaces with NA.

json.url<- "https://raw.githubusercontent.com/bharbans/DATA607_HW/main/Week%207/books.json"
json.raw <- jsonlite::fromJSON(getURL(json.url))
books.df.fromJSON <- json.raw$books %>% 
  mutate_all(list(~na_if(.,""))) %>% 
  mutate( ISBN10  = as.integer(ISBN10) )

Conclusion and Extra Credit

There are many different ways to obtain data in R, in this assignment we are concerned with JSON, XML files and HTML Tables. I will use the all_equals function to show that the three data frames are in fact equal, bear in mind I did introduce NAs for the JSON and HTML formats where the information was empty. I also cast the ISBN10 column as integers for the XML and JSON formats, the HTML parser did this on its own.

all_equal(books.df.fromHTML,books.df.fromJSON)
## [1] TRUE
all_equal(books.df.fromHTML,books.df.fromXML)
## [1] TRUE
all_equal(books.df.fromJSON,books.df.fromXML)
## [1] TRUE