Week 7 Assignment
Description
In this assignment I have selected three books that I enjoyed reading and have entered information these three books in separately created files. One encodes the information in HTML, XML and JSON formats. I am to pull in the data into R and answer if the three data frames are identical.
Reading HTML File into R
Here’s what the HTML file looks like:
Table 1. Preview of books.html
<!DOCTYPE HTML>
<html>
<body>
<table>
<tbody>
<tr><th>Title</th><td>Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets</td></tr>
<tr><th>Author</th><td>Nassim Nicholas Taleb </td></tr>
<tr><th>Pages</th><td>368</td></tr>
<tr><th>Publisher</th><td> Random House Trade Paperbacks</td></tr>
<tr><th>ISBN</th><td>9780812975215</td></tr>
</tbody>
</table>
</body>
</html>
I will use rvest to extract the data from the table.
library(rvest)
html_df <- read_html("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.html") %>%
html_node("table") %>%
html_table()
Here’s what the HTML data looks like after processed by rvest:
Table 2. Data From books.html
X1 | X2 |
---|---|
Title | Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets |
Author | Nassim Nicholas Taleb |
Pages | 368 |
Publisher | Random House Trade Paperbacks |
ISBN | 9780812975215 |
Reading XML File into R
Now it’s time to work on the XML file. Here’s what the XML file looks like:
Table 3. Preview of books.xml
<?xml version="1.0" encoding="UTF-8"?>
<Book>
<Title>The Signal and the Noise: Why So Many Predictions Fail - But Some Don't</Title>
<Author>Nate Silver</Author>
<Pages>544</Pages>
<Publisher>Penguin</Publisher>
<ISBN>9781594204111</ISBN>
</Book>
I will use xml2 package to extract the data from the XML file.
library(xml2)
books_xml <- read_xml("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.xml")
# Pull out the element names
X1 <- books_xml %>%
xml_children() %>%
xml_name()
# Pull out the value
X2 <- books_xml %>%
xml_children() %>%
xml_text()
# Put it all together in a data frame
xml_df <- data.frame(X1, X2)
Here’s what the XML data looks like after passing through the above process:
Table 4. Data Frame of books.xml
X1 | X2 |
---|---|
Title | The Signal and the Noise: Why So Many Predictions Fail - But Some Don’t |
Author | Nate Silver |
Pages | 544 |
Publisher | Penguin |
ISBN | 9781594204111 |
Reading JSON File into R
Finally I will work on the JSON file. Here’s what it looks like:
Table 5. Preview of books.json
{
"Title": "Why Nations Fail: The Origins of Power, Prosperity, and Poverty",
"Author": [
"Daron Acemoğlu",
"James A. Robinson"
],
"Pages": 544,
"Publisher": "Crown Business",
"ISBN": 9780307719218
}
I will use jsonlite package to extract the data from the file.
library(jsonlite)
json_df <- fromJSON("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.json") %>%
unlist() %>%
as.data.frame()
# Change the name
names(json_df) <- c("X2")
# Pull the row names into X1
json_df <- data.frame(X1 = row.names(json_df), json_df)
# Drop the row names
rownames(json_df) <- c()
Here’s what the JSON data looks like after my wrangling:
Table 6. Data Frame of books.json
X1 | X2 |
---|---|
Title | Why Nations Fail: The Origins of Power, Prosperity, and Poverty |
Author1 | Daron Acemoğlu |
Author2 | James A. Robinson |
Pages | 544 |
Publisher | Crown Business |
ISBN | 9780307719218 |
Are the Three Data Frames Identical?
They are now but that was intentional. If I simply read in the data and made it into a data frame they would not be.