Week 7 Assignment

Description
Reading HTML File into R
Reading XML File into R
Reading JSON File into R
Are the Three Data Frames Identical?

Description

In this assignment I have selected three books that I enjoyed reading and have entered information these three books in separately created files. One encodes the information in HTML, XML and JSON formats. I am to pull in the data into R and answer if the three data frames are identical.

Book Files

I created the files by hand and saved the HTML, XML and JSON files to my GitHub repository. I have also loaded dplyr into my R environment.

Reading HTML File into R

Here’s what the HTML file looks like:

Table 1. Preview of books.html

 <!DOCTYPE HTML>                                                                                                       
 <html>                                                                                                                
 <body>                                                                                                                
     <table>                                                                                                           
         <tbody>                                                                                                       
             <tr><th>Title</th><td>Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets</td></tr>
             <tr><th>Author</th><td>Nassim Nicholas Taleb </td></tr>                                                   
             <tr><th>Pages</th><td>368</td></tr>                                                                       
             <tr><th>Publisher</th><td> Random House Trade Paperbacks</td></tr>                                        
             <tr><th>ISBN</th><td>9780812975215</td></tr>                                                              
         </tbody>                                                                                                      
     </table>                                                                                                          
 </body>                                                                                                               
 </html>

I will use rvest to extract the data from the table.

library(rvest)

html_df <- read_html("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.html") %>%
  html_node("table") %>%
  html_table()

Here’s what the HTML data looks like after processed by rvest:

Table 2. Data From books.html

X1	X2
Title	Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets
Author	Nassim Nicholas Taleb
Pages	368
Publisher	Random House Trade Paperbacks
ISBN	9780812975215

Reading XML File into R

Now it’s time to work on the XML file. Here’s what the XML file looks like:

Table 3. Preview of books.xml

 <?xml version="1.0" encoding="UTF-8"?>                                                  
 <Book>                                                                                  
   <Title>The Signal and the Noise: Why So Many Predictions Fail - But Some Don't</Title>
   <Author>Nate Silver</Author>                                                          
   <Pages>544</Pages>                                                                    
   <Publisher>Penguin</Publisher>                                                        
   <ISBN>9781594204111</ISBN>                                                            
 </Book>

I will use xml2 package to extract the data from the XML file.

library(xml2)
books_xml <- read_xml("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.xml")
# Pull out the element names
X1 <- books_xml %>%
  xml_children() %>%
  xml_name()
# Pull out the value
X2 <- books_xml %>%
  xml_children() %>%
  xml_text()
# Put it all together in a data frame
xml_df <- data.frame(X1, X2)

Here’s what the XML data looks like after passing through the above process:

Table 4. Data Frame of books.xml

X1	X2
Title	The Signal and the Noise: Why So Many Predictions Fail - But Some Don’t
Author	Nate Silver
Pages	544
Publisher	Penguin
ISBN	9781594204111

Reading JSON File into R

Finally I will work on the JSON file. Here’s what it looks like:

Table 5. Preview of books.json

 {                                                                              
     "Title": "Why Nations Fail: The Origins of Power, Prosperity, and Poverty",
     "Author": [                                                                
         "Daron Acemoğlu",                                                      
         "James A. Robinson"                                                    
     ],                                                                         
     "Pages": 544,                                                              
     "Publisher": "Crown Business",                                             
     "ISBN": 9780307719218                                                      
 }

I will use jsonlite package to extract the data from the file.

library(jsonlite)
json_df <- fromJSON("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.json") %>%
  unlist() %>%
  as.data.frame()
# Change the name
names(json_df) <- c("X2")
# Pull the row names into X1
json_df <- data.frame(X1 = row.names(json_df), json_df)
# Drop the row names
rownames(json_df) <- c()

Here’s what the JSON data looks like after my wrangling:

Table 6. Data Frame of books.json

X1	X2
Title	Why Nations Fail: The Origins of Power, Prosperity, and Poverty
Author1	Daron Acemoğlu
Author2	James A. Robinson
Pages	544
Publisher	Crown Business
ISBN	9780307719218

Are the Three Data Frames Identical?

They are now but that was intentional. If I simply read in the data and made it into a data frame they would not be.