Week 7 Assignment

Description

In this assignment I have selected three books that I enjoyed reading and have entered information these three books in separately created files. One encodes the information in HTML, XML and JSON formats. I am to pull in the data into R and answer if the three data frames are identical.

Book Files

I created the files by hand and saved the HTML, XML and JSON files to my GitHub repository. I have also loaded dplyr into my R environment.

Reading HTML File into R

Here’s what the HTML file looks like:

Table 1. Preview of books.html
 <!DOCTYPE HTML>                                                                                                       
 <html>                                                                                                                
 <body>                                                                                                                
     <table>                                                                                                           
         <tbody>                                                                                                       
             <tr><th>Title</th><td>Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets</td></tr>
             <tr><th>Author</th><td>Nassim Nicholas Taleb </td></tr>                                                   
             <tr><th>Pages</th><td>368</td></tr>                                                                       
             <tr><th>Publisher</th><td> Random House Trade Paperbacks</td></tr>                                        
             <tr><th>ISBN</th><td>9780812975215</td></tr>                                                              
         </tbody>                                                                                                      
     </table>                                                                                                          
 </body>                                                                                                               
 </html>

I will use rvest to extract the data from the table.

library(rvest)

html_df <- read_html("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.html") %>%
  html_node("table") %>%
  html_table()

Here’s what the HTML data looks like after processed by rvest:

Table 2. Data From books.html
X1 X2
Title Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets
Author Nassim Nicholas Taleb
Pages 368
Publisher Random House Trade Paperbacks
ISBN 9780812975215

Reading XML File into R

Now it’s time to work on the XML file. Here’s what the XML file looks like:

Table 3. Preview of books.xml
 <?xml version="1.0" encoding="UTF-8"?>                                                  
 <Book>                                                                                  
   <Title>The Signal and the Noise: Why So Many Predictions Fail - But Some Don't</Title>
   <Author>Nate Silver</Author>                                                          
   <Pages>544</Pages>                                                                    
   <Publisher>Penguin</Publisher>                                                        
   <ISBN>9781594204111</ISBN>                                                            
 </Book>

I will use xml2 package to extract the data from the XML file.

library(xml2)
books_xml <- read_xml("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.xml")
# Pull out the element names
X1 <- books_xml %>%
  xml_children() %>%
  xml_name()
# Pull out the value
X2 <- books_xml %>%
  xml_children() %>%
  xml_text()
# Put it all together in a data frame
xml_df <- data.frame(X1, X2)

Here’s what the XML data looks like after passing through the above process:

Table 4. Data Frame of books.xml
X1 X2
Title The Signal and the Noise: Why So Many Predictions Fail - But Some Don’t
Author Nate Silver
Pages 544
Publisher Penguin
ISBN 9781594204111

Reading JSON File into R

Finally I will work on the JSON file. Here’s what it looks like:

Table 5. Preview of books.json
 {                                                                              
     "Title": "Why Nations Fail: The Origins of Power, Prosperity, and Poverty",
     "Author": [                                                                
         "Daron Acemoğlu",                                                      
         "James A. Robinson"                                                    
     ],                                                                         
     "Pages": 544,                                                              
     "Publisher": "Crown Business",                                             
     "ISBN": 9780307719218                                                      
 }

I will use jsonlite package to extract the data from the file.

library(jsonlite)
json_df <- fromJSON("https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/books.json") %>%
  unlist() %>%
  as.data.frame()
# Change the name
names(json_df) <- c("X2")
# Pull the row names into X1
json_df <- data.frame(X1 = row.names(json_df), json_df)
# Drop the row names
rownames(json_df) <- c()

Here’s what the JSON data looks like after my wrangling:

Table 6. Data Frame of books.json
X1 X2
Title Why Nations Fail: The Origins of Power, Prosperity, and Poverty
Author1 Daron Acemoğlu
Author2 James A. Robinson
Pages 544
Publisher Crown Business
ISBN 9780307719218

Are the Three Data Frames Identical?

They are now but that was intentional. If I simply read in the data and made it into a data frame they would not be.

Mike Silva

October 8, 2018