Hazal Gunduz

Introduction

In this assignment, we are tasked to create one (1) HTML file, one (1) XML file and one (1) JSON file that holds information on our favorite book on one favorite subject. At least one of the book information must hold more than one author and for each book, title , author and two or three attributes should be included. Once these three files are created, we are asked to load them in R and see if these three data frames are identical.

Load Libraries

library('XML')
library('RCurl')
library('jsonlite')

Create each file

Each file is created separately and can be found in below github repo

HTML File : https://github.com/Gunduzhazal/html

XML File: https://github.com/Gunduzhazal/.xml

JSON File: https://github.com/Gunduzhazal/json

Read HTML File and understand file structure

# create the url
html_url <- "https://github.com/Gunduzhazal/html"

Read XML File and understand file structure

# create the url
book_xml_url <- "https://raw.githubusercontent.com//working-with-xml-json/master/book.xml"

# collect the source of the xml 
book_xml <- getURLContent(book_xml_url)

# parse the xml and create the dataframe 
book_xml

[1] "<a href=\"/working-with-xml-json/master/book.xml\">Moved Permanently</a>.\n\n"
attr(,"Content-Type")
                charset 
"text/html"     "utf-8"

# A tibble: 6 × 1
  `<!DOCTYPE html>`              
  <chr>                          
1 "<html>"                       
2 "<head>"                       
3 "<title>Favorite Books</title>"
4 "<meta charset=\"UTF-8\">"     
5 "</head>"                      
6 "<body>"

# A tibble: 6 × 1
  `<?xml version="1.0" encoding="UTF-8"?>`                
  <chr>                                                   
1 <favbooks>                                              
2 <book>                                                  
3 <title>Harry Potter and the Order of the Phoenix</title>
4 <author>J. K. Rowling</author>                          
5 <publisher>Scholastic</publisher>                       
6 <pages>870</pages>

# look at the structure
str(book_xml)

 chr "<a href=\"/working-with-xml-json/master/book.xml\">Moved Permanently</a>.\n\n"
 - attr(*, "Content-Type")= Named chr [1:2] "text/html" "utf-8"
  ..- attr(*, "names")= chr [1:2] "" "charset"

Read JSON File and Understand its structure

# create the url
book_json_url <- "https://raw.githubusercontent.com//working-with-xml-json/master/book.json"

# collect the json source and create the table
book_json_url

[1] "https://raw.githubusercontent.com//working-with-xml-json/master/book.json"

# A tibble: 6 × 1
  `{"favbooks" :[`                                  
  <chr>                                             
1 {                                                 
2 title : Harry Potter and the Order of the Phoenix,
3 author : J. K. Rowling,                           
4 publisher : Scholastic,                           
5 pages : 870,                                      
6 price : 12.99,

#Look at structure
str(book_json_url)

 chr "https://raw.githubusercontent.com//working-with-xml-json/master/book.json"

Conclusion

Upon load of the html, xml and json content and files in R, we see that the xml and htlm are more similar compare to the json file. Both xml and html tables sees all the variables as factor data type but json table sees some variables as character and some integer. From a structure perspective html and xml files that we loaded in ‘R’ are almost identical however json is different in terms of data types.

Rpubs => https://rpubs.com/gunduzhazal/821706

Github => https://github.com/Gunduzhazal/html

Github => https://github.com/Gunduzhazal/.xml

Github => https://github.com/Gunduzhazal/json