In this assignment, we are tasked to create one (1) HTML file, one (1) XML file and one (1) JSON file that holds information on our favorite book on one favorite subject. At least one of the book information must hold more than one author and for each book, title , author and two or three attributes should be included. Once these three files are created, we are asked to load them in R and see if these three data frames are identical.
library('XML')
library('RCurl')
library('jsonlite')
Each file is created separately and can be found in below github repo
HTML File : https://github.com/Gunduzhazal/html
XML File: https://github.com/Gunduzhazal/.xml
JSON File: https://github.com/Gunduzhazal/json
# create the url
html_url <- "https://github.com/Gunduzhazal/html"
# create the url
book_xml_url <- "https://raw.githubusercontent.com//working-with-xml-json/master/book.xml"
# collect the source of the xml
book_xml <- getURLContent(book_xml_url)
# parse the xml and create the dataframe
book_xml
[1] "<a href=\"/working-with-xml-json/master/book.xml\">Moved Permanently</a>.\n\n"
attr(,"Content-Type")
charset
"text/html" "utf-8"
# A tibble: 6 × 1
`<!DOCTYPE html>`
<chr>
1 "<html>"
2 "<head>"
3 "<title>Favorite Books</title>"
4 "<meta charset=\"UTF-8\">"
5 "</head>"
6 "<body>"
# A tibble: 6 × 1
`<?xml version="1.0" encoding="UTF-8"?>`
<chr>
1 <favbooks>
2 <book>
3 <title>Harry Potter and the Order of the Phoenix</title>
4 <author>J. K. Rowling</author>
5 <publisher>Scholastic</publisher>
6 <pages>870</pages>
# look at the structure
str(book_xml)
chr "<a href=\"/working-with-xml-json/master/book.xml\">Moved Permanently</a>.\n\n"
- attr(*, "Content-Type")= Named chr [1:2] "text/html" "utf-8"
..- attr(*, "names")= chr [1:2] "" "charset"
# create the url
book_json_url <- "https://raw.githubusercontent.com//working-with-xml-json/master/book.json"
# collect the json source and create the table
book_json_url
[1] "https://raw.githubusercontent.com//working-with-xml-json/master/book.json"
# A tibble: 6 × 1
`{"favbooks" :[`
<chr>
1 {
2 title : Harry Potter and the Order of the Phoenix,
3 author : J. K. Rowling,
4 publisher : Scholastic,
5 pages : 870,
6 price : 12.99,
#Look at structure
str(book_json_url)
chr "https://raw.githubusercontent.com//working-with-xml-json/master/book.json"
Upon load of the html, xml and json content and files in R, we see that the xml and htlm are more similar compare to the json file. Both xml and html tables sees all the variables as factor data type but json table sees some variables as character and some integer. From a structure perspective html and xml files that we loaded in ‘R’ are almost identical however json is different in terms of data types.
Rpubs => https://rpubs.com/gunduzhazal/821706
Github => https://github.com/Gunduzhazal/html
Github => https://github.com/Gunduzhazal/.xml
Github => https://github.com/Gunduzhazal/json