Introduction
In this assignment, we are tasked to create one (1) HTML file, one (1) XML file and one (1) JSON file that holds information on our favorite book on one favorite subject. At least one of the book information must hold more than one author and for each book, title , author and two or three attributes should be included. Once these three files are created, we are asked to load them in R and see if these three data frames are identical.
Create each file
Each file is created separately and can be found in below githun repo
HTML File : https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.html
XML File: https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.xml
JSON File: https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.json
Read HTML File and understand file structure
# create the url
book_html_url <- "https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.html"
# collect the source of the html page
html <- getURLContent(book_html_url)
# parse html table
book_html <- readHTMLTable(html)
book_html <- book_html[[1]]
book_html Title Author
1 Harry Potter and the Order of the Phoenix J. K. Rowling
2 The Last Olympian Rick Riordan, Percy Jackson
3 Hour of the Bees Lindsay Eagar, Kristina Closs
4 Timmy Failure Stephan Pastis
Publisher Pages Price
1 Scholastic 870 12.99
2 Disney 377 19.99
3 Candlewick Press 360 8.99
4 Candlewick Press 330 14.99
Website
1 https://kids.scholastic.com/kids/books/harry-potter/
2 https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/
3 https://www.goodreads.com/book/show/22453777-hour-of-the-bees
4 https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end
'data.frame': 4 obs. of 6 variables:
$ Title : Factor w/ 4 levels "Harry Potter and the Order of the Phoenix",..: 1 3 2 4
$ Author : Factor w/ 4 levels "J. K. Rowling",..: 1 3 2 4
$ Publisher: Factor w/ 3 levels "Candlewick Press",..: 3 2 1 1
$ Pages : Factor w/ 4 levels "330","360","377",..: 4 3 2 1
$ Price : Factor w/ 4 levels "12.99","14.99",..: 1 3 4 2
$ Website : Factor w/ 4 levels "https://kids.scholastic.com/kids/books/harry-potter/",..: 1 4 2 3
Read XML File and understand file structure
# create the url
book_xml_url <- "https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.xml"
# collect the source of the xml
book_xml <- getURLContent(book_xml_url)
# parse the xml and create the dataframe
book_xml <- xmlParse(book_xml)
book_xml <- xmlToDataFrame(book_xml)
book_xml title author
1 Harry Potter and the Order of the Phoenix J. K. Rowling
2 The Last Olympian Rick Riordan, Percy Jackson
3 Hour of the Bees Lindsay Eagar, Kristina Closs
4 Timmy Failure Stephan Pastis
publisher pages price
1 Scholastic 870 12.99
2 Disney 377 19.99
3 Candlewick Press 360 8.99
4 Candlewick Press 330 14.99
website
1 https://kids.scholastic.com/kids/books/harry-potter/
2 https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/
3 https://www.goodreads.com/book/show/22453777-hour-of-the-bees
4 https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end
'data.frame': 4 obs. of 6 variables:
$ title : Factor w/ 4 levels "Harry Potter and the Order of the Phoenix",..: 1 3 2 4
$ author : Factor w/ 4 levels "J. K. Rowling",..: 1 3 2 4
$ publisher: Factor w/ 3 levels "Candlewick Press",..: 3 2 1 1
$ pages : Factor w/ 4 levels "330","360","377",..: 4 3 2 1
$ price : Factor w/ 4 levels "12.99","14.99",..: 1 3 4 2
$ website : Factor w/ 4 levels "https://kids.scholastic.com/kids/books/harry-potter/",..: 1 4 2 3
Read JSON File and Understand its structure
# create the url
book_json_url <- "https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.json"
# collect the json source and create the table
book_json <- fromJSON(book_json_url)
book_json <- book_json[[1]]
book_json title author
1 Harry Potter and the Order of the Phoenix J. K. Rowling
2 The Last Olympian Rick Riordan, Percy Jackson
3 Hour of the Bees Lindsay Eagar, Kristina Closs
4 Timmy Failure Stephan Pastis
publisher pages price
1 Scholastic 870 12.99
2 Disney 377 19.99
3 Candlewick Press 360 8.99
4 Candlewick Press 330 14.99
website
1 https://kids.scholastic.com/kids/books/harry-potter/
2 https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/
3 https://www.goodreads.com/book/show/22453777-hour-of-the-bees
4 https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end
'data.frame': 4 obs. of 6 variables:
$ title : chr "Harry Potter and the Order of the Phoenix" "The Last Olympian" "Hour of the Bees" "Timmy Failure"
$ author : chr "J. K. Rowling" "Rick Riordan, Percy Jackson" "Lindsay Eagar, Kristina Closs" "Stephan Pastis"
$ publisher: chr "Scholastic" "Disney" "Candlewick Press" "Candlewick Press"
$ pages : int 870 377 360 330
$ price : num 12.99 19.99 8.99 14.99
$ website : chr "https://kids.scholastic.com/kids/books/harry-potter/" "https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/" "https://www.goodreads.com/book/show/22453777-hour-of-the-bees" "https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end"
Conclusion
Upon load of the html, xml and json content and files in R, we see that the xml and htlm are more similar compare to the json file. Both xml and html tables sees all the variables as factor data type but json table sees some variables as character and some integer. From a structure perspective html and xml files that we loaded in R are almost identical however json is different in terms of data types.