Introduction

In this assignment, we are tasked to create one (1) HTML file, one (1) XML file and one (1) JSON file that holds information on our favorite book on one favorite subject. At least one of the book information must hold more than one author and for each book, title , author and two or three attributes should be included. Once these three files are created, we are asked to load them in R and see if these three data frames are identical.

Load Libraries

library("XML")
library("RCurl")
library("jsonlite")

Create each file

Each file is created separately and can be found in below githun repo

HTML File : https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.html

XML File: https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.xml

JSON File: https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.json

Read HTML File and understand file structure

# create the url
book_html_url <- "https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.html"

# collect the source of the html page
html <- getURLContent(book_html_url)

# parse html table
book_html <- readHTMLTable(html)
book_html <- book_html[[1]]
book_html

                                      Title                        Author
1 Harry Potter and the Order of the Phoenix                 J. K. Rowling
2                         The Last Olympian   Rick Riordan, Percy Jackson
3                          Hour of the Bees Lindsay Eagar, Kristina Closs
4                             Timmy Failure                Stephan Pastis
         Publisher Pages Price
1       Scholastic   870 12.99
2           Disney   377 19.99
3 Candlewick Press   360  8.99
4 Candlewick Press   330 14.99
                                                                            Website
1                              https://kids.scholastic.com/kids/books/harry-potter/
2             https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/
3                     https://www.goodreads.com/book/show/22453777-hour-of-the-bees
4 https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end

# Look at the structure
str(book_html)

'data.frame':   4 obs. of  6 variables:
 $ Title    : Factor w/ 4 levels "Harry Potter and the Order of the Phoenix",..: 1 3 2 4
 $ Author   : Factor w/ 4 levels "J. K. Rowling",..: 1 3 2 4
 $ Publisher: Factor w/ 3 levels "Candlewick Press",..: 3 2 1 1
 $ Pages    : Factor w/ 4 levels "330","360","377",..: 4 3 2 1
 $ Price    : Factor w/ 4 levels "12.99","14.99",..: 1 3 4 2
 $ Website  : Factor w/ 4 levels "https://kids.scholastic.com/kids/books/harry-potter/",..: 1 4 2 3

Read XML File and understand file structure

# create the url
book_xml_url <- "https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.xml"

# collect the source of the xml
book_xml <- getURLContent(book_xml_url)

# parse the xml and create the dataframe
book_xml <- xmlParse(book_xml)
book_xml <- xmlToDataFrame(book_xml)
book_xml

                                      title                        author
1 Harry Potter and the Order of the Phoenix                 J. K. Rowling
2                         The Last Olympian   Rick Riordan, Percy Jackson
3                          Hour of the Bees Lindsay Eagar, Kristina Closs
4                             Timmy Failure                Stephan Pastis
         publisher pages price
1       Scholastic   870 12.99
2           Disney   377 19.99
3 Candlewick Press   360  8.99
4 Candlewick Press   330 14.99
                                                                            website
1                              https://kids.scholastic.com/kids/books/harry-potter/
2             https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/
3                     https://www.goodreads.com/book/show/22453777-hour-of-the-bees
4 https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end

# look at the structure
str(book_xml)

'data.frame':   4 obs. of  6 variables:
 $ title    : Factor w/ 4 levels "Harry Potter and the Order of the Phoenix",..: 1 3 2 4
 $ author   : Factor w/ 4 levels "J. K. Rowling",..: 1 3 2 4
 $ publisher: Factor w/ 3 levels "Candlewick Press",..: 3 2 1 1
 $ pages    : Factor w/ 4 levels "330","360","377",..: 4 3 2 1
 $ price    : Factor w/ 4 levels "12.99","14.99",..: 1 3 4 2
 $ website  : Factor w/ 4 levels "https://kids.scholastic.com/kids/books/harry-potter/",..: 1 4 2 3

Read JSON File and Understand its structure

# create the url
book_json_url <- "https://raw.githubusercontent.com/anilak1978/working-with-xml-json/master/book.json"

# collect the json source and create the table
book_json <- fromJSON(book_json_url)
book_json <- book_json[[1]]
book_json

                                      title                        author
1 Harry Potter and the Order of the Phoenix                 J. K. Rowling
2                         The Last Olympian   Rick Riordan, Percy Jackson
3                          Hour of the Bees Lindsay Eagar, Kristina Closs
4                             Timmy Failure                Stephan Pastis
         publisher pages price
1       Scholastic   870 12.99
2           Disney   377 19.99
3 Candlewick Press   360  8.99
4 Candlewick Press   330 14.99
                                                                            website
1                              https://kids.scholastic.com/kids/books/harry-potter/
2             https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/
3                     https://www.goodreads.com/book/show/22453777-hour-of-the-bees
4 https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end

# Look at structure
str(book_json)

'data.frame':   4 obs. of  6 variables:
 $ title    : chr  "Harry Potter and the Order of the Phoenix" "The Last Olympian" "Hour of the Bees" "Timmy Failure"
 $ author   : chr  "J. K. Rowling" "Rick Riordan, Percy Jackson" "Lindsay Eagar, Kristina Closs" "Stephan Pastis"
 $ publisher: chr  "Scholastic" "Disney" "Candlewick Press" "Candlewick Press"
 $ pages    : int  870 377 360 330
 $ price    : num  12.99 19.99 8.99 14.99
 $ website  : chr  "https://kids.scholastic.com/kids/books/harry-potter/" "https://www.readriordan.com/book/the-last-olympian-the-graphic-novel/" "https://www.goodreads.com/book/show/22453777-hour-of-the-bees" "https://www.goodreads.com/book/show/37976815-it-s-the-end-when-i-say-it-s-the-end"

Conclusion

Upon load of the html, xml and json content and files in R, we see that the xml and htlm are more similar compare to the json file. Both xml and html tables sees all the variables as factor data type but json table sees some variables as character and some integer. From a structure perspective html and xml files that we loaded in R are almost identical however json is different in terms of data types.

Working with XML and JSON in R

2019-10-10