Homework 7: Investigating XML, HTML and JSON

Introduction

For this assignment, I gathered three pieces of information about three different books and created an XML, JSON, and HTML file containing the books’ authors’ names and whether the book had clarity of style, and its brevity. In some cases, an author may have written specific chapters. In those instances, the individual author was judged on their clarity and brevity. Each file was read and the information was loaded into a dataframe.

Reading HTML file

bookshtml<-read_html("https://raw.githubusercontent.com/greerda/Data607/main/books.html")
bookshtml

## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n\t\t\t<table border="1">\n<thead><tr>\n<th>Author</th>\n\t\t\t\t\ ...

Reading XML file

booksxml <-read_html("https://raw.githubusercontent.com/greerda/Data607/main/books.xml")
booksxml

## {html_document}
## <html>
## [1] <body><root><books title="Design Patterns"><author author="Freeman"><attr ...

Reading JSON file

booksjson <-fromJSON("https://raw.githubusercontent.com/greerda/Data607/main/books.json")
booksjson

##             title                                Author
## 1 Design Patterns Freeman, Friedman, Yes, Yes, Yes, Yes
## 2      C++ Primer        Lippman, Lajoe, No, No, No, No
## 3          JQuery         York, Smith, No, Yes, No, Yes

Comparison of JSON and XML Dataframes

all.equal(booksjson,booksxml)

##  [1] "Names: 2 string mismatches"                                                            
##  [2] "Attributes: < Length mismatch: comparison on first 1 components >"                     
##  [3] "Attributes: < Component \"class\": Lengths (1, 2) differ (string compare on first 1) >"
##  [4] "Attributes: < Component \"class\": 1 string mismatch >"                                
##  [5] "Component 1: Modes: character, externalptr"                                            
##  [6] "Component 1: Lengths: 3, 1"                                                            
##  [7] "Component 1: target is character, current is externalptr"                              
##  [8] "Component 2: Modes: list, externalptr"                                                 
##  [9] "Component 2: Lengths: 3, 1"                                                            
## [10] "Component 2: current is not list-like"

Comparison of JSON and HTML Dataframes

all.equal(booksjson,bookshtml)

##  [1] "Names: 2 string mismatches"                                                            
##  [2] "Attributes: < Length mismatch: comparison on first 1 components >"                     
##  [3] "Attributes: < Component \"class\": Lengths (1, 2) differ (string compare on first 1) >"
##  [4] "Attributes: < Component \"class\": 1 string mismatch >"                                
##  [5] "Component 1: Modes: character, externalptr"                                            
##  [6] "Component 1: Lengths: 3, 1"                                                            
##  [7] "Component 1: target is character, current is externalptr"                              
##  [8] "Component 2: Modes: list, externalptr"                                                 
##  [9] "Component 2: Lengths: 3, 1"                                                            
## [10] "Component 2: current is not list-like"

Comparison of HTML and XML Dataframes

all.equal(bookshtml,booksxml)

## [1] TRUE

Conclusion: Are the Dataframes Identical?

The HTML and XML files are equal in R but not identical. It makes sense that they are recognized as equal in R because XML and HTML are based on the same Standard Generalized Markup Language ISO standard. Therefore you can use the same tactics and techniques to extract the data from each file type. They aren’t identical because of the inherent differences between XML and HTML.

The XML/HTML files are not equal to the JSON. JSON has a much different syntax and structure in comparison either XML or HTML.

structures.