Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). Create each of these files “by hand” unless you’re already very comfortable with the file formats. Your deliverable is the three source files and the R code.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Subject | Title | Author | Publisher | ISBN | Pages | Attributes |
---|---|---|---|---|---|---|
Mathematics | Applied Linear Statistical Models | Michael Kutner, William Li, Christopher Nachtsheim, John Neter | McGraw Hill | 9780073108742 | 1396 | Exercises, Illustrations, Readability |
Mathematics | Mathematical Proofs: A Transition to Advanced Mathematics | Gary Chartrand, Ping Zhang, Albert Polimeni | Pearson | 9780321390530 | 365 | Exercises, Readability |
Mathematics | Mathematical Statistics with Resampling and R | Laura Chihara, Tim Hesterberg | Wiley | 9781118029855 | 418 | Exercises, Illustrations, Readability |
library(httr)
library(XML)
library(jsonlite)
Web scraping, less structured. An HTML table is defined with the <table> tag. Each table row is defined with the <tr> tag. A table header is defined with the <th> tag. By default, table headings are bold and centered. A table data/cell is defined with the <td> tag.
<table>
<tr>
<th>Subject</th>
<th>Title</th>
<th>Author</th>
<th>Publisher</th>
<th>ISBN</th>
<th>Pages</th>
<th>Attributes</th>
</tr>
<tr>
<td>Mathematics</td>
<td>Applied Linear Statistical Models</td>
<td>Michael Kutner, William Li, Christopher Nachtsheim, John Neter</td>
<td>McGraw Hill</td>
<td>9780073108742</td>
<td>1396</td>
<td>Exercises, Illustrations, Readability</td>
</tr>
<tr>
<td>Mathematics</td>
<td>Mathematical Proofs: A Transition to Advanced Mathematics</td>
<td>Gary Chartrand, Ping Zhang, Albert Polimeni</td>
<td>Pearson</td>
<td>9780321390530</td>
<td>365</td>
<td>Exercises, Readability</td>
</tr>
<tr>
<td>Mathematics</td>
<td>Mathematical Statistics with Resampling and R</td>
<td>Laura Chihara, Tim Hesterberg</td>
<td>Wiley</td>
<td>9781118029855</td>
<td>418</td>
<td>Exercises, Illustrations, Readability</td>
</tr>
</table>
html <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/books.html"
html <- GET(html)
html <- rawToChar(html$content)
html <- htmlParse(html)
html <- readHTMLTable(html)
HTML <- data.frame(html)
Web API, more structured. XML stands for EXtensible Markup Language. XML does not DO anything. XML is just information wrapped in tags. Someone must write a piece of software to send, receive, store, or display it. XML and HTML were designed with different goals. XML was designed to carry data - with focus on what data is. HTML was designed to display data - with focus on how data looks. XML tags are not predefined like HTML tags are. Many computer systems contain data in incompatible formats. Exchanging data between incompatible systems (or upgraded systems) is a time-consuming task for web developers. Large amounts of data must be converted, and incompatible data is often lost. XML stores data in plain text format. This provides a software- and hardware-independent way of storing, transporting, and sharing data. XML also makes it easier to expand or upgrade to new operating systems, new applications, or new browsers, without losing data.
<textbooks>
<area id="1">
<subject>Mathematics</subject>
<book id="1">
<title>Applied Linear Statistical Models</title>
<publisher>McGraw Hill</publisher>
<isbn>9780073108742</isbn>
<pages>1396</pages>
<author id="1">Michael Kutner</author>
<author id="2">William Li</author>
<author id="3">Christopher Nachtsheim</author>
<author id="4">John Neter</author>
<attribute id="1">Exercises</attribute>
<attribute id="2">Illustrations</attribute>
<attribute id="3">Readability</attribute>
</book>
<book id="2">
<title>Mathematical Proofs: A Transition to Advanced Mathematics</title>
<publisher>Pearson</publisher>
<isbn>9780321390530</isbn>
<pages>365</pages>
<author id="1">Gary Chartrand</author>
<author id="2">Ping Zhang</author>
<author id="3">Albert Polimeni</author>
<attribute id="1">Exercises</attribute>
<attribute id="2">Readability</attribute>
</book>
<book id="3">
<title>Mathematical Statistics with Resampling and R</title>
<publisher>Wiley</publisher>
<isbn>9781118029855</isbn>
<pages>418</pages>
<author id="1">Laura Chihara</author>
<author id="2">Tim Hesterberg</author>
<attribute id="1">Exercises</attribute>
<attribute id="2">Illustrations</attribute>
<attribute id="3">Readability</attribute>
</book>
</area>
</textbooks>
xml <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/books.xml"
xml <- GET(xml)
xml <- rawToChar(xml$content)
xml <- xmlParse(xml)
xml <- xmlToList(xml)
XML <- data.frame(xml)
Web API, more structured. JSON stands for JavaScript Object Notation. The JSON format is syntactically identical to the code for creating JavaScript objects. Because of this similarity, instead of using a parser (like XML does), a JavaScript program can use standard JavaScript functions to convert JSON data into native JavaScript objects. XML has to be parsed with an XML parser. JSON can be parsed by a standard JavaScript function. For AJAX applications, JSON is faster and easier than XML.
{"Mathematics":
{"book": [
{
"title": "Applied Linear Statistical Models",
"publisher": "McGraw Hill",
"isbn": "9780073108742",
"pages": "1396",
"author": [
"Michael Kutner",
"William Li",
"Christopher Nachtsheim",
"John Neter"
],
"attribute": [
"Exercises",
"Illustrations",
"Readability"
]
},
{
"title": "Mathematical Proofs: A Transition to Advanced Mathematics",
"publisher": "Pearson",
"isbn": "9780321390530",
"pages": "365",
"author": [
"Gary Chartrand",
"Ping Zhang",
"Albert Polimeni"
],
"attribute": [
"Exercises",
"Readability"
]
},
{
"title": "Mathematical Statistics with Resampling and R",
"publisher": "Wiley",
"isbn": "9781118029855",
"pages": "418",
"author": [
"Laura Chihara",
"Tim Hesterberg"
],
"attribute": [
"Exercises",
"Illustrations",
"Readability"
]
}
]}
}
json <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20607/books.json"
json <- GET(json)
json <- rawToChar(json$content)
json <- fromJSON(json)
JSON <- data.frame(json)
HTML
## NULL.Subject NULL.Title
## 1 Mathematics Applied Linear Statistical Models
## 2 Mathematics Mathematical Proofs: A Transition to Advanced Mathematics
## 3 Mathematics Mathematical Statistics with Resampling and R
## NULL.Author
## 1 Michael Kutner, William Li, Christopher Nachtsheim, John Neter
## 2 Gary Chartrand, Ping Zhang, Albert Polimeni
## 3 Laura Chihara, Tim Hesterberg
## NULL.Publisher NULL.ISBN NULL.Pages
## 1 McGraw Hill 9780073108742 1396
## 2 Pearson 9780321390530 365
## 3 Wiley 9781118029855 418
## NULL.Attributes
## 1 Exercises, Illustrations, Readability
## 2 Exercises, Readability
## 3 Exercises, Illustrations, Readability
XML
## area.subject area.book.title area.book.publisher
## id Mathematics Applied Linear Statistical Models McGraw Hill
## area.book.isbn area.book.pages area.book.author.text
## id 9780073108742 1396 Michael Kutner
## area.book.author..attrs area.book.author.text.1
## id 1 William Li
## area.book.author..attrs.1 area.book.author.text.2
## id 2 Christopher Nachtsheim
## area.book.author..attrs.2 area.book.author.text.3
## id 3 John Neter
## area.book.author..attrs.3 area.book.attribute.text
## id 4 Exercises
## area.book.attribute..attrs area.book.attribute.text.1
## id 1 Illustrations
## area.book.attribute..attrs.1 area.book.attribute.text.2
## id 2 Readability
## area.book.attribute..attrs.2 area.book..attrs
## id 3 1
## area.book.title.1
## id Mathematical Proofs: A Transition to Advanced Mathematics
## area.book.publisher.1 area.book.isbn.1 area.book.pages.1
## id Pearson 9780321390530 365
## area.book.author.text.4 area.book.author..attrs.4
## id Gary Chartrand 1
## area.book.author.text.5 area.book.author..attrs.5
## id Ping Zhang 2
## area.book.author.text.6 area.book.author..attrs.6
## id Albert Polimeni 3
## area.book.attribute.text.3 area.book.attribute..attrs.3
## id Exercises 1
## area.book.attribute.text.4 area.book.attribute..attrs.4
## id Readability 2
## area.book..attrs.1 area.book.title.2
## id 2 Mathematical Statistics with Resampling and R
## area.book.publisher.2 area.book.isbn.2 area.book.pages.2
## id Wiley 9781118029855 418
## area.book.author.text.7 area.book.author..attrs.7
## id Laura Chihara 1
## area.book.author.text.8 area.book.author..attrs.8
## id Tim Hesterberg 2
## area.book.attribute.text.5 area.book.attribute..attrs.5
## id Exercises 1
## area.book.attribute.text.6 area.book.attribute..attrs.6
## id Illustrations 2
## area.book.attribute.text.7 area.book.attribute..attrs.7
## id Readability 3
## area.book..attrs.2 area..attrs
## id 3 1
JSON
## Mathematics.book.title
## 1 Applied Linear Statistical Models
## 2 Mathematical Proofs: A Transition to Advanced Mathematics
## 3 Mathematical Statistics with Resampling and R
## Mathematics.book.publisher Mathematics.book.isbn Mathematics.book.pages
## 1 McGraw Hill 9780073108742 1396
## 2 Pearson 9780321390530 365
## 3 Wiley 9781118029855 418
## Mathematics.book.author
## 1 Michael Kutner, William Li, Christopher Nachtsheim, John Neter
## 2 Gary Chartrand, Ping Zhang, Albert Polimeni
## 3 Laura Chihara, Tim Hesterberg
## Mathematics.book.attribute
## 1 Exercises, Illustrations, Readability
## 2 Exercises, Readability
## 3 Exercises, Illustrations, Readability
The data frames are not identical. According to Automated Data Collection with R:
XML and other data exchange formats like JSON can store much more complicated data structures. This is what makes them so powerful for data exchange over the Web. Forcing such structures into one common data frame comes at a certain cost-complicated data transformation tasks or the loss of information. xmlToDataFrame() is not an almighty function to achieve the task for which it is named. Rather, we are typically forced to develop and apply own extraction functions.