Homework 7

I created three tables containing information on books in three formats: HTML, XML and JSON. Our goal is to load the files into R and parse them, creating dataframes.

HTML

Reading html from url as text

Parse the HTML document and keep only the body section.

Navigate to the table child and inspect it

## $table
## <table>
##  <tr>
##   <th>title</th>
##   <th>author</th>
##   <th>yearpub</th>
##   <th>pages</th>
##   <th>price</th>
##  </tr>
##  <tr>
##   <td>The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life</td>
##   <td>Mark Mason</td>
##   <td>2016</td>
##   <td>224</td>
##   <td>23.95</td>
##  </tr>
##  <tr>
##   <td>The Art of Seduction</td>
##   <td>Robert Greene</td>
##   <td>2001</td>
##   <td>468</td>
##   <td>24.49</td>
##  </tr>
##  <tr>
##   <td>Disarming the Narcissist: Surviving and Thriving with the Self-Absorbed</td>
##   <td>Wendy T. Behary, Daniel J. Siegel</td>
##   <td>2013</td>
##   <td>249</td>
##   <td>20.99</td>
##  </tr>
## </table>
## 
## attr(,"class")
## [1] "XMLNodeList"

Get the column headers from the first tr element and the values from the others

Table

Books Extracted from HTML
title author yearpub pages price
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life Mark Mason 2016 224 23.95
The Art of Seduction Robert Greene 2001 468 24.49
Disarming the Narcissist: Surviving and Thriving with the Self-Absorbed Wendy T. Behary, Daniel J. Siegel 2013 249 20.99

XPath

Creating the HTML dataframe via xpath

Parse the HTML document. Note that we are not using htmlTreeParse() this time.

Get the column headers from the path to /th and the values from the path to /td

Check the data is loaded correctly

Books Extracted from HTML via XPath
title author yearpub pages price
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life Mark Mason 2016 224 23.95
The Art of Seduction Robert Greene 2001 468 24.49
Disarming the Narcissist: Surviving and Thriving with the Self-Absorbed Wendy T. Behary, Daniel J. Siegel 2013 249 20.99

XML

Read and parse the XML document and inspect the contents of the root node

## <books>
##  <book>
##   <title>The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life</title>
##   <author>Mark Manson</author>
##   <yearpub>2016</yearpub>
##   <pages>224</pages>
##   <price>23.95</price>
##  </book>
##  <book>
##   <title>The Art of Seduction</title>
##   <author>Robert Greene</author>
##   <yearpub>2001</yearpub>
##   <pages>468</pages>
##   <price>24.49</price>
##  </book>
##  <book>
##   <title>Disarming the Narcissist: Surviving and Thriving with the Self-Absorbed</title>
##   <author>
##    <first>Wendy T. Behary</first>
##    <second>Daniel J. Siegel</second>
##   </author>
##   <yearpub>2013</yearpub>
##   <pages>249</pages>
##   <price>20.99</price>
##  </book>
## </books>

Extract data from nodes. The names of the elements are retained.

Table

Books Extracted from XML
title author yearpub pages price
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life Mark Manson 2016 224 23.95
The Art of Seduction Robert Greene 2001 468 24.49
Disarming the Narcissist: Surviving and Thriving with the Self-Absorbed Wendy T. BeharyDaniel J. Siegel 2013 249 20.99

JSON

Books Extracted from JSON
books.title books.author books.published books.pages books.cost
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life Mark Manson 2016 224 23.95
Disarming the Narcissist: Surviving and Thriving with the Self-Absorbed c(“Wendy T. Behary”, “Daniel J. Siegel”) NA 249 20.99
The Art of Seduction Robert Greene 2001 468 24.49

Conclusion

The data frames are not all identical.The XML table is missing a comma in the case where there are two authors and the JSON one is showing a vector with the two author names and well as column headers that contain the “books.” prefix.