library(data.table)
library(ggplot2)
library(janitor)
library(tidyverse)
library(knitr)
library(XML)
library(kableExtra)
library(htmltools)
library(rjson)
dir.home = '~/git/CUNY.MDS/DATA607/'
setwd(dir.home)
For reference, here's the book data, we'll be working with
fread('books.csv') %>% kable() %>% kable_styling(bootstrap_options = "basic")
| Title | Author | Attribute.1 | Attribute.2 | Attribute.3 |
|---|---|---|---|---|
| And Then There Were None | Agatha Christie | Thrilling | Suspenseful | |
| Dreamland | Sam Quinones | Informative | Saddening | Detailed |
| The Great Gatsby | Scott Fitzgerald | Thoughtful | Tragic |
I started by manually creating an XML file:
## <?xml version="1.0" encoding="UTF-8"?>
## <bookstore>
## <book>
## <title>And Then There Were None</title>
## <Author>Agatha Christie</Author>
## <Attribute.1>Thrilling</Attribute.1>
## <Attribute.2>Suspenseful</Attribute.2>
## </book>
## <book>
## <title>Dreamland</title>
## <Author>Sam Quinones</Author>
## <Attribute.1>Informative</Attribute.1>
## <Attribute.2>Saddening</Attribute.2>
## <Attribute.3>Detailed</Attribute.3>
## </book>
## <book>
## <title>The Great Gatsby</title>
## <Author>Scott Fitzgerald</Author>
## <Attribute.1>Thoughtful</Attribute.1>
## <Attribute.2>Tragic</Attribute.2>
## </book>
## </bookstore>
From here, we'll use the built-in R command to convert the data
xmlToDataFrame(vec.xml) %>% kable() %>% kable_styling(bootstrap_options = "basic")
| title | Author | Attribute.1 | Attribute.2 | Attribute.3 |
|---|---|---|---|---|
| And Then There Were None | Agatha Christie | Thrilling | Suspenseful | NA |
| Dreamland | Sam Quinones | Informative | Saddening | Detailed |
| The Great Gatsby | Scott Fitzgerald | Thoughtful | Tragic | NA |
Similar to the XML file, I manually created an HTML file:
## <html>
## <table>
## <tr>
## <td>Title</td>
## <td>Author</td>
## <td>Attribute.1</td>
## <td>Attribute.2</td>
## <td>Attribute.3</td>
## </tr>
## <tr>
## <td>And Then There Were None</td>
## <td>Agatha Christie</td>
## <td>Thrilling</td>
## <td>Suspenseful</td>
## </tr>
## <tr>
## <td>Dreamland</td>
## <td>Sam Quinones</span></td>
## <td>Informative</td>
## <td>Saddening</td>
## <td>Detailed</td>
## </tr>
## <tr>
## <td>The Great Gatsby</td>
## <td>Scott Fitzgerald</td>
## <td>Thoughtful</td>
## </tr>
## </table>
## </html>
From here, we'll use the xml package to parse the file:
html.tbl = htmlParse(vec.html) %>% readHTMLTable()
html.tbl$`NULL` %>% kable() %>% kable_styling(bootstrap_options = "basic")
| Title | Author | Attribute.1 | Attribute.2 | Attribute.3 |
|---|---|---|---|---|
| And Then There Were None | Agatha Christie | Thrilling | Suspenseful | NA |
| Dreamland | Sam Quinones | Informative | Saddening | Detailed |
| The Great Gatsby | Scott Fitzgerald | Thoughtful | NA | NA |
I manually created a JSON file:
## [
## {
## "Title": "And Then There Were None",
## "Author": "Agatha Christie",
## "Attribute.1": "Thrilling",
## "Attribute.2": "Suspenseful",
## "Attribute.3": ""
## },
## {
## "Title": "Dreamland",
## "Author": "Sam Quinones",
## "Attribute.1": "Informative",
## "Attribute.2": "Saddening",
## "Attribute.3": "Detailed"
## },
## {
## "Title": "The Great Gatsby",
## "Author": "Scott Fitzgerald",
## "Attribute.1": "Thoughtful",
## "Attribute.2": "Tragic",
## "Attribute.3": ""
## }
## ]
lst.json = fromJSON(vec.json)
do.call(rbind, lst.json) %>% kable() %>% kable_styling(bootstrap_options = "basic")
| Title | Author | Attribute.1 | Attribute.2 | Attribute.3 |
|---|---|---|---|---|
| And Then There Were None | Agatha Christie | Thrilling | Suspenseful | |
| Dreamland | Sam Quinones | Informative | Saddening | Detailed |
| The Great Gatsby | Scott Fitzgerald | Thoughtful | Tragic |
NA values