Import Libraries

library(data.table)
library(ggplot2)
library(janitor)
library(tidyverse)
library(knitr)
library(XML)
library(kableExtra)
library(htmltools)
library(rjson)

dir.home = '~/git/CUNY.MDS/DATA607/'
setwd(dir.home)

Data Preview

For reference, here's the book data, we'll be working with

fread('books.csv') %>% kable() %>% kable_styling(bootstrap_options = "basic")

Title	Author	Attribute.1	Attribute.2	Attribute.3
And Then There Were None	Agatha Christie	Thrilling	Suspenseful
Dreamland	Sam Quinones	Informative	Saddening	Detailed
The Great Gatsby	Scott Fitzgerald	Thoughtful	Tragic

XML File

I started by manually creating an XML file:

## <?xml version="1.0" encoding="UTF-8"?>
## <bookstore>
##   <book>
##     <title>And Then There Were None</title>
##     <Author>Agatha Christie</Author>
##     <Attribute.1>Thrilling</Attribute.1>
##     <Attribute.2>Suspenseful</Attribute.2>
##   </book>
##   <book>
##     <title>Dreamland</title>
##     <Author>Sam Quinones</Author>
##     <Attribute.1>Informative</Attribute.1>
##     <Attribute.2>Saddening</Attribute.2>
##     <Attribute.3>Detailed</Attribute.3>
##   </book>
##   <book>
##     <title>The Great Gatsby</title>
##     <Author>Scott Fitzgerald</Author>
##     <Attribute.1>Thoughtful</Attribute.1>
##     <Attribute.2>Tragic</Attribute.2>
##   </book>
## </bookstore>

From here, we'll use the built-in R command to convert the data

xmlToDataFrame(vec.xml) %>% kable() %>% kable_styling(bootstrap_options = "basic")

title	Author	Attribute.1	Attribute.2	Attribute.3
And Then There Were None	Agatha Christie	Thrilling	Suspenseful	NA
Dreamland	Sam Quinones	Informative	Saddening	Detailed
The Great Gatsby	Scott Fitzgerald	Thoughtful	Tragic	NA

HTML File

Similar to the XML file, I manually created an HTML file:

## <html>
## <table>
##  <tr>
##   <td>Title</td>
##   <td>Author</td>
##   <td>Attribute.1</td>
##   <td>Attribute.2</td>
##   <td>Attribute.3</td>
##  </tr>
##  <tr>
##   <td>And Then There Were None</td>
##   <td>Agatha Christie</td>
##   <td>Thrilling</td>
##   <td>Suspenseful</td>
##  </tr>
##  <tr>
##   <td>Dreamland</td>
##   <td>Sam Quinones</span></td>
##   <td>Informative</td>
##   <td>Saddening</td>
##   <td>Detailed</td>
##  </tr>
##  <tr>
##   <td>The Great Gatsby</td>
##   <td>Scott Fitzgerald</td>
##   <td>Thoughtful</td>
##  </tr>
## </table>
## </html>

From here, we'll use the xml package to parse the file:

html.tbl = htmlParse(vec.html) %>% readHTMLTable()
html.tbl$`NULL` %>% kable() %>%  kable_styling(bootstrap_options = "basic")

Title	Author	Attribute.1	Attribute.2	Attribute.3
And Then There Were None	Agatha Christie	Thrilling	Suspenseful	NA
Dreamland	Sam Quinones	Informative	Saddening	Detailed
The Great Gatsby	Scott Fitzgerald	Thoughtful	NA	NA

JSON File

I manually created a JSON file:

## [
##  {
##    "Title": "And Then There Were None",
##    "Author": "Agatha Christie",
##    "Attribute.1": "Thrilling",
##    "Attribute.2": "Suspenseful",
##    "Attribute.3": ""
##  },
##  {
##    "Title": "Dreamland",
##    "Author": "Sam Quinones",
##    "Attribute.1": "Informative",
##    "Attribute.2": "Saddening",
##    "Attribute.3": "Detailed"
##  },
##  {
##    "Title": "The Great Gatsby",
##    "Author": "Scott Fitzgerald",
##    "Attribute.1": "Thoughtful",
##    "Attribute.2": "Tragic",
##    "Attribute.3": ""
##  }
## ]

lst.json = fromJSON(vec.json)
do.call(rbind, lst.json) %>% kable() %>%  kable_styling(bootstrap_options = "basic")

Title	Author	Attribute.1	Attribute.2	Attribute.3
And Then There Were None	Agatha Christie	Thrilling	Suspenseful
Dreamland	Sam Quinones	Informative	Saddening	Detailed
The Great Gatsby	Scott Fitzgerald	Thoughtful	Tragic

Takeaways

The HTML and XML formats populated the table with NA values
The JSON formats preserved the blank cells

Links

Raw XML File

Raw HTML File

Raw JSON File

Assignment 5

Deepika Dilip

3/19/2022