I began by creating a function to be used to process each of the data frames. I wanted to add a few extra steps by tidying to the project and help ensure uniformity in my data.
Format.Raw.Data <- function(df){
df %>%
separate('book_number', c('book_number', 'series_length'), sep=' of ') %>%
map_df(~str_replace_all(., '^(The|A) (.*)','\\2, \\1')) %>%
arrange(title)
}
{"favorite_books":[
{
"title": "The Waste Lands",
"authors": ["Stephen King"],
"series": "The Dark Tower",
"book_number": "3 of 7",
"release_year": 1992,
"pages": 422
},
{
"title": "The Hitchhiker's Guide to the Galaxy",
"authors": ["Douglas Adams"],
"series": "The Hitchhiker's Guide to the Galaxy",
"book_number": "1 of 5",
"release_year": 1979,
"pages": 144
},
{
"title": "Babylon's Ashes",
"authors": ["Daniel Abraham", "Ty Franck"],
"series": "The Expanse",
"book_number": "7 of 9",
"release_year": 2017,
"pages": 540
}]
}
The json data was simpliest to work with. It was read in and passed directly to the Format.Raw.Data function.
json.raw.data <- fromJSON('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.json')[[1]]
json.data.frame <- json.raw.data %>%
Format.Raw.Data()
json.data.frame %>% kable()
| title | authors | series | book_number | series_length | release_year | pages |
|---|---|---|---|---|---|---|
| Babylon’s Ashes | c(“Daniel Abraham”, “Ty Franck”) | Expanse, The | 7 | 9 | 2017 | 540 |
| Hitchhiker’s Guide to the Galaxy, The | Douglas Adams | Hitchhiker’s Guide to the Galaxy, The | 1 | 5 | 1979 | 144 |
| Waste Lands, The | Stephen King | Dark Tower, The | 3 | 7 | 1992 | 422 |
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>My Favorite Authors</title>
</head>
<body>
<h1>My Favorite Books!</h1>
<h2>Check them out!</h2>
<p>In the below table you will find a list of my favorite fiction books.</p>
<table id="books">
<thead>
<tr>
<td id='title'>Title</td>
<td id='authors'>Authors</td>
<td id='series'>Series</td>
<td id='book_number'>Book Number</td>
<td id='release_year'>Release Year</td>
<td id='pages'>Pages</td>
</tr>
</thead>
<tbody>
<tr>
<td>The Waste Lands</td>
<td>Stephen King</td>
<td>The Dark Tower</td>
<td>3 of 7</td>
<td>1992</td>
<td>422</td>
</tr>
<tr>
<td>The Hitchhiker's Guide to the Galaxy</td>
<td>Douglas Adams</td>
<td>The Hitchhiker's Guide to the Galaxy</td>
<td>1 of 5</td>
<td>1979</td>
<td>144</td>
</tr>
<tr>
<td>Babylon's Ashes</td>
<td>Daniel Abraham, Ty Franck</td>
<td>The Expanse</td>
<td>6 of 9</td>
<td>2017</td>
<td>540</td>
</tr>
</tbody>
</table>
</body>
</html>
I created a function called Html.Trimmer designed to trim the excess white space around each value grabbed from the html.
Html.Trimmer <- function(value){
value %>%
xmlValue() %>%
str_trim()
}
The raw data was read in. I began by grabbing the id attributes that contained all the names for the eventual dataframe. Then I grabbed all the needed values, ordered them in a matrix and put them into a data frame before calling the Format.Raw.Data method.
html.raw.data <- getURL('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.html') %>%
htmlParse()
html.names <- html.raw.data %>%
xpathSApply('//table//thead//td', xmlGetAttr, 'id')
html.data.frame <- html.raw.data %>%
xpathSApply('//table//tbody//td', Html.Trimmer) %>%
matrix(ncol=html.names %>% length(), byrow=T) %>%
as.data.frame() %>%
setNames(html.names) %>%
Format.Raw.Data()
html.data.frame %>% kable()
| title | authors | series | book_number | series_length | release_year | pages |
|---|---|---|---|---|---|---|
| Babylon’s Ashes | Daniel Abraham, Ty Franck | Expanse, The | 6 | 9 | 2017 | 540 |
| Hitchhiker’s Guide to the Galaxy, The | Douglas Adams | Hitchhiker’s Guide to the Galaxy, The | 1 | 5 | 1979 | 144 |
| Waste Lands, The | Stephen King | Dark Tower, The | 3 | 7 | 1992 | 422 |
<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE favorite_books [
<!ELEMENT favorite_books (book+)>
<!ELEMENT book (title,authors,series,book_number,release_year,pages)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT authors (author+)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT series (#PCDATA)>
<!ELEMENT book_number (#PCDATA)>
<!ELEMENT release_year (#PCDATA)>
<!ELEMENT pages (#PCDATA)>
]>
<favorite_books>
<book id="1">
<title>The Waste Lands</title>
<authors>
<author>Stephen King</author>
</authors>
<series>The Dark Tower</series>
<book_number>3 of 7</book_number>
<release_year>1992</release_year>
<pages>422</pages>
</book>
<book id="2">
<title>The Hitchhiker's Guide to the Galaxy</title>
<authors>
<author>Douglas Adams</author>
</authors>
<series>The Hitchhiker's Guide to the Galaxy</series>
<book_number>1 of 5</book_number>
<release_year>1979</release_year>
<pages>144</pages>
</book>
<book id="3">
<title>Babylon's Ashes</title>
<authors>
<author>Daniel Abraham</author>
<author>Ty Franck</author>
</authors>
<series>The Expanse</series>
<book_number>7 of 9</book_number>
<release_year>2017</release_year>
<pages>540</pages>
</book>
</favorite_books>
I wrote a function designed to find lists of nodes (ie: multiple authors) and create a list of the elements. If there is only a single node, it returns it’s value.
Xml.Processor <- function(value){
ifelse(value %>% xmlSize() > 1,
value %>% xmlChildren() %>%
map(. %>%
xmlValue()
) %>%
unlist() %>%
list(),
value %>% xmlValue
)
}
Like with the HTML code, I created a names variable and then proccessed the xml file in a fashion similar to the html file.
xml.raw.data <- getURL('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.xml') %>%
xmlParse()
xml.names <- xml.raw.data %>%
xpathSApply('//book/*', xmlName) %>%
unique()
xml.data.frame <- xml.raw.data %>%
xpathSApply('//book/*', Xml.Processor) %>%
matrix(ncol=xml.names %>% length(), byrow=T) %>%
as.data.frame() %>%
setNames(xml.names) %>%
Format.Raw.Data()
xml.data.frame %>% kable()
| title | authors | series | book_number | series_length | release_year | pages |
|---|---|---|---|---|---|---|
| Babylon’s Ashes | c(“Daniel Abraham”, “Ty Franck”) | Expanse, The | 7 | 9 | 2017 | 540 |
| Hitchhiker’s Guide to the Galaxy, The | Douglas Adams | Hitchhiker’s Guide to the Galaxy, The | 1 | 5 | 1979 | 144 |
| Waste Lands, The | Stephen King | Dark Tower, The | 3 | 7 | 1992 | 422 |
book:
- title : The Waste Lands
authors :
- Stephen King
series : The Dark Tower
book_number : 3 of 7
relase_year : 1992
pages : 422
- title: The Hitchhiker's Guide to the Galaxy
authors:
- Douglas Adams
series: The Hitchhiker's Guide to the Galaxy
book_number: 1 of 5
release_year: 1979
pages: 144
- title: Babylon's Ashes
authors:
- Daniel Abraham
- Ty Franck
series: The Expanse
book_number: 7 of 9
release_year: 2017
pages: 540
I decided to try working with YAML as well as I have never used the mark up language before and I wanted to see how it compared to HTML, XML and JSON. I followed a similar pattern as with the other formats by creating a names variable and then creating a data frame out of the list of information.
yaml.raw.data <- read_yaml('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.yaml')[[1]]
yaml.names <- yaml.raw.data[[1]] %>%
names() %>%
unique()
yaml.data.frame <- yaml.raw.data %>%
unlist(recursive=F) %>%
matrix(ncol=yaml.names %>% length(), byrow=T) %>%
as.data.frame() %>%
setNames(yaml.names) %>%
Format.Raw.Data()
yaml.data.frame %>% kable()
| title | authors | series | book_number | series_length | relase_year | pages |
|---|---|---|---|---|---|---|
| Babylon’s Ashes | c(“Daniel Abraham”, “Ty Franck”) | Expanse, The | 7 | 9 | 2017 | 540 |
| Hitchhiker’s Guide to the Galaxy, The | Douglas Adams | Hitchhiker’s Guide to the Galaxy, The | 1 | 5 | 1979 | 144 |
| Waste Lands, The | Stephen King | Dark Tower, The | 3 | 7 | 1992 | 422 |
All 4 input types have been successfully turned into data frames. The only difference amongst all the data frames is that while authors is a list in Json, XML and YAML it is a single value in HTML (seperated by a comma). The HTML could be modified further to bring it in line with the others if needed.