week 5 Assignment

By Brian Weinfeld

March 12, 2018

Set Up

I began by creating a function to be used to process each of the data frames. I wanted to add a few extra steps by tidying to the project and help ensure uniformity in my data.

Format.Raw.Data <- function(df){
  df %>%
    separate('book_number', c('book_number', 'series_length'), sep=' of ') %>%
    map_df(~str_replace_all(., '^(The|A) (.*)','\\2, \\1')) %>%
    arrange(title)
}

Json

{"favorite_books":[
  {
    "title": "The Waste Lands",
    "authors": ["Stephen King"],
    "series": "The Dark Tower",
    "book_number": "3 of 7",
    "release_year": 1992,
    "pages": 422
  },
  {
    "title": "The Hitchhiker's Guide to the Galaxy",
    "authors": ["Douglas Adams"],
    "series": "The Hitchhiker's Guide to the Galaxy",
    "book_number": "1 of 5",
    "release_year": 1979,
    "pages": 144
  },
  {
    "title": "Babylon's Ashes",
    "authors": ["Daniel Abraham", "Ty Franck"],
    "series": "The Expanse",
    "book_number": "7 of 9",
    "release_year": 2017,
    "pages": 540
  }]
}

The json data was simpliest to work with. It was read in and passed directly to the Format.Raw.Data function.

json.raw.data <- fromJSON('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.json')[[1]]

json.data.frame <- json.raw.data %>%
  Format.Raw.Data()
json.data.frame %>% kable()

title	authors	series	book_number	series_length	release_year	pages
Babylon’s Ashes	c(“Daniel Abraham”, “Ty Franck”)	Expanse, The	7	9	2017	540
Hitchhiker’s Guide to the Galaxy, The	Douglas Adams	Hitchhiker’s Guide to the Galaxy, The	1	5	1979	144
Waste Lands, The	Stephen King	Dark Tower, The	3	7	1992	422

HTML

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>My Favorite Authors</title>
</head>
<body>
  <h1>My Favorite Books!</h1>
  <h2>Check them out!</h2>
  <p>In the below table you will find a list of my favorite fiction books.</p>
  <table id="books">
    <thead>
      <tr>
        <td id='title'>Title</td>
        <td id='authors'>Authors</td>
        <td id='series'>Series</td>
        <td id='book_number'>Book Number</td>
        <td id='release_year'>Release Year</td>
        <td id='pages'>Pages</td>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>The Waste Lands</td>
        <td>Stephen King</td>
        <td>The Dark Tower</td>
        <td>3 of 7</td>
        <td>1992</td>
        <td>422</td>
      </tr>
      <tr>
        <td>The Hitchhiker's Guide to the Galaxy</td>
        <td>Douglas Adams</td>
        <td>The Hitchhiker's Guide to the Galaxy</td>
        <td>1 of 5</td>
        <td>1979</td>
        <td>144</td>
      </tr>
      <tr>
        <td>Babylon's Ashes</td>
        <td>Daniel Abraham, Ty Franck</td>
        <td>The Expanse</td>
        <td>6 of 9</td>
        <td>2017</td>
        <td>540</td>
      </tr>
    </tbody>
  </table>
</body>
</html>

I created a function called Html.Trimmer designed to trim the excess white space around each value grabbed from the html.

Html.Trimmer <- function(value){
  value %>%
    xmlValue() %>%
    str_trim()
}

The raw data was read in. I began by grabbing the id attributes that contained all the names for the eventual dataframe. Then I grabbed all the needed values, ordered them in a matrix and put them into a data frame before calling the Format.Raw.Data method.

html.raw.data <- getURL('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.html') %>% 
  htmlParse()

html.names <- html.raw.data %>%
  xpathSApply('//table//thead//td', xmlGetAttr, 'id')

html.data.frame <- html.raw.data %>%
  xpathSApply('//table//tbody//td', Html.Trimmer) %>% 
  matrix(ncol=html.names %>% length(), byrow=T) %>%
  as.data.frame() %>%
  setNames(html.names) %>%
  Format.Raw.Data()
html.data.frame %>% kable()

title	authors	series	book_number	series_length	release_year	pages
Babylon’s Ashes	Daniel Abraham, Ty Franck	Expanse, The	6	9	2017	540
Hitchhiker’s Guide to the Galaxy, The	Douglas Adams	Hitchhiker’s Guide to the Galaxy, The	1	5	1979	144
Waste Lands, The	Stephen King	Dark Tower, The	3	7	1992	422

XML

<?xml version='1.0' encoding='ISO-8859-1'?>
<!DOCTYPE favorite_books [
  <!ELEMENT favorite_books (book+)>
  <!ELEMENT book (title,authors,series,book_number,release_year,pages)>
  <!ELEMENT title (#PCDATA)>
  <!ELEMENT authors (author+)>
  <!ELEMENT author (#PCDATA)>
  <!ELEMENT series (#PCDATA)>
  <!ELEMENT book_number (#PCDATA)>
  <!ELEMENT release_year (#PCDATA)>
  <!ELEMENT pages (#PCDATA)>
]>
<favorite_books>
  <book id="1">
    <title>The Waste Lands</title>
    <authors>
      <author>Stephen King</author>
    </authors>
    <series>The Dark Tower</series>
    <book_number>3 of 7</book_number>
    <release_year>1992</release_year>
    <pages>422</pages>
  </book>
  <book id="2">
    <title>The Hitchhiker's Guide to the Galaxy</title>
    <authors>
      <author>Douglas Adams</author>
    </authors>
    <series>The Hitchhiker's Guide to the Galaxy</series>
    <book_number>1 of 5</book_number>
    <release_year>1979</release_year>
    <pages>144</pages>
  </book>
  <book id="3">
    <title>Babylon's Ashes</title>
    <authors>
      <author>Daniel Abraham</author>
      <author>Ty Franck</author>
    </authors>
    <series>The Expanse</series>
    <book_number>7 of 9</book_number>
    <release_year>2017</release_year>
    <pages>540</pages>
  </book>
</favorite_books>

I wrote a function designed to find lists of nodes (ie: multiple authors) and create a list of the elements. If there is only a single node, it returns it’s value.

Xml.Processor <- function(value){
  ifelse(value %>% xmlSize() > 1,
    value %>% xmlChildren() %>% 
             map(. %>% 
                   xmlValue()
                 ) %>% 
             unlist() %>% 
             list(),
    value %>% xmlValue
  )
}

Like with the HTML code, I created a names variable and then proccessed the xml file in a fashion similar to the html file.

xml.raw.data <- getURL('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.xml') %>%
  xmlParse()

xml.names <- xml.raw.data %>%
  xpathSApply('//book/*', xmlName) %>%
  unique()

xml.data.frame <- xml.raw.data %>%
  xpathSApply('//book/*', Xml.Processor) %>% 
  matrix(ncol=xml.names %>% length(), byrow=T) %>%
  as.data.frame() %>%
  setNames(xml.names) %>%
  Format.Raw.Data()
xml.data.frame %>% kable()

title	authors	series	book_number	series_length	release_year	pages
Babylon’s Ashes	c(“Daniel Abraham”, “Ty Franck”)	Expanse, The	7	9	2017	540
Hitchhiker’s Guide to the Galaxy, The	Douglas Adams	Hitchhiker’s Guide to the Galaxy, The	1	5	1979	144
Waste Lands, The	Stephen King	Dark Tower, The	3	7	1992	422

YAML

book:
 - title : The Waste Lands
   authors :
    - Stephen King
   series : The Dark Tower
   book_number : 3 of 7
   relase_year : 1992
   pages : 422
 - title: The Hitchhiker's Guide to the Galaxy
   authors:
    - Douglas Adams
   series: The Hitchhiker's Guide to the Galaxy
   book_number: 1 of 5
   release_year: 1979
   pages: 144
 - title: Babylon's Ashes
   authors:
    - Daniel Abraham
    - Ty Franck
   series: The Expanse
   book_number: 7 of 9
   release_year: 2017
   pages: 540

I decided to try working with YAML as well as I have never used the mark up language before and I wanted to see how it compared to HTML, XML and JSON. I followed a similar pattern as with the other formats by creating a names variable and then creating a data frame out of the list of information.

yaml.raw.data <- read_yaml('https://raw.githubusercontent.com/brian-cuny/607assignment5/master/authors.yaml')[[1]]

yaml.names <- yaml.raw.data[[1]] %>% 
  names() %>% 
  unique()

yaml.data.frame <- yaml.raw.data %>%
  unlist(recursive=F) %>%
  matrix(ncol=yaml.names %>% length(), byrow=T) %>%
  as.data.frame() %>%
  setNames(yaml.names) %>%
  Format.Raw.Data()
yaml.data.frame %>% kable()

title	authors	series	book_number	series_length	relase_year	pages
Babylon’s Ashes	c(“Daniel Abraham”, “Ty Franck”)	Expanse, The	7	9	2017	540
Hitchhiker’s Guide to the Galaxy, The	Douglas Adams	Hitchhiker’s Guide to the Galaxy, The	1	5	1979	144
Waste Lands, The	Stephen King	Dark Tower, The	3	7	1992	422

All 4 input types have been successfully turned into data frames. The only difference amongst all the data frames is that while authors is a list in Json, XML and YAML it is a single value in HTML (seperated by a comma). The HTML could be modified further to bring it in line with the others if needed.