Capofari_Week_8_Assignment

The Information
The Data Frames
Are the three data frames identical?

The Information

I created 3 separate files to store the infromation from my favorite children’s books.
* HTML
* JSON
* XML

The Data Frames

The information will now be loaded into 3 separate data frames.

library(RCurl)

## Loading required package: bitops

library(XML)

## Warning: package 'XML' was built under R version 3.2.2

library(jsonlite)

## Warning: package 'jsonlite' was built under R version 3.2.2

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View

#1st the html file
url_html <- getURL("https://raw.githubusercontent.com/ncapofari/IS_607/IS_607/favorite_books.html")
html_frame <- as.data.frame(readHTMLTable(url_html), stringsAsFactors = FALSE)

#2nd the xml file
url_xml <- getURL("https://raw.githubusercontent.com/ncapofari/IS_607/IS_607/favorite_books.xml")
xml_frame <- xmlToDataFrame(url_xml, stringsAsFactors = FALSE)

#1st the json file
url_json <- getURL("https://raw.githubusercontent.com/ncapofari/IS_607/IS_607/favorite_books.json")
json_frame <- as.data.frame(fromJSON(url_json), stringsAsFactors = FALSE)

Here are the html, xml, json files as data frames:

class(html_frame)

## [1] "data.frame"

html_frame

##                                    NULL.Title                NULL.Author
## 1             The Day the Babies Crawled Away              Peggy Rathman
## 2                                  Locomotive                Brian Floca
## 3                              Cha-cha Chimps              Julia Durango
## 4 The Juniper Tree and Other Tales from Grimm Jacob Grimm, Wilhelm Grimm
##   NULL.Illustrator NULL.Date NULL.Pages
## 1    Peggy Rathman      2003         40
## 2      Brian Floca      2013         64
## 3    Elanor Taylor      2010         32
## 4   Maurice Sendak      1973        352

class(xml_frame)

## [1] "data.frame"

xml_frame

##                                         title        author    illustrator
## 1             The Day the Babies Crawled Away Peggy Rathman  Peggy Rathman
## 2                                  Locomotive   Brian Floca    Brian Floca
## 3                              Cha-cha Chimps Julia Durango  Elanor Taylor
## 4 The Juniper Tree and Other Tales from Grimm               Maurice Sendak
##   date pages
## 1 2003    40
## 2 2013    64
## 3 2010    32
## 4 1973   352

class(json_frame)

## [1] "data.frame"

json_frame

##                 favorite_children_books.title
## 1             The Day the Babies Crawled Away
## 2                                  Locomotive
## 3                              Cha-cha Chimps
## 4 The Juniper Tree and Other Tales from Grimm
##   favorite_children_books.author favorite_children_books.illustrator
## 1                  Peggy Rathman                       Peggy Rathman
## 2                    Brian Floca                         Brian Floca
## 3                  Peggy Rathman                       Julia Durango
## 4     Jacob Grimm, Wilhelm Grimm                      Maurice Sendak
##   favorite_children_books.date favorite_children_books.pages
## 1                         2003                            40
## 2                         2013                            64
## 3                         2010                            32
## 4                         1973                           352

Are the three data frames identical?

identical(html_frame, xml_frame)

## [1] FALSE

identical(html_frame, json_frame)

## [1] FALSE

identical(xml_frame, json_frame)

## [1] FALSE

None of the 3 data frames are the same. But they contain the same information.
As for the xml table, the multiple authors did not even appear in the data frame.
This is the line of code in the xml doc that contains the author information that did not show up:
<author author1=“Jacob Grimm” author2=“Wilhelm Grimm”/>

I will try to add the authors back into the data frame.

xml_doc <- htmlParse(url_xml)
#this returns a list of any author attribute
#when I created the xml doc, the book with 2 authors was stored them this way
multiple_authors <- xpathSApply(xml_doc, "///author", fun = xmlAttrs)
temp <- unlist(multiple_authors[4])
temp <- paste(temp[1], temp[2], sep = ", ")
xml_frame$author[4] <- temp
xml_frame

##                                         title                     author
## 1             The Day the Babies Crawled Away              Peggy Rathman
## 2                                  Locomotive                Brian Floca
## 3                              Cha-cha Chimps              Julia Durango
## 4 The Juniper Tree and Other Tales from Grimm Jacob Grimm, Wilhelm Grimm
##      illustrator date pages
## 1  Peggy Rathman 2003    40
## 2    Brian Floca 2013    64
## 3  Elanor Taylor 2010    32
## 4 Maurice Sendak 1973   352

Capofari_Week_8_Assignment

Nicholas Capofari

October 16, 2015

The Information

The Data Frames

Are the three data frames identical?