Week 8

XML, HTML and JSON Homework

For this project, I created 3 documents using XML, HTML and JSON in a notepad (I have never done this before… My apologies if these documents are not properly written, but I copied the basic formats I found via the slide deck from Gaston Sanchez and from Google searches on HTML tables.)

Using the r packages I then uploaded the data using the following packages:

library(XML)
library(RJSONIO)
library(rvest)

## Loading required package: xml2
## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:XML':
## 
##     xml

library(RCurl)

## Loading required package: bitops

For the week 8 assignment I created the following files:

url_html <- getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/Books.HTML")
url_json <- getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/Books.JSON")
url_xml <- getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/Books.xml")

Data_XML <- xmlToDataFrame(url_xml)
#Data_JSON <- fromJSON(url_json)
Data_HTML <-readHTMLTable(url_html)
Data_XML

##                                   title
## 1                 Count of Monte Cristo
## 2      Automated Data Collection with R
## 3 Goosebumps #13: Welcome to Dead House
##                                                   authors
## 1                                         Alexander Dumas
## 2 Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis
## 3                                              R.L. Stine
##               publisher            genre published
## 1       Penguin Classic Historical Novel      1844
## 2                 Wiley         Textbook      2015
## 3 Scholastic Paperbacks           Horror      2010

# I cannot get the JSON to read properly via web... my familiarity with this is quite new. I wrote an appendix section, to show an ouput that I used to "create" a JSON from the HTML data set. It created a similar data frame and and Identical source code. For the life of me I can't figure out why it doesn't read it. 
#Data_JSON
Data_HTML

## $`NULL`
##                                   Title         Author1         Author2
## 1                Counte of Monte Cristo Alexander Dumas              NA
## 2      Automated Data Collection with R   Simon Munzert Christian Rubba
## 3 Goosebumps #13: Welcome to Dead House      R.L. Stine              NA
##         Author3        Author4             Publisher            Genre
## 1            NA             NA       Penquin Classic Historical Novel
## 2 Peter Meibner Dominic Nyhuis                 Wiley         Textbook
## 3            NA             NA Scholastic Paperbacks           Horror
##   Published
## 1      1844
## 2      2015
## 3      2010

As you can see the data is read differently based on the packages (and because it was written slightly different in each type). One problem I was having (and you can see from the two sources) is that the XML has a subchild, which created multiple Authors, but put them togethere in a single row. For HTML, I couldn’t figure out a way to utlized the subchild method, so I created 4 separate authors (I did this with the JSON file as well but like I stated previously, I could not get the file to load, see below).

Appendix

As I couldn’t create the JSON file, I took the data.frame I created from the HTML and made a JSON File using the RJSONIO package. See below:

JSON_file  <- toJSON(Data_HTML)
write(JSON_file, "json_file.json")

This created a json_file, which I then attempted to load(This looked exactly like mine, maybe I am just tired, and missed a comma somewhere).

url_json <-getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/json_file.json")
Data_JSON <- fromJSON(url_json)
Data_JSON

## $`NULL`
## $`NULL`$Title
## [1] "Counte of Monte Cristo"               
## [2] "Automated Data Collection with R"     
## [3] "Goosebumps #13: Welcome to Dead House"
## 
## $`NULL`$Author1
## [1] "Alexander Dumas" "Simon Munzert"   "R.L. Stine"     
## 
## $`NULL`$Author2
## [1] "NA"              "Christian Rubba" "NA"             
## 
## $`NULL`$Author3
## [1] "NA"            "Peter Meibner" "NA"           
## 
## $`NULL`$Author4
## [1] "NA"             "Dominic Nyhuis" "NA"            
## 
## $`NULL`$Publisher
## [1] "Penquin Classic"       "Wiley"                 "Scholastic Paperbacks"
## 
## $`NULL`$Genre
## [1] "Historical Novel" "Textbook"         "Horror"          
## 
## $`NULL`$Published
## [1] "1844" "2015" "2010"

As you can see, the output of this is different than the other two data sources.The other files created a data.frame that was much more tabular, while this one created a less organized file. I wonder if this was more because of the JSON package utilized. The other packages appear to have created a data frame.

Week 8

Section 2

October 18, 2015

XML, HTML and JSON Homework

Appendix