For this project, I created 3 documents using XML, HTML and JSON in a notepad (I have never done this before… My apologies if these documents are not properly written, but I copied the basic formats I found via the slide deck from Gaston Sanchez and from Google searches on HTML tables.)
Using the r packages I then uploaded the data using the following packages:
library(XML)
library(RJSONIO)
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:XML':
##
## xml
library(RCurl)
## Loading required package: bitops
For the week 8 assignment I created the following files:
url_html <- getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/Books.HTML")
url_json <- getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/Books.JSON")
url_xml <- getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/Books.xml")
Data_XML <- xmlToDataFrame(url_xml)
#Data_JSON <- fromJSON(url_json)
Data_HTML <-readHTMLTable(url_html)
Data_XML
## title
## 1 Count of Monte Cristo
## 2 Automated Data Collection with R
## 3 Goosebumps #13: Welcome to Dead House
## authors
## 1 Alexander Dumas
## 2 Simon MunzertChristian RubbaPeter MeibnerDominic Nyhuis
## 3 R.L. Stine
## publisher genre published
## 1 Penguin Classic Historical Novel 1844
## 2 Wiley Textbook 2015
## 3 Scholastic Paperbacks Horror 2010
# I cannot get the JSON to read properly via web... my familiarity with this is quite new. I wrote an appendix section, to show an ouput that I used to "create" a JSON from the HTML data set. It created a similar data frame and and Identical source code. For the life of me I can't figure out why it doesn't read it.
#Data_JSON
Data_HTML
## $`NULL`
## Title Author1 Author2
## 1 Counte of Monte Cristo Alexander Dumas NA
## 2 Automated Data Collection with R Simon Munzert Christian Rubba
## 3 Goosebumps #13: Welcome to Dead House R.L. Stine NA
## Author3 Author4 Publisher Genre
## 1 NA NA Penquin Classic Historical Novel
## 2 Peter Meibner Dominic Nyhuis Wiley Textbook
## 3 NA NA Scholastic Paperbacks Horror
## Published
## 1 1844
## 2 2015
## 3 2010
As you can see the data is read differently based on the packages (and because it was written slightly different in each type). One problem I was having (and you can see from the two sources) is that the XML has a subchild, which created multiple Authors, but put them togethere in a single row. For HTML, I couldn’t figure out a way to utlized the subchild method, so I created 4 separate authors (I did this with the JSON file as well but like I stated previously, I could not get the file to load, see below).
As I couldn’t create the JSON file, I took the data.frame I created from the HTML and made a JSON File using the RJSONIO package. See below:
JSON_file <- toJSON(Data_HTML)
write(JSON_file, "json_file.json")
This created a json_file, which I then attempted to load(This looked exactly like mine, maybe I am just tired, and missed a comma somewhere).
url_json <-getURL("https://raw.githubusercontent.com/mfarris9505/Week-8-Hwk/master/json_file.json")
Data_JSON <- fromJSON(url_json)
Data_JSON
## $`NULL`
## $`NULL`$Title
## [1] "Counte of Monte Cristo"
## [2] "Automated Data Collection with R"
## [3] "Goosebumps #13: Welcome to Dead House"
##
## $`NULL`$Author1
## [1] "Alexander Dumas" "Simon Munzert" "R.L. Stine"
##
## $`NULL`$Author2
## [1] "NA" "Christian Rubba" "NA"
##
## $`NULL`$Author3
## [1] "NA" "Peter Meibner" "NA"
##
## $`NULL`$Author4
## [1] "NA" "Dominic Nyhuis" "NA"
##
## $`NULL`$Publisher
## [1] "Penquin Classic" "Wiley" "Scholastic Paperbacks"
##
## $`NULL`$Genre
## [1] "Historical Novel" "Textbook" "Horror"
##
## $`NULL`$Published
## [1] "1844" "2015" "2010"
As you can see, the output of this is different than the other two data sources.The other files created a data.frame that was much more tabular, while this one created a less organized file. I wonder if this was more because of the JSON package utilized. The other packages appear to have created a data frame.