Week8

Week 8 Assignment: Working with XML and JSON in R:

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author.
For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table)XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data framesidentical?your deliverable is the three source files and the R code. If you can, package your assignment solution up into an . Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web] =========================================================================================================================================

First I created 3 files on 3 book details on Data science and stored it in github with .html, .xml and .csv (for JSO) format. Now my goal is to fetch these files through R code and get their data dislayed per difference format. To start with, I read the csv file and loaded the data table , the to make it structure have transformed it to data frame using tbl_df (), then my nest step was to convert this data frame to a Json fle which I did it with the help of RJSONIO package and using toJSON() function. Display of JSON can be done by cat()

library(xml2)

## Warning: package 'xml2' was built under R version 3.2.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.2

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.2

library(rvest)

## Warning: package 'rvest' was built under R version 3.2.2

library(RCurl)

## Warning: package 'RCurl' was built under R version 3.2.2

## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete

library(RJSONIO)

## Warning: package 'RJSONIO' was built under R version 3.2.2

library(XML)

## Warning: package 'XML' was built under R version 3.2.2

## 
## Attaching package: 'XML'
## 
## The following object is masked from 'package:rvest':
## 
##     xml

csv_data<- read.csv(file = "https://raw.githubusercontent.com/ksanju0/IS607/master/bookscsvtojson.csv", header=TRUE)
conv_DF<-tbl_df(csv_data)
conv_json=toJSON(conv_DF)
cat(conv_json)

## {
##  "Title": [ "Data Science from Scratch: First Principles with Python", "Data Science for Business: What you need to know about data mining and data-analytic thinking", "Data Smart: Using Data Science to Transform Information into Insight" ],
## "Author": [ "Joel Grus", "Foster Provost, Tom Fawcett", "John Foreman" ],
## "ISBN": [ "978-1491901427", "978-1449361327", "978-1118661468" ],
## "Price": [ "$28.51 ", "$39.99 ", "$27.48 " ],
## "Language": [ "English", "English", "English" ] 
## }

For Html format, I created 3 tables which has book details and then loaded the url to R for fetching the data. Then reading he HTML table using readHTMLTable() and then finally displaying the content using head()

fetchURL=getURL("https://raw.githubusercontent.com/ksanju0/IS607/master/bookshtml.html")
tab_content= readHTMLTable(fetchURL)
head(tab_content)

## $`NULL`
##   Book Title: Data Science from Scratch: First Principles with Python
## 1     Author:                                               Joel Grus
## 2       ISBN:                                          978-1491901427
## 3      Price:                                                  $28.51
## 4   Language:                                                 English
## 
## $`NULL`
##   Book Title:
## 1    Authors:
## 2 Tom Fawcett
## 3       ISBN:
## 4      Price:
## 5   Language:
##   What you need to know about data mining and data-analytic thinking
## 1                                                     Forest Provost
## 2                                                               <NA>
## 3                                                     978-1449361327
## 4                                                             $39.99
## 5                                                            English
## 
## $`NULL`
##   Book Title:
## 1     Author:
## 2       ISBN:
## 3      Price:
## 4   Language:
##   Data Smart: Using Data Science to Transform Information into Insight
## 1                                                         John Foreman
## 2                                                       978-1118661468
## 3                                                               $27.48
## 4                                                              English

First we need to fetch the https url using getURL(), the we parse the XML. Then we get the XML Root and chilren root and then get all the value using sapply()

fetchURL=getURL("https://raw.githubusercontent.com/ksanju0/IS607/master/booksxml.xml")
parseXML= xmlParse(fetchURL, useInternalNodes = FALSE)
getroot=xmlRoot(parseXML)
getchild=xmlChildren(getroot)
sapply(getchild, xmlValue)

##                                                                                                                                              book 
##                                                      "Data Science from Scratch: First Principles with PythonJoelGrus978-1491901427$28.51English" 
##                                                                                                                                              book 
## "Data Science for Business: What you need to know about data mining and data-analytic thinkingFosterProvostTomFawcett978-1449361327$39.99English" 
##                                                                                                                                              book 
##                                      "Data Smart: Using Data Science to Transform Information into InsightJohnForeman978-1118661468$27.48English"

Another way to get the XMLvalue is getting the XMl Root and then converting it to a Data frame.

fetchURL = getURL("https://raw.githubusercontent.com/ksanju0/IS607/master/booksxml.xml")
parseXML= xmlParse(fetchURL, useInternalNodes = TRUE)
getroot = xmlRoot(parseXML)
tbl_DF = xmlToDataFrame(getroot)
tbl_DF

##                                                                                            title
## 1                                       \tData Science from Scratch: First Principles with Python
## 2 \tData Science for Business: What you need to know about data mining and data-analytic thinking
## 3                          \tData Smart: Using Data Science to Transform Information into Insight
##          author           isbn  price language  Co_Author
## 1      JoelGrus 978-1491901427 $28.51  English       <NA>
## 2 FosterProvost 978-1449361327 $39.99  English TomFawcett
## 3   JohnForeman 978-1118661468 $27.48  English       <NA>

Conclusion: Data extracted from three different format are same except for JSON which provide a specific format of data, rest XML and HTML has the same data format.

Week8

sanjivek

October 17, 2015

Week 8 Assignment: Working with XML and JSON in R: