DATA 607_week7_web technologies

Goal of Assignment:

First create three files (XML, HTML, JSON) containing the same data- your favorite books. For each book, include the title, authors, and two or three other attributes that you find interesting.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

#Load required packages
library("rjson")
library(RCurl)

## Loading required package: bitops

library(XML)

## Warning: package 'XML' was built under R version 3.4.3

library(stringr)

##Let's read in XML file from github
xml.url <- getURL("https://raw.githubusercontent.com/rickidonsingh/Data607/master/books.xml") 
file.xml <- xmlParse(file = xml.url)

#Next, let's call the xml function and put it in a df
df.xml <- xmlToDataFrame(file.xml)

#Let's see what it looks like
df.xml

##                    Title          Author
## 1 The Catcher in the Rye     JD Salinger
## 2              Moby Dick Herman Melville
## 3    The Grapes of Wrath  John Steinbeck
##                                                        Brief       ISBN
## 1                Salinger’s sort of autobiographical account 7543321726
## 2     Captain Ahab on the hunt for the monstrous white whale 1503280780
## 3 Poor family driven from their land in the Great Depression 0143039431
##   Pages
## 1   240
## 2   378
## 3   464

##Let's read in HTML file from github
html.url <- getURL("https://raw.githubusercontent.com/rickidonsingh/Data607/master/books.html") 
df.html <- readHTMLTable(html.url, header = T, as.data.frame = T)

#Let's see what it looks like
df.html

## $`NULL`
##                    Title          Author
## 1 The Catcher in the Rye     JD Salinger
## 2              Moby Dick Herman Melville
## 3    The Grapes of Wrath  John Steinbeck
##                                                        Brief       ISBN
## 1    Salingerâ\u0080\u0099s sort of autobiographical account 7543321726
## 2     Captain Ahab on the hunt for the monstrous white whale 1503280780
## 3 Poor family driven from their land in the Great Depression 0143039431
##   Pages
## 1   240
## 2   378
## 3   464

#Let's read in JSON file from github
json.url <- getURL("https://raw.githubusercontent.com/rickidonsingh/Data607/master/books.json")
file.json <- (file = json.url)

#Next, let's call the JSON function and put it in a df
data.json <- fromJSON(file.json)
df.json <- as.data.frame(data.json)

#Let's see what it looks like
df.json

##    book_table.book.Title book_table.book.Author
## 1 The Catcher in the Rye            JD Salinger
##                         book_table.book.Brief book_table.book.ISBN
## 1 Salinger’s sort of autobiographical account           7543321726
##   book_table.book.Pages book_table.book.Title.1 book_table.book.Author.1
## 1                   240               Moby Dick          Herman Melville
##                                  book_table.book.Brief.1
## 1 Captain Ahab on the hunt for the monstrous white whale
##   book_table.book.ISBN.1 book_table.book.Pages.1 book_table.book.Title.2
## 1             1503280780                     378     The Grapes of Wrath
##   book_table.book.Author.2
## 1           John Steinbeck
##                                      book_table.book.Brief.2
## 1 Poor family driven from their land in the Great Depression
##   book_table.book.ISBN.2 book_table.book.Pages.2
## 1             0143039431                     464

Yes, we can conclude all three dataframes are Identical since all three structures match.

str(df.xml)

## 'data.frame':    3 obs. of  5 variables:
##  $ Title : Factor w/ 3 levels "Moby Dick","The Catcher in the Rye",..: 2 1 3
##  $ Author: Factor w/ 3 levels "Herman Melville",..: 2 1 3
##  $ Brief : Factor w/ 3 levels "Captain Ahab on the hunt for the monstrous white whale",..: 3 1 2
##  $ ISBN  : Factor w/ 3 levels "0143039431","1503280780",..: 3 2 1
##  $ Pages : Factor w/ 3 levels "240","378","464": 1 2 3

str(df.html)

## List of 1
##  $ NULL:'data.frame':    3 obs. of  5 variables:
##   ..$ Title : Factor w/ 3 levels "Moby Dick","The Catcher in the Rye",..: 2 1 3
##   ..$ Author: Factor w/ 3 levels "Herman Melville",..: 2 1 3
##   ..$ Brief : Factor w/ 3 levels "Captain Ahab on the hunt for the monstrous white whale",..: 3 1 2
##   ..$ ISBN  : Factor w/ 3 levels "0143039431","1503280780",..: 3 2 1
##   ..$ Pages : Factor w/ 3 levels "240","378","464": 1 2 3

str(df.json)

## 'data.frame':    1 obs. of  15 variables:
##  $ book_table.book.Title   : Factor w/ 1 level "The Catcher in the Rye": 1
##  $ book_table.book.Author  : Factor w/ 1 level "JD Salinger": 1
##  $ book_table.book.Brief   : Factor w/ 1 level "Salinger’s sort of autobiographical account": 1
##  $ book_table.book.ISBN    : Factor w/ 1 level "7543321726": 1
##  $ book_table.book.Pages   : Factor w/ 1 level "240": 1
##  $ book_table.book.Title.1 : Factor w/ 1 level "Moby Dick": 1
##  $ book_table.book.Author.1: Factor w/ 1 level "Herman Melville": 1
##  $ book_table.book.Brief.1 : Factor w/ 1 level "Captain Ahab on the hunt for the monstrous white whale": 1
##  $ book_table.book.ISBN.1  : Factor w/ 1 level "1503280780": 1
##  $ book_table.book.Pages.1 : Factor w/ 1 level "378": 1
##  $ book_table.book.Title.2 : Factor w/ 1 level "The Grapes of Wrath": 1
##  $ book_table.book.Author.2: Factor w/ 1 level "John Steinbeck": 1
##  $ book_table.book.Brief.2 : Factor w/ 1 level "Poor family driven from their land in the Great Depression": 1
##  $ book_table.book.ISBN.2  : Factor w/ 1 level "0143039431": 1
##  $ book_table.book.Pages.2 : Factor w/ 1 level "464": 1

DATA 607_week7_web technologies

RSingh

3/18/2018

Yes, we can conclude all three dataframes are Identical since all three structures match.