Assignment – Working with XML and JSON in R

⌂ Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Load in the libraries.

library(RCurl)

## Loading required package: bitops

library(plyr)
library(XML)
library(jsonlite)
library(knitr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

I created tables that contained 3 books that I have read into html, json, xml format. Now we will parse each one into R.

Parsing HTML

Get the html link from github.

baseURL <- "https://raw.githubusercontent.com/Sizzlo/books/master/3Books.html"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")

html_books <- as.data.frame(t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)])))

# Rename the columns 
colnames(html_books)<-c("Title", "Author(s)", "Publisher", "Genre")
html_books

##                                    Title         Author(s)
## 1  Harry Potter and the Sorcerer's Stone       J.K.Rowling
## 2 Homo Deus: A Brief History of Tomorrow Yuval Noah Harari
## 3                     Rich Dad, Poor Dad   Robert Kiyosaki
##                Publisher              Genre
## 1 Scholastic Corporation Fantasy Literature
## 2        Harvill Secker         Non-fiction
## 3        Warner Books Ed   Personal Finance

Parsing Json

Load json file from Github.

baseURL2 <- "https://raw.githubusercontent.com/Sizzlo/books/master/3books.json"
txt2 <- getURL(url=baseURL2)

json_books <- fromJSON(txt2)
json_books

##                                    Title           Authors
## 1  Harry Potter and the Sorcerer's Stone       J.K.Rowling
## 2 Homo Deus: A Brief History of Tomorrow Yuval Noah Harari
## 3                     Rich Dad, Poor Dad   Robert Kiyosaki
##                Publisher              Genre
## 1 Scholastic Corporation Fantasy Literature
## 2         Harvill Secker        Non-fiction
## 3        Warner Books Ed   Personal Finance

json package seems like the simplest way to convert.

Parsing XML

1)Load xml file from github.

baseURL3 <- "https://raw.githubusercontent.com/Sizzlo/books/master/3books.xml"
txt3 <- getURL(url=baseURL3)

xml_books <- xmlParse(txt3,  validate = F)

xml_books1 <- ldply(xmlToList(txt3), data.frame)

Select the desired columns, Title, Authors, Publisher and Genre.

xml_books1<-xml_books1 %>% 
  select(Title, Authors, Publisher, Genre)
xml_books1

##                                    Title           Authors
## 1  Harry Potter and the Sorcerer's Stone       J.K.Rowling
## 2 Homo Deus: A Brief History of Tomorrow Yuval Noah Harari
## 3                     Rich Dad, Poor Dad   Robert Kiyosaki
##                Publisher              Genre
## 1 Scholastic Corporation Fantasy Literature
## 2         Harvill Secker        Non-fiction
## 3        Warner Books Ed   Personal Finance

Data607 HW7

Tony Mei

10/13/2019

Assignment – Working with XML and JSON in R

Parsing HTML

Get the html link from github.

Parsing Json

Parsing XML