Data 607: Working with XML and JSON in R

David Quarshie

October 10, 2017

Assignment

Make 3 files (html, xml, and json) that contains information on 3 books. Load the files into R and compare them to see if there are any differences.

Load Libraries

library(knitr)
library(RCurl)
library(XML)
library(jsonlite)
library(plyr)

Load HTML File

htmlurl <- getURL("https://raw.githubusercontent.com/dquarshie89/Data607/master/books.html")
html <- readHTMLTable(htmlurl, header=TRUE, which=1)
knitr::kable(html)
Title Author Publisher Pages Rating
R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics Paul Teetor O’Reilly Media 438 4.5
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data Hadley Wickham O’Reilly Media 522 5
R Graphics Cookbook: Practical Recipes for Visualizing Data Winston Chang O’Reilly Media 416 4.5

Load XML File

xmlurl <- getURL("https://raw.githubusercontent.com/dquarshie89/Data607/master/books.xml")
xml <- xmlToDataFrame(xmlurl)
knitr::kable(xml)
Title Author Publisher Pages Rating
R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics Paul Teetor O’Reilly Media 438 4.5
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data Hadley Wickham O’Reilly Media 522 5
R Graphics Cookbook: Practical Recipes for Visualizing Data Winston Chang O’Reilly Media 416 4.5

Load JSON File

jsonurl <- getURL("https://raw.githubusercontent.com/dquarshie89/Data607/master/books.json")
json <- fromJSON(jsonurl)
json <- data.frame(json)
json
##                                                   books.table.book.Title
## 1 R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics
## 2 R for Data Science: Import, Tidy, Transform, Visualize, and Model Data
## 3            R Graphics Cookbook: Practical Recipes for Visualizing Data
##   books.table.book.Author books.table.book.Publisher
## 1             Paul Teetor             O'Reilly Media
## 2          Hadley Wickham             O'Reilly Media
## 3           Winston Chang             O'Reilly Media
##   books.table.book.Pages books.table.book.Rating
## 1                    438                     4.5
## 2                    522                       5
## 3                    416                     4.5
json <- rename(json, c("books.table.book.Title"="Title",
               "books.table.book.Author"="Author",
               "books.table.book.Publisher"="Publisher",
               "books.table.book.Pages"="Pages",
               "books.table.book.Rating"="Rating"
               ))
knitr::kable(json)
Title Author Publisher Pages Rating
R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics Paul Teetor O’Reilly Media 438 4.5
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data Hadley Wickham O’Reilly Media 522 5
R Graphics Cookbook: Practical Recipes for Visualizing Data Winston Chang O’Reilly Media 416 4.5

Compare JSON and XML and HTML

knitr::kable(json == xml)
Title Author Publisher Pages Rating
TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE
knitr::kable(json == html)
Title Author Publisher Pages Rating
TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE
knitr::kable(html == xml)
Title Author Publisher Pages Rating
TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE
str(json)
## 'data.frame':    3 obs. of  5 variables:
##  $ Title    : chr  "R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics" "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data" "R Graphics Cookbook: Practical Recipes for Visualizing Data"
##  $ Author   : chr  "Paul Teetor" "Hadley Wickham" "Winston Chang"
##  $ Publisher: chr  "O'Reilly Media" "O'Reilly Media" "O'Reilly Media"
##  $ Pages    : chr  "438" "522" "416"
##  $ Rating   : chr  "4.5" "5" "4.5"
str(xml)
## 'data.frame':    3 obs. of  5 variables:
##  $ Title    : Factor w/ 3 levels "R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics",..: 1 2 3
##  $ Author   : Factor w/ 3 levels "Hadley Wickham",..: 2 1 3
##  $ Publisher: Factor w/ 1 level "O'Reilly Media": 1 1 1
##  $ Pages    : Factor w/ 3 levels "416","438","522": 2 3 1
##  $ Rating   : Factor w/ 2 levels "4.5","5": 1 2 1
str(html)
## 'data.frame':    3 obs. of  5 variables:
##  $ Title    : Factor w/ 3 levels "R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics",..: 1 2 3
##  $ Author   : Factor w/ 3 levels "Hadley Wickham",..: 2 1 3
##  $ Publisher: Factor w/ 1 level "O'Reilly Media": 1 1 1
##  $ Pages    : Factor w/ 3 levels "416","438","522": 2 3 1
##  $ Rating   : Factor w/ 2 levels "4.5","5": 1 2 1
sapply(json, typeof)
##       Title      Author   Publisher       Pages      Rating 
## "character" "character" "character" "character" "character"
sapply(xml, typeof)
##     Title    Author Publisher     Pages    Rating 
## "integer" "integer" "integer" "integer" "integer"
sapply(html, typeof)
##     Title    Author Publisher     Pages    Rating 
## "integer" "integer" "integer" "integer" "integer"

Conclusion

All values in the 3 files are equal but looking at the data frames in the environment we see that the JSON transfer has characters while HTML and XML have factors.