Working with XML and JSON in R

This assignment requires us to create objects in html, xml, and json and then load them into separate dataframes to compare.

I chose the subject of woodworking, and selected the top 3 books when searched on amazon.

I created the html, xml, and json in text files and saved them and uploaded them to the web.

Lets load some packages and take a look:

Packages

library(RCurl)
library(XML)
library(jsonlite)

HTML

html_o <- 'https://raw.githubusercontent.com/dataconsumer101/data607/master/books.html'
html_c <- getURLContent(html_o)
html_l <- readHTMLTable(html_c)
html_df <- html_l[[1]]
head(html_df)
##                                Title
## 1 The Complete Manual of Woodworking
## 2                  Essentail Joinery
## 3     Great Book of Woodworking Tips
##                                                                                            Subtitle
## 1                     A Detailed Guide to Design, Techniques, and Tools for the Beginner and Expert
## 2                                           The Fundamental Techniques Every Woodworker Should Know
## 3 Over 650 Ingenious Workshop Tips, Techniques, and Secrets from the Experts at American Woodworker
##                      Author Year     Cover Edition Pages Price
## 1 Albert Jackson, David Day 1996 Paperback       1   320 19.89
## 2            Marc Spagnuolo 2019 Paperback       1   216 17.75
## 3             Randy Johnson 2012 Paperback       1   336 15.99

XML

xml_o <- 'https://raw.githubusercontent.com/dataconsumer101/data607/master/books.xml'
xml_c <- getURLContent(xml_o)
xml_p <- xmlParse(xml_c)
xml_df <- xmlToDataFrame(xml_p)
head(xml_df)
##                                TITLE
## 1 The Complete Manual of Woodworking
## 2                  Essential Joinery
## 3     Great Book of Woodworking Tips
##                                                                                            SUBTITLE
## 1                     A Detailed Guide to Design, Techniques, and Tools for the Beginner and Expert
## 2                                           The Fundamental Techniques Every Woodworker Should Know
## 3 Over 650 Ingenious Workshop Tips, Techniques, and Secrets from the Experts at American Woodworker
##          AUTHOR  AUTHOR2 YEAR     COVER EDITION PAGES PRICE
## 1 AlbertJackson DavidDay 1996 Paperback       1   320 19.89
## 2 MarcSpagnuolo     <NA> 2019 Paperback       1   216 17.75
## 3  RandyJohnson     <NA> 2012 Paperback       1   336 15.99

JSON

json_o <- 'https://raw.githubusercontent.com/dataconsumer101/data607/master/books.json'
json_r <- fromJSON(json_o)
json_df <- json_r[[1]]
head(json_df)
##                                Title
## 1 The Complete Manual of Woodworking
## 2                  Essentail Joinery
## 3     Great Book of Woodworking Tips
##                                                                                            Subtitle
## 1                     A Detailed Guide to Design, Techniques, and Tools for the Beginner and Expert
## 2                                           The Fundamental Techniques Every Woodworker Should Know
## 3 Over 650 Ingenious Workshop Tips, Techniques, and Secrets from the Experts at American Woodworker
##                      Author Year     Cover Edition Pages Price
## 1 Albert Jackson, David Day 1996 Paperback       1   320 19.89
## 2            Marc Spagnuolo 2019 Paperback       1   216 17.75
## 3             Randy Johnson 2012 Paperback       1   336 15.99

Differences in Author

html_df[1,]$Author
## [1] Albert Jackson, David Day
## Levels: Albert Jackson, David Day Marc Spagnuolo Randy Johnson
xml_df[1,]$AUTHOR
## [1] "AlbertJackson"
xml_df[1,]$AUTHOR2
## [1] "DavidDay"
json_df[1,]$Author
## [[1]]
## [1] "Albert Jackson" "David Day"
  • For the most part, the data in each of these formats looks the same– as long as there is only one value per field.
  • If we look at the first book with 2 authors, we can see the differences in how the values appear. This is also influenced by how they were entered into each of the objects.