Homework 5 - Working with XML and JSON in R

Nick Oliver

Books

Here is a table with the 3 books I chose and their properties

title authors page count publication year current price on amazon
Real World Haskell Bryan O’Sullivan, John Goerzen, Don Stewart 670 2008 $49.99
Programming in Haskell Graham Hutton 171 2007 $18.97
Learn You a Haskell for Great Good!: A Beginner’s Guide Miran Lipovača 400 2011 $42.21

JSON

Use the jsonlite libraries fromJSON function and simply cast as dataframe

library(jsonlite)
jsonDf <- as.data.frame(fromJSON("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.json"))

XML

Use the read_xml function of the xml2 library.

Convert the XML to a list.

Unnest by the root node of the XML

Now you have a structure that looks like the dataframe you want but the values are lists of lists of the actual value due to the XML structure.

Unnest every column two more times to extract the values from the list of lists

library(xml2)
library(tidyverse)
xmlDoc <- as_list(read_xml("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.xml"))

# put list into a one column tibble
xmlDf<-as_tibble(xmlDoc) %>%
  unnest_wider(books) %>%
  # unnest same length list cols
  unnest(cols = names(.)) %>%
  # unnest again because each column value is still a list
  unnest(cols = names(.)) %>%
  type_convert()

HTML

Use the web scraping libraryrvest’s html_table function and cast the result as a dataframe

library(rvest)
htmlDf <- read_html("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.html")

htmlDf <- as.data.frame(html_table(htmlDf))

Are the three data frames identical?

First compare all dataframes and they are not equal because the column names are different.

Set all the column names the same and they are still not identical because the values of each column are different types from each other due to the different ways the libraries are parsing the data into dataframes.

(all_equal(xmlDf, htmlDf))
## [1] "not compatible: \n- Cols in y but not x: `Title`, `Authors`, `Page.Count`, `Publication.Year`, `Price.On.Amazon`.\n- Cols in x but not y: `title`, `authors`, `page-count`, `publication-year`, `price-on-amazon`.\n"
(all_equal(xmlDf, jsonDf))
## [1] "not compatible: \n- Cols in y but not x: `books.title`, `books.authors`, `books.pageCount`, `books.publicationYear`, `books.priceOnAmazon`.\n- Cols in x but not y: `title`, `authors`, `page-count`, `publication-year`, `price-on-amazon`.\n"
(all_equal(jsonDf, htmlDf))
## [1] "not compatible: \n- Cols in y but not x: `Title`, `Authors`, `Page.Count`, `Publication.Year`, `Price.On.Amazon`.\n- Cols in x but not y: `books.title`, `books.authors`, `books.pageCount`, `books.publicationYear`, `books.priceOnAmazon`.\n"
colNames <- c("title","authors","page count","publication year","current price on amazon")
names(xmlDf) <- colNames
names(htmlDf) <- colNames
names(jsonDf) <- colNames

(all_equal(xmlDf, htmlDf))
## [1] "- Different types for column `page count`: double vs integer\n- Different types for column `publication year`: double vs integer\n"
(all_equal(xmlDf, jsonDf))
## [1] "- Different types for column `authors`: character vs list\n- Different types for column `page count`: double vs integer\n- Different types for column `publication year`: double vs integer\n"
(all_equal(jsonDf, htmlDf))
## [1] "- Different types for column `authors`: list vs character\n"