Books

Here is a table with the 3 books I chose and their properties

title	authors	page count	publication year	current price on amazon
Real World Haskell	Bryan O’Sullivan, John Goerzen, Don Stewart	670	2008	$49.99
Programming in Haskell	Graham Hutton	171	2007	$18.97
Learn You a Haskell for Great Good!: A Beginner’s Guide	Miran Lipovača	400	2011	$42.21

JSON

Use the jsonlite libraries fromJSON function and simply cast as dataframe

library(jsonlite)
jsonDf <- as.data.frame(fromJSON("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.json"))

XML

Use the read_xml function of the xml2 library.

Convert the XML to a list.

Unnest by the root node of the XML

Now you have a structure that looks like the dataframe you want but the values are lists of lists of the actual value due to the XML structure.

Unnest every column two more times to extract the values from the list of lists

library(xml2)
library(tidyverse)
xmlDoc <- as_list(read_xml("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.xml"))

# put list into a one column tibble
xmlDf<-as_tibble(xmlDoc) %>%
  unnest_wider(books) %>%
  # unnest same length list cols
  unnest(cols = names(.)) %>%
  # unnest again because each column value is still a list
  unnest(cols = names(.)) %>%
  type_convert()

HTML

Use the web scraping libraryrvest’s html_table function and cast the result as a dataframe

library(rvest)
htmlDf <- read_html("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.html")

htmlDf <- as.data.frame(html_table(htmlDf))

Are the three data frames identical?

First compare all dataframes and they are not equal because the column names are different.

Set all the column names the same and they are still not identical because the values of each column are different types from each other due to the different ways the libraries are parsing the data into dataframes.

(all_equal(xmlDf, htmlDf))

## [1] "not compatible: \n- Cols in y but not x: `Title`, `Authors`, `Page.Count`, `Publication.Year`, `Price.On.Amazon`.\n- Cols in x but not y: `title`, `authors`, `page-count`, `publication-year`, `price-on-amazon`.\n"

(all_equal(xmlDf, jsonDf))

## [1] "not compatible: \n- Cols in y but not x: `books.title`, `books.authors`, `books.pageCount`, `books.publicationYear`, `books.priceOnAmazon`.\n- Cols in x but not y: `title`, `authors`, `page-count`, `publication-year`, `price-on-amazon`.\n"

(all_equal(jsonDf, htmlDf))

## [1] "not compatible: \n- Cols in y but not x: `Title`, `Authors`, `Page.Count`, `Publication.Year`, `Price.On.Amazon`.\n- Cols in x but not y: `books.title`, `books.authors`, `books.pageCount`, `books.publicationYear`, `books.priceOnAmazon`.\n"

colNames <- c("title","authors","page count","publication year","current price on amazon")
names(xmlDf) <- colNames
names(htmlDf) <- colNames
names(jsonDf) <- colNames

(all_equal(xmlDf, htmlDf))

## [1] "- Different types for column `page count`: double vs integer\n- Different types for column `publication year`: double vs integer\n"

(all_equal(xmlDf, jsonDf))

## [1] "- Different types for column `authors`: character vs list\n- Different types for column `page count`: double vs integer\n- Different types for column `publication year`: double vs integer\n"

(all_equal(jsonDf, htmlDf))

## [1] "- Different types for column `authors`: list vs character\n"

Homework 5 - Working with XML and JSON in R

Nick Oliver

Books

JSON

XML

HTML

Are the three data frames identical?