Books
Here is a table with the 3 books I chose and their properties
| title | authors | page count | publication year | current price on amazon |
|---|---|---|---|---|
| Real World Haskell | Bryan O’Sullivan, John Goerzen, Don Stewart | 670 | 2008 | $49.99 |
| Programming in Haskell | Graham Hutton | 171 | 2007 | $18.97 |
| Learn You a Haskell for Great Good!: A Beginner’s Guide | Miran Lipovača | 400 | 2011 | $42.21 |
JSON
Use the jsonlite libraries fromJSON function and simply cast as dataframe
library(jsonlite)
jsonDf <- as.data.frame(fromJSON("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.json"))XML
Use the read_xml function of the xml2 library.
Convert the XML to a list.
Unnest by the root node of the XML
Now you have a structure that looks like the dataframe you want but the values are lists of lists of the actual value due to the XML structure.
Unnest every column two more times to extract the values from the list of lists
library(xml2)
library(tidyverse)
xmlDoc <- as_list(read_xml("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.xml"))
# put list into a one column tibble
xmlDf<-as_tibble(xmlDoc) %>%
unnest_wider(books) %>%
# unnest same length list cols
unnest(cols = names(.)) %>%
# unnest again because each column value is still a list
unnest(cols = names(.)) %>%
type_convert()HTML
Use the web scraping libraryrvest’s html_table function and cast the result as a dataframe
library(rvest)
htmlDf <- read_html("https://raw.githubusercontent.com/nolivercuny/data607/master/homework5/books.html")
htmlDf <- as.data.frame(html_table(htmlDf))Are the three data frames identical?
First compare all dataframes and they are not equal because the column names are different.
Set all the column names the same and they are still not identical because the values of each column are different types from each other due to the different ways the libraries are parsing the data into dataframes.
(all_equal(xmlDf, htmlDf))## [1] "not compatible: \n- Cols in y but not x: `Title`, `Authors`, `Page.Count`, `Publication.Year`, `Price.On.Amazon`.\n- Cols in x but not y: `title`, `authors`, `page-count`, `publication-year`, `price-on-amazon`.\n"
(all_equal(xmlDf, jsonDf))## [1] "not compatible: \n- Cols in y but not x: `books.title`, `books.authors`, `books.pageCount`, `books.publicationYear`, `books.priceOnAmazon`.\n- Cols in x but not y: `title`, `authors`, `page-count`, `publication-year`, `price-on-amazon`.\n"
(all_equal(jsonDf, htmlDf))## [1] "not compatible: \n- Cols in y but not x: `Title`, `Authors`, `Page.Count`, `Publication.Year`, `Price.On.Amazon`.\n- Cols in x but not y: `books.title`, `books.authors`, `books.pageCount`, `books.publicationYear`, `books.priceOnAmazon`.\n"
colNames <- c("title","authors","page count","publication year","current price on amazon")
names(xmlDf) <- colNames
names(htmlDf) <- colNames
names(jsonDf) <- colNames
(all_equal(xmlDf, htmlDf))## [1] "- Different types for column `page count`: double vs integer\n- Different types for column `publication year`: double vs integer\n"
(all_equal(xmlDf, jsonDf))## [1] "- Different types for column `authors`: character vs list\n- Different types for column `page count`: double vs integer\n- Different types for column `publication year`: double vs integer\n"
(all_equal(jsonDf, htmlDf))## [1] "- Different types for column `authors`: list vs character\n"