DATA607 Assignment 7

0. Packages

I used the jsonlite package to work with JSON files and the httr2, rvest, xml2, and xmltools packages to work with XML and HTML files. I also used the kableExtra package to create tables. If needed, you can install them using the commands below.

install.packages("httr2")
install.packages("jsonlite")
install.packages("kableExtra")
install.packages("rvest")
install.packages("xml2")
devtools::install_github('ultinomics/xmltools')

1. Introduction

The goals of this assignment were to import JSON, XML, and HTML files of the same information about 3 books into dataframes and test whether they are identical. The information included the books’ title, authors, publication date, publisher, and the price on Amazon as of the date of this assignment.

2. Data files

I manually created JSON, XML, and HTML files about the 3 books using the Visual Studio code editor, which provided helpful syntax highlighting for each file type. I also used JSON, XML, and HTML linters to validate my files.

I saved these files to my GitHub repository.

3. Data import

HTML

books_html <- read_html("https://raw.githubusercontent.com/alexandersimon1/Data607/main/Assignment7/books.html")

XML

books_xml <- read_xml("https://raw.githubusercontent.com/alexandersimon1/Data607/main/Assignment7/books.xml")

JSON

books_json <- fromJSON("https://raw.githubusercontent.com/alexandersimon1/Data607/main/Assignment7/books.json", flatten = TRUE)

4. Data transformation

4.1. HTML

I started with the HTML table since there is a function in the rvest package to do the dataframe conversion.

books_html_df <- html_table(books_html)[[1]]

The data types and content appear correct.

glimpse(books_html_df)

## Rows: 3
## Columns: 5
## $ title        <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author       <chr> "Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, …
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher    <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price        <dbl> 21.99, 30.99, 20.49

kbl_display(books_html_df, "100%")

title	author	publish_date	publisher	price
Joy of Cooking	Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, John Becker, Megan Scott	2019-11-12	Scribner	21.99
The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat	Peter Couquyt, Bernard Lahousse, Johan Langenbick	2020-09-10	Firefly Books	30.99
The Good Life: Lessons from the World’s Longest Scientific Study of Happiness	Robert Waldinger, M.D., Marc Schulz, Ph.D.	2023-01-10	Simon & Schuster	20.49

I used this dataframe as a first-pass comparator for the XML and JSON dataframes.

4.2. XML

I used the xmltools package to convert the XML object to a dataframe. The code below is adapted from the package readme file.

The first step is to identify the “terminal parent”, which is the parent node that does not have any children.

terminal_parent <- books_xml %>%
  xml_get_paths(only_terminal_parent = TRUE)

Second, determine the terminal XPATHs (XML path query language to extract nodes).

terminal_xpaths <- terminal_parent %>%
  unlist() %>%
  unique()

Then use the terminal XPATHs to extract the data. Finally, bind the rows and columns into a dataframe.

terminal_nodesets <- lapply(terminal_xpaths, xml2::xml_find_all, x = books_xml)
books_xml_df <- terminal_nodesets %>%
  purrr::map(xml_dig_df) %>% 
  purrr::map(dplyr::bind_rows) %>%
  dplyr::bind_cols()

A glimpse of the resulting dataframe shows that the authors are in individual columns and the price is a character data type rather than numeric.

glimpse(books_xml_df)

## Rows: 3
## Columns: 9
## $ title        <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author...2   <chr> "Irma S. Rombauer", "Peter Couquyt", "Robert Waldinger, M…
## $ author...3   <chr> "Marion Rombauer Becker", "Bernard Lahousse", "Marc Schul…
## $ author...4   <chr> "Ethan Becker", "Johan Langenbick", NA
## $ author...5   <chr> "John Becker", NA, NA
## $ author...6   <chr> "Megan Scott", NA, NA
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher    <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price        <chr> "21.99", "30.99", "20.49"

So I combined the author columns and coerced the price column to be numeric.

books_xml_df <- books_xml_df %>%
  unite(author, contains("author"), sep = ", ", na.rm = TRUE)

books_xml_df <- books_xml_df %>%
  mutate(
    price = as.numeric(price)    
  )

Now the dataframe looks better.

glimpse(books_xml_df)

## Rows: 3
## Columns: 5
## $ title        <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author       <chr> "Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, …
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher    <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price        <dbl> 21.99, 30.99, 20.49

kbl_display(books_xml_df, "100%")

title	author	publish_date	publisher	price
Joy of Cooking	Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, John Becker, Megan Scott	2019-11-12	Scribner	21.99
The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat	Peter Couquyt, Bernard Lahousse, Johan Langenbick	2020-09-10	Firefly Books	30.99
The Good Life: Lessons from the World’s Longest Scientific Study of Happiness	Robert Waldinger, M.D., Marc Schulz, Ph.D.	2023-01-10	Simon & Schuster	20.49

4.3. JSON

The input object is a list of lists.

str(books_json)

## List of 1
##  $ catalog:List of 1
##   ..$ books:'data.frame':    3 obs. of  5 variables:
##   .. ..$ title       : chr [1:3] "Joy of Cooking" "The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat" "The Good Life: Lessons from the World's Longest Scientific Study of Happiness"
##   .. ..$ author      :List of 3
##   .. .. ..$ : chr [1:5] "Irma S. Rombauer" "Marion Rombauer Becker" "Ethan Becker" "John Becker" ...
##   .. .. ..$ : chr [1:3] "Peter Couquyt" "Bernard Lahousse" "Johan Langenbick"
##   .. .. ..$ : chr [1:2] "Robert Waldinger, M.D." "Marc Schulz, Ph.D."
##   .. ..$ publish_date: chr [1:3] "2019-11-12" "2020-09-10" "2023-01-10"
##   .. ..$ publisher   : chr [1:3] "Scribner" "Firefly Books" "Simon & Schuster"
##   .. ..$ price       : num [1:3] 22 31 20.5

First, I binded the columns into a dataframe.

books_json_df <- bind_cols(books_json[[1]])

Then I converted each author list to a string. I use ungroup() at the end because rowwise() groups by row. (I realized this when comparing dataframes in section 4.4.)

books_json_df <- books_json_df %>%
  rowwise() %>%
  mutate(
    author = toString(unlist(author)),
  ) %>%
  ungroup()

Now the dataframe looks better.

glimpse(books_json_df)

## Rows: 3
## Columns: 5
## $ title        <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author       <chr> "Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, …
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher    <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price        <dbl> 21.99, 30.99, 20.49

kbl_display(books_json_df, "100%")

title	author	publish_date	publisher	price
Joy of Cooking	Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, John Becker, Megan Scott	2019-11-12	Scribner	21.99
The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat	Peter Couquyt, Bernard Lahousse, Johan Langenbick	2020-09-10	Firefly Books	30.99
The Good Life: Lessons from the World’s Longest Scientific Study of Happiness	Robert Waldinger, M.D., Marc Schulz, Ph.D.	2023-01-10	Simon & Schuster	20.49

4.4. Compare dataframes

Pairwise comparisons using all.equal showed that all 3 dataframes are identical.

JSON dataframe vs HTML dataframe

all.equal(books_json_df, books_html_df)

## [1] TRUE

XML dataframe vs HTML dataframe

all.equal(books_xml_df, books_html_df)

## [1] TRUE

JSON dataframe vs XML dataframe

all.equal(books_json_df, books_xml_df)

## [1] TRUE

5. Conclusions

I successfully imported JSON, XML, and HTML versions of the same information into dataframes. All 3 dataframes were identical.