I used the jsonlite package to work with JSON files and the httr2, rvest, xml2, and xmltools packages to work with XML and HTML files. I also used the kableExtra package to create tables. If needed, you can install them using the commands below.
install.packages("httr2")
install.packages("jsonlite")
install.packages("kableExtra")
install.packages("rvest")
install.packages("xml2")
devtools::install_github('ultinomics/xmltools')
The goals of this assignment were to import JSON, XML, and HTML files of the same information about 3 books into dataframes and test whether they are identical. The information included the books’ title, authors, publication date, publisher, and the price on Amazon as of the date of this assignment.
I manually created JSON, XML, and HTML files about the 3 books using the Visual Studio code editor, which provided helpful syntax highlighting for each file type. I also used JSON, XML, and HTML linters to validate my files.
I saved these files to my GitHub repository.
HTML
books_html <- read_html("https://raw.githubusercontent.com/alexandersimon1/Data607/main/Assignment7/books.html")
XML
books_xml <- read_xml("https://raw.githubusercontent.com/alexandersimon1/Data607/main/Assignment7/books.xml")
JSON
books_json <- fromJSON("https://raw.githubusercontent.com/alexandersimon1/Data607/main/Assignment7/books.json", flatten = TRUE)
I started with the HTML table since there is a function in the
rvest package to do the dataframe conversion.
books_html_df <- html_table(books_html)[[1]]
The data types and content appear correct.
glimpse(books_html_df)
## Rows: 3
## Columns: 5
## $ title <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author <chr> "Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, …
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price <dbl> 21.99, 30.99, 20.49
kbl_display(books_html_df, "100%")
| title | author | publish_date | publisher | price |
|---|---|---|---|---|
| Joy of Cooking | Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, John Becker, Megan Scott | 2019-11-12 | Scribner | 21.99 |
| The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat | Peter Couquyt, Bernard Lahousse, Johan Langenbick | 2020-09-10 | Firefly Books | 30.99 |
| The Good Life: Lessons from the World’s Longest Scientific Study of Happiness | Robert Waldinger, M.D., Marc Schulz, Ph.D. | 2023-01-10 | Simon & Schuster | 20.49 |
I used this dataframe as a first-pass comparator for the XML and JSON dataframes.
I used the xmltools
package to convert the XML object to a dataframe. The code below is
adapted from the package readme file.
The first step is to identify the “terminal parent”, which is the parent node that does not have any children.
terminal_parent <- books_xml %>%
xml_get_paths(only_terminal_parent = TRUE)
Second, determine the terminal XPATHs (XML path query language to extract nodes).
terminal_xpaths <- terminal_parent %>%
unlist() %>%
unique()
Then use the terminal XPATHs to extract the data. Finally, bind the rows and columns into a dataframe.
terminal_nodesets <- lapply(terminal_xpaths, xml2::xml_find_all, x = books_xml)
books_xml_df <- terminal_nodesets %>%
purrr::map(xml_dig_df) %>%
purrr::map(dplyr::bind_rows) %>%
dplyr::bind_cols()
A glimpse of the resulting dataframe shows that the authors are in individual columns and the price is a character data type rather than numeric.
glimpse(books_xml_df)
## Rows: 3
## Columns: 9
## $ title <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author...2 <chr> "Irma S. Rombauer", "Peter Couquyt", "Robert Waldinger, M…
## $ author...3 <chr> "Marion Rombauer Becker", "Bernard Lahousse", "Marc Schul…
## $ author...4 <chr> "Ethan Becker", "Johan Langenbick", NA
## $ author...5 <chr> "John Becker", NA, NA
## $ author...6 <chr> "Megan Scott", NA, NA
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price <chr> "21.99", "30.99", "20.49"
So I combined the author columns and coerced the price column to be numeric.
books_xml_df <- books_xml_df %>%
unite(author, contains("author"), sep = ", ", na.rm = TRUE)
books_xml_df <- books_xml_df %>%
mutate(
price = as.numeric(price)
)
Now the dataframe looks better.
glimpse(books_xml_df)
## Rows: 3
## Columns: 5
## $ title <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author <chr> "Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, …
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price <dbl> 21.99, 30.99, 20.49
kbl_display(books_xml_df, "100%")
| title | author | publish_date | publisher | price |
|---|---|---|---|---|
| Joy of Cooking | Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, John Becker, Megan Scott | 2019-11-12 | Scribner | 21.99 |
| The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat | Peter Couquyt, Bernard Lahousse, Johan Langenbick | 2020-09-10 | Firefly Books | 30.99 |
| The Good Life: Lessons from the World’s Longest Scientific Study of Happiness | Robert Waldinger, M.D., Marc Schulz, Ph.D. | 2023-01-10 | Simon & Schuster | 20.49 |
The input object is a list of lists.
str(books_json)
## List of 1
## $ catalog:List of 1
## ..$ books:'data.frame': 3 obs. of 5 variables:
## .. ..$ title : chr [1:3] "Joy of Cooking" "The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat" "The Good Life: Lessons from the World's Longest Scientific Study of Happiness"
## .. ..$ author :List of 3
## .. .. ..$ : chr [1:5] "Irma S. Rombauer" "Marion Rombauer Becker" "Ethan Becker" "John Becker" ...
## .. .. ..$ : chr [1:3] "Peter Couquyt" "Bernard Lahousse" "Johan Langenbick"
## .. .. ..$ : chr [1:2] "Robert Waldinger, M.D." "Marc Schulz, Ph.D."
## .. ..$ publish_date: chr [1:3] "2019-11-12" "2020-09-10" "2023-01-10"
## .. ..$ publisher : chr [1:3] "Scribner" "Firefly Books" "Simon & Schuster"
## .. ..$ price : num [1:3] 22 31 20.5
First, I binded the columns into a dataframe.
books_json_df <- bind_cols(books_json[[1]])
Then I converted each author list to a string. I use
ungroup() at the end because rowwise() groups
by row. (I realized this when comparing dataframes in section 4.4.)
books_json_df <- books_json_df %>%
rowwise() %>%
mutate(
author = toString(unlist(author)),
) %>%
ungroup()
Now the dataframe looks better.
glimpse(books_json_df)
## Rows: 3
## Columns: 5
## $ title <chr> "Joy of Cooking", "The Art and Science of Foodpairing: 10…
## $ author <chr> "Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, …
## $ publish_date <chr> "2019-11-12", "2020-09-10", "2023-01-10"
## $ publisher <chr> "Scribner", "Firefly Books", "Simon & Schuster"
## $ price <dbl> 21.99, 30.99, 20.49
kbl_display(books_json_df, "100%")
| title | author | publish_date | publisher | price |
|---|---|---|---|---|
| Joy of Cooking | Irma S. Rombauer, Marion Rombauer Becker, Ethan Becker, John Becker, Megan Scott | 2019-11-12 | Scribner | 21.99 |
| The Art and Science of Foodpairing: 10,000 flavour matches that will transform the way you eat | Peter Couquyt, Bernard Lahousse, Johan Langenbick | 2020-09-10 | Firefly Books | 30.99 |
| The Good Life: Lessons from the World’s Longest Scientific Study of Happiness | Robert Waldinger, M.D., Marc Schulz, Ph.D. | 2023-01-10 | Simon & Schuster | 20.49 |
Pairwise comparisons using all.equal showed that all 3
dataframes are identical.
JSON dataframe vs HTML dataframe
all.equal(books_json_df, books_html_df)
## [1] TRUE
XML dataframe vs HTML dataframe
all.equal(books_xml_df, books_html_df)
## [1] TRUE
JSON dataframe vs XML dataframe
all.equal(books_json_df, books_xml_df)
## [1] TRUE
I successfully imported JSON, XML, and HTML versions of the same information into dataframes. All 3 dataframes were identical.