Working with XML and JSON in R

Getting Started

Instructions

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Load Libraries

We are coding in the tidyverse.

# Install new packages
#install.packages('rjson')

# Load packages --------------------------------------
library(tidyverse)
library(rvest)
library(xml2)
library(rjson)

Writing Tables

Here we show what was written to the three html, xml and json files.

Information I’m trying to represent

Book: The Information
Author: James Gleick
Published: 2011
Description: Talks about redundancy and patterns in language and entropy

Book: Hold Me Tight
Author: Dr. Sue Johnson
Published: 2008
Description: Explains childhood attachment theory as it applies to adult relationships

Book: Chasing the Scream
Author: Johann Hari and Second Author
Published: 2015
Description: Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs

in HTML code

<table>
  <tr>
    <th>Book</th>
    <th>Author</th>
    <th>Published</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>The Information</td>
    <td>James Gleick</td>
    <td>2011</td>
    <td>Redundancy and patterns in language and entropy</td>
  </tr>
  <tr>
    <td>Hold Me Tight</td>
    <td>Dr. Sue Johnson</td>
    <td>2008</td>
    <td>Explains childhood attachment theory as it applies to adult relationships</td>
  </tr>
  <tr>
    <td>Chasing the Scream</td>
    <td>Johann Hari and Second Author</td>
    <td>2015</td>
    <td>Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs</td>
  </tr>
</table>

in XML code

<root>

  <books>
    <book>The Information</book>
    <author>James Gleick</author>
    <published>2011</published>
    <description>Redundancy and patterns in language and entropy</description>
  </books>

  <books>
    <book>Hold Me Tight</book>
    <author>Dr. Sue Johnson</author>
    <published>2008</published>
    <description>Explains childhood attachment theory as it applies to adult relationships</description>
  </books>

  <books>
    <book>Chasing the Scream</book>
    <author>Johann Hari and Second Author</author>
    <published>2015</published>
    <description>Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs</description>
  </books>

</root>

in JSON code

{
  "book":["The Information","Hold Me Tight","Chasing the Scream"],
  "author":["James Gleick","Dr. Sue Johnson","Johann Hari and Second Author"],
  "published":[2011,2008,2015],
  "description":["Redundancy and patterns in language and entropy",
    "Explains childhood attachment theory as it applies to adult relationships",
    "Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs"]
}

Reading Tables

Here we read the tables above that were saved as their respective file types and uploaded to our github account.

from HTML

html_address <- "https://raw.githubusercontent.com/pkofy/DATA607/main/WK7Assignment/DATA607WK7Assignment.html"
dfhtml <- read_html(html_address)
dfhtml <- html_table(dfhtml)
dfhtml <- as.data.frame(dfhtml)
dfhtml

##                 Book                        Author Published
## 1    The Information                  James Gleick      2011
## 2      Hold Me Tight               Dr. Sue Johnson      2008
## 3 Chasing the Scream Johann Hari and Second Author      2015
##                                                                                 Description
## 1                                           Redundancy and patterns in language and entropy
## 2                 Explains childhood attachment theory as it applies to adult relationships
## 3 Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs

from XML

xml_address <- "https://raw.githubusercontent.com/pkofy/DATA607/main/WK7Assignment/DATA607WK7Assignment.xml"

# Read xml as list
step1 <- as_list(read_xml(xml_address))

# Expand the data to multiple rows by tags
step2 <- as_tibble(step1) %>%
  unnest_longer(root)

# Make a tibble with the data from each tag
step3a <- step2 %>%
  filter(root_id == "book") %>%
  unnest_wider(root)

step3b <- step2 %>%
  filter(root_id == "author") %>%
  unnest_wider(root)

step3c <- step2 %>%
  filter(root_id == "published") %>%
  unnest_wider(root)

step3d <- step2 %>%
  filter(root_id == "description") %>%
  unnest_wider(root)

# Convert tibbles to data frames
step3a <- as.data.frame(step3a)
step3b <- as.data.frame(step3b)
step3c <- as.data.frame(step3c)
step3d <- as.data.frame(step3d)

# Assemble data frame
dfxml <- data.frame(matrix(ncol = 0, nrow = 3))
dfxml$book <- step3a$...1
dfxml$author <- step3b$...1
dfxml$published <- step3c$...1
dfxml$description <- step3d$...1
dfxml

##                 book                        author published
## 1    The Information                  James Gleick      2011
## 2      Hold Me Tight               Dr. Sue Johnson      2008
## 3 Chasing the Scream Johann Hari and Second Author      2015
##                                                                                 description
## 1                                           Redundancy and patterns in language and entropy
## 2                 Explains childhood attachment theory as it applies to adult relationships
## 3 Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs

from JSON

json_address <- "https://raw.githubusercontent.com/pkofy/DATA607/main/WK7Assignment/DATA607WK7Assignment.json"
dfjson <- fromJSON(file=json_address)
dfjson <- as.data.frame(dfjson)
dfjson

##                 book                        author published
## 1    The Information                  James Gleick      2011
## 2      Hold Me Tight               Dr. Sue Johnson      2008
## 3 Chasing the Scream Johann Hari and Second Author      2015
##                                                                                 description
## 1                                           Redundancy and patterns in language and entropy
## 2                 Explains childhood attachment theory as it applies to adult relationships
## 3 Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs

Conclusion

All three data frames are equal.

# The data frames from the HTML and XML route are the same
dfhtml == dfxml

##      Book Author Published Description
## [1,] TRUE   TRUE      TRUE        TRUE
## [2,] TRUE   TRUE      TRUE        TRUE
## [3,] TRUE   TRUE      TRUE        TRUE

# The data frames from the XML and JSON route are the same
dfxml == dfjson

##      book author published description
## [1,] TRUE   TRUE      TRUE        TRUE
## [2,] TRUE   TRUE      TRUE        TRUE
## [3,] TRUE   TRUE      TRUE        TRUE

Resources

The following link was helpful in parsing the xml file: urbandatapalette.com/post/2021-03-xml-dataframe-r/

DATA607WK7Assignment

PK O’Flaherty

2022-03-20