Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
We are coding in the tidyverse.
# Install new packages
#install.packages('rjson')
# Load packages --------------------------------------
library(tidyverse)
library(rvest)
library(xml2)
library(rjson)
Here we show what was written to the three html, xml and json files.
Book: The Information
Author: James Gleick
Published: 2011
Description: Talks about redundancy and patterns in language and
entropy
Book: Hold Me Tight
Author: Dr. Sue Johnson
Published: 2008
Description: Explains childhood attachment theory as it applies to adult
relationships
Book: Chasing the Scream
Author: Johann Hari and Second Author
Published: 2015
Description: Has an arc from the war on drugs, a scientific experiment,
to a possible peace with drugs
<table>
<tr>
<th>Book</th>
<th>Author</th>
<th>Published</th>
<th>Description</th>
</tr>
<tr>
<td>The Information</td>
<td>James Gleick</td>
<td>2011</td>
<td>Redundancy and patterns in language and entropy</td>
</tr>
<tr>
<td>Hold Me Tight</td>
<td>Dr. Sue Johnson</td>
<td>2008</td>
<td>Explains childhood attachment theory as it applies to adult relationships</td>
</tr>
<tr>
<td>Chasing the Scream</td>
<td>Johann Hari and Second Author</td>
<td>2015</td>
<td>Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs</td>
</tr>
</table>
root>
<
books>
<book>The Information</book>
<author>James Gleick</author>
<published>2011</published>
<description>Redundancy and patterns in language and entropy</description>
<books>
</
books>
<book>Hold Me Tight</book>
<author>Dr. Sue Johnson</author>
<published>2008</published>
<description>Explains childhood attachment theory as it applies to adult relationships</description>
<books>
</
books>
<book>Chasing the Scream</book>
<author>Johann Hari and Second Author</author>
<published>2015</published>
<description>Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs</description>
<books>
</
root> </
{
"book":["The Information","Hold Me Tight","Chasing the Scream"],
"author":["James Gleick","Dr. Sue Johnson","Johann Hari and Second Author"],
"published":[2011,2008,2015],
"description":["Redundancy and patterns in language and entropy",
"Explains childhood attachment theory as it applies to adult relationships",
"Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs"]
}
Here we read the tables above that were saved as their respective file types and uploaded to our github account.
<- "https://raw.githubusercontent.com/pkofy/DATA607/main/WK7Assignment/DATA607WK7Assignment.html"
html_address <- read_html(html_address)
dfhtml <- html_table(dfhtml)
dfhtml <- as.data.frame(dfhtml)
dfhtml dfhtml
## Book Author Published
## 1 The Information James Gleick 2011
## 2 Hold Me Tight Dr. Sue Johnson 2008
## 3 Chasing the Scream Johann Hari and Second Author 2015
## Description
## 1 Redundancy and patterns in language and entropy
## 2 Explains childhood attachment theory as it applies to adult relationships
## 3 Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs
<- "https://raw.githubusercontent.com/pkofy/DATA607/main/WK7Assignment/DATA607WK7Assignment.xml"
xml_address
# Read xml as list
<- as_list(read_xml(xml_address))
step1
# Expand the data to multiple rows by tags
<- as_tibble(step1) %>%
step2 unnest_longer(root)
# Make a tibble with the data from each tag
<- step2 %>%
step3a filter(root_id == "book") %>%
unnest_wider(root)
<- step2 %>%
step3b filter(root_id == "author") %>%
unnest_wider(root)
<- step2 %>%
step3c filter(root_id == "published") %>%
unnest_wider(root)
<- step2 %>%
step3d filter(root_id == "description") %>%
unnest_wider(root)
# Convert tibbles to data frames
<- as.data.frame(step3a)
step3a <- as.data.frame(step3b)
step3b <- as.data.frame(step3c)
step3c <- as.data.frame(step3d) step3d
# Assemble data frame
<- data.frame(matrix(ncol = 0, nrow = 3))
dfxml $book <- step3a$...1
dfxml$author <- step3b$...1
dfxml$published <- step3c$...1
dfxml$description <- step3d$...1
dfxml dfxml
## book author published
## 1 The Information James Gleick 2011
## 2 Hold Me Tight Dr. Sue Johnson 2008
## 3 Chasing the Scream Johann Hari and Second Author 2015
## description
## 1 Redundancy and patterns in language and entropy
## 2 Explains childhood attachment theory as it applies to adult relationships
## 3 Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs
<- "https://raw.githubusercontent.com/pkofy/DATA607/main/WK7Assignment/DATA607WK7Assignment.json"
json_address <- fromJSON(file=json_address)
dfjson <- as.data.frame(dfjson)
dfjson dfjson
## book author published
## 1 The Information James Gleick 2011
## 2 Hold Me Tight Dr. Sue Johnson 2008
## 3 Chasing the Scream Johann Hari and Second Author 2015
## description
## 1 Redundancy and patterns in language and entropy
## 2 Explains childhood attachment theory as it applies to adult relationships
## 3 Has an arc from the war on drugs, a scientific experiment, to a possible peace with drugs
All three data frames are equal.
# The data frames from the HTML and XML route are the same
== dfxml dfhtml
## Book Author Published Description
## [1,] TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE
# The data frames from the XML and JSON route are the same
== dfjson dfxml
## book author published description
## [1,] TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE
The following link was helpful in parsing the xml file: urbandatapalette.com/post/2021-03-xml-dataframe-r/