Assignment 7 - HTML, XML, JSON

Prompt

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Response

My favorite subject is comedy or comedians so I’ll pick 2 books that I’ve read, and one with multiple authors.

The additional attributes will be page number, publish year, and any awards the book has won.

Egghead by Bo Burnham
Dril Official “Mr. Ten Years” Anniversary Collection by Dril
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch by Terry Pratchett & Neil Gaiman

I’ve worked with these three file types before so I’m going to make a dataframe here and publish the files, then upload to github, then read them in here and display.

Data files

df <- data.frame(
  Titles = c("Egghead; or, You Can't Survive on Ideas Alone", 'Dril Official "Mr. Ten Years" Anniversary Collection', "Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch"),
  Authors = c("Bo Burnham", "Dril","Terry Pratchett & Neil Gaiman"),
  Page_Number = c("240","420","491"),
  Publish_Year = c("2013", "2018", "1990"),
  Awards = c("Goodreads Choice Award Nominee for Poetry (2013)", "", "Locus Award Nominee for Best Fantasy Novel (1991), World Fantasy Award Nominee for Best Novel (1991), Audie Award Nominee for Audio Drama and for Fantasy (2023)")
)
knitr::kable(head(df))

Titles	Authors	Page_Number	Publish_Year	Awards
Egghead; or, You Can’t Survive on Ideas Alone	Bo Burnham	240	2013	Goodreads Choice Award Nominee for Poetry (2013)
Dril Official “Mr. Ten Years” Anniversary Collection	Dril	420	2018
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch	Terry Pratchett & Neil Gaiman	491	1990	Locus Award Nominee for Best Fantasy Novel (1991), World Fantasy Award Nominee for Best Novel (1991), Audie Award Nominee for Audio Drama and for Fantasy (2023)

Write to html

#install.packages('tableHTML')
library(tableHTML)
write_tableHTML(tableHTML(df, rownames = FALSE), file = 'Ass7_Book_Table.html')

Uploaded HTML file to Github

Import HTML and write to XML

#install.packages('xml2')
library('xml2')
h <- read_html("https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_Book_Table.html")
tmp <- tempfile(fileext = ".xml")
write_xml(h, tmp, options = "as_xml")
#readLines(tmp)
writeLines(readLines(tmp), "Ass7_Book_Table.xml")

Ok so the above didn’t work :( made XML file manually. Will import below.

Export to JSON

#install.packages("jsonlite")
library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

write_json(df,"Ass7_Book_Table.json", pretty = TRUE, auto_unbox = TRUE)

Read files in

Read HTML

library("rvest")

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

html_book_table <- read_html("https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_Book_Table.html", rownames = FALSE) %>% html_table() %>% .[[1]]
#had to add the .[[1]] because was returning a list
knitr::kable(head(html_book_table))

Titles	Authors	Page_Number	Publish_Year	Awards
Egghead; or, You Can’t Survive on Ideas Alone	Bo Burnham	240	2013	Goodreads Choice Award Nominee for Poetry (2013)
Dril Official “Mr. Ten Years” Anniversary Collection	Dril	420	2018
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch	Terry Pratchett & Neil Gaiman	491	1990	Locus Award Nominee for Best Fantasy Novel (1991), World Fantasy Award Nominee for Best Novel (1991), Audie Award Nominee for Audio Drama and for Fantasy (2023)

Read XML (manually created file)

xml_book_table <- xml2::read_xml('https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_XML_file_of_books.xml')
print(xml_book_table)

Still does not work… Trying to follow solution from here: https://stackoverflow.com/questions/17198658/how-to-parse-an-xml-file-to-an-r-data-frame

#install.packages("XML")
library(XML)
#data <- xmlParse("https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_XML_file_of_books.xml")
#Getting an "Error: XML content does not seem to be XML: ''"

#Trying RCurl per https://stackoverflow.com/questions/23584514/error-xml-content-does-not-seem-to-be-xml-r-3-1-0
#install.packages("RCurl")
library(RCurl)

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

fileURL <- "https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_XML_file_of_books.xml"
xData <- getURL(fileURL, ssl.verifyPeer=FALSE)
doc <- xmlParse(xData)

#OK apparently there are "escape characters" which I had to replace in the xml file. Now it's working

xml_data <- xmlToList(doc)

authors <- as.list(xml_data[["book"]][["author"]])
#ok well this isn't working either. Let's go back to xml2 package

back to xml2 - following solution from https://stackoverflow.com/questions/33446888/r-convert-xml-data-to-data-frame

library(xml2)
xml_data <- xml2::read_xml('https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_XML_file_of_books.xml')
print(xml_data)

## {xml_document}
## <books>
## [1] <book id="1">\n  <title>Egghead; or, You Can't Survive on Ideas Alone</ti ...
## [2] <book id="2">\n  <title>Dril Official "Mr. Ten Years" Anniversary Collect ...
## [3] <book id="3">\n  <title>Good Omens: The Nice and Accurate Prophecies of A ...

#good!
titles <- xml_find_all(xml_data, "//title")
titles <- trimws(xml_text(titles)) #makes a character list
#could make function to do this but it's only 5 things so will just manually do it
authors <- xml_find_all(xml_data, "//author")
authors <- trimws(xml_text(authors))
page_num <- xml_find_all(xml_data, "//page_number")
page_num <- trimws(xml_text(page_num))
pub_year <- xml_find_all(xml_data, "//publish_year")
pub_year <- trimws(xml_text(pub_year))
awards <- xml_find_all(xml_data, "//awards")
awards <- trimws(xml_text(awards))

#now build dataframe from character lists
xml_dataframe <- data.frame(Titles = titles, Authors = authors, Page_Numbers = page_num, Publish_Year = pub_year, Awards = awards)
knitr::kable(head(xml_dataframe))

Titles	Authors	Page_Numbers	Publish_Year	Awards
Egghead; or, You Can’t Survive on Ideas Alone	Bo Burnham	240	2013	Goodreads Choice Award Nominee for Poetry (2013)
Dril Official “Mr. Ten Years” Anniversary Collection	Dril	420	2018
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch	Terry Pratchett & Neil Gaiman	491	1990	Locus Award Nominee for Best Fantasy Novel (1991), World Fantasy Award Nominee for Best Novel (1991), Audie Award Nominee for Audio Drama and for Fantasy (2023)

Read JSON

json_book_table <- fromJSON("https://raw.githubusercontent.com/jacshap/Data607/refs/heads/main/Ass7_Book_Table.json")
knitr::kable(head(json_book_table))

Titles	Authors	Page_Number	Publish_Year	Awards
Egghead; or, You Can’t Survive on Ideas Alone	Bo Burnham	240	2013	Goodreads Choice Award Nominee for Poetry (2013)
Dril Official “Mr. Ten Years” Anniversary Collection	Dril	420	2018
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch	Terry Pratchett & Neil Gaiman	491	1990	Locus Award Nominee for Best Fantasy Novel (1991), World Fantasy Award Nominee for Best Novel (1991), Audie Award Nominee for Audio Drama and for Fantasy (2023)

recap for html table

knitr::kable(head(html_book_table))

Titles	Authors	Page_Number	Publish_Year	Awards
Egghead; or, You Can’t Survive on Ideas Alone	Bo Burnham	240	2013	Goodreads Choice Award Nominee for Poetry (2013)
Dril Official “Mr. Ten Years” Anniversary Collection	Dril	420	2018
Good Omens: The Nice and Accurate Prophecies of Agnes Nutter, Witch	Terry Pratchett & Neil Gaiman	491	1990	Locus Award Nominee for Best Fantasy Novel (1991), World Fantasy Award Nominee for Best Novel (1991), Audie Award Nominee for Audio Drama and for Fantasy (2023)

Conclusion

I got all the tables to be read in and look the same. There was a small difference in how the tables were being read in with consideration to the quotes of “Mr. Ten Years”. I had done a single quote within the double quotes in the original dataframe, and the writing to html and json files changed the ” character to `, thus it looked weird when printing to a knitr::kable table. Fixed by putting the double quotes in the single quotes.

Assignment 7 - HTML, XML, JSON - DATA 607

Jacob Shapiro

2025-10-08

Prompt

Response

Data files

Read files in

Conclusion