DATA607

Overview: Working with XML and JSON in R

For this assignment we created individual xml, JSON, and html files to store information related to books we have read and enjoyed. After uploading these files to the internet, we developed R scripts to retrieve the files and then the information into separate dataframes.

This exercise facilitates our understanding of xml, JSON, and html file structures as well as our ablity to manipulate this information.

library(xml2)
library(flatxml)
library(methods)
library(htmltab)
library(kableExtra)
library(tidyverse)
library(magrittr)
library(jsonlite)

HTML

Read html table and save into dataframe. In this case, I used the htmltab function.

#Read in html table from github and save as dataframe

url <- "https://raw.githubusercontent.com/sconnin/DATA607_HW7/main/HW7_html_table.html?token=APC5QM2CW2P3H5W6W6QXZSTAK6PXI"

df_books<-htmltab(doc=url)

#Print table without assigned row numbers

row.names(df_books) <- NULL

df_books%>%kbl%>%kable_material(c("striped"))

ID	Title	Author	ISBN	Publisher	Pages	Copyright Year
1	Their Eyes Were Watching God	Zora Neale Hurston	9780060916503	HARPERCOLLINS PUBLISHERS	207	1990
2	Sand County Almanac With Essays on Conservation from Round River	Aldo Leopold	9780345345059	PENGUIN RANDOM HOUSE	320	1966
3	Desert Solitaire	Edward Abbey	9780671695880	POCKET BOOKS	288	1990

XML

Read xml file from Github and convert to dataframe. I used the flatxml library for this purpose.

# Convert the input xml file to a data frame.

df_xml <- fxml_importXMLFlat("https://raw.githubusercontent.com/sconnin/DATA607_HW7/main/books.xml")


df_xml%>%kbl%>%kable_material(c("striped"))

elem.	elemid.	attr.	value.	level1	level2	level3
Books	1	NA	NA	Books	NA	NA
Book	2	NA	NA	Books	Book	NA
ID	3	NA	1	Books	Book	ID
Title	4	NA	Their Eyes Were Watching God	Books	Book	Title
Author	5	NA	Zora Neale Hurston	Books	Book	Author
ISBN	6	NA	9780060916503	Books	Book	ISBN
Publisher	7	NA	HARPERCOLLINS PUBLISHERS	Books	Book	Publisher
Pages	8	NA	207	Books	Book	Pages
Copyright	9	NA	1990	Books	Book	Copyright
Book	10	NA	NA	Books	Book	NA
ID	11	NA	2	Books	Book	ID
Title	12	NA	Sand County Almanac With Essays on Conservation from Round River	Books	Book	Title
Author	13	NA	Aldo Leopold	Books	Book	Author
ISBN	14	NA	9780345345059	Books	Book	ISBN
Publisher	15	NA	PENGUIN RANDOM HOUSE	Books	Book	Publisher
Pages	16	NA	320	Books	Book	Pages
Copyright	17	NA	1966	Books	Book	Copyright
Book	18	NA	NA	Books	Book	NA
ID	19	NA	3	Books	Book	ID
Title	20	NA	Desert Solitaire	Books	Book	Title
Author	21	NA	Edward Abbey	Books	Book	Author
ISBN	22	NA	9780671695880	Books	Book	ISBN
Publisher	23	NA	POCKET BOOKS	Books	Book	Publisher
Pages	24	NA	288	Books	Book	Pages
Copyright	25	NA	1990	Books	Book	Copyright

JSON

Read Json file from Github and convert to dataframe. I used the JSONlite library for this purpose.

url <- "https://raw.githubusercontent.com/sconnin/DATA607_HW7/main/HW7_JSON.json"

books <- jsonlite::fromJSON(url, simplifyVector = TRUE)

df_bks <- data.frame(books)

row.names(df_bks)

## [1] "1" "2" "3"

df_bks%>%kbl%>%kable_material(c("striped"))

ID	Title	Author	ISBN	Pages	Copyright.Year
1	Their Eyes Were Watching God	Zora Neale Hurston	9780060916503	207	1990
2	Sand County Almanac With Essays on Conservation from Round River	Aldo Leopold	9780345345059	320	1996
3	Desert Solitaire	Edward Abbey	9780671695880	288	1990

Assessment

The dataframes produced from html and JSON file formats are essentially identical. The dataframe produced from the xml file format has been flattened but still retains the hierarchical relationships (e.g., root, path) inherent to xml. Additional manipulation is required to remove these features.

DATA607_HW7

Sean Connin

Overview: Working with XML and JSON in R

HTML

XML

JSON

Assessment