Assignment – Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

DATA 607 LAB 7

I will begin by creating a csv file in Google sheets by exporting this data and importing it into R to give me a view of how my data should look once I convert it to different formats.”

csv_link <- "https://raw.githubusercontent.com/Zcash95/DATA607-7/main/DATA607%20LAB%207%20Books%20CSV%20-%20Sheet1.csv"
books_data <- read.csv(url(csv_link))
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'https://raw.githubusercontent.com/Zcash95/DATA607-7/main/DATA607%20LAB%207%20Books%20CSV%20-%20Sheet1.csv'
head(books_data)
##            Book.Title                     Author Year.Published           ISBN
## 1   Rich Dad Poor Dad            Robert Kiyosaki           1997 978-1612680194
## 2 Think and Grow Rich              Napolean Hill           1937 978-1612680195
## 3         Zero to One Peter Thiel, Blake Masters           2014 978-1612680196

HTML Format

html_data <- read_html("https://raw.githubusercontent.com/Zcash95/DATA607-7/main/books.html")

html_data1 <- html_data %>% html_table(fill = TRUE)

head(html_data1)
## [[1]]
## # A tibble: 3 × 4
##   Title               Author                     `Year Published` ISBN          
##   <chr>               <chr>                                 <int> <chr>         
## 1 Rich Dad Poor Dad   Robert Kiyosaki                        1997 978-1612680194
## 2 Think and Grow Rich Napoleon Hill                          1937 978-1612680195
## 3 Zero to One         Peter Thiel, Blake Masters             2014 978-1612680196

HTML Source Code

<!DOCTYPE html> Books Information
Title Author Year Published ISBN
Rich Dad Poor Dad Robert Kiyosaki 1997 978-1612680194
Think and Grow Rich Napoleon Hill 1937 978-1612680195
Zero to One Peter Thiel, Blake Masters 2014 978-1612680196

XML Format

xml_link <- "https://raw.githubusercontent.com/Zcash95/DATA607-7/main/book.xml"

xml_content <- readLines(xml_link, warn = FALSE)
xml_string <- paste(xml_content, collapse = "\n")

xml_data <- read_xml(xml_string)

xml_df <- xml_data %>%
xml_find_all(".//book") %>%
map_df(~{
  title <- xml_text(xml_find_all(., ".//title"))
  author <- xml_text(xml_find_all(., ".//author"))
  year <- xml_text(xml_find_all(., ".//year"))
  isbn <- xml_text(xml_find_all(., ".//isbn"))

data.frame(title, author, year, isbn)
})


print(xml_df)
##                 title                     author year           isbn
## 1   Rich Dad Poor Dad            Robert Kiyosaki 1997 978-1612680194
## 2 Think and Grow Rich              Napoleon Hill 1937 978-1612680195
## 3         Zero to One Peter Thiel, Blake Masters 2014 978-1612680196

XML Source code

Rich Dad Poor Dad
<author>Robert Kiyosaki</author>
<year>1997</year>
<isbn>978-1612680194</isbn>
Think and Grow Rich
<author>Napoleon Hill</author>
<year>1937</year>
<isbn>978-1612680195</isbn>
Zero to One
<author>Peter Thiel, Blake Masters</author>
<year>2014</year>
<isbn>978-1612680196</isbn>

JSON Format

json_data <- fromJSON("https://raw.githubusercontent.com/Zcash95/DATA607-7/main/books.json")

json_df <- as.data.frame(json_data)

print(json_df)
##           books.title               books.author books.year     books.isbn
## 1   Rich Dad Poor Dad            Robert Kiyosaki       1997 978-1612680194
## 2 Think and Grow Rich              Napoleon Hill       1937 978-1612680195
## 3         Zero to One Peter Thiel, Blake Masters       2014 978-1612680196

JSON Source code

{ “books”: [ { “title”: “Rich Dad Poor Dad”, “author”: “Robert Kiyosaki”, “year”: 1997, “isbn”: “978-1612680194” }, { “title”: “Think and Grow Rich”, “author”: “Napoleon Hill”, “year”: 1937, “isbn”: “978-1612680195” }, { “title”: “Zero to One”, “author”: “Peter Thiel, Blake Masters”, “year”: 2014, “isbn”: “978-1612680196” } ] }

Conclusion:

The data tables were all imported into R and their data frames look the same. I had trouble with the XML file but with some tidying I was able to import the data into a table format just like that of JSON and HTML. The difference that I seen between the three data frames are that HTML and JSON had the year column as an integer while the XML file had it as a character. The source code for all are provided as well is located in the GitHub repo

print(html_data1)
## [[1]]
## # A tibble: 3 × 4
##   Title               Author                     `Year Published` ISBN          
##   <chr>               <chr>                                 <int> <chr>         
## 1 Rich Dad Poor Dad   Robert Kiyosaki                        1997 978-1612680194
## 2 Think and Grow Rich Napoleon Hill                          1937 978-1612680195
## 3 Zero to One         Peter Thiel, Blake Masters             2014 978-1612680196
print(xml_df)
##                 title                     author year           isbn
## 1   Rich Dad Poor Dad            Robert Kiyosaki 1997 978-1612680194
## 2 Think and Grow Rich              Napoleon Hill 1937 978-1612680195
## 3         Zero to One Peter Thiel, Blake Masters 2014 978-1612680196
print(json_df)
##           books.title               books.author books.year     books.isbn
## 1   Rich Dad Poor Dad            Robert Kiyosaki       1997 978-1612680194
## 2 Think and Grow Rich              Napoleon Hill       1937 978-1612680195
## 3         Zero to One Peter Thiel, Blake Masters       2014 978-1612680196