Assignment 7 - Working with XML and JSON in R

Books

As a christian, I try and read a lot of christian literature. These are some of my favorites:

“Mere Christianity” by C.S. Lewis - encouraging and easy to read
“The Reason for God” by Timothy Keller - logical and mentions good sources
The Bible by James, Paul, King David, and others - epic, dense, and great plot twist

Technically the Bible is a collection of books, but since everyone publishes and refers to it as one book I will use that to my advantage.

Load Libraries

library(xml2)
library(jsonlite)
library(rvest)
library(RCurl)
library(dplyr)
library(tidyverse)
library(purrr)

Reading the book files

books_xml <- getURL("https://raw.githubusercontent.com/Ryungje/DATA607/refs/heads/main/Assignment%207/books.xml") %>% 
  read_xml("books.xml") %>%
  xml_find_all(".//book")

books_json <- getURL("https://raw.githubusercontent.com/Ryungje/DATA607/refs/heads/main/Assignment%207/books.json") %>%
  fromJSON("books.json")

books_html <- getURL("https://raw.githubusercontent.com/Ryungje/DATA607/refs/heads/main/Assignment%207/books.html") %>%
  read_html("books.html")

The Raw Imports

Each of the reads produces a data frame (if it can be called that) which resembles a list that simply contains all the file’s syntax and data. For example, the json version presents itself as a nested list with each

Processing Imports

I am not entirely sure if this step is necessary, but I am opting to process the raw imports a little further to extract the relevant information into a data frame.

# HTML
html_df <- books_html %>% 
  html_element("table") %>% 
  html_table()

# JSON
json_df <- books_json$books %>%
  map_dfr(function(book) {
  tibble(
    title = book$title,
    authors = list(book$author %||% book$authors),
    attributes = list(book$attributes)
  )
})


# XML
xml_df <- data.frame(title = xml_text(xml_find_all(books_xml, "./title")),
                     authors = c(
                       xml_text(xml_find_all(books_xml, "./author")),
                       xml_text(xml_find_all(books_xml, "./authors"))),
                     attributes = xml_text(xml_find_all(books_xml, "./attributes")))
html_df
## # A tibble: 3 × 3
##   `Book Title`             `Author(s)`                         Attributes       
##   <chr>                    <chr>                               <chr>            
## 1 "\"Mere Christianity\""  C.S. Lewis                          Encouraging and …
## 2 "\"The Reason for God\"" Timothy Keller                      Logical and ment…
## 3 "\"The Bible\""          James, Paul, King David, and others Epic, dense, and…
json_df
## # A tibble: 3 × 3
##   title              authors    attributes
##   <chr>              <list>     <list>    
## 1 Mere Christianity  <chr [1]>  <list [2]>
## 2 The Reason for God <chr [1]>  <list [2]>
## 3 The Bible          <list [4]> <list [3]>
xml_df
##                title                   authors                   attributes
## 1  Mere Christianity                C.S. Lewis      EncouragingEasy to read
## 2 The Reason for God            Timothy Keller LogicalMentions good sources
## 3          The Bible JamesPaulKing DavidOthers    EpicDenseGreat plot twist