Overview

We are working with data containing information on 3 fantasy books. We chose to store this information in three separate files: html, xml, and json. The idea is to familiarize ourselves with these different file types and structures, and using packages to load the data from these different formats for downstream use in R.

Loading data and libraries

library(dplyr) # to use piping function `|>` 
library(xml2) #reading raw data from url and parsing xml file
library(XML) # view xml file as its plain text
library(rvest) # read html as seen on web, parse html text
library(jsonlite) # read json files into R from url

Reading data into R

HTML

Here we will read in the url for the html formatted table. We are taking a public access point, or path to the raw file, and inserting it into the object html_raw.

url_html <- "https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Assignment%207/Books.html"
html_raw <- read_html(url_html) # get html text from url 
html_raw 
## {html_document}
## <html>
## [1] <body><table>\n<tr>\n<th>Book id</th>\n<th>Title</th>\n<th>Author</th>\n< ...

You can see how the output of the raw file still has holdover formatting of the html file. Next, we want to coerce this into a data frame using the html_table function from the rvest package to first coerce this into a table as if the html code were run on the web. Then, using base R package we’ll coerce this into a dataframe for downstream use in R.

html_books <- html_raw |>
  html_table() |> # get table from raw html text
  as.data.frame() # coerce list element to df 
html_books  
##   Book.id                             Title                  Author
## 1       1    A Court of Thorns and Roses #1           Sarah J. Maas
## 2       2 The Invisible Life of Addie LaRue      Victoria E. Schwab
## 3       3 A Thousand Beginnings and Endings Ellen Oh, Elsie Chapman
##   Goodreads.Rating Number.of.Pages
## 1             4.17             419
## 2             4.18             448
## 3             3.77             336
##                                             Highlights
## 1                 Romance, Magic, Adventure, Retelling
## 2                           Mystery, Romance, Mystical
## 3 Short Stories, Mythology, Science-Fiction, Retelling

XML

Now we want to read in and prepare an xml file for downstream use in R. We begin by getting the xml file from the url in a convenient format.

url_xml <- "https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Assignment%207/Books.xml"
xml_raw <- read_xml(url_xml) # read raw file into R from url
xml_raw
## {xml_document}
## <fantasy_books>
## [1] <book id="1">\n  <book_id>1</book_id>\n  <title>A Court of Thorns and Ros ...
## [2] <book id="2">\n  <book_id>2</book_id>\n  <title>The Invisible Life of Add ...
## [3] <book id="3">\n  <book_id>3</book_id>\n  <title>A Thousand Beginnings and ...

To see how our original xml file looks like as plain text we can use xmlParse from the XML package.

xml_raw |>
  xmlParse(url_xml) 
## <?xml version="1.0" encoding="UTF-8"?>
## <fantasy_books>
##   <book id="1">
##     <book_id>1</book_id>
##     <title>A Court of Thorns and Roses #1</title>
##     <author>Sarah J. Maas</author>
##     <goodreads_rating>4.17</goodreads_rating>
##     <number_of_pages>419</number_of_pages>
##     <highlights>Romance, Magic, Adventure, Retelling</highlights>
##   </book>
##   <book id="2">
##     <book_id>2</book_id>
##     <title>The Invisible Life of Addie LaRue</title>
##     <author>Victoria E. Schwab</author>
##     <goodreads_rating>4.18</goodreads_rating>
##     <number_of_pages>448</number_of_pages>
##     <highlights>Mystery, Romance, Mystical</highlights>
##   </book>
##   <book id="3">
##     <book_id>3</book_id>
##     <title>A Thousand Beginnings and Endings</title>
##     <author>Ellen Oh, Elsie Chapman</author>
##     <goodreads_rating>3.77</goodreads_rating>
##     <number_of_pages>336</number_of_pages>
##     <highlights>Short Stories, Mythology, Science-Fiction, Retelling</highlights>
##   </book>
## </fantasy_books>
## 

Our next step is to take the conveniently formatted xml file and parse through to get a usable tibble. We’ll be using xml_find_all to extract nodes with the same tag and use xml_text, xml_integer, and xml_double to extract only the data from the nodes.

## EXTRACT INFO

book_id <- xml_raw |> 
  xml_find_all(".//book_id") |> # identify nodes
  xml_text() # extract text from nodes

title <- xml_raw |> 
  xml_find_all(".//title") |> # identify nodes
  xml_text() # extract text from nodes

author <- xml_raw |> 
  xml_find_all(".//author") |> # identify nodes
  xml_text() # extract text from nodes

goodreads_rating <- xml_raw |> 
  xml_find_all(".//goodreads_rating") |> # identify nodes
  xml_double() # extract numeric from nodes

number_of_pages <- xml_raw |> 
  xml_find_all(".//number_of_pages") |> # identify nodes
  xml_integer() # extract integers from nodes

highlights <- xml_raw |> 
  xml_find_all(".//highlights") |> # identify nodes
  xml_text() # extract text from nodes


## FORMAT AS TIBBLE

xml_books <- tibble(book_id = book_id, title = title, author = author,
                    goodreads_rating = goodreads_rating, 
                    number_of_pages = number_of_pages, highlights = highlights)
xml_books
## # A tibble: 3 × 6
##   book_id title               author goodreads_rating number_of_pages highlights
##   <chr>   <chr>               <chr>             <dbl>           <int> <chr>     
## 1 1       A Court of Thorns … Sarah…             4.17             419 Romance, …
## 2 2       The Invisible Life… Victo…             4.18             448 Mystery, …
## 3 3       A Thousand Beginni… Ellen…             3.77             336 Short Sto…

JSON

Next, we will apply the same ideas as above to our json file. We’ll begin by using fromJSON from the jsonlite package to get a raw output of a list-like dataframe.

url_json <- "https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Assignment%207/Books.JSON"
json_raw <- fromJSON(url_json) # get raw json file from url
json_raw <- json_raw[-1,]
json_raw
##      [,1] [,2]                                [,3]                      [,4]  
## [1,] "1"  "A Court of Thorns and Roses #1"    "Sarah J. Maas"           "4.17"
## [2,] "2"  "The Invisible Life of Addie LaRue" "Victoria E. Schwab"      "4.18"
## [3,] "3"  "A Thousand Beginnings and Endings" "Ellen Oh, Elsie Chapman" "3.77"
##      [,5]  [,6]                                                  
## [1,] "419" "Romance, Magic, Adventure, Retelling"                
## [2,] "448" "Mystery, Romance, Mystical"                          
## [3,] "336" "Short Stories, Mythology, Science-Fiction, Retelling"

Finally, we can use the base package as.data.frame to coerce the output into a dataframe. We’ll also have to rename columns and coerce the appropriate variables into the correct data types.

json_books <- as.data.frame(json_raw) |>
  mutate(V1 = as.integer(V1), V4 = as.numeric(V4), V5 = as.integer(V5)) |>
  rename(book_id = V1, title = V2, author = V3, goodreads_rating = V4, 
         number_of_pages = V5, highlights = V6)
json_books
##   book_id                             title                  author
## 1       1    A Court of Thorns and Roses #1           Sarah J. Maas
## 2       2 The Invisible Life of Addie LaRue      Victoria E. Schwab
## 3       3 A Thousand Beginnings and Endings Ellen Oh, Elsie Chapman
##   goodreads_rating number_of_pages
## 1             4.17             419
## 2             4.18             448
## 3             3.77             336
##                                             highlights
## 1                 Romance, Magic, Adventure, Retelling
## 2                           Mystery, Romance, Mystical
## 3 Short Stories, Mythology, Science-Fiction, Retelling

Conclusion

In summary, we took read three different file types html, xml, and json into R and prepared them as dataframes for downstream use. Each file type used different preparation methods and the raw files looked different from each other. After preparation, each dataframe/tibble looks exactly like each other with the same data types.