We are working with data containing information on 3 fantasy books. We chose to store this information in three separate files: html, xml, and json. The idea is to familiarize ourselves with these different file types and structures, and using packages to load the data from these different formats for downstream use in R.
library(dplyr) # to use piping function `|>`
library(xml2) #reading raw data from url and parsing xml file
library(XML) # view xml file as its plain text
library(rvest) # read html as seen on web, parse html text
library(jsonlite) # read json files into R from url
HTML
Here we will read in the url for the html formatted table. We are
taking a public access point, or path to the raw file, and inserting it
into the object html_raw
.
url_html <- "https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Assignment%207/Books.html"
html_raw <- read_html(url_html) # get html text from url
html_raw
## {html_document}
## <html>
## [1] <body><table>\n<tr>\n<th>Book id</th>\n<th>Title</th>\n<th>Author</th>\n< ...
You can see how the output of the raw file still has holdover
formatting of the html file. Next, we want to coerce this into a data
frame using the html_table
function from the
rvest
package to first coerce this into a table as if the
html code were run on the web. Then, using base
R package
we’ll coerce this into a dataframe for downstream use in R.
html_books <- html_raw |>
html_table() |> # get table from raw html text
as.data.frame() # coerce list element to df
html_books
## Book.id Title Author
## 1 1 A Court of Thorns and Roses #1 Sarah J. Maas
## 2 2 The Invisible Life of Addie LaRue Victoria E. Schwab
## 3 3 A Thousand Beginnings and Endings Ellen Oh, Elsie Chapman
## Goodreads.Rating Number.of.Pages
## 1 4.17 419
## 2 4.18 448
## 3 3.77 336
## Highlights
## 1 Romance, Magic, Adventure, Retelling
## 2 Mystery, Romance, Mystical
## 3 Short Stories, Mythology, Science-Fiction, Retelling
XML
Now we want to read in and prepare an xml file for downstream use in R. We begin by getting the xml file from the url in a convenient format.
url_xml <- "https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Assignment%207/Books.xml"
xml_raw <- read_xml(url_xml) # read raw file into R from url
xml_raw
## {xml_document}
## <fantasy_books>
## [1] <book id="1">\n <book_id>1</book_id>\n <title>A Court of Thorns and Ros ...
## [2] <book id="2">\n <book_id>2</book_id>\n <title>The Invisible Life of Add ...
## [3] <book id="3">\n <book_id>3</book_id>\n <title>A Thousand Beginnings and ...
To see how our original xml file looks like as plain text we can use
xmlParse
from the XML
package.
xml_raw |>
xmlParse(url_xml)
## <?xml version="1.0" encoding="UTF-8"?>
## <fantasy_books>
## <book id="1">
## <book_id>1</book_id>
## <title>A Court of Thorns and Roses #1</title>
## <author>Sarah J. Maas</author>
## <goodreads_rating>4.17</goodreads_rating>
## <number_of_pages>419</number_of_pages>
## <highlights>Romance, Magic, Adventure, Retelling</highlights>
## </book>
## <book id="2">
## <book_id>2</book_id>
## <title>The Invisible Life of Addie LaRue</title>
## <author>Victoria E. Schwab</author>
## <goodreads_rating>4.18</goodreads_rating>
## <number_of_pages>448</number_of_pages>
## <highlights>Mystery, Romance, Mystical</highlights>
## </book>
## <book id="3">
## <book_id>3</book_id>
## <title>A Thousand Beginnings and Endings</title>
## <author>Ellen Oh, Elsie Chapman</author>
## <goodreads_rating>3.77</goodreads_rating>
## <number_of_pages>336</number_of_pages>
## <highlights>Short Stories, Mythology, Science-Fiction, Retelling</highlights>
## </book>
## </fantasy_books>
##
Our next step is to take the conveniently formatted xml file and
parse through to get a usable tibble. We’ll be using
xml_find_all
to extract nodes with the same tag and use
xml_text
, xml_integer
, and
xml_double
to extract only the data from the nodes.
## EXTRACT INFO
book_id <- xml_raw |>
xml_find_all(".//book_id") |> # identify nodes
xml_text() # extract text from nodes
title <- xml_raw |>
xml_find_all(".//title") |> # identify nodes
xml_text() # extract text from nodes
author <- xml_raw |>
xml_find_all(".//author") |> # identify nodes
xml_text() # extract text from nodes
goodreads_rating <- xml_raw |>
xml_find_all(".//goodreads_rating") |> # identify nodes
xml_double() # extract numeric from nodes
number_of_pages <- xml_raw |>
xml_find_all(".//number_of_pages") |> # identify nodes
xml_integer() # extract integers from nodes
highlights <- xml_raw |>
xml_find_all(".//highlights") |> # identify nodes
xml_text() # extract text from nodes
## FORMAT AS TIBBLE
xml_books <- tibble(book_id = book_id, title = title, author = author,
goodreads_rating = goodreads_rating,
number_of_pages = number_of_pages, highlights = highlights)
xml_books
## # A tibble: 3 × 6
## book_id title author goodreads_rating number_of_pages highlights
## <chr> <chr> <chr> <dbl> <int> <chr>
## 1 1 A Court of Thorns … Sarah… 4.17 419 Romance, …
## 2 2 The Invisible Life… Victo… 4.18 448 Mystery, …
## 3 3 A Thousand Beginni… Ellen… 3.77 336 Short Sto…
JSON
Next, we will apply the same ideas as above to our json file. We’ll
begin by using fromJSON
from the jsonlite
package to get a raw output of a list-like dataframe.
url_json <- "https://raw.githubusercontent.com/isaias-soto/CUNY_DAT607/refs/heads/main/Assignment%207/Books.JSON"
json_raw <- fromJSON(url_json) # get raw json file from url
json_raw <- json_raw[-1,]
json_raw
## [,1] [,2] [,3] [,4]
## [1,] "1" "A Court of Thorns and Roses #1" "Sarah J. Maas" "4.17"
## [2,] "2" "The Invisible Life of Addie LaRue" "Victoria E. Schwab" "4.18"
## [3,] "3" "A Thousand Beginnings and Endings" "Ellen Oh, Elsie Chapman" "3.77"
## [,5] [,6]
## [1,] "419" "Romance, Magic, Adventure, Retelling"
## [2,] "448" "Mystery, Romance, Mystical"
## [3,] "336" "Short Stories, Mythology, Science-Fiction, Retelling"
Finally, we can use the base
package
as.data.frame
to coerce the output into a dataframe. We’ll
also have to rename columns and coerce the appropriate variables into
the correct data types.
json_books <- as.data.frame(json_raw) |>
mutate(V1 = as.integer(V1), V4 = as.numeric(V4), V5 = as.integer(V5)) |>
rename(book_id = V1, title = V2, author = V3, goodreads_rating = V4,
number_of_pages = V5, highlights = V6)
json_books
## book_id title author
## 1 1 A Court of Thorns and Roses #1 Sarah J. Maas
## 2 2 The Invisible Life of Addie LaRue Victoria E. Schwab
## 3 3 A Thousand Beginnings and Endings Ellen Oh, Elsie Chapman
## goodreads_rating number_of_pages
## 1 4.17 419
## 2 4.18 448
## 3 3.77 336
## highlights
## 1 Romance, Magic, Adventure, Retelling
## 2 Mystery, Romance, Mystical
## 3 Short Stories, Mythology, Science-Fiction, Retelling
In summary, we took read three different file types html, xml, and json into R and prepared them as dataframes for downstream use. Each file type used different preparation methods and the raw files looked different from each other. After preparation, each dataframe/tibble looks exactly like each other with the same data types.