Data 607 Assignment 7

Introduction

For this assignment I chose the following three books: Rabid: A Cultural History of the World’s Most Diabolic Virus, The Haunting of Hill House and Everything Is Tuberculosis: The History and Persistence of Our Deadliest Infection. For each book I created a tables that included the title, author(s), genre and Goodreads rating. I stored these tables in a JSON file, HTML file and XML file.

In working with these data files we will use the following libraries:

The xml2 library
The rvest library
The jsonlite library
The dplyr library

The HTML File

To start let’s read in the html file, check if it’s saved in a data frame and then print it.

hurl <- "https://raw.githubusercontent.com/WendyR20/DATA-607-Assignment-7/refs/heads/main/books.html"
html_books <-read_html(hurl)

is.data.frame(html_books)

## [1] FALSE

html_books

## {html_document}
## <html>
## [1] <body>\n\t<table border="1">\n<tr>\n<th>Title</th>\n\t  <th>Authors</th>\ ...

As we can see after reading in our html file our data is read in as a list by the read_html package. I’d like it to be stored in a data frame and to look more like the original table in the html file. So let us try to use the html_element and html_table functions from the rvest package.

html_df<- html_books %>%
  html_element("table") %>% 
  html_table()

Let’s print!

html_df

## # A tibble: 3 × 4
##   Title                                           Authors Genre Goodreads_Rating
##   <chr>                                           <chr>   <chr>            <dbl>
## 1 Rabid: A Cultural History of the World's Most … Bill W… Non-…             3.72
## 2 The Haunting of Hill House                      Shirle… Horr…             3.81
## 3 Everything Is Tuberculosis: The History and Pe… John G… Non-…             4.4

Since our table is stored in a tibble right now, let’s convert it to a data frame.

html_df <- as.data.frame(html_df)
is.data.frame(html_df)

## [1] TRUE

Now let us print the data frame as a table.

knitr::kable(html_df,
             format = "markdown",
             caption = "HTML Dataframe")

HTML Dataframe
Title	Authors	Genre	Goodreads_Rating
Rabid: A Cultural History of the World’s Most Diabolical Virus	Bill Wasik, Monica Murphy	Non-Fiction	3.72
The Haunting of Hill House	Shirley Jackson	Horror	3.81
Everything Is Tuberculosis: The History and Persistence of Our Deadliest Infection	John Green	Non-Fiction	4.40

Great!

The XML file

Now let us move on to the xml file, let’s read in our book table data.

xurl <- "https://raw.githubusercontent.com/WendyR20/DATA-607-Assignment-7/refs/heads/main/books.xml"
xml_books <- read_xml(xurl)

is.data.frame(xml_books)

## [1] FALSE

And now let’s print our data.

xml_books

## {xml_document}
## <books>
## [1] <book>\n  <Title>Rabid: A Cultural History of the World's Most Diabolical ...
## [2] <book>\n  <Title>The Haunting of Hill House</Title>\n  <Authors>Shirley J ...
## [3] <book>\n  <Title>Everything Is Tuberculosis: The History and Persistence  ...

Similar to our issues with the html file data, our xml file data is saved in a list rather than a data frame and it is hard to read our data when we print it. Let’s try to create a neater looking data frame from our xml data.

# Finding all book nodes
xrows <- xml_find_all(xml_books, "//book") 

# Extracting the data for each column
title <- xml_text(xml_find_all(xrows, "Title"))
authors <- xml_text(xml_find_all(xrows, "Authors"))
genre <- xml_text(xml_find_all(xrows, "Genre"))
goodreads <- as.double(xml_text(xml_find_all(xrows, "Goodreads_Rating")))

Okay let’s save our extracted information as a data frame

# Create a data frame
xml_df <- data.frame(Title = title, Authors = authors, Genre = genre, 
                     Goodreads_Rating = goodreads, stringsAsFactors = FALSE)

Let’s see what the data looks like now when we print it.

xml_df

##                                                                                Title
## 1                     Rabid: A Cultural History of the World's Most Diabolical Virus
## 2                                                         The Haunting of Hill House
## 3 Everything Is Tuberculosis: The History and Persistence of Our Deadliest Infection
##                     Authors       Genre Goodreads_Rating
## 1 Bill Wasik, Monica Murphy Non-Fiction             3.72
## 2           Shirley Jackson      Horror             3.81
## 3                John Green Non-Fiction             4.40

And now let’s print it in table format!

knitr::kable(xml_df,
             format = "markdown",
             caption = "XML Dataframe")

XML Dataframe
Title	Authors	Genre	Goodreads_Rating
Rabid: A Cultural History of the World’s Most Diabolical Virus	Bill Wasik, Monica Murphy	Non-Fiction	3.72
The Haunting of Hill House	Shirley Jackson	Horror	3.81
Everything Is Tuberculosis: The History and Persistence of Our Deadliest Infection	John Green	Non-Fiction	4.40

Great, this is much easier to read than earlier!

The JSON file

Finally, let’s read in the JSON file, check if it’s saved in a dataframe and then print it.

#reading in json file table
jurl <- "https://raw.githubusercontent.com/WendyR20/DATA-607-Assignment-7/refs/heads/main/books.json"
json_books <- fromJSON(jurl)

#checking to see if our data was read into a data frame
is.data.frame(json_books)

## [1] TRUE

Let’s print our data and see if it’s easy to read, thankfully the jsonlite package easily read in our data as a data frame.

json_books

##                                                                                Title
## 1                     Rabid: A Cultural History of the World's Most Diabolical Virus
## 2                                                         The Haunting of Hill House
## 3 Everything Is Tuberculosis: The History and Persistence of Our Deadliest Infection
##                     Authors       Genre Goodreads_Rating
## 1 Bill Wasik, Monica Murphy Non-Fiction             3.72
## 2           Shirley Jackson      Horror             3.81
## 3                John Green Non-Fiction             4.40

Okay, it seems we don’t have to take any extra steps to make our data look neat. Let’s print our data again in table format.

knitr::kable(json_books,
             format = "markdown",
             caption = "JSON Dataframe")

JSON Dataframe
Title	Authors	Genre	Goodreads_Rating
Rabid: A Cultural History of the World’s Most Diabolical Virus	Bill Wasik, Monica Murphy	Non-Fiction	3.72
The Haunting of Hill House	Shirley Jackson	Horror	3.81
Everything Is Tuberculosis: The History and Persistence of Our Deadliest Infection	John Green	Non-Fiction	4.40

Identical Data Frames

Let’s see if we have successfully saves our data frames into identical data frames.

identical(html_df,xml_df)

## [1] TRUE

identical(xml_df, json_books)

## [1] TRUE

identical(html_df, json_books)

## [1] TRUE

Comparing The Process of Reading in the HTML, XML and JSON File Data

When initially reading in the HTML, XML and JSON file data, the data from the HTML and XML file was read into a list (by the xml2 package) not a data frame, and the data had to be more carefully read in, and manipulated so that it could be stored in a data frame. While the JSON file data was immediately read into a neat data frame by the jsonlite package. Thus the data frames were not identical at first (as two were not even data frames) but we were able to manipulate them into being identical data frames.