DATA 607-Assignment #7

Assignment Week 7

knitr::include_graphics("C:/Users/para2/Documents/R_Working_Directory/pittsburgh+bridges/Assignment #7 DATA 607/Assignment - Working with XML and JSON in R.png")

For this assignment, I needed to create three (3) source files to be imported into R as separate data frames-HTML, JSON, and XML. The files will be named book.html, book.json, and book.xml and will be located on my GitHub repository under Assignment7.

Reading in the Files

In this section we will read in the three (3) different file structures, and then set them up individually into their own data frames.

#JSON read in
books_json <- fromJSON("https://raw.githubusercontent.com/Aconrard/DATA607/main/Assignment7/books.json")
#XML read in
books_xml <- read_xml("https://raw.githubusercontent.com/Aconrard/DATA607/main/Assignment7/books.xml")
#HTML read in
books_html <- read_html("https://raw.githubusercontent.com/Aconrard/DATA607/main/Assignment7/Books.html")

Looking at the Entry

In this section we will take a look at the different frame structures for each of the file types. We can clearly see that the structures are quite different and will need special handling to make usable.

#JSON
head(books_json,6)

##                                                              title
## 1 Descision in Philadelphia: The Constitutional Convention of 1787
## 2                                            The Mismeasure of Man
## 3                   A Christmas Carol and Other Christmas Writings
## 4              The Politically Incorrect Guide to American History
## 5                Applied Spatial Statistics for Public Health Data
## 6                              The Southwest: New American Cooking
##                                       author_s              topic
## 1 Collier, Christopher; Collier, James Lincoln   American History
## 2                           Gould, Stephen Jay        Non-fiction
## 3                             Dickens, Charles            Fiction
## 4                           Woods Jr, Thomas E             Parody
## 5           Waller, Lance A.; Gotway, Carol A. Spatial Statistics
## 6                                  Long, Kathi            Cooking
##                isbn num_pages      type
## 1     0-345-34652-1       434 paperback
## 2 978-0-393-31425-0       432 paperback
## 3 978-0-14-043905-2       289 paperback
## 4  978-089526-047-5       270 paperback
## 5     0-471-38771-1       494 hardcover
## 6     0-7370-2047-4       144 hardcover

#XML
books_xml

## {xml_document}
## <books>
## [1] <book>\n  <title>Descision in Philadelphia: The Constitutional Convention ...
## [2] <book>\n  <title>The Mismeasure of Man</title>\n  <authors>\n    <author> ...
## [3] <book>\n  <title>A Christmas Carol and Other Christmas Writings</title>\n ...
## [4] <book>\n  <title>The Politically Incorrect Guide to American History</tit ...
## [5] <book>\n  <title>Applied Spatial Statistics for Public Health Data</title ...
## [6] <book>\n  <title>The Southwest: New American Cooking</title>\n  <authors> ...

#HTML
books_html

## {html_document}
## <html lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <table>\n<caption>Book Information</caption>\n        <thead> ...

A Little Transformation

In this section we are going to alter the data frame structures so that it resembles something that we can transform and analyze. Luckily, the JSON read in library (jsonlite) is able to bring over the file in a way that allows it to resemble the original table type by using fromJSON. However, if we used the read_json function in the same package, it would bring the file over as a list that would need to be restructured in the data frame. For the purposes of this assignment, we will consider the JSON data frame as sufficient and we will move on to the XML and HTML file entries.

Let’s look at the HTML file first. We can see that there is something in the data frame, but it is difficult to see exactly what it is. The file content is read in and stored as a character string. However, we would like to have something a little more refined as part of our data frame to be used. Therefore, we identify the “nodes” of the text to extract the data that we need. We will also need to extract the variable names.

# th is the node for the table header
header_text <- books_html |> html_nodes("th") |> html_text()
# td is the node for the individual table cells
row_text <- books_html |> html_nodes("td") |> html_text()
# Identify the number of rows that will be in the dataframe
num_rows <- length(row_text)/length(header_text)
# Matrix the data frame and place the values in the cells
books_html_df <- data.frame(matrix(row_text, ncol = length(header_text), byrow = TRUE))
# Put the colum variable names back on
colnames(books_html_df) <- header_text
head(books_html_df,6)

##                                                              Title
## 1 Descision in Philadelphia: The Constitutional Convention of 1787
## 2                                            The Mismeasure of Man
## 3                   A Christmas Carol and Other Christmas Writings
## 4              The Politically Incorrect Guide to American History
## 5                Applied Spatial Statistics for Public Health Data
## 6                              The Southwest: New American Cooking
##                                Authors              Topic              ISBN
## 1 Collier, Chrsitopher; Collier, James   American History     0-345-34652-1
## 2                   Gould, Stephen Jay        Non-fiction 978-0-393-31425-0
## 3                     Dickens, Charles            Fiction 978-0-14-043905-2
## 4                        Woods, Thomas             Parody  978-089526-047-5
## 5         Waller, Lance; Gotway, Carol Spatial Statistics     0-471-38771-1
## 6                          Long, Kathi            Cooking     0-7370-2047-4
##   Number of Pages      Type
## 1             434 Paperback
## 2             432 Paperback
## 3             289 Paperback
## 4             270 Paperback
## 5             484 Hardcover
## 6             144 Hardcover

Now let’s look at the XML file. Once again, the data is stored in what appears to be a structured character string in which we have to extract the data and make it functional for our purposes. This structure took more work than the others to get it manageable due.

# Extract specific XML elements (e.g., <book> elements)
books <- xml_find_all(books_xml, "//book")

# Create empty lists to store data
title <- list()
author <- list()
topic <- list()
ISBN <- list()
num_pages <- list()
type <- list()

# Loop through each book element
for (i in seq_along(books)) {
  # Extract data from XML elements
  title[i] <- xml_text(xml_find_first(books[i], ".//title"))
  authors <- xml_find_all(books[i], ".//author")
  author[i] <- paste(xml_text(authors), collapse = "; ") # Concatenate multiple authors into a single string
  topic[i] <- xml_text(xml_find_first(books[i], ".//topic"))
  ISBN[i] <- xml_text(xml_find_first(books[i], ".//ISBN"))
  num_pages[i] <- xml_text(xml_find_first(books[i], ".//num_pages"))
  type[i] <- xml_text(xml_find_first(books[i], ".//type"))
}

# Create a data frame
books_xml_df <- data.frame(
  title = unlist(title),
  author = unlist(author),
  topic = unlist(topic),
  ISBN = unlist(ISBN),
  num_pages = unlist(num_pages),
  type = unlist(type)
)
head(books_xml_df)

##                                                              title
## 1 Descision in Philadelphia: The Constitutional Convention of 1787
## 2                                            The Mismeasure of Man
## 3                   A Christmas Carol and Other Christmas Writings
## 4              The Politically Incorrect Guide to American History
## 5                Applied Spatial Statistics for Public Health Data
## 6                              The Southwest: New American Cooking
##                                      author              topic
## 1 CollierChristopher;  CollierJames Lincoln   American History
## 2                          GouldStephen Jay        Non-fiction
## 3                            DickensCharles            Fiction
## 4                          Woods JrThomas E             Parody
## 5           WallerLance A.;  GotwayCarol A. Spatial Statistics
## 6                                 LongKathi            Cooking
##                ISBN num_pages      type
## 1     0-345-34652-1       434 paperback
## 2 978-0-393-31425-0       432 paperback
## 3 978-0-14-043905-2       289 paperback
## 4  978-089526-047-5       270 paperback
## 5     0-471-38771-1       494 hardcover
## 6     0-7370-2047-4       144 hardcover

Conclusion

This was a challenging activity since I have very little experience working with these file types or platforms. The data frame structures are not perfect, but they are functional at this point for further analysis and transformation.