Objective

The goal of this assignment is to identify three books and for each book capture fields including, title and author. I have also chosen to capture the book genre and type. One of the books is required to have more than one author. I decided to store all authors in one variable (authors) so that a book with multiple authors had one “authors” column where the names of the multiple authors were separated by commas within that column. If necessary, it would be easy to separate the authors by splitting the string by comma.

Once this information is gathered, it must be saved in three different files types: HTML, XML, and JSON.

The goal is to then read these file types into R and save the data into a data frame.

I have saved these files into my Github repo to allow for code replication.

Importing the HTML file

Below I import the HTML file. I start with loading the required packages. Then, I save the file URL (from Github) and read in the data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(XML)
library(RCurl)
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
url <- getURL("https://raw.githubusercontent.com/stoybis/DATA607Repo/main/bookData.html")

bookDataHTML <-readHTMLTable(url)

The data came across as a list of one that has three observations of four variables. Below, I convert the list into a data frame.

bookDataHTMLDF <- as.data.frame(bookDataHTML)

head(bookDataHTMLDF)
##                                                                NULL.Title
## 1                                                  The Catcher in the Rye
## 2                                                 Thinking, Fast and Slow
## 3 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
##                                  NULL.Authors  NULL.Genre     NULL.Type
## 1                               J.D. Salinger     Fiction Coming of Age
## 2                             Daniel Kahneman Non-fiction    Psychology
## 3 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction     Investing

I remove “NULL” from the column names.

colnames(bookDataHTMLDF) <- sub("NULL.","",colnames(bookDataHTMLDF))

head(bookDataHTMLDF)
##                                                                     Title
## 1                                                  The Catcher in the Rye
## 2                                                 Thinking, Fast and Slow
## 3 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
##                                       Authors       Genre          Type
## 1                               J.D. Salinger     Fiction Coming of Age
## 2                             Daniel Kahneman Non-fiction    Psychology
## 3 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction     Investing

Importing the XML file

Below I import the XML file. I start with loading the required packages.

The XML content is stored via the read_xml function. This captures the encoding.

The xmlParse function then parses the XML file to create an R structure that represents the XML tree.

getNodeSet is then used to capture the nodes in the file. In my file, each book was entered as a “book” node, with that note containing the four observations about the book. These are the nodes I am interested in getting into a data frame.

Last, I use xmlToDataFrame based on the “book” nodes to convert the data into a data frame.

library(xml2)

bookDataXML <- read_xml('https://raw.githubusercontent.com/stoybis/DATA607Repo/main/bookData.xml')

bookDataXMLParsed <-xmlParse(bookDataXML)

bookNodes <- getNodeSet(bookDataXMLParsed, '//book')

bookDataXMLDF <-xmlToDataFrame(nodes = bookNodes)

head(bookDataXMLDF)
##                                                                     title
## 1                                                  The Catcher in the Rye
## 2                                                 Thinking, Fast and Slow
## 3 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
##                                       authors       genre          type
## 1                               J.D. Salinger     Fiction Coming of Age
## 2                             Daniel Kahneman Non-fiction    Psychology
## 3 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction     Investing

Importing the JSON file

Below I import the JSON file. I start with loading the required packages.

library(rjson)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
bookDataJSON <- fromJSON(file = 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/bookData.json')

bookDataJSONDF <- as.data.frame(bookDataJSON)

head(bookDataJSONDF)
##          Book.Data.Title Book.Data.Authors Book.Data.Genre Book.Data.Type
## 1 The Catcher in the Rye     J.D. Salinger         Fiction  Coming of Age
##         Book.Data.Title.1 Book.Data.Authors.1 Book.Data.Genre.1
## 1 Thinking, Fast and Slow     Daniel Kahneman       Non-fiction
##   Book.Data.Type.1
## 1       Psychology
##                                                         Book.Data.Title.2
## 1 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
##                           Book.Data.Authors.2 Book.Data.Genre.2
## 1 Dirk Willer, Ram Bala Chandran, Kenneth Lam       Non-fiction
##   Book.Data.Type.2
## 1        Investing

The data came across as one row with 12 observations. This data needs to be tidies so that each row is an observation (a book), that has four features (title, authors, genre, and type).

First, I pivot the data to a long format.

bookDataJSONDF <- pivot_longer(bookDataJSONDF, cols = everything())

head(bookDataJSONDF, n = 10)
## # A tibble: 10 × 2
##    name                value                                                    
##    <chr>               <chr>                                                    
##  1 Book.Data.Title     The Catcher in the Rye                                   
##  2 Book.Data.Authors   J.D. Salinger                                            
##  3 Book.Data.Genre     Fiction                                                  
##  4 Book.Data.Type      Coming of Age                                            
##  5 Book.Data.Title.1   Thinking, Fast and Slow                                  
##  6 Book.Data.Authors.1 Daniel Kahneman                                          
##  7 Book.Data.Genre.1   Non-fiction                                              
##  8 Book.Data.Type.1    Psychology                                               
##  9 Book.Data.Title.2   Trading Fixed Income and FX in Emerging Markets: A Pract…
## 10 Book.Data.Authors.2 Dirk Willer, Ram Bala Chandran, Kenneth Lam

Next, I add a column that has the number from each observation in “name”. This will allow me to group the appropriate fields together by book. Since the first book doesn’t have a number, I replace the NAs with 0.

bookDataJSONDF$bookID <- str_match(bookDataJSONDF$name, '\\d')

bookDataJSONDF$bookID <- as.numeric(bookDataJSONDF$bookID)

bookDataJSONDF$bookID <- replace(bookDataJSONDF$bookID, is.na(bookDataJSONDF$bookID), 0)


head(bookDataJSONDF, n=10)
## # A tibble: 10 × 3
##    name                value                                              bookID
##    <chr>               <chr>                                               <dbl>
##  1 Book.Data.Title     The Catcher in the Rye                                  0
##  2 Book.Data.Authors   J.D. Salinger                                           0
##  3 Book.Data.Genre     Fiction                                                 0
##  4 Book.Data.Type      Coming of Age                                           0
##  5 Book.Data.Title.1   Thinking, Fast and Slow                                 1
##  6 Book.Data.Authors.1 Daniel Kahneman                                         1
##  7 Book.Data.Genre.1   Non-fiction                                             1
##  8 Book.Data.Type.1    Psychology                                              1
##  9 Book.Data.Title.2   Trading Fixed Income and FX in Emerging Markets: …      2
## 10 Book.Data.Authors.2 Dirk Willer, Ram Bala Chandran, Kenneth Lam             2

Next, I remove the numbers from the name column so that I can create common columns in the wider format. I also replace periods to make the observations in “name” common.

bookDataJSONDF$name <- str_replace_all(bookDataJSONDF$name, '\\d','')
bookDataJSONDF$name <-str_replace_all(bookDataJSONDF$name, "\\.", "")

head(bookDataJSONDF, n=10)
## # A tibble: 10 × 3
##    name            value                                                  bookID
##    <chr>           <chr>                                                   <dbl>
##  1 BookDataTitle   The Catcher in the Rye                                      0
##  2 BookDataAuthors J.D. Salinger                                               0
##  3 BookDataGenre   Fiction                                                     0
##  4 BookDataType    Coming of Age                                               0
##  5 BookDataTitle   Thinking, Fast and Slow                                     1
##  6 BookDataAuthors Daniel Kahneman                                             1
##  7 BookDataGenre   Non-fiction                                                 1
##  8 BookDataType    Psychology                                                  1
##  9 BookDataTitle   Trading Fixed Income and FX in Emerging Markets: A Pr…      2
## 10 BookDataAuthors Dirk Willer, Ram Bala Chandran, Kenneth Lam                 2

Last, I pivot the data to a wider format and drop the ID column. I also drop “BookData” from the column names.

bookDataJSONDF <- bookDataJSONDF |> pivot_wider(id_cols = bookID, names_from = name, values_from = value)


bookDataJSONDF <- select(bookDataJSONDF, -bookID)

colnames(bookDataJSONDF) <- str_replace_all(colnames(bookDataJSONDF), "BookData", "")

head(bookDataJSONDF)
## # A tibble: 3 × 4
##   Title                                                      Authors Genre Type 
##   <chr>                                                      <chr>   <chr> <chr>
## 1 The Catcher in the Rye                                     J.D. S… Fict… Comi…
## 2 Thinking, Fast and Slow                                    Daniel… Non-… Psyc…
## 3 Trading Fixed Income and FX in Emerging Markets: A Practi… Dirk W… Non-… Inve…

Conclusion

I was ultimately able to get the data frames to be essentially the same (one of the data frames has column names starting in lower case, but for our purposes this doesn’t make a difference).

The HTML file was easiest to import into a data frame, as it came across as a list that was easy to convert to a data frame. The XML file was similar, however there were more steps to decipher the nodes in order to convert the file into a data frame. The JSON file took the most editing and required tidying the data. It is possible that I missed some fields in creating the JSON file that would have made it more easily convertible to an R data frame.