The goal of this assignment is to identify three books and for each book capture fields including, title and author. I have also chosen to capture the book genre and type. One of the books is required to have more than one author. I decided to store all authors in one variable (authors) so that a book with multiple authors had one “authors” column where the names of the multiple authors were separated by commas within that column. If necessary, it would be easy to separate the authors by splitting the string by comma.
Once this information is gathered, it must be saved in three different files types: HTML, XML, and JSON.
The goal is to then read these file types into R and save the data into a data frame.
I have saved these files into my Github repo to allow for code replication.
Below I import the HTML file. I start with loading the required packages. Then, I save the file URL (from Github) and read in the data.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(XML)
library(RCurl)
##
## Attaching package: 'RCurl'
##
## The following object is masked from 'package:tidyr':
##
## complete
url <- getURL("https://raw.githubusercontent.com/stoybis/DATA607Repo/main/bookData.html")
bookDataHTML <-readHTMLTable(url)
The data came across as a list of one that has three observations of four variables. Below, I convert the list into a data frame.
bookDataHTMLDF <- as.data.frame(bookDataHTML)
head(bookDataHTMLDF)
## NULL.Title
## 1 The Catcher in the Rye
## 2 Thinking, Fast and Slow
## 3 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
## NULL.Authors NULL.Genre NULL.Type
## 1 J.D. Salinger Fiction Coming of Age
## 2 Daniel Kahneman Non-fiction Psychology
## 3 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction Investing
I remove “NULL” from the column names.
colnames(bookDataHTMLDF) <- sub("NULL.","",colnames(bookDataHTMLDF))
head(bookDataHTMLDF)
## Title
## 1 The Catcher in the Rye
## 2 Thinking, Fast and Slow
## 3 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
## Authors Genre Type
## 1 J.D. Salinger Fiction Coming of Age
## 2 Daniel Kahneman Non-fiction Psychology
## 3 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction Investing
Below I import the XML file. I start with loading the required packages.
The XML content is stored via the read_xml function. This captures the encoding.
The xmlParse function then parses the XML file to create an R structure that represents the XML tree.
getNodeSet is then used to capture the nodes in the file. In my file, each book was entered as a “book” node, with that note containing the four observations about the book. These are the nodes I am interested in getting into a data frame.
Last, I use xmlToDataFrame based on the “book” nodes to convert the data into a data frame.
library(xml2)
bookDataXML <- read_xml('https://raw.githubusercontent.com/stoybis/DATA607Repo/main/bookData.xml')
bookDataXMLParsed <-xmlParse(bookDataXML)
bookNodes <- getNodeSet(bookDataXMLParsed, '//book')
bookDataXMLDF <-xmlToDataFrame(nodes = bookNodes)
head(bookDataXMLDF)
## title
## 1 The Catcher in the Rye
## 2 Thinking, Fast and Slow
## 3 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
## authors genre type
## 1 J.D. Salinger Fiction Coming of Age
## 2 Daniel Kahneman Non-fiction Psychology
## 3 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction Investing
Below I import the JSON file. I start with loading the required packages.
library(rjson)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
bookDataJSON <- fromJSON(file = 'https://raw.githubusercontent.com/stoybis/DATA607Repo/main/bookData.json')
bookDataJSONDF <- as.data.frame(bookDataJSON)
head(bookDataJSONDF)
## Book.Data.Title Book.Data.Authors Book.Data.Genre Book.Data.Type
## 1 The Catcher in the Rye J.D. Salinger Fiction Coming of Age
## Book.Data.Title.1 Book.Data.Authors.1 Book.Data.Genre.1
## 1 Thinking, Fast and Slow Daniel Kahneman Non-fiction
## Book.Data.Type.1
## 1 Psychology
## Book.Data.Title.2
## 1 Trading Fixed Income and FX in Emerging Markets: A Practitioner's Guide
## Book.Data.Authors.2 Book.Data.Genre.2
## 1 Dirk Willer, Ram Bala Chandran, Kenneth Lam Non-fiction
## Book.Data.Type.2
## 1 Investing
The data came across as one row with 12 observations. This data needs to be tidies so that each row is an observation (a book), that has four features (title, authors, genre, and type).
First, I pivot the data to a long format.
bookDataJSONDF <- pivot_longer(bookDataJSONDF, cols = everything())
head(bookDataJSONDF, n = 10)
## # A tibble: 10 × 2
## name value
## <chr> <chr>
## 1 Book.Data.Title The Catcher in the Rye
## 2 Book.Data.Authors J.D. Salinger
## 3 Book.Data.Genre Fiction
## 4 Book.Data.Type Coming of Age
## 5 Book.Data.Title.1 Thinking, Fast and Slow
## 6 Book.Data.Authors.1 Daniel Kahneman
## 7 Book.Data.Genre.1 Non-fiction
## 8 Book.Data.Type.1 Psychology
## 9 Book.Data.Title.2 Trading Fixed Income and FX in Emerging Markets: A Pract…
## 10 Book.Data.Authors.2 Dirk Willer, Ram Bala Chandran, Kenneth Lam
Next, I add a column that has the number from each observation in “name”. This will allow me to group the appropriate fields together by book. Since the first book doesn’t have a number, I replace the NAs with 0.
bookDataJSONDF$bookID <- str_match(bookDataJSONDF$name, '\\d')
bookDataJSONDF$bookID <- as.numeric(bookDataJSONDF$bookID)
bookDataJSONDF$bookID <- replace(bookDataJSONDF$bookID, is.na(bookDataJSONDF$bookID), 0)
head(bookDataJSONDF, n=10)
## # A tibble: 10 × 3
## name value bookID
## <chr> <chr> <dbl>
## 1 Book.Data.Title The Catcher in the Rye 0
## 2 Book.Data.Authors J.D. Salinger 0
## 3 Book.Data.Genre Fiction 0
## 4 Book.Data.Type Coming of Age 0
## 5 Book.Data.Title.1 Thinking, Fast and Slow 1
## 6 Book.Data.Authors.1 Daniel Kahneman 1
## 7 Book.Data.Genre.1 Non-fiction 1
## 8 Book.Data.Type.1 Psychology 1
## 9 Book.Data.Title.2 Trading Fixed Income and FX in Emerging Markets: … 2
## 10 Book.Data.Authors.2 Dirk Willer, Ram Bala Chandran, Kenneth Lam 2
Next, I remove the numbers from the name column so that I can create common columns in the wider format. I also replace periods to make the observations in “name” common.
bookDataJSONDF$name <- str_replace_all(bookDataJSONDF$name, '\\d','')
bookDataJSONDF$name <-str_replace_all(bookDataJSONDF$name, "\\.", "")
head(bookDataJSONDF, n=10)
## # A tibble: 10 × 3
## name value bookID
## <chr> <chr> <dbl>
## 1 BookDataTitle The Catcher in the Rye 0
## 2 BookDataAuthors J.D. Salinger 0
## 3 BookDataGenre Fiction 0
## 4 BookDataType Coming of Age 0
## 5 BookDataTitle Thinking, Fast and Slow 1
## 6 BookDataAuthors Daniel Kahneman 1
## 7 BookDataGenre Non-fiction 1
## 8 BookDataType Psychology 1
## 9 BookDataTitle Trading Fixed Income and FX in Emerging Markets: A Pr… 2
## 10 BookDataAuthors Dirk Willer, Ram Bala Chandran, Kenneth Lam 2
Last, I pivot the data to a wider format and drop the ID column. I also drop “BookData” from the column names.
bookDataJSONDF <- bookDataJSONDF |> pivot_wider(id_cols = bookID, names_from = name, values_from = value)
bookDataJSONDF <- select(bookDataJSONDF, -bookID)
colnames(bookDataJSONDF) <- str_replace_all(colnames(bookDataJSONDF), "BookData", "")
head(bookDataJSONDF)
## # A tibble: 3 × 4
## Title Authors Genre Type
## <chr> <chr> <chr> <chr>
## 1 The Catcher in the Rye J.D. S… Fict… Comi…
## 2 Thinking, Fast and Slow Daniel… Non-… Psyc…
## 3 Trading Fixed Income and FX in Emerging Markets: A Practi… Dirk W… Non-… Inve…
I was ultimately able to get the data frames to be essentially the same (one of the data frames has column names starting in lower case, but for our purposes this doesn’t make a difference).
The HTML file was easiest to import into a data frame, as it came across as a list that was easy to convert to a data frame. The XML file was similar, however there were more steps to decipher the nodes in order to convert the file into a data frame. The JSON file took the most editing and required tidying the data. It is possible that I missed some fields in creating the JSON file that would have made it more easily convertible to an R data frame.