Andrew Bowen Data 607 HW 5

TODO: - [x] Ensure file paths are github urls - [x] HTML and XML read-in - Publish to Rpubs

Using a few libraries to parse the various data formats

library(rvest)
library(XML)
library(xml2)
library(rjson)
library(dplyr)
library(tidyr)

First, let’s lay out where we’re

json_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/books.json"
html_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/books.html"
xml_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/books.xml"

JSON read-in

json_data <- fromJSON(file=json_url, simplify=TRUE)

json_df <- as.data.frame(json_data)

json_df

##                                title           author       ISBN rating
## 1 Funny Weather: Art in an Emergency     Olivia Laing 132400570X      5
## 2        Weapons of Math Destruction     Cathy O'Neil 0553418815      4
## 3         Capitalism Without Capital Jonathen Haskell 0691175039      5
##            genre
## 1    Non-fiction
## 2    Mathematics
## 3 Social Science

This table looks pretty neatly formatted. Since we want one “item”/“observation” to constitute a row in our books table, it makes sense that

XML Read-in

We put the same data into our XML file, but in a different format. We still want each book to constitute a row in our dataframe

xml_data <- xml2::read_xml(xml_url, as_html=FALSE)

# Parse XML and convert into R dataframe
book_xml <- XML::xmlParse(xml_data)
xml_df <- XML::xmlToDataFrame(book_xml)

xml_df

##                                title           author       ISBN rating
## 1 Funny Weather: Art in an Emergency     Olivia Laing 132400570X      5
## 2        Weapons of Math Destruction     Cathy O'Neil 0553418815      4
## 3         Capitalism Without Capital Jonathan Haskell 0691175039      5
##            genre
## 1    Non-fiction
## 2    Mathematics
## 3 Social Science

Let’s convert our rating column to a double, to be consistent with our json_df

xml_df$rating <- as.double(xml_df$rating)

HTML read-in

Going to use the XML library from above and its readHTMLTable method in order to grab data from out html table.

html_df <- (rvest::read_html(html_url) %>% html_table)[[1]]

# Read in using XML::readHTMLTable method
# html <- readHTMLTable(html_url), as.data.frame=TRUE) # "../data/books.html
# html_df <- as.data.frame(html)

html_df

## # A tibble: 3 × 5
##   title                              author           ISBN       rating genre   
##   <chr>                              <chr>            <chr>       <int> <chr>   
## 1 Funny Weather: Art in an Emergency Olivia Laing     132400570X      5 Non-fic…
## 2 Weapons of Math Destruction        Cathy O'Neil     0553418815      4 Mathema…
## 3 Capitalism without Capital         Jonathan Haskell 0691175039      5 Social …

This dataframe is close to the ones we generated vai XML and JSON, with the rating type as a character rather than an double, and the column names prefixed by NULL.. Let’s

html_df <- (rvest::read_html(html_url) %>% html_table)[[1]]

html_df$rating <- as.double(html_df$rating)

html_df

## # A tibble: 3 × 5
##   title                              author           ISBN       rating genre   
##   <chr>                              <chr>            <chr>       <dbl> <chr>   
## 1 Funny Weather: Art in an Emergency Olivia Laing     132400570X      5 Non-fic…
## 2 Weapons of Math Destruction        Cathy O'Neil     0553418815      4 Mathema…
## 3 Capitalism without Capital         Jonathan Haskell 0691175039      5 Social …

This looks consistent with our above XML and JSON dataframes!

Andrew Bowen Data 607 HW 5

Andrew Bowen

2022-10-13

JSON read-in

XML Read-in

HTML read-in