TODO: - [x] Ensure file paths are github urls - [x] HTML and XML read-in - Publish to Rpubs
Using a few libraries to parse the various data formats
library(rvest)
library(XML)
library(xml2)
library(rjson)
library(dplyr)
library(tidyr)
First, let’s lay out where we’re
json_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/books.json"
html_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/books.html"
xml_url <- "https://raw.githubusercontent.com/andrewbowen19/cunyDATA607/main/data/books.xml"
json_data <- fromJSON(file=json_url, simplify=TRUE)
json_df <- as.data.frame(json_data)
json_df
## title author ISBN rating
## 1 Funny Weather: Art in an Emergency Olivia Laing 132400570X 5
## 2 Weapons of Math Destruction Cathy O'Neil 0553418815 4
## 3 Capitalism Without Capital Jonathen Haskell 0691175039 5
## genre
## 1 Non-fiction
## 2 Mathematics
## 3 Social Science
This table looks pretty neatly formatted. Since we want one “item”/“observation” to constitute a row in our books table, it makes sense that
We put the same data into our XML file, but in a different format. We still want each book to constitute a row in our dataframe
xml_data <- xml2::read_xml(xml_url, as_html=FALSE)
# Parse XML and convert into R dataframe
book_xml <- XML::xmlParse(xml_data)
xml_df <- XML::xmlToDataFrame(book_xml)
xml_df
## title author ISBN rating
## 1 Funny Weather: Art in an Emergency Olivia Laing 132400570X 5
## 2 Weapons of Math Destruction Cathy O'Neil 0553418815 4
## 3 Capitalism Without Capital Jonathan Haskell 0691175039 5
## genre
## 1 Non-fiction
## 2 Mathematics
## 3 Social Science
Let’s convert our rating column to a double, to be consistent with
our json_df
xml_df$rating <- as.double(xml_df$rating)
Going to use the XML library from above and its
readHTMLTable method in order to grab data from out html
table.
html_df <- (rvest::read_html(html_url) %>% html_table)[[1]]
# Read in using XML::readHTMLTable method
# html <- readHTMLTable(html_url), as.data.frame=TRUE) # "../data/books.html
# html_df <- as.data.frame(html)
html_df
## # A tibble: 3 × 5
## title author ISBN rating genre
## <chr> <chr> <chr> <int> <chr>
## 1 Funny Weather: Art in an Emergency Olivia Laing 132400570X 5 Non-fic…
## 2 Weapons of Math Destruction Cathy O'Neil 0553418815 4 Mathema…
## 3 Capitalism without Capital Jonathan Haskell 0691175039 5 Social …
This dataframe is close to the ones we generated vai XML and JSON,
with the rating type as a character rather
than an double, and the column names prefixed by
NULL.. Let’s
html_df <- (rvest::read_html(html_url) %>% html_table)[[1]]
html_df$rating <- as.double(html_df$rating)
html_df
## # A tibble: 3 × 5
## title author ISBN rating genre
## <chr> <chr> <chr> <dbl> <chr>
## 1 Funny Weather: Art in an Emergency Olivia Laing 132400570X 5 Non-fic…
## 2 Weapons of Math Destruction Cathy O'Neil 0553418815 4 Mathema…
## 3 Capitalism without Capital Jonathan Haskell 0691175039 5 Social …
This looks consistent with our above XML and JSON dataframes!