It was fairly easy to create the book formats using a text editor tool in each of the 3 structures requested which are typical of web data. It is interesting to see the default package behavior for loading nested data (authors) and a few different syntax formats within R
xml_fp <- 'https://raw.githubusercontent.com/jforster19/DATA607/main/books.xml'
xmldata_pull <- RCurl::getURL(xml_fp)
xml_input <- XML::xmlParse(xmldata_pull)
xml_df <- XML::xmlToDataFrame(xmldata_pull)
head(xml_df)
## title authors pages
## 1 The Hot Hand Ben Cohen 304
## 2 Basketball: A Love Story Rafe BartholomewDan KloresJackie MacMullan 448
## 3 The Book of Basketball Bill Simmons 752
## release_year price
## 1 2019 15.15
## 2 2018 18.00
## 3 2009 15.99
As the output demonstrates there is a slight difference in the generated dataframe which can easily be processing via string manipulation. All of the author names are concatenated together without any spaces, but perhaps this is driven off the design of my XML input.
html_fp <- 'https://raw.githubusercontent.com/jforster19/DATA607/main/books.html'
html_download <- RCurl::getURL(html_fp)
html_df <- XML::readHTMLTable(html_download,header=TRUE,as.data.frame = TRUE)
head(html_df)
## $`NULL`
## title authors pages release_year price
## 1 The Hot Hand Ben Cohen 304 2019 15.15
## 2 Basketball: A Love Story Rafe Bartholomew 448 2018 18.0
## 3 Basketball: A Love Story Dan Klores 448 2018 18.0
## 4 Basketball: A Love Story Jackie MacMullan 448 2018 18.0
## 5 The Book of Basketball Bill Simmons 752 2009 15.99
With the HTML Table I specifically added the authors as separate rows within the table expecting them to be parsed and if the books need to be uniquely identified by row there could easily be aggregation with a concatenated author field to account for this 1 to many relationship.
json_fp <- 'https://raw.githubusercontent.com/jforster19/DATA607/main/books.json'
json_download <- RCurl::getURL(json_fp)
json_df <- jsonlite::fromJSON(json_download)
json_df2 <- tidyr::unnest(json_df,col=authors)
head(json_df2)
## # A tibble: 3 × 7
## title author1 author2 author3 pages relea…¹ price
## <chr> <chr> <chr> <chr> <int> <chr> <dbl>
## 1 The Hot Hand Ben Cohen <NA> <NA> 304 2019 15.2
## 2 Basketball: A Love Story Rafe Bartholomew Dan Klo… Jackie… 448 2018 18
## 3 The Book of Basketball Bill Simmons <NA> <NA> 752 2009 16.0
## # … with abbreviated variable name ¹release_year
The major differences with the JSON input file is that by default the nested lists are not treated the same way as XML although Tidy packages have some easy ways to parse out that information. When there are different numbers of values in these nested lists it appears it is safer to first parse wider and then pivot the data, but if there are large datasets that can be computationally expensive. JSON files make it easier with imputed data types of columns, only format where it recognized numeric type of price field, and that can be much easier when you have more complex JSON structure.