Assignment – Working with XML and JSON in R

library(abind)
library(gtable)
library(markdown)
library(prettyunits)
library(promises)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RCurl)
library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v readr   1.4.0

## -- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()

library(XML)
library(knitr)
library(rjson)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:purrr':
## 
##     compact

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

#JSON

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON

## The following object is masked from 'package:purrr':
## 
##     flatten

books_json <- fromJSON("https://raw.githubusercontent.com/Darstolk/DATA607_07/main/books_jason")

books_json <- bind_rows(books_json, .id = 'Author')
books_json

## # A tibble: 3 x 4
##      ID Title                               Author             ISBN      
##   <int> <chr>                               <chr>              <chr>     
## 1     1 Data Wrangling with R               Bradley C. Boehmke 0135133106
## 2     2 Learning Web Design                 Jennifer Robbins   3319455982
## 3     3 Programming Skills for Data Science Michael Freeman    1491960205

#HTML

dasbuch_html <- readHTMLTable(
    getURL("https://raw.githubusercontent.com/Darstolk/DATA607_07/main/dasbuch.html"), header = TRUE, which = 1)

class(dasbuch_html)

## [1] "data.frame"

knitr::kable(dasbuch_html)

ID	Title	Author	ISBN
1	Data Wrangling with R	Bradley C. Boehmke	3319455982
2	Learning Web Design	Jennifer Robbins	1491960205
3	Programming Skills for Data Science	Michael Freeman	0135133106

#XML

dasbuch_zwei <- ldply(xmlToList(getURL("https://raw.githubusercontent.com/Darstolk/DATA607_07/main/dasbuch_xml.xml")), data.frame) %>%
    select(-.id)

class(dasbuch_zwei)

## [1] "data.frame"

knitr::kable(dasbuch_zwei)

id	title	author	isbn
1	Data Wrangling with R	Bradley C. Boehmke	3319455982
2	Learning Web Design	Jennifer Robbins	1491960205
3	Programming Skills for Data Science	Michael Freeman	0135133106

#Conclusion

The way the data is being stored from file format to file format is a bit different. It took me a while to learn the differences and finally realize that HTML format is not so much different from XML format. No clue as to why this is so. The subject matter of data is so vast and incomprehensible in this instance; therefore you need to possess many years of experience only to find your bearing on most basic techniques and processes of analyzing data in meaningful and useful way, so it can be used down the road for building more complex and useful things. JASON file format is yet another addition to this entire technology stack. I had to spend a good share of time to find out how to build this type of file. My attempts to squeeze more information in addition to title, author, and ISBN number did not bear any fruits. I gave up after having tried for a prolonged stretch of time. All I can say at the end that these files are not a joke to work with. One needs some serious technical knowledge right here backed up by quite serious high level education. The books I used as mere titles for this exercise I read as reference, still it takes a long to digest the content.

dasbuch_html == dasbuch_zwei

##        ID Title Author ISBN
## [1,] TRUE  TRUE   TRUE TRUE
## [2,] TRUE  TRUE   TRUE TRUE
## [3,] TRUE  TRUE   TRUE TRUE

This is the end of this file.

Assignment – Working with XML and JSON in R

Dariusz Siergiejuk

10/10/2020