Assignment – Working with XML and JSON in R

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(RCurl)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.2     v forcats 0.5.0
## v readr   1.3.1
## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
library(XML)
library(knitr)
library(rjson)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

JSON

library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON
## The following object is masked from 'package:purrr':
## 
##     flatten
books_json <- fromJSON("https://raw.githubusercontent.com/hrensimin05/Data_607/master/books.json")
books_json <- bind_rows(books_json, .id = 'Author')
books_json
## # A tibble: 3 x 4
##      ID Title                                             Author       ISBN     
##   <int> <chr>                                             <chr>        <chr>    
## 1     1 Innumeracy : mathematical illiteracy and its con~ John Paulos  08090584~
## 2     2 The Rosie Project                                 Graeme Sims~ 14767290~
## 3     3 Is Everyone Hanging Out Without Me? (And Other C~ Mindy Kaling 03078862~

HTML

books_df <- readHTMLTable(
    getURL("https://raw.githubusercontent.com/hrensimin05/Data_607/master/books.htm"), header = TRUE, which = 1)

class(books_df)
## [1] "data.frame"
knitr::kable(books_df)
ID Title Author ISBN
1 Innumeracy : mathematical illiteracy and its consequences John Paulos 0809058405
2 The Rosie Project Graeme Simsion 1476729093
3 Is Everyone Hanging Out Without Me? (And Other Concerns) Mindy Kaling 0307886271

XML

books2 <- ldply(xmlToList(getURL("https://raw.githubusercontent.com/hrensimin05/Data_607/master/books.xml")), data.frame) %>%
    select(-.id)

class(books2)
## [1] "data.frame"
knitr::kable(books2)
id title author isbn
1 Innumeracy : mathematical illiteracy and its consequences John Paulos 0809058405
2 The Rosie Project Graeme Simsion 1476729093
3 Is Everyone Hanging Out Without Me? (And Other Concerns) Mindy Kaling 0307886271

Conclusion

All three files are storing information slightly differently. HTML and Xml data frames are the same , but the json data frame, which was the most difficult for me to implement, transform the data into raws, but I also created the json file a bit differently compare to xml and html.

#xml==html
books_df == books2
##        ID Title Author ISBN
## [1,] TRUE  TRUE   TRUE TRUE
## [2,] TRUE  TRUE   TRUE TRUE
## [3,] TRUE  TRUE   TRUE TRUE