Assignment – Working with XML and JSON in R
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RCurl)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v stringr 1.4.0
## v tidyr 1.1.2 v forcats 0.5.0
## v readr 1.3.1
## -- Conflicts --------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(XML)
library(knitr)
library(rjson)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
JSON
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following objects are masked from 'package:rjson':
##
## fromJSON, toJSON
## The following object is masked from 'package:purrr':
##
## flatten
books_json <- fromJSON("https://raw.githubusercontent.com/hrensimin05/Data_607/master/books.json")
books_json <- bind_rows(books_json, .id = 'Author')
books_json
## # A tibble: 3 x 4
## ID Title Author ISBN
## <int> <chr> <chr> <chr>
## 1 1 Innumeracy : mathematical illiteracy and its con~ John Paulos 08090584~
## 2 2 The Rosie Project Graeme Sims~ 14767290~
## 3 3 Is Everyone Hanging Out Without Me? (And Other C~ Mindy Kaling 03078862~
HTML
books_df <- readHTMLTable(
getURL("https://raw.githubusercontent.com/hrensimin05/Data_607/master/books.htm"), header = TRUE, which = 1)
class(books_df)
## [1] "data.frame"
knitr::kable(books_df)
| 1 |
Innumeracy : mathematical illiteracy and its consequences |
John Paulos |
0809058405 |
| 2 |
The Rosie Project |
Graeme Simsion |
1476729093 |
| 3 |
Is Everyone Hanging Out Without Me? (And Other Concerns) |
Mindy Kaling |
0307886271 |
XML
books2 <- ldply(xmlToList(getURL("https://raw.githubusercontent.com/hrensimin05/Data_607/master/books.xml")), data.frame) %>%
select(-.id)
class(books2)
## [1] "data.frame"
knitr::kable(books2)
| 1 |
Innumeracy : mathematical illiteracy and its consequences |
John Paulos |
0809058405 |
| 2 |
The Rosie Project |
Graeme Simsion |
1476729093 |
| 3 |
Is Everyone Hanging Out Without Me? (And Other Concerns) |
Mindy Kaling |
0307886271 |
Conclusion
All three files are storing information slightly differently. HTML and Xml data frames are the same , but the json data frame, which was the most difficult for me to implement, transform the data into raws, but I also created the json file a bit differently compare to xml and html.
#xml==html
books_df == books2
## ID Title Author ISBN
## [1,] TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE