Load general libraries:
This document explores loading JSON, XML, and HTML table representations of the same data into R data frames. The data itself is three records, each containing information on popular young adults books.
For your understanding, this is what the JSON data looks like:
## [1] "[{"
## [2] "\t\t\"id\": 1,"
## [3] "\t\t\"title\": \"The Hunger Games\","
## [4] "\t\t\"authors\": \"Suzanne Collins\","
## [5] "\t\t\"langauge\": \"eng\","
## [6] "\t\t\"avg_rating\": 4.34,"
## [7] "\t\t\"n_ratings\": 4780653"
## [8] "\t},"
## [9] "\t{"
## [10] "\t\t\"id\": 2,"
## [11] "\t\t\"title\": \"Twilight\","
## [12] "\t\t\"authors\": [\"Stephenie Meyer\", \"Illustrator Jones\"],"
## [13] "\t\t\"langauge\": \"eng\","
## [14] "\t\t\"avg_rating\": 3.57,"
## [15] "\t\t\"n_ratings\": 3866839"
## [16] "\t},"
## [17] "\t{"
## [18] "\t\t\"id\": 3,"
## [19] "\t\t\"title\": \"The Fellowship of the Ring\","
## [20] "\t\t\"authors\": \"J.R.R. Tolkien\","
## [21] "\t\t\"langauge\": \"eng\","
## [22] "\t\t\"avg_rating\": 4.34,"
## [23] "\t\t\"n_ratings\": 1766803"
## [24] "\t}"
## [25] "]"
See the Remarks section toward the end of the document.
JSON
## id title authors
## 1 1 The Hunger Games Suzanne Collins
## 2 2 Twilight Stephenie Meyer, Illustrator Jones
## 3 3 The Fellowship of the Ring J.R.R. Tolkien
## langauge avg_rating n_ratings
## 1 eng 4.34 4780653
## 2 eng 3.57 3866839
## 3 eng 4.34 1766803
XML
library(XML)
xml_doc <- xmlParse('books.xml')
root <- xmlRoot(xml_doc)
data <- xmlSApply(root, function(x) xmlSApply(x, xmlValue))
from_xml <- data.frame(t(data), row.names=NULL) %>%
select(colnames(from_json))
head(from_xml)## id title authors langauge
## 1 1 The Hunger Games Suzanne Collins eng
## 2 2 Twilight Stephenie MeyerIllustrator Jones eng
## 3 3 The Fellowship of the Ring J.R.R. Tolkien eng
## avg_rating n_ratings
## 1 4.34 4780653
## 2 3.57 3866839
## 3 4.34 1766803
Note the authors column, see Remarks below.
HTML
Using the rvest package by our dear Hadley Wickham:
library(rvest)
html_doc <- read_html('books.html')
html_tbl <- html_doc %>%
html_nodes('table') %>%
html_table
from_html <- as.data.frame(html_tbl[1])
head(from_html)## id title authors language
## 1 1 The Hunger Games Suzanne Collins eng
## 2 2 Twilight Stephenie Meyer, Illustrator Jones eng
## 3 3 J.R.R. Tolkien The Fellowship of the Ring eng
## avg_rating n_ratings
## 1 4.34 4780653
## 2 3.57 3866839
## 3 4.34 1766803
Remarks
These three data frames are almost identical. The author data in the XML-derived data frame is printed as one author, Stephenie MeyerIllustrator Jones which could become a problem for some analyses of this data. In the othre two data frames, it is appropriately represented with a seperator character, Stephenie Meyer, Illustrator Jones.
In the data recording stage, it would probably be a better design decision to put quotation marks around each authors’ name, and use a semicolon (rather than comma) to seperate them. This would reduce problems trying to seperate out the individual authors in the author column of the R data frame.
A final observation: The XML library converts much of the data to factors, where rvest does a good job using the more appropriate integer, numeric, and character data types.