Load general libraries:

library(dplyr)
# install.packages('jsonlite', 'rvest')

This document explores loading JSON, XML, and HTML table representations of the same data into R data frames. The data itself is three records, each containing information on popular young adults books.

For your understanding, this is what the JSON data looks like:

##  [1] "[{"                                                            
##  [2] "\t\t\"id\": 1,"                                                
##  [3] "\t\t\"title\": \"The Hunger Games\","                          
##  [4] "\t\t\"authors\": \"Suzanne Collins\","                         
##  [5] "\t\t\"langauge\": \"eng\","                                    
##  [6] "\t\t\"avg_rating\": 4.34,"                                     
##  [7] "\t\t\"n_ratings\": 4780653"                                    
##  [8] "\t},"                                                          
##  [9] "\t{"                                                           
## [10] "\t\t\"id\": 2,"                                                
## [11] "\t\t\"title\": \"Twilight\","                                  
## [12] "\t\t\"authors\": [\"Stephenie Meyer\", \"Illustrator Jones\"],"
## [13] "\t\t\"langauge\": \"eng\","                                    
## [14] "\t\t\"avg_rating\": 3.57,"                                     
## [15] "\t\t\"n_ratings\": 3866839"                                    
## [16] "\t},"                                                          
## [17] "\t{"                                                           
## [18] "\t\t\"id\": 3,"                                                
## [19] "\t\t\"title\": \"The Fellowship of the Ring\","                
## [20] "\t\t\"authors\": \"J.R.R. Tolkien\","                          
## [21] "\t\t\"langauge\": \"eng\","                                    
## [22] "\t\t\"avg_rating\": 4.34,"                                     
## [23] "\t\t\"n_ratings\": 1766803"                                    
## [24] "\t}"                                                           
## [25] "]"

See the Remarks section toward the end of the document.

JSON

library(jsonlite)
from_json <- fromJSON('books.json', flatten=TRUE)
head(from_json)

##   id                      title                            authors
## 1  1           The Hunger Games                    Suzanne Collins
## 2  2                   Twilight Stephenie Meyer, Illustrator Jones
## 3  3 The Fellowship of the Ring                     J.R.R. Tolkien
##   langauge avg_rating n_ratings
## 1      eng       4.34   4780653
## 2      eng       3.57   3866839
## 3      eng       4.34   1766803

XML

library(XML)
xml_doc <- xmlParse('books.xml')
root <- xmlRoot(xml_doc)
data <- xmlSApply(root, function(x) xmlSApply(x, xmlValue))
from_xml <- data.frame(t(data), row.names=NULL) %>%
    select(colnames(from_json))
head(from_xml)

##   id                      title                          authors langauge
## 1  1           The Hunger Games                  Suzanne Collins      eng
## 2  2                   Twilight Stephenie MeyerIllustrator Jones      eng
## 3  3 The Fellowship of the Ring                   J.R.R. Tolkien      eng
##   avg_rating n_ratings
## 1       4.34   4780653
## 2       3.57   3866839
## 3       4.34   1766803

Note the authors column, see Remarks below.

HTML

Using the rvest package by our dear Hadley Wickham:

library(rvest)
html_doc <- read_html('books.html')
html_tbl <-  html_doc %>% 
    html_nodes('table') %>% 
    html_table
from_html <- as.data.frame(html_tbl[1])
head(from_html)

##   id            title                            authors language
## 1  1 The Hunger Games                    Suzanne Collins      eng
## 2  2         Twilight Stephenie Meyer, Illustrator Jones      eng
## 3  3   J.R.R. Tolkien         The Fellowship of the Ring      eng
##   avg_rating n_ratings
## 1       4.34   4780653
## 2       3.57   3866839
## 3       4.34   1766803

Remarks

These three data frames are almost identical. The author data in the XML-derived data frame is printed as one author, Stephenie MeyerIllustrator Jones which could become a problem for some analyses of this data. In the othre two data frames, it is appropriately represented with a seperator character, Stephenie Meyer, Illustrator Jones.

In the data recording stage, it would probably be a better design decision to put quotation marks around each authors’ name, and use a semicolon (rather than comma) to seperate them. This would reduce problems trying to seperate out the individual authors in the author column of the R data frame.

A final observation: The XML library converts much of the data to factors, where rvest does a good job using the more appropriate integer, numeric, and character data types.

DATA 607—Homework No. 7

Ben Horvath

October 14, 2018

JSON

XML

HTML

Remarks