Introduction

This week’s assignment is about reading three file formats into R:

  • HTML
  • XML
  • JSON

I have manually created the three files describing the authors and titles of the books on decision making and have uploaded them on GitHub.

Reading HTML Table

In a document with a simple structure, reading an HTML table is easy using the readHTMLTable function.

library(XML)
library(knitr)
library(RCurl)
## Loading required package: bitops
url_HTML = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.html", ssl.verifypeer = FALSE)

books_HTML = readHTMLTable(url_HTML, 
                           header = T, 
                           which=1, 
                           isHTML = TRUE, 
                           stringsAsFactors = F)

The output is directly converted to a dataframe

kable(books_HTML)
ID Title Authors Year ISBN
1 Thinking, Fast and Slow Daniel Kahneman 2013 0374533555
2 The Power of Habit: Why We Do What We Do, and How to Change Charles Duhigg 2013 1847946240
3 The Organized Mind: Thinking Straight in the Age of Information Overload Daniel J. Levitin 2015 0147516315
3 Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts Carol Tavris, Elliot Aronson 2007 0151010986

Reading XML

First, parse the XML tree and have a look at the document structure

url_XML = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.xml", ssl.verifypeer = FALSE)

books_XML = xmlParse(url_XML)

book_child = xmlChildren(books_XML)

head(book_child)
## $books
## <books lang="en">
##   <book>
##     <id>1</id>
##     <title>Thinking, Fast and Slow</title>
##     <authors>Daniel Kahneman</authors>
##     <year>2013</year>
##     <ISBN>0374533555</ISBN>
##   </book>
##   <book>
##     <id>2</id>
##     <title>The Power of Habit: Why We Do What We Do, and How to Change</title>
##     <authors>Charles Duhigg</authors>
##     <year>2013</year>
##     <ISBN>1847946240</ISBN>
##   </book>
##   <book>
##     <id>3</id>
##     <title>The Organized Mind: Thinking Straight in the Age of Information Overload</title>
##     <authors>Daniel J. Levitin</authors>
##     <year>2015</year>
##     <ISBN>0147516315</ISBN>
##   </book>
##   <book>
##     <id>4</id>
##     <title>Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts</title>
##     <authors>Carol Tavris, Elliot Aronson</authors>
##     <year>2007</year>
##     <ISBN>0151010986</ISBN>
##   </book>
## </books>

We can see that each element that we need has a list of tags with id, title, authors, year, and ISBN.
Now we can populate a dataframe with the contents of each tag.

books_XML_df = data.frame()

books_XML_df = cbind(xpathSApply(books_XML, "//id", xmlValue),
                     xpathSApply(books_XML, "//title", xmlValue),
                     xpathSApply(books_XML, "//authors", xmlValue),
                     xpathSApply(books_XML, "//year", xmlValue),
                     xpathSApply(books_XML, "//ISBN", xmlValue))

books_XML_df = data.frame(books_XML_df)
names(books_XML_df) = c("id", "title", "authors", "year", "ISBN")

kable(books_XML_df)
id title authors year ISBN
1 Thinking, Fast and Slow Daniel Kahneman 2013 0374533555
2 The Power of Habit: Why We Do What We Do, and How to Change Charles Duhigg 2013 1847946240
3 The Organized Mind: Thinking Straight in the Age of Information Overload Daniel J. Levitin 2015 0147516315
4 Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts Carol Tavris, Elliot Aronson 2007 0151010986

Reading a JSON

Similarly to the XML file, the JSON first should be parsed first, which is easily accomplished by the jsonlite package. However, as I have picked a different node structure than in the previous formats, the parser interprets the node of “authors” as a list of character vectors (as one of the records is an array of two authors).

library(jsonlite)

url_JSON = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.json", ssl.verifypeer = FALSE)

books_JSON = fromJSON(txt = url_JSON)

str(books_JSON)
## List of 5
##  $ id    : int [1:4] 1 2 3 4
##  $ Title : chr [1:4] "Thinking, Fast and Slow" "The Power of Habit: Why We Do What We Do, and How to Change" "The Organized Mind: Thinking Straight in the Age of Information Overload" "Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts"
##  $ Autors:List of 4
##   ..$ : chr "Daniel Kahneman"
##   ..$ : chr "Charles Duhigg"
##   ..$ : chr "Daniel J. Levitin"
##   ..$ : chr [1:2] "Carol Tavris" "Elliot Aronson"
##  $ Year  : int [1:4] 2013 2013 2015 2007
##  $ ISBN  : chr [1:4] "0374533555" "1847946240" "0147516315" "0151010986"

Forcing this file into a dataframe results in an incorrect data frame structure

books_JSON_df = as.data.frame(books_JSON)

kable(books_JSON_df)
id Title Autors..Daniel.Kahneman. Autors..Charles.Duhigg. Autors..Daniel.J..Levitin. Autors.c..Carol.Tavris….Elliot.Aronson.. Year ISBN
1 Thinking, Fast and Slow Daniel Kahneman Charles Duhigg Daniel J. Levitin Carol Tavris 2013 0374533555
2 The Power of Habit: Why We Do What We Do, and How to Change Daniel Kahneman Charles Duhigg Daniel J. Levitin Elliot Aronson 2013 1847946240
3 The Organized Mind: Thinking Straight in the Age of Information Overload Daniel Kahneman Charles Duhigg Daniel J. Levitin Carol Tavris 2015 0147516315
4 Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts Daniel Kahneman Charles Duhigg Daniel J. Levitin Elliot Aronson 2007 0151010986

One solution is to collapse any embedded lists into characters first so that the “authors” column could be turned into a simple character vector.

# Function to collapse any list longer than 1 into a character vector
collapselist = function(x) { 
  x = ifelse(length(unlist(x))>1,
             yes = paste(unlist(x),collapse = ", "),
             no = as.character(x))
}

books_JSON$Autors = sapply(books_JSON$Autors,FUN = collapselist)

books_JSON_df = as.data.frame(books_JSON)

kable(books_JSON_df)
id Title Autors Year ISBN
1 Thinking, Fast and Slow Daniel Kahneman 2013 0374533555
2 The Power of Habit: Why We Do What We Do, and How to Change Charles Duhigg 2013 1847946240
3 The Organized Mind: Thinking Straight in the Age of Information Overload Daniel J. Levitin 2015 0147516315
4 Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts Carol Tavris, Elliot Aronson 2007 0151010986

Now the data frame is correct.