Data 607 Week 7 Assignment

Introduction

This week’s assignment is about reading three file formats into R:

HTML
XML
JSON

I have manually created the three files describing the authors and titles of the books on decision making and have uploaded them on GitHub.

Reading HTML Table

In a document with a simple structure, reading an HTML table is easy using the readHTMLTable function.

library(XML)
library(knitr)
library(RCurl)

## Loading required package: bitops

url_HTML = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.html", ssl.verifypeer = FALSE)

books_HTML = readHTMLTable(url_HTML, 
                           header = T, 
                           which=1, 
                           isHTML = TRUE, 
                           stringsAsFactors = F)

The output is directly converted to a dataframe

kable(books_HTML)

ID	Title	Authors	Year	ISBN
1	Thinking, Fast and Slow	Daniel Kahneman	2013	0374533555
2	The Power of Habit: Why We Do What We Do, and How to Change	Charles Duhigg	2013	1847946240
3	The Organized Mind: Thinking Straight in the Age of Information Overload	Daniel J. Levitin	2015	0147516315
3	Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts	Carol Tavris, Elliot Aronson	2007	0151010986

Reading XML

First, parse the XML tree and have a look at the document structure

url_XML = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.xml", ssl.verifypeer = FALSE)

books_XML = xmlParse(url_XML)

book_child = xmlChildren(books_XML)

head(book_child)

## $books
## <books lang="en">
##   <book>
##     <id>1</id>
##     <title>Thinking, Fast and Slow</title>
##     <authors>Daniel Kahneman</authors>
##     <year>2013</year>
##     <ISBN>0374533555</ISBN>
##   </book>
##   <book>
##     <id>2</id>
##     <title>The Power of Habit: Why We Do What We Do, and How to Change</title>
##     <authors>Charles Duhigg</authors>
##     <year>2013</year>
##     <ISBN>1847946240</ISBN>
##   </book>
##   <book>
##     <id>3</id>
##     <title>The Organized Mind: Thinking Straight in the Age of Information Overload</title>
##     <authors>Daniel J. Levitin</authors>
##     <year>2015</year>
##     <ISBN>0147516315</ISBN>
##   </book>
##   <book>
##     <id>4</id>
##     <title>Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts</title>
##     <authors>Carol Tavris, Elliot Aronson</authors>
##     <year>2007</year>
##     <ISBN>0151010986</ISBN>
##   </book>
## </books>

We can see that each element that we need has a list of tags with id, title, authors, year, and ISBN.
Now we can populate a dataframe with the contents of each tag.

books_XML_df = data.frame()

books_XML_df = cbind(xpathSApply(books_XML, "//id", xmlValue),
                     xpathSApply(books_XML, "//title", xmlValue),
                     xpathSApply(books_XML, "//authors", xmlValue),
                     xpathSApply(books_XML, "//year", xmlValue),
                     xpathSApply(books_XML, "//ISBN", xmlValue))

books_XML_df = data.frame(books_XML_df)
names(books_XML_df) = c("id", "title", "authors", "year", "ISBN")

kable(books_XML_df)

id	title	authors	year	ISBN
1	Thinking, Fast and Slow	Daniel Kahneman	2013	0374533555
2	The Power of Habit: Why We Do What We Do, and How to Change	Charles Duhigg	2013	1847946240
3	The Organized Mind: Thinking Straight in the Age of Information Overload	Daniel J. Levitin	2015	0147516315
4	Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts	Carol Tavris, Elliot Aronson	2007	0151010986

Reading a JSON

Similarly to the XML file, the JSON first should be parsed first, which is easily accomplished by the jsonlite package. However, as I have picked a different node structure than in the previous formats, the parser interprets the node of “authors” as a list of character vectors (as one of the records is an array of two authors).

library(jsonlite)

url_JSON = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.json", ssl.verifypeer = FALSE)

books_JSON = fromJSON(txt = url_JSON)

str(books_JSON)

## List of 5
##  $ id    : int [1:4] 1 2 3 4
##  $ Title : chr [1:4] "Thinking, Fast and Slow" "The Power of Habit: Why We Do What We Do, and How to Change" "The Organized Mind: Thinking Straight in the Age of Information Overload" "Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts"
##  $ Autors:List of 4
##   ..$ : chr "Daniel Kahneman"
##   ..$ : chr "Charles Duhigg"
##   ..$ : chr "Daniel J. Levitin"
##   ..$ : chr [1:2] "Carol Tavris" "Elliot Aronson"
##  $ Year  : int [1:4] 2013 2013 2015 2007
##  $ ISBN  : chr [1:4] "0374533555" "1847946240" "0147516315" "0151010986"

Forcing this file into a dataframe results in an incorrect data frame structure

books_JSON_df = as.data.frame(books_JSON)

kable(books_JSON_df)

id	Title	Autors..Daniel.Kahneman.	Autors..Charles.Duhigg.	Autors..Daniel.J..Levitin.	Autors.c..Carol.Tavris….Elliot.Aronson..	Year	ISBN
1	Thinking, Fast and Slow	Daniel Kahneman	Charles Duhigg	Daniel J. Levitin	Carol Tavris	2013	0374533555
2	The Power of Habit: Why We Do What We Do, and How to Change	Daniel Kahneman	Charles Duhigg	Daniel J. Levitin	Elliot Aronson	2013	1847946240
3	The Organized Mind: Thinking Straight in the Age of Information Overload	Daniel Kahneman	Charles Duhigg	Daniel J. Levitin	Carol Tavris	2015	0147516315
4	Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts	Daniel Kahneman	Charles Duhigg	Daniel J. Levitin	Elliot Aronson	2007	0151010986

One solution is to collapse any embedded lists into characters first so that the “authors” column could be turned into a simple character vector.

# Function to collapse any list longer than 1 into a character vector
collapselist = function(x) { 
  x = ifelse(length(unlist(x))>1,
             yes = paste(unlist(x),collapse = ", "),
             no = as.character(x))
}

books_JSON$Autors = sapply(books_JSON$Autors,FUN = collapselist)

books_JSON_df = as.data.frame(books_JSON)

kable(books_JSON_df)

id	Title	Autors	Year	ISBN
1	Thinking, Fast and Slow	Daniel Kahneman	2013	0374533555
2	The Power of Habit: Why We Do What We Do, and How to Change	Charles Duhigg	2013	1847946240
3	The Organized Mind: Thinking Straight in the Age of Information Overload	Daniel J. Levitin	2015	0147516315
4	Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts	Carol Tavris, Elliot Aronson	2007	0151010986

Now the data frame is correct.

Data 607 Week 7 Assignment

by Dmitriy Vecheruk

15 October 2016

Introduction

Reading HTML Table

Reading XML

Reading a JSON