This week’s assignment is about reading three file formats into R:
I have manually created the three files describing the authors and titles of the books on decision making and have uploaded them on GitHub.
In a document with a simple structure, reading an HTML table is easy using the readHTMLTable function.
library(XML)
library(knitr)
library(RCurl)
## Loading required package: bitops
url_HTML = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.html", ssl.verifypeer = FALSE)
books_HTML = readHTMLTable(url_HTML,
header = T,
which=1,
isHTML = TRUE,
stringsAsFactors = F)
The output is directly converted to a dataframe
kable(books_HTML)
| ID | Title | Authors | Year | ISBN |
|---|---|---|---|---|
| 1 | Thinking, Fast and Slow | Daniel Kahneman | 2013 | 0374533555 |
| 2 | The Power of Habit: Why We Do What We Do, and How to Change | Charles Duhigg | 2013 | 1847946240 |
| 3 | The Organized Mind: Thinking Straight in the Age of Information Overload | Daniel J. Levitin | 2015 | 0147516315 |
| 3 | Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts | Carol Tavris, Elliot Aronson | 2007 | 0151010986 |
First, parse the XML tree and have a look at the document structure
url_XML = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.xml", ssl.verifypeer = FALSE)
books_XML = xmlParse(url_XML)
book_child = xmlChildren(books_XML)
head(book_child)
## $books
## <books lang="en">
## <book>
## <id>1</id>
## <title>Thinking, Fast and Slow</title>
## <authors>Daniel Kahneman</authors>
## <year>2013</year>
## <ISBN>0374533555</ISBN>
## </book>
## <book>
## <id>2</id>
## <title>The Power of Habit: Why We Do What We Do, and How to Change</title>
## <authors>Charles Duhigg</authors>
## <year>2013</year>
## <ISBN>1847946240</ISBN>
## </book>
## <book>
## <id>3</id>
## <title>The Organized Mind: Thinking Straight in the Age of Information Overload</title>
## <authors>Daniel J. Levitin</authors>
## <year>2015</year>
## <ISBN>0147516315</ISBN>
## </book>
## <book>
## <id>4</id>
## <title>Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts</title>
## <authors>Carol Tavris, Elliot Aronson</authors>
## <year>2007</year>
## <ISBN>0151010986</ISBN>
## </book>
## </books>
We can see that each element that we need has a list of tags with id, title, authors, year, and ISBN.
Now we can populate a dataframe with the contents of each tag.
books_XML_df = data.frame()
books_XML_df = cbind(xpathSApply(books_XML, "//id", xmlValue),
xpathSApply(books_XML, "//title", xmlValue),
xpathSApply(books_XML, "//authors", xmlValue),
xpathSApply(books_XML, "//year", xmlValue),
xpathSApply(books_XML, "//ISBN", xmlValue))
books_XML_df = data.frame(books_XML_df)
names(books_XML_df) = c("id", "title", "authors", "year", "ISBN")
kable(books_XML_df)
| id | title | authors | year | ISBN |
|---|---|---|---|---|
| 1 | Thinking, Fast and Slow | Daniel Kahneman | 2013 | 0374533555 |
| 2 | The Power of Habit: Why We Do What We Do, and How to Change | Charles Duhigg | 2013 | 1847946240 |
| 3 | The Organized Mind: Thinking Straight in the Age of Information Overload | Daniel J. Levitin | 2015 | 0147516315 |
| 4 | Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts | Carol Tavris, Elliot Aronson | 2007 | 0151010986 |
Similarly to the XML file, the JSON first should be parsed first, which is easily accomplished by the jsonlite package. However, as I have picked a different node structure than in the previous formats, the parser interprets the node of “authors” as a list of character vectors (as one of the records is an array of two authors).
library(jsonlite)
url_JSON = getURL("https://raw.githubusercontent.com/datafeelings/data607/master/week_7/books.json", ssl.verifypeer = FALSE)
books_JSON = fromJSON(txt = url_JSON)
str(books_JSON)
## List of 5
## $ id : int [1:4] 1 2 3 4
## $ Title : chr [1:4] "Thinking, Fast and Slow" "The Power of Habit: Why We Do What We Do, and How to Change" "The Organized Mind: Thinking Straight in the Age of Information Overload" "Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts"
## $ Autors:List of 4
## ..$ : chr "Daniel Kahneman"
## ..$ : chr "Charles Duhigg"
## ..$ : chr "Daniel J. Levitin"
## ..$ : chr [1:2] "Carol Tavris" "Elliot Aronson"
## $ Year : int [1:4] 2013 2013 2015 2007
## $ ISBN : chr [1:4] "0374533555" "1847946240" "0147516315" "0151010986"
Forcing this file into a dataframe results in an incorrect data frame structure
books_JSON_df = as.data.frame(books_JSON)
kable(books_JSON_df)
| id | Title | Autors..Daniel.Kahneman. | Autors..Charles.Duhigg. | Autors..Daniel.J..Levitin. | Autors.c..Carol.Tavris….Elliot.Aronson.. | Year | ISBN |
|---|---|---|---|---|---|---|---|
| 1 | Thinking, Fast and Slow | Daniel Kahneman | Charles Duhigg | Daniel J. Levitin | Carol Tavris | 2013 | 0374533555 |
| 2 | The Power of Habit: Why We Do What We Do, and How to Change | Daniel Kahneman | Charles Duhigg | Daniel J. Levitin | Elliot Aronson | 2013 | 1847946240 |
| 3 | The Organized Mind: Thinking Straight in the Age of Information Overload | Daniel Kahneman | Charles Duhigg | Daniel J. Levitin | Carol Tavris | 2015 | 0147516315 |
| 4 | Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts | Daniel Kahneman | Charles Duhigg | Daniel J. Levitin | Elliot Aronson | 2007 | 0151010986 |
One solution is to collapse any embedded lists into characters first so that the “authors” column could be turned into a simple character vector.
# Function to collapse any list longer than 1 into a character vector
collapselist = function(x) {
x = ifelse(length(unlist(x))>1,
yes = paste(unlist(x),collapse = ", "),
no = as.character(x))
}
books_JSON$Autors = sapply(books_JSON$Autors,FUN = collapselist)
books_JSON_df = as.data.frame(books_JSON)
kable(books_JSON_df)
| id | Title | Autors | Year | ISBN |
|---|---|---|---|---|
| 1 | Thinking, Fast and Slow | Daniel Kahneman | 2013 | 0374533555 |
| 2 | The Power of Habit: Why We Do What We Do, and How to Change | Charles Duhigg | 2013 | 1847946240 |
| 3 | The Organized Mind: Thinking Straight in the Age of Information Overload | Daniel J. Levitin | 2015 | 0147516315 |
| 4 | Mistakes Were Made (But Not by Me): Why We Justify Foolish Beliefs, Bad Decisions, and Hurtful Acts | Carol Tavris, Elliot Aronson | 2007 | 0151010986 |
Now the data frame is correct.