The books I used for this exercise aren’t related to the same subject, but I think it is safe to assume that the content of the books are not totally relevant to the purpose of the assignment.
Loading Our Libraries
Let’s load all the libraries that we’ll use to load our XML, HTML, and JSON files into dataframes and clean them up so that they are exactly the same.
library(XML)
library(httr)
library(RCurl)
library(plyr)
library(dplyr)
library(tidyr)
library(jsonlite)
library(zoo)Now that our libraries are loaded, let’s start first with getting our XML file into a data frame.
XML
This file type was the simplest of the three file types. For this, we’ll use the xmlToList function to convert our XML file to a list. We’ll then use ldply from the plyr library to convert to a data frame. But first, we’ll get the file from my github page (apologies, if you visit my site, it’s ugly…I’ll find time at some point to make this nice).
xmldoc <- rawToChar(GET('http://chesterpoon8.github.io/books.xml')$content)
xml <- ldply(xmlToList(xmldoc), data.frame)
knitr::kable(xml, format = "html")| .id | title | author | subject | publisher | comments | .attrs |
|---|---|---|---|---|---|---|
| book | Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. | 1 |
| book | A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. | 2 |
| book | Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays | 3 |
| book | Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | OiYan A. Poon | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays | 3 |
| book | Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Lori Patton Davis | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays | 3 |
Getting closer. Let’s get rid of the .id and .attrs columns and then replace the column names with capitalized versions of those names.
books_xml <- xml %>%
select(-c(.id,.attrs))
colnames(books_xml) <- c('Title','Author','Subject','Publisher','Comments')
knitr::kable(books_xml, format = "html")| Title | Author | Subject | Publisher | Comments |
|---|---|---|---|---|
| Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. |
| A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | OiYan A. Poon | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Lori Patton Davis | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
Our data now looks pretty clean. Let’s move on to our HTML file.
HTML
We’ll continue to use the XML package to read our HTML file using the readHTMLTable function. This function creates a list similar to the xmlToList function we used earlier, so we’ll use ldply for this just like we did earlier with our XML file and see what this looks like. But first, we’ll parse the html file with htmlParse for use with readHTMLTable.
html_b <- getURL('https://chesterpoon8.github.io/books.html')
htmldoc <- htmlParse(html_b,encoding = "UTF-8")
html <- readHTMLTable(htmldoc)
html <- ldply(html, data.frame)
knitr::kable(html, format = "html")| .id | Title | Author | Subject | Publisher | Comments |
|---|---|---|---|---|---|
| NULL | Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. |
| NULL | A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. |
| NULL | Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy, OiYan A. Poon, & Lori Patton Davis | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
We have quite a bit of cleaning to do. First we have to get rid of our .id column. Then we’ll need to separate out the multiple authors for one of our books and then gather them together into a single column to keep our data clean. We can then filter out any null values for books that have only one author. We’ll also need to handle the ampersand for the third author of one of our books.
books_html <- html %>%
select(-.id) %>%
separate(Author, sep = ", ", c('a1','a2','a3')) %>%
gather("x","Author",2:4) %>%
filter(!is.na(Author)) %>%
select(-x)
books_html$Author <- gsub("\\&[[:blank:]]","", books_html$Author)
books_html <- books_html[,c('Title','Author','Subject','Publisher','Comments')]
knitr::kable(books_html, format = "html")| Title | Author | Subject | Publisher | Comments |
|---|---|---|---|---|
| Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. |
| A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | OiYan A. Poon | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Lori Patton Davis | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
As you can see, this data frame is exactly the same as our data frame we created from our XML file. Let’s try working with our JSON file now.
JSON
The most challenging file to work with is our JSON file. We’ll use jsonlite to try to convert this to a dataframe.
First, we’ll unlist the elements and store in a variable. This next part is pretty tricky and my strategy below mimics the strategy taken from our textbook: Automated Data Collection with R. They explain it best:
First, we transpose each list element, turn them into data frames, and finally make use of the rbind.fill() function of the plyr package to combine the data frames into one single data frame, taking care of the fact that some variables do not exist in some data frames.
Following those steps, we end up with a dataframe that had to be transposed once more to more closely resemble what our final data frame should look like. Let’s take a look to see what we have so far.
json_doc <- rawToChar(GET('http://chesterpoon8.github.io/books.json')$content)
json <- fromJSON(json_doc)
json_unlist <- sapply(json[[1]],unlist)
books_json <- t(do.call("rbind.fill", lapply(lapply(json_unlist, t),
data.frame, stringsAsFactors = FALSE)))
knitr::kable(books_json, format = "html")| X1 | Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. |
| X2 | A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. |
| X3 | Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| X4 | NA | OiYan A. Poon | NA | NA | NA |
| X5 | NA | Lori Patton Davis | NA | NA | NA |
As we can see, there are no column names in our table. However, there are row names in our table that we should remove. Also, our “data frame” is not being recognized as such.
class(books_json)## [1] "matrix"
Let’s remove the row names and add the appropriate column names. We’ll also convert the matrix to be a data frame.
rownames(books_json) <- c()
colnames(books_json) <- c('Title','Author','Subject','Publisher','Comments')
books_json <- data.frame(books_json)
knitr::kable(books_json, format = "html")| Title | Author | Subject | Publisher | Comments |
|---|---|---|---|---|
| Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. |
| A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| NA | OiYan A. Poon | NA | NA | NA |
| NA | Lori Patton Davis | NA | NA | NA |
Much better, but we have null values for every column for our 2nd and 3rd author of our 3rd book. In this case, we need to replace our null values with the previous value in the data frame to properly fill it out.
books_json[books_json$Author=="","Author"] <- NA
books_json <- books_json %>%
do(na.locf(.))
knitr::kable(books_json, format = "html")| Title | Author | Subject | Publisher | Comments |
|---|---|---|---|---|
| Automate the Boring Stuff with Python | Al Sweigart | Python Programming | No Starch Press | Great book on how to automate yourself into a new job! I used this book to do just that. |
| A Long Way Gone: Memoirs of a Boy Soldier | Ishmael Beah | Memoir | Sarah Crichton Books | I went to college with the author and played intramural soccer with him not having the slightest idea of his past. |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Badia Ahad-Legardy | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | OiYan A. Poon | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
| Difficult Subjects: Insights and Strategies for Teaching About Race, Sexuality, and Gender | Lori Patton Davis | Education | Stylus Publishing | My sister is one of the editors to this anthology of academic essays |
Now our data frame looks just like our first two.