Select books on your favorite subject, one of the books should be by multiple authors.Use title, author, and two or three other attributes to describe the books. Create a XML, HTML, and JSON file with a table of the books and attributes. Using R, read, load, and compare the tables.
For my book selections, I took the reading list from my course in undergrad for “Workshop in Urban Studies,” which used L.A. as out lab. It was a transformative course for me and led me down my eventual career path working within the homeless response system.
Initially, I had a modest set of libraries. As the project went on, I added more and more to bug fix singular issues. With so few records to manage, there’s no loss in performance.
library(dplyr)
library(jsonlite)
library(rvest)
library(XML)
library(xml2)
library(lemon)
knit_print.data.frame <- lemon_print
I was very adamant with myself to only access files directly from the web since we’ll need to do it later on in future assignments/projects. Might as well get used to it now. After creating each table by hand (the HTML was my favorite), I loaded them onto my GitHub.
books_jsonURL <- "https://raw.githubusercontent.com/iscostello/Data607/master/bookTable.json"
books_HTMLURL <- "https://raw.githubusercontent.com/iscostello/Data607/master/bookTable.html"
books_XMLURL <- "https://raw.githubusercontent.com/iscostello/Data607/master/bookTable.xml"
First up was the JSON, and it proved easiest to deal with in a few respects. Using jsonlite package, fronJSON was the only function I really needed. After a bit of trial and error, using as.data.frame from base R worked just fine.
One of the issues I encountered as I attempted to compare XML, JSON, and HTML is that JSON had a weird naming convention for column names. I had to rename them after the fact to line them up with XML and HTML.
books_json <- fromJSON(books_jsonURL) %>% as.data.frame
books_json2 <- books_json %>% rename(Title = Books.List.Title,Author = Books.List.Author,Author = Books.List.Author,Published = Books.List.Published,Publisher = Books.List.Publisher,Subject = Books.List.Subject)
Printing the tables naturally looked awful, so I hunted around for a solution to print out logical looking tables and found the lemon package and some instructions on how to get it working. Other solutions were the DT package, but I found it too bulky for just four rows of data, but good to keep in my backpocket.
head(books_json2)
| Title | Author | Published | Publisher | Subject |
|---|---|---|---|---|
| City of Quartz | Mike Davis | 1990 | Verso Books | History |
| The Reluctant Metropolis: The Politics of Urban Growth in Los Angeles | William Fulton | 1997 | The Johns Hopkins University Press | Politics |
| The Next Los Angeles: The Struggle for a Livable City | Robert Gottlieb, Regina Freer, Eric Garcetti, Peter Dreier | 2006 | University of California Press | Urban Planning |
| The Death and Life of Great American Cities | Jane Jacobs | 1992 | Vintage Books | Urban Planning |
Next with HTML, I tried to pipe these terms together, but could not get it working appropriately. I went the long way eventually, declaring each step in the process.
With both HTML and XML, I noticed late that parsing and table functions called for literal files or documents. Passing those functions URLs from GitHub did not work. After catching the pattern “Read,” “Parse,” “Table” I was able to manage with HTML and XML.
books_htmlread <- read_html(books_HTMLURL)
books_htmlparsed<-htmlParse(books_htmlread)
books_htmltable<-readHTMLTable(books_htmlparsed, stringsAsFactors = FALSE)
books_htmltable<-books_htmltable[[1]]
head(books_htmltable)
| Title | Author | Published | Publisher | Subject |
|---|---|---|---|---|
| City of Quartz | Mike Davis | 1990 | Verso Books | History |
| The Reluctant Metropolis: The Politics of Urban Growth in Los Angeles | William Fulton | 1997 | The Johns Hopkins University Press | Politics |
| The Next Los Angeles: The Struggle for a Livable City | Robert Gottlieb, Regina Freer, Eric Garcetti, Peter Dreier | 2006 | University of California Press | Urban Planning |
| The Death and Life of Great American Cities | Jane Jacobs | 1992 | Vintage Books | Urban Planning |
I was most unfamiliar with the structure of XML. When I started to get errors loading it into R, I thought it must be how I built the XML table. I copied the structure from the book, Automated Data Collection with R so naturally I must have done something wrong with the source. I used an online XML checker and it didn’t catch any errors, which was great for my XML skills and I eliminated a source of the errors.
It’s here I noticed that the parsing function was calling for a doc/file. I put the read function in front of them and it worked with not trouble. I was able to pipe these functions together, which was really exciting for me. I tried to get the HTML section to do the same, emulating the same structure as XML, but couldn’t get it working.
books_XML <- read_xml(books_XMLURL) %>%
xmlParse() %>%
xmlToDataFrame()
head(books_XML)
| Title | Author | Published | Publisher | Subject |
|---|---|---|---|---|
| City of Quartz | Mike Davis | 1990 | Verso Books | History |
| The Reluctant Metropolis: The Politics of Urban Growth in Los Angeles | William Fulton | 1997 | The Johns Hopkins University Press | Politics |
| The Next Los Angeles: The Struggle for a Livable City | Robert Gottlieb, Regina Freer, Eric Garcetti, Peter Dreier | 2006 | University of California Press | Urban Planning |
| The Death and Life of Great American Cities | Jane Jacobs | 1992 | Vintage Books | Urban Planning |
Base R has a handy comparative function all.equal. I used this to compare the XML table to both the HTML and JSON tables. It returned TRUE (eventually, after I renamed the JSON columns) for both files. And, in accordance with the transitive property, since XML = HTML and HTML = JSON, then XML = JSON.
all.equal(books_XML, books_htmltable)
## [1] TRUE
all.equal(books_htmltable, books_json2)
## [1] TRUE
I learned a few lessons on the process of bringing in data from the web, particularly the READ, PARSE, TABLE order. Importantly, I learned to read the fine print on a function’s arguments. It would have saved me a bit of time search for bugs, when all I needed to do was turn my URL into a readable format for R. I also got a bit more exposure to the power of dplyr and additional formatting options.