CUNY DATA607_Wk7_Zherold

In this assignment, I load Notepad-generated HTML, XML, and JSON files, which were uploaded to Github, into R as dataframes.

library(XML)
library(xml2)
library(jsonlite)
library(httr)
library(knitr)

HTML Table >> R, using httr’s GET ()…

dfhtml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html")
dfhtml

## Response [https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html]
##   Date: 2018-10-14 16:41
##   Status: 200
##   Content-Type: text/plain; charset=utf-8
##   Size: 449 B
## <U+FEFF><table>
## <tr>
## <th>book </th>
## <th>first_name</th>
## <th>last_name</th>
## <th>year</th>
## </tr>
## <tr>
## <td>Imperialism and World Economy</td>
## <td>Nikolai</td>
## ...

class(dfhtml)

## [1] "response"

… then converting it into a dataframe with readHTMLTable()

Not entirely sure why I had to repeat the GET() line of code again here in this chunk. When I don’t have it， I get Error in rawToChar(dfhtml$content) : argument ‘x’ must be a raw vector

## Load HTML Table
dfhtml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html")
dfhtml <- as.data.frame(readHTMLTable(rawToChar(dfhtml$content), stringsAsFactors = F))
names(dfhtml) <- c("Title", "Author_First", "Author_Last", "DoP")
kable(dfhtml)

Title	Author_First	Author_Last	DoP
Imperialism and World Economy	Nikolai	Bukharin	1972
Why Marx Was Right	Terry	Eagleton	2011
Ruling America: A History of Wealth and Power in a Democracy	Steve, Gary	Fraser, Gerstle	2005

class(dfhtml)

## [1] "data.frame"

Further manipulations are required to separate the multiple authors. One possible way could involve, first, creating NULL Author_First2 and Author_Last2 columns, then using stringr to do strsplit the Author columns at the comma, after they had been unlisted.

XML Document >> R, using GET() and XMLParse()…

The httr package did prove more reliable than RCurl when Knitting the Markdown file to html. The readHTMLTable was missing function calls (Handlers, etc.) when using getURL().

## Load XML File
url.xml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_xml.xml")
dfxml <- xmlParse(url.xml)
dfxml

## <?xml version="1.0" encoding="UTF-8"?>
## <books>
##   <book pages="178" publisher="Cornell University">
##     <title>Imperialism and World Economy</title>
##     <author>
##       <first_name>Nikolai</first_name>
##       <last_name>Bukharin</last_name>
##     </author>
##     <year>1972</year>
##   </book>
##   <book pages="258" publisher="Yale University">
##     <title>Why Marx Was Right</title>
##     <author>
##       <first_name>Terry</first_name>
##       <last_name>Eagleton</last_name>
##     </author>
##     <year>2011</year>
##   </book>
##   <book pages="368" publisher="Harvard University">
##     <title>Ruling America: A History of Wealth and Power in a Democracy</title>
##     <author>
##       <first_name>Steve</first_name>
##       <last_name>Fraser</last_name>
##       <first_name>Gary</first_name>
##       <last_name>Gerstle</last_name>
##     </author>
##     <year>2005</year>
##   </book>
## </books>
##

…and then xmlToDataFrame()

books.df <- xmlToDataFrame(dfxml, stringsAsFactors = FALSE)
kable(books.df)

title	author	year
Imperialism and World Economy	NikolaiBukharin	1972
Why Marx Was Right	TerryEagleton	2011
Ruling America: A History of Wealth and Power in a Democracy	SteveFraserGaryGerstle	2005

Should debug XML file as follows:

<author>
  <first_name>Steve</first_name>
  <last_name>Fraser</last_name>
</author>  
<author>
  <first_name>Gary</first_name>
  <last_name>Gerstle</last_name>
</author>

JSON Document >> R, using httr’s GET()

## Load JSON File
url.json <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json")
url.json

## Response [https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json]
##   Date: 2018-10-14 16:41
##   Status: 200
##   Content-Type: text/plain; charset=utf-8
##   Size: 597 B
## <U+FEFF>{"social economy books" :[
## {
## "title" : "Imperialism and World Economy",
## "author_first" : "Nikolai",
## "author_last": "Bukharin",
## "year" : "1972",
## "pages": 178,
## "publisher" : "Cornell University"
## },
## {
## ...

…. readLines(), editing out the garbage at the beginning of the JSON string…

url.json <- "https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json"
json.data <- readLines(url.json)

## Warning in readLines(url.json): incomplete final line found on
## 'https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/
## books_json.json'

json.data[[1]] <- "{\"social economy books\" :["

Checking the class of the readLine’s output. Had difficulty without readLines using the fromJSON() function.

class(json.data)

## [1] "character"

…. and finally coercing to a dataframe with the as.data.frame() function.

dfjson <- fromJSON(json.data)
dfjson <- as.data.frame(dfjson)
names(dfjson) <- c("Title","Author_First","Author_Last","Year","Pages","Publisher")
kable(dfjson)

Title	Author_First	Author_Last	Year	Pages	Publisher
Imperialism and World Economy	Nikolai	Bukharin	1972	178	Cornell University
Why Marx Was Right	Terry	Eagleton	2011	258	Yale University
Ruling America: A History of Wealth and Power in a Democracy	c(“Steve”, “Gary”)	c(“Fraser”, “Gerstle”)	2005	368	Harvard University