In this assignment, I load Notepad-generated HTML, XML, and JSON files, which were uploaded to Github, into R as dataframes.

library(XML)
library(xml2)
library(jsonlite)
library(httr)
library(knitr)

HTML Table >> R, using httr’s GET ()…

dfhtml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html")
dfhtml 
## Response [https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html]
##   Date: 2018-10-14 16:41
##   Status: 200
##   Content-Type: text/plain; charset=utf-8
##   Size: 449 B
## <U+FEFF><table>
## <tr>
## <th>book </th>
## <th>first_name</th>
## <th>last_name</th>
## <th>year</th>
## </tr>
## <tr>
## <td>Imperialism and World Economy</td>
## <td>Nikolai</td>
## ...
class(dfhtml)
## [1] "response"

… then converting it into a dataframe with readHTMLTable()

Not entirely sure why I had to repeat the GET() line of code again here in this chunk. When I don’t have it, I get Error in rawToChar(dfhtml$content) : argument ‘x’ must be a raw vector

## Load HTML Table
dfhtml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html")
dfhtml <- as.data.frame(readHTMLTable(rawToChar(dfhtml$content), stringsAsFactors = F))
names(dfhtml) <- c("Title", "Author_First", "Author_Last", "DoP")
kable(dfhtml) 
Title Author_First Author_Last DoP
Imperialism and World Economy Nikolai Bukharin 1972
Why Marx Was Right Terry Eagleton 2011
Ruling America: A History of Wealth and Power in a Democracy Steve, Gary Fraser, Gerstle 2005
class(dfhtml)
## [1] "data.frame"

Further manipulations are required to separate the multiple authors. One possible way could involve, first, creating NULL Author_First2 and Author_Last2 columns, then using stringr to do strsplit the Author columns at the comma, after they had been unlisted.

XML Document >> R, using GET() and XMLParse()…

## Load XML File
url.xml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_xml.xml")
dfxml <- xmlParse(url.xml)
dfxml
## <?xml version="1.0" encoding="UTF-8"?>
## <books>
##   <book pages="178" publisher="Cornell University">
##     <title>Imperialism and World Economy</title>
##     <author>
##       <first_name>Nikolai</first_name>
##       <last_name>Bukharin</last_name>
##     </author>
##     <year>1972</year>
##   </book>
##   <book pages="258" publisher="Yale University">
##     <title>Why Marx Was Right</title>
##     <author>
##       <first_name>Terry</first_name>
##       <last_name>Eagleton</last_name>
##     </author>
##     <year>2011</year>
##   </book>
##   <book pages="368" publisher="Harvard University">
##     <title>Ruling America: A History of Wealth and Power in a Democracy</title>
##     <author>
##       <first_name>Steve</first_name>
##       <last_name>Fraser</last_name>
##       <first_name>Gary</first_name>
##       <last_name>Gerstle</last_name>
##     </author>
##     <year>2005</year>
##   </book>
## </books>
## 

…and then xmlToDataFrame()

books.df <- xmlToDataFrame(dfxml, stringsAsFactors = FALSE)
kable(books.df)
title author year
Imperialism and World Economy NikolaiBukharin 1972
Why Marx Was Right TerryEagleton 2011
Ruling America: A History of Wealth and Power in a Democracy SteveFraserGaryGerstle 2005

Should debug XML file as follows:

<author>
  <first_name>Steve</first_name>
  <last_name>Fraser</last_name>
</author>  
<author>
  <first_name>Gary</first_name>
  <last_name>Gerstle</last_name>
</author>

JSON Document >> R, using httr’s GET()

## Load JSON File
url.json <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json")
url.json
## Response [https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json]
##   Date: 2018-10-14 16:41
##   Status: 200
##   Content-Type: text/plain; charset=utf-8
##   Size: 597 B
## <U+FEFF>{"social economy books" :[
## {
## "title" : "Imperialism and World Economy",
## "author_first" : "Nikolai",
## "author_last": "Bukharin",
## "year" : "1972",
## "pages": 178,
## "publisher" : "Cornell University"
## },
## {
## ...

…. readLines(), editing out the garbage at the beginning of the JSON string…

.

url.json <- "https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json"
json.data <- readLines(url.json)
## Warning in readLines(url.json): incomplete final line found on
## 'https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/
## books_json.json'
json.data[[1]] <- "{\"social economy books\" :["

Checking the class of the readLine’s output. Had difficulty without readLines using the fromJSON() function.

class(json.data)
## [1] "character"

…. and finally coercing to a dataframe with the as.data.frame() function.

dfjson <- fromJSON(json.data)
dfjson <- as.data.frame(dfjson)
names(dfjson) <- c("Title","Author_First","Author_Last","Year","Pages","Publisher")
kable(dfjson)
Title Author_First Author_Last Year Pages Publisher
Imperialism and World Economy Nikolai Bukharin 1972 178 Cornell University
Why Marx Was Right Terry Eagleton 2011 258 Yale University
Ruling America: A History of Wealth and Power in a Democracy c(“Steve”, “Gary”) c(“Fraser”, “Gerstle”) 2005 368 Harvard University