In this assignment, I load Notepad-generated HTML, XML, and JSON files, which were uploaded to Github, into R as dataframes.
library(XML)
library(xml2)
library(jsonlite)
library(httr)
library(knitr)
dfhtml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html")
dfhtml
## Response [https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html]
## Date: 2018-10-14 16:41
## Status: 200
## Content-Type: text/plain; charset=utf-8
## Size: 449 B
## <U+FEFF><table>
## <tr>
## <th>book </th>
## <th>first_name</th>
## <th>last_name</th>
## <th>year</th>
## </tr>
## <tr>
## <td>Imperialism and World Economy</td>
## <td>Nikolai</td>
## ...
class(dfhtml)
## [1] "response"
Not entirely sure why I had to repeat the GET() line of code again here in this chunk. When I don’t have it, I get Error in rawToChar(dfhtml$content) : argument ‘x’ must be a raw vector
## Load HTML Table
dfhtml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_html.html")
dfhtml <- as.data.frame(readHTMLTable(rawToChar(dfhtml$content), stringsAsFactors = F))
names(dfhtml) <- c("Title", "Author_First", "Author_Last", "DoP")
kable(dfhtml)
Title | Author_First | Author_Last | DoP |
---|---|---|---|
Imperialism and World Economy | Nikolai | Bukharin | 1972 |
Why Marx Was Right | Terry | Eagleton | 2011 |
Ruling America: A History of Wealth and Power in a Democracy | Steve, Gary | Fraser, Gerstle | 2005 |
class(dfhtml)
## [1] "data.frame"
Further manipulations are required to separate the multiple authors. One possible way could involve, first, creating NULL Author_First2 and Author_Last2 columns, then using stringr to do strsplit the Author columns at the comma, after they had been unlisted.
## Load XML File
url.xml <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_xml.xml")
dfxml <- xmlParse(url.xml)
dfxml
## <?xml version="1.0" encoding="UTF-8"?>
## <books>
## <book pages="178" publisher="Cornell University">
## <title>Imperialism and World Economy</title>
## <author>
## <first_name>Nikolai</first_name>
## <last_name>Bukharin</last_name>
## </author>
## <year>1972</year>
## </book>
## <book pages="258" publisher="Yale University">
## <title>Why Marx Was Right</title>
## <author>
## <first_name>Terry</first_name>
## <last_name>Eagleton</last_name>
## </author>
## <year>2011</year>
## </book>
## <book pages="368" publisher="Harvard University">
## <title>Ruling America: A History of Wealth and Power in a Democracy</title>
## <author>
## <first_name>Steve</first_name>
## <last_name>Fraser</last_name>
## <first_name>Gary</first_name>
## <last_name>Gerstle</last_name>
## </author>
## <year>2005</year>
## </book>
## </books>
##
books.df <- xmlToDataFrame(dfxml, stringsAsFactors = FALSE)
kable(books.df)
title | author | year |
---|---|---|
Imperialism and World Economy | NikolaiBukharin | 1972 |
Why Marx Was Right | TerryEagleton | 2011 |
Ruling America: A History of Wealth and Power in a Democracy | SteveFraserGaryGerstle | 2005 |
Should debug XML file as follows:
<author>
<first_name>Steve</first_name>
<last_name>Fraser</last_name>
</author>
<author>
<first_name>Gary</first_name>
<last_name>Gerstle</last_name>
</author>
## Load JSON File
url.json <- GET("https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json")
url.json
## Response [https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json]
## Date: 2018-10-14 16:41
## Status: 200
## Content-Type: text/plain; charset=utf-8
## Size: 597 B
## <U+FEFF>{"social economy books" :[
## {
## "title" : "Imperialism and World Economy",
## "author_first" : "Nikolai",
## "author_last": "Bukharin",
## "year" : "1972",
## "pages": 178,
## "publisher" : "Cornell University"
## },
## {
## ...
.
url.json <- "https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/books_json.json"
json.data <- readLines(url.json)
## Warning in readLines(url.json): incomplete final line found on
## 'https://raw.githubusercontent.com/ZacharyHerold/CUNY-DATA607/master/
## books_json.json'
json.data[[1]] <- "{\"social economy books\" :["
Checking the class of the readLine’s output. Had difficulty without readLines using the fromJSON() function.
class(json.data)
## [1] "character"
dfjson <- fromJSON(json.data)
dfjson <- as.data.frame(dfjson)
names(dfjson) <- c("Title","Author_First","Author_Last","Year","Pages","Publisher")
kable(dfjson)
Title | Author_First | Author_Last | Year | Pages | Publisher |
---|---|---|---|---|---|
Imperialism and World Economy | Nikolai | Bukharin | 1972 | 178 | Cornell University |
Why Marx Was Right | Terry | Eagleton | 2011 | 258 | Yale University |
Ruling America: A History of Wealth and Power in a Democracy | c(“Steve”, “Gary”) | c(“Fraser”, “Gerstle”) | 2005 | 368 | Harvard University |