week7

Load in Libraries

I chose to just add a fake author called author B to Understanding Power

library(knitr)

## Warning: package 'knitr' was built under R version 3.4.4

library(rvest)
library(RCurl)
library(XML)
library(htmltab)

## Warning: package 'htmltab' was built under R version 3.4.4

library(kableExtra)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.4

## Warning: package 'ggplot2' was built under R version 3.4.4

library(dplyr)

Load in HTML File

Being that I loaded the html as a table, this was very easy to load
Should be reproducible as it loads from my github

url <- "https://raw.githubusercontent.com/justinherman42/Justin-Data-607/master/tidy_data_week4/Politicalbooks.html"
my_books<- htmltab(doc = url, which = "/html/body/table")

## Neither <thead> nor <th> information found. Taking first table row for the header. If incorrect, specifiy header argument.

kable(my_books)

	Book	Author	Publication Date	Publisher
2	Understanding Power	Noam Chomsky	2002	The New Press
3	Understanding Power	Author B	2002	The New Press
4	Blackwater	Jeremy Scahill	2008	Nation Books
5	Shock Doctrine	Naomi Klein	2008	Picador

Load in XML File

Using Xmlparse
- I initially received alot of errors attempting to load from github, found workarounds on stack overflow
- Tidy data and create a coauthor column

fileURL <- "https://raw.githubusercontent.com/justinherman42/Justin-Data-607/master/tidy_data_week4/Politicalbooks2.xml"
books_xml <- getURL(fileURL,ssl.verifypeer = FALSE)
books_xml %>% 
    xmlParse(.,useInternal = TRUE) %>% 
    xmlToList(.) %>%  
    plyr::ldply(., data.frame) %>% 
    select(-.id) %>%
    mutate(Coauthor=Author.1) %>%
    select(-Author.1) %>% 
    kable(.)

Title	Author	Publication_Date	Publisher	Coauthor
Understanding Power	Noam Chomsky	2002	The New Press	Author B
Blackwater	Jeremy Scahill	2008	Nation Books	NA
The Shock Doctrine	Naomi Klein	2008	Picador	NA

#books_xml <- xmlParse(books_xml ,useInternal = TRUE)
#xL <- xmlToList(books_xml)
#kable(ldply(xL, data.frame))

Alternate example of loading in xml

Couldn’t get this example to work with git hub- will only display in rpub
There are no 2nd authors in this example
The df needed to me transposed

xml.url <- "file:///C:/Users/JN/Documents/No%20interenet/Politicalbooks.XML"
xmlfile <- xmlTreeParse(xml.url)
class(xmlfile)

## [1] "XMLDocument"         "XMLAbstractDocument"

xmlfile = xmlRoot(xmlfile)
print(xmlfile)[1:2]

## <Political_Books>
##  <Book>
##   <Title>Understanding Power</Title>
##   <Author>Noam Chomsky</Author>
##   <Publication_Date>2002</Publication_Date>
##   <Publisher>The New Press</Publisher>
##  </Book>
##  <Book>
##   <Title>Blackwater</Title>
##   <Author>Jeremy Scahill</Author>
##   <Publication_Date>2008</Publication_Date>
##   <Publisher>Nation Books</Publisher>
##  </Book>
##  <Book>
##   <Title>The Shock Doctrine</Title>
##   <Author>Naomi Klein</Author>
##   <Publication_Date>2008</Publication_Date>
##   <Publisher>Picador</Publisher>
##  </Book>
## </Political_Books>

## NULL

my_xml <- xmlSApply(xmlfile, function(x) xmlSApply(x, xmlValue))
my_xml_df <- t(as_data_frame(my_xml))
colnames(my_xml_df) <-c("Book", "Author", "Publication Date","Publisher")
kable(my_xml_df)

	Book	Author	Publication Date	Publisher
Book	Understanding Power	Noam Chomsky	2002	The New Press
Book1	Blackwater	Jeremy Scahill	2008	Nation Books
Book2	The Shock Doctrine	Naomi Klein	2008	Picador

Load in JSON file

library(curl)

## 
## Attaching package: 'curl'

## The following object is masked from 'package:readr':
## 
##     parse_date

library(rjson)
library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON

## The following object is masked from 'package:purrr':
## 
##     flatten

json_file <- 'https://raw.githubusercontent.com/justinherman42/Justin-Data-607/master/tidy_data_week4/Politicalbooks.json'
text <- readLines(curl(json_file))
text %>% 
    jsonlite::fromJSON(.,flatten = TRUE ) %>% 
    kable(.)

Title	Author	Author_2	Publication_Date	Publisher
Understanding Power	Noam Chomsky	Author B	2002	The New Press
Blackwater	Jeremy Scahill	NA	2008	Nation Books
The Shock Doctrine	Naomi Klein	NA	2008	Picador

warnings()

## NULL

#text
#dd  <-  as.data.frame(t(matrix(unlist(json_data), nrow=4)))
#colnames(dd) <- c("Book", "Author", "Publication Date","Publisher")
#kable(dd)

There were some minor differences between the data depending on how I loaded it in.
The HTML came in very clean becuase I created a table with my intended design in mind
The xml and Json data was loaded in with plyr::ldply(., data.frame) which I took from stack overflow
This method was very helpful as it allows for structuring of uneven row entries, which in the case of our dataset was the one book with 2 authors

week7

Justin Herman

March 15, 2018

Load in Libraries

Load in HTML File

Load in XML File

Alternate example of loading in xml

Load in JSON file