Silverio_Vasquez-HW-books

Create books.html, books.xml, and books.json. Read data into R.

Read from HTML

Download data from respective files hosted on my GitHub page:

url <- 'https://sjv1030.github.io/books.html'
q1_url <- read_html(url)
df <- as.data.frame(html_table(html_nodes(q1_url, "table")))
df

##                Title          Author.s.
## 1        Investments Bodi, Kane, Marcus
## 2         The Quants    Scott Patterson
## 3 When Genius Failed   Roger Lowenstein
##                                                                                                  Attribute.s.
## 1                                                         Finance 101, comprehensive, finance bible, textbook
## 2                         quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius

——- OR——

df1 <- readHTMLTable(getURL(url))[[1]]
df1

##                Title          Author(s)
## 1        Investments Bodi, Kane, Marcus
## 2         The Quants    Scott Patterson
## 3 When Genius Failed   Roger Lowenstein
##                                                                                                  Attribute(s)
## 1                                                         Finance 101, comprehensive, finance bible, textbook
## 2                         quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius

Read from XML

url2 <- 'https://sjv1030.github.io/books.xml'
xml_file <- xmlTreeParse(getURL(url2))
top <- xmlRoot(xml_file)
topxml <- xmlSApply(top,
                    function(x) xmlSApply(x, xmlValue))
df2 <- data.frame(t(topxml),row.names = NULL)
df2

##                title            authors
## 1        Investments Bodi, Kane, Marcus
## 2         The Quants    Scott Patterson
## 3 When Genius Failed   Roger Lowenstein
##                                                                                                    attributes
## 1                                                         Finance 101, comprehensive, finance bible, textbook
## 2                         quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius

——- OR——

df3 <- xmlToDataFrame(getURL(url2))

## Warning in names(x) == varNames: longer object length is not a multiple of
## shorter object length

## Warning in names(x) == varNames: longer object length is not a multiple of
## shorter object length

df3 <- df3 %>% 
        replace_na(list(author="",authors="")) %>%
        unite(author, author, authors, remove=TRUE, sep="") 
df3

##                title             author
## 1        Investments Bodi, Kane, Marcus
## 2         The Quants    Scott Patterson
## 3 When Genius Failed   Roger Lowenstein
##                                                                                                    attributes
## 1                                                         Finance 101, comprehensive, finance bible, textbook
## 2                         quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius

Read from JSON

url3 <- 'https://sjv1030.github.io/books.json'
jdata <- fromJSON(getURL(url3)) 
df4 <- as.data.frame(t(matrix(unlist(jdata),ncol=3)))
colnames(df4) <- c('title','authors','attributes')
df4

##                title            authors
## 1        Investments Bodi, Kane, Marcus
## 2         The Quants    Scott Patterson
## 3 When Genius Failed   Roger Lowenstein
##                                                                                                    attributes
## 1                                                         Finance 101, comprehensive, finance bible, textbook
## 2                         quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius

Are they all equal?

Is the HTML dataframe equal to the XML dataframe?

all.equal(df,df3)

## [1] "Names: 3 string mismatches"

identical(df,df3)

## [1] FALSE

Is the HTML dataframe equal to the JSON dataframe?

all.equal(df1,df4)

## [1] "Names: 3 string mismatches"

identical(df1,df4)

## [1] FALSE

Is the XML dataframe equal to the JSON dataframe?

all.equal(df2,df4)

## [1] "Component \"title\": names for target but not for current"     
## [2] "Component \"authors\": names for target but not for current"   
## [3] "Component \"attributes\": names for target but not for current"

identical(df2,df4)

## [1] FALSE

While visually all the dataframes look similar (if not almost exactly the same), according to R they’re all different.

Silverio_Vasquez-HW-books

Silverio Vasquez

October 12, 2017

Read from HTML

Read from XML

Read from JSON

Are they all equal?