Create books.html, books.xml, and books.json. Read data into R.
Download data from respective files hosted on my GitHub page:
url <- 'https://sjv1030.github.io/books.html'
q1_url <- read_html(url)
df <- as.data.frame(html_table(html_nodes(q1_url, "table")))
df
## Title Author.s.
## 1 Investments Bodi, Kane, Marcus
## 2 The Quants Scott Patterson
## 3 When Genius Failed Roger Lowenstein
## Attribute.s.
## 1 Finance 101, comprehensive, finance bible, textbook
## 2 quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius
——- OR——
df1 <- readHTMLTable(getURL(url))[[1]]
df1
## Title Author(s)
## 1 Investments Bodi, Kane, Marcus
## 2 The Quants Scott Patterson
## 3 When Genius Failed Roger Lowenstein
## Attribute(s)
## 1 Finance 101, comprehensive, finance bible, textbook
## 2 quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius
url2 <- 'https://sjv1030.github.io/books.xml'
xml_file <- xmlTreeParse(getURL(url2))
top <- xmlRoot(xml_file)
topxml <- xmlSApply(top,
function(x) xmlSApply(x, xmlValue))
df2 <- data.frame(t(topxml),row.names = NULL)
df2
## title authors
## 1 Investments Bodi, Kane, Marcus
## 2 The Quants Scott Patterson
## 3 When Genius Failed Roger Lowenstein
## attributes
## 1 Finance 101, comprehensive, finance bible, textbook
## 2 quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius
——- OR——
df3 <- xmlToDataFrame(getURL(url2))
## Warning in names(x) == varNames: longer object length is not a multiple of
## shorter object length
## Warning in names(x) == varNames: longer object length is not a multiple of
## shorter object length
df3 <- df3 %>%
replace_na(list(author="",authors="")) %>%
unite(author, author, authors, remove=TRUE, sep="")
df3
## title author
## 1 Investments Bodi, Kane, Marcus
## 2 The Quants Scott Patterson
## 3 When Genius Failed Roger Lowenstein
## attributes
## 1 Finance 101, comprehensive, finance bible, textbook
## 2 quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius
url3 <- 'https://sjv1030.github.io/books.json'
jdata <- fromJSON(getURL(url3))
df4 <- as.data.frame(t(matrix(unlist(jdata),ncol=3)))
colnames(df4) <- c('title','authors','attributes')
df4
## title authors
## 1 Investments Bodi, Kane, Marcus
## 2 The Quants Scott Patterson
## 3 When Genius Failed Roger Lowenstein
## attributes
## 1 Finance 101, comprehensive, finance bible, textbook
## 2 quant, quantitative analyst, page turner, well written, algos, Wall Street, finance
## 3 Hedge Funds, finance, Wall Street, quants, crisis, Long Term Capital Management, liquidity, bailout, genius
Is the HTML dataframe equal to the XML dataframe?
all.equal(df,df3)
## [1] "Names: 3 string mismatches"
identical(df,df3)
## [1] FALSE
Is the HTML dataframe equal to the JSON dataframe?
all.equal(df1,df4)
## [1] "Names: 3 string mismatches"
identical(df1,df4)
## [1] FALSE
Is the XML dataframe equal to the JSON dataframe?
all.equal(df2,df4)
## [1] "Component \"title\": names for target but not for current"
## [2] "Component \"authors\": names for target but not for current"
## [3] "Component \"attributes\": names for target but not for current"
identical(df2,df4)
## [1] FALSE
While visually all the dataframes look similar (if not almost exactly the same), according to R they’re all different.