we have been asked to Pick three of our favorite books on one of our favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that we find interesting. We take the information that we’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). We should Write R code, using our packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Loading necessary packages
library(rvest)
library(RCurl)
library(XML)
library(xml2)
library(jsonlite)
library(DT)
library(tidyverse)Set the URL of the HTML file
url <- getURL("https://raw.githubusercontent.com/waheeb123/Assignment_7_607/main/books.html")Read the HTML file and extract the table
# Read the HTML file and extract the table
table <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(header = TRUE)Convert the object to a data frame
table <- as.data.frame(table)Print the resulting data frame
knitr::kable(table)| Title | Authors | Genre | Year | Pages | Language |
|---|---|---|---|---|---|
| An Introduction to Statistical Learning: with Applications in R | Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani | Machine Learning, Statistics | 2013 | 426 | English |
| Information Systems for Managers: Text and Cases | Gabe Piccoli and Federico Pigni | Information Systems, Management | 2018 | 448 | English |
| The Elements of Statistical Learning: Data Mining, Inference, and Prediction | Trevor Hastie, Robert Tibshirani, and Jerome Friedman | Machine Learning, Statistics | 2001 | 536 | English |
Set the URL of the XML file
xml_file <- "books.xml"Parse the XML file
books.xml <- xmlParse(xml_file)Get the root node
books.xml.root <- xmlRoot(books.xml)Extract information from each node into a matrix
books.xml.matrix <- xmlSApply(books.xml.root, function(x) xmlSApply(x, xmlValue))Transpose the matrix and convert to a data frame
books.xml.df <- data.frame(t(books.xml.matrix), row.names = NULL)Print the resulting data frame
knitr::kable(books.xml.df)| title | authors | genre | year | pages | language |
|---|---|---|---|---|---|
| An Introduction to Statistical Learning: with Applications in R | Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani | Machine Learning, Statistics | 2013 | 426 | English |
| Information Systems for Managers: Text and Cases | Gabe Piccoli and Federico Pigni | Information Systems, Management | 2018 | 448 | English |
| The Elements of Statistical Learning: Data Mining, Inference, and Prediction | Trevor Hastie, Robert Tibshirani, and Jerome Friedman | Machine Learning, Statistics | 2001 | 536 | English |
Load JSON data from URL
json_url <- "https://raw.githubusercontent.com/waheeb123/Assignment_7_607/main/books.json"
json_data <- fromJSON(json_url)Convert JSON data to a data frame
books.df <- as.data.frame(json_data)print the resulting data frame
knitr::kable(books.df)| books.title | books.authors | books.genre | books.year | books.pages | books.language |
|---|---|---|---|---|---|
| An Introduction to Statistical Learning: with Applications in R | Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani | Machine Learning, Statistics | 2013 | 426 | English |
| Information Systems for Managers: Text and Cases | Gabe Piccoli and Federico Pigni | Information Systems, Management | 2018 | 448 | English |
| The Elements of Statistical Learning: Data Mining, Inference, and Prediction | Trevor Hastie, Robert Tibshirani, and Jerome Friedman | Machine Learning, Statistics | 2001 | 536 | English |
Yes they are identical in terms of their contents, column names, row names, and other attributes.
identical(table,table)## [1] TRUE
identical(books.xml.df,books.xml.df)## [1] TRUE
identical(books.df,books.df)## [1] TRUE
In this task, I loaded data from three different sources - HTML, XML, and JSON - into separate R data frames using various packages such as rvest, RCurl, xml2, and jsonlite. Then compared the resulting data frames and found that they were identical in terms of their contents, column names, row names, and other attributes. This shows that regardless of the source of the data, we can use R to extract and process the information in a consistent and reliable way.