I’m interested in Artificial Intelligence. Here are the three books that I chose:
Title: “Artificial Intelligence: A Modern Approach”
Authors: Stuart Russell, Peter Norvig
Attributes: Publication date: December 11, 2009; Publisher: Pearson Education
Title: “Deep Learning”
Author: Ian Goodfellow, Yoshua Bengio, Aaron Courville
Attributes: Publication date: November 18, 2016; Publisher: MIT Press
Title: “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems”
Author: Aurelien Geron
Attributes: Publication date: March 2019; Publisher: O’Reilly Media
I’ve created 3 files: books.html, books.xml and books.json
This is the content of each of the files:
<table>
<tr>
<th>Title</th>
<th>Authors</th>
<th>Publication Date</th>
<th>Publisher</th>
</tr>
<tr>
<td>Artificial Intelligence: A Modern Approach</td>
<td>Stuart Russell, Peter Norvig</td>
<td>December 11, 2009</td>
<td>Pearson Education</td>
</tr>
<tr>
<td>Deep Learning</td>
<td>Ian Goodfellow, Yoshua Bengio, Aaron Courville</td>
<td>November 18, 2016</td>
<td>MIT Press</td>
</tr>
<tr>
<td>Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems</td>
<td>Aurelien Geron</td>
<td>March 2019</td>
<td>O'Reilly Media</td>
</tr>
</table>
<books>
<book>
<title>Artificial Intelligence: A Modern Approach</title>
<authors>
<author>Stuart Russell</author>
<author>Peter Norvig</author>
</authors>
<publication_date>December 11, 2009</publication_date>
<publisher>Pearson Education</publisher>
</book>
<book>
<title>Deep Learning</title>
<authors>
<author>Ian Goodfellow</author>
<author>Yoshua Bengio</author>
<author>Aaron Courville</author>
</authors>
<publication_date>November 18, 2016</publication_date>
<publisher>MIT Press</publisher>
</book>
<book>
<title>Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems</title>
<authors>
<author>Aurelien Geron</author>
</authors>
<publication_date>March 2019</publication_date>
<publisher>O'Reilly Media</publisher>
</book>
</books>
[
{
"title": "Artificial Intelligence: A Modern Approach",
"authors": [
"Stuart Russell",
"Peter Norvig"
],
"publication_date": "December 11, 2009",
"publisher": "Pearson Education"
},
{
"title": "Deep Learning",
"authors": [
"Ian Goodfellow",
"Yoshua Bengio",
"Aaron Courville"
],
"publication_date": "November 18, 2016",
"publisher": "MIT Press"
},
{
"title": "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems",
"authors": [
"Aurelien Geron"
],
"publication_date": "March 2019",
"publisher": "O'Reilly Media"
}
]
Load required packages:
library(XML)
library(jsonlite)
library(rvest)
library(dplyr)
library(janitor)
books.html into a data frame:
html_file <- "books.html"
html_table <- read_html(html_file) %>% html_table()
df_html <- html_table[[1]] %>%
as.data.frame() %>%
clean_names()
Data frame from HTML:
books.xml into a data frame:
xml_file <- "books.xml"
doc_xml <- xmlParse(xml_file) %>% xmlToList()
df_xml <- bind_rows(doc_xml) %>%
group_by(title) %>%
summarise(
authors = paste0(authors, collapse = ", "),
publication_date = unique(publication_date),
publisher = unique(publisher)
) %>%
as.data.frame()
Data frame from XML:
books.json into a data frame:
json_file <- "books.json"
df_json <- fromJSON(json_file) %>%
clean_names()
# convert column authors from a list to comma separated values:
df_json$authors <- lapply(df_json$authors, function(x) {
paste0(x, collapse = ", ")
}) %>%
unlist()
Data frame from JSON:
identical(df_html, df_xml)
## [1] TRUE
identical(df_html, df_json)
## [1] TRUE
The data frames are all identical.