Selecting Books

I’m interested in Artificial Intelligence. Here are the three books that I chose:

  1. Title: “Artificial Intelligence: A Modern Approach”

    Authors: Stuart Russell, Peter Norvig

    Attributes: Publication date: December 11, 2009; Publisher: Pearson Education

  2. Title: “Deep Learning”

    Author: Ian Goodfellow, Yoshua Bengio, Aaron Courville

    Attributes: Publication date: November 18, 2016; Publisher: MIT Press

  3. Title: “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems”

    Author: Aurelien Geron

    Attributes: Publication date: March 2019; Publisher: O’Reilly Media

Creating files by hand

I’ve created 3 files: books.html, books.xml and books.json

This is the content of each of the files:

books.html

<table>
  <tr>
    <th>Title</th>
    <th>Authors</th>
    <th>Publication Date</th>
    <th>Publisher</th>
  </tr>
  <tr>
    <td>Artificial Intelligence: A Modern Approach</td>
    <td>Stuart Russell, Peter Norvig</td>
    <td>December 11, 2009</td>
    <td>Pearson Education</td>
  </tr>
  <tr>
    <td>Deep Learning</td>
    <td>Ian Goodfellow, Yoshua Bengio, Aaron Courville</td>
    <td>November 18, 2016</td>
    <td>MIT Press</td>
  </tr>
  <tr>
    <td>Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems</td>
    <td>Aurelien Geron</td>
    <td>March 2019</td>
    <td>O'Reilly Media</td>
  </tr>
</table>

books.xml

<books>
  <book>
    <title>Artificial Intelligence: A Modern Approach</title>
    <authors>
      <author>Stuart Russell</author>
      <author>Peter Norvig</author>
    </authors>
    <publication_date>December 11, 2009</publication_date>
    <publisher>Pearson Education</publisher>
  </book>
  <book>
    <title>Deep Learning</title>
    <authors>
      <author>Ian Goodfellow</author>
      <author>Yoshua Bengio</author>
      <author>Aaron Courville</author>
    </authors>
    <publication_date>November 18, 2016</publication_date>
    <publisher>MIT Press</publisher>
  </book>
  <book>
    <title>Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems</title>
    <authors>
      <author>Aurelien Geron</author>
    </authors>
    <publication_date>March 2019</publication_date>
    <publisher>O'Reilly Media</publisher>
  </book>
</books>

books.json

[
  {
    "title": "Artificial Intelligence: A Modern Approach",
    "authors": [
      "Stuart Russell",
      "Peter Norvig"
    ],
    "publication_date": "December 11, 2009",
    "publisher": "Pearson Education"
  },
  {
    "title": "Deep Learning",
    "authors": [
      "Ian Goodfellow",
      "Yoshua Bengio",
      "Aaron Courville"
    ],
    "publication_date": "November 18, 2016",
    "publisher": "MIT Press"
  },
  {
    "title": "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems",
    "authors": [
      "Aurelien Geron"
    ],
    "publication_date": "March 2019",
    "publisher": "O'Reilly Media"
  }
]

Loading information into R data frames

Load required packages:

library(XML)
library(jsonlite)
library(rvest)
library(dplyr)
library(janitor)

HTML

books.html into a data frame:

html_file <- "books.html"
html_table <- read_html(html_file) %>% html_table()
df_html <- html_table[[1]] %>% 
  as.data.frame() %>%
  clean_names()

Data frame from HTML:

XML

books.xml into a data frame:

xml_file <- "books.xml"
doc_xml <- xmlParse(xml_file) %>% xmlToList()

df_xml <- bind_rows(doc_xml) %>%
  group_by(title) %>%
  summarise(
    authors = paste0(authors, collapse = ", "),
    publication_date = unique(publication_date),
    publisher = unique(publisher)
  ) %>%
  as.data.frame()

Data frame from XML:

JSON

books.json into a data frame:

json_file <- "books.json"
df_json <- fromJSON(json_file) %>%
  clean_names()

# convert column authors from a list to comma separated values:
df_json$authors <- lapply(df_json$authors, function(x) {
  paste0(x, collapse = ", ")
}) %>%
  unlist()

Data frame from JSON:

Check if all data frames are identical

identical(df_html, df_xml)
## [1] TRUE
identical(df_html, df_json)
## [1] TRUE

The data frames are all identical.