Week7AssignmentCode

Author

Jonnathan Zuna Largo

Introduction

My goal is to create an HTML and also a JSON file by inputting it manually the same structure book data. After creating the files, I will use R to load each file into a dataframe and ensure that both files give me identical results. The requirement for this projects are 3 books in which my choices are: “Astrophysics for people in a hurry” “Atomic Habits” and “An Introduction to Statistical Learning.” I will search the authors, publication year, publisher, and ISBN for each one and then work for the process of creating files and then comparing the results.

Body

Files

In this section I’ll create the html file and the json file by using the information from each book I selected

<!DOCTYPE html>
<html>
  <head>
    <title>Books</title>
  </head>
  <body>
    <table>
      <tr>
        <th>Title</th>
        <th>Authors</th>
        <th>Year</th>
        <th>Publisher</th>
        <th>ISBN</th>
      </tr>
      <tr>
        <td>Astrophysics for people in a hurry</td>
        <td>Neil deGrasse Tyson</td>
        <td>2017</td>
        <td>W. W. Norton & Company</td>
        <td>978-0393609394</td>
      </tr>
      <tr>
        <td>Atomic Habits</td>
        <td>James Clear</td>
        <td>2018</td>
        <td>Random House Business Books</td>
        <td>978-1847941848</td>
      </tr>
      <tr>
        <td>An Introduction to Statistical Learning</td>
        <td>Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani</td>
        <td>2021</td>
        <td>Springer</td>
        <td>978-1071614174</td>
      </tr>
    </table>
  </body>
</html>

[
  {
    "Title": "Astrophysics for people in a hurry",
    "Authors": "Neil deGrasse Tyson",
    "Year": 2017,
    "Publisher": "W. W. Norton & Company",
    "ISBN": "978-0393609394"
  },
  {
    "Title": "Atomic Habits",
    "Authors": "James Clear",
    "Year": 2018,
    "Publisher": "Random House Business Books",
    "ISBN": "978-1847941848"
  },
  {
    "Title": "An Introduction to Statistical Learning",
    "Authors": "Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani",
    "Year": 2021,
    "Publisher": "Springer",
    "ISBN": "978-1071614174"
  }
]

Code

For the following section the packages rvest will be used for html. The function read_html() parses the file and html_table() extracts the table it finds.

library(rvest)
html_raw <- read_html("BookInformation.html")

df_html <- html_table(html_raw)[[1]]
head(df_html)

# A tibble: 3 × 5
  Title                                   Authors           Year Publisher ISBN 
  <chr>                                   <chr>            <int> <chr>     <chr>
1 Astrophysics for people in a hurry      Neil deGrasse T…  2017 W. W. No… 978-…
2 Atomic Habits                           James Clear       2018 Random H… 978-…
3 An Introduction to Statistical Learning Gareth James, D…  2021 Springer  978-…

The the json file the package jsonlite handles it perfectly. The function fromJSON() reads the file.

library(jsonlite)
df_json <- fromJSON("BookInformation.json")

head(df_json)

                                    Title
1      Astrophysics for people in a hurry
2                           Atomic Habits
3 An Introduction to Statistical Learning
                                                         Authors Year
1                                            Neil deGrasse Tyson 2017
2                                                    James Clear 2018
3 Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani 2021
                    Publisher           ISBN
1      W. W. Norton & Company 978-0393609394
2 Random House Business Books 978-1847941848
3                    Springer 978-1071614174

First, I want to compare the structure for both data frames side by side

names(df_html)

[1] "Title"     "Authors"   "Year"      "Publisher" "ISBN"

names(df_json)

[1] "Title"     "Authors"   "Year"      "Publisher" "ISBN"

Both structures are correct so now I can compare data types

class(df_html)

[1] "tbl_df"     "tbl"        "data.frame"

class(df_json)

[1] "data.frame"

As shown above df_html is coming in a different format than the json file, to compare them both I’ll change the html to the same format as the json file

df_html <- as.data.frame(df_html)

Now that both tables are normalized and in the same format, I’ll compare them both and both should be identical

all.equal(df_html, df_json)

[1] TRUE

identical(df_html, df_json)

[1] TRUE

Conclusion

Working in this exercise game a more clear understanding on how the same data can live in two different formats and what takes to make them both comparable in a consistent format. Loading the files were straight forward however the comparison part showed a subtle structural difference that I wouldn’t have realized if I did not inspect the data frame classes directly. Moving forward, being aware of type and class consistency when loading from multiple sources will be a crucial step to build before working into web scraping and API projects.