Assignment#7

Approach

This assignment is designed to give experience with HTML and JSON file formats and for this assignment we are supposed to use 3 books and their information. I will decide on the books in my approach and manually create the raw HTML file and JSON file. Ill use these 3 books on the topic of strength training and powerlifting.

  1. Starting Strength by Mark Rippetoe. Year: 2011. Publisher: Aasgaard Company. Genre: Fitness.
  2. Practical Programming for Strength Training by Mark Rippetoe and Andy Baker. Year: 2013. Publisher: Aasgaard Company. Genre: Strength Training
  3. The New Encylopedia of Modern Bodybuilding by Arnold Schwarzenegger and Bill Dobbins. Year: 1998. Publisher: Simon & Schuster. Genre: Bodybuilding

Code Base

html_content <- '<!DOCTYPE html>
<html>
  <head><title>Books</title></head>
  <body>
    <table>
      <thead>
        <tr>
          <th>Title</th><th>Authors</th><th>Year</th><th>Publisher</th><th>Genre</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <td>Starting Strength</td>
          <td>Mark Rippetoe</td>
          <td>2011</td>
          <td>Aasgaard Company</td>
          <td>Fitness</td>
        </tr>
        <tr>
          <td>Practical Programming for Strength Training</td>
          <td>Mark Rippetoe, Andy Baker</td>
          <td>2013</td>
          <td>Aasgaard Company</td>
          <td>Strength Training</td>
        </tr>
        <tr>
          <td>The New Encyclopedia of Modern Bodybuilding</td>
          <td>Arnold Schwarzenegger, Bill Dobbins</td>
          <td>1998</td>
          <td>Simon &amp; Schuster</td>
          <td>Bodybuilding</td>
        </tr>
      </tbody>
    </table>
  </body>
</html>'

writeLines(html_content, "books.html")

json_content <- '{
  "books": [
    {
      "title": "Starting Strength",
      "authors": "Mark Rippetoe",
      "year": 2011,
      "publisher": "Aasgaard Company",
      "genre": "Fitness"
    },
    {
      "title": "Practical Programming for Strength Training",
      "authors": "Mark Rippetoe, Andy Baker",
      "year": 2013,
      "publisher": "Aasgaard Company",
      "genre": "Strength Training"
    },
    {
      "title": "The New Encyclopedia of Modern Bodybuilding",
      "authors": "Arnold Schwarzenegger, Bill Dobbins",
      "year": 1998,
      "publisher": "Simon & Schuster",
      "genre": "Bodybuilding"
    }
  ]
}'

writeLines(json_content, "books.json")
library(rvest)

Load HTML into dataframe

df_html <- read_html("books.html") |>
  html_element("table") |>
  html_table()

colnames(df_html) <- tolower(colnames(df_html))

df_html
# A tibble: 3 × 5
  title                                       authors       year publisher genre
  <chr>                                       <chr>        <int> <chr>     <chr>
1 Starting Strength                           Mark Rippet…  2011 Aasgaard… Fitn…
2 Practical Programming for Strength Training Mark Rippet…  2013 Aasgaard… Stre…
3 The New Encyclopedia of Modern Bodybuilding Arnold Schw…  1998 Simon & … Body…

Load JSON into a data frame

library(jsonlite)
df_json <- fromJSON("books.json")$books
df_json$year <- as.integer(df_json$year)

df_json
                                        title
1                           Starting Strength
2 Practical Programming for Strength Training
3 The New Encyclopedia of Modern Bodybuilding
                              authors year        publisher             genre
1                       Mark Rippetoe 2011 Aasgaard Company           Fitness
2           Mark Rippetoe, Andy Baker 2013 Aasgaard Company Strength Training
3 Arnold Schwarzenegger, Bill Dobbins 1998 Simon & Schuster      Bodybuilding

Compare the 2 dataframes

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
df_json <- df_json[, colnames(df_html)]

# Are they identical?
identical(df_html, df_json)
[1] FALSE
# What differs (if anything)?
all.equal(df_html, df_json)
[1] "Attributes: < Component \"class\": Lengths (3, 1) differ (string compare on first 1) >"
[2] "Attributes: < Component \"class\": 1 string mismatch >"                                
# Side by side
bind_rows(
  mutate(df_html, source = "HTML"),
  mutate(df_json, source = "JSON")
)
# A tibble: 6 × 6
  title                                     authors  year publisher genre source
  <chr>                                     <chr>   <int> <chr>     <chr> <chr> 
1 Starting Strength                         Mark R…  2011 Aasgaard… Fitn… HTML  
2 Practical Programming for Strength Train… Mark R…  2013 Aasgaard… Stre… HTML  
3 The New Encyclopedia of Modern Bodybuild… Arnold…  1998 Simon & … Body… HTML  
4 Starting Strength                         Mark R…  2011 Aasgaard… Fitn… JSON  
5 Practical Programming for Strength Train… Mark R…  2013 Aasgaard… Stre… JSON  
6 The New Encyclopedia of Modern Bodybuild… Arnold…  1998 Simon & … Body… JSON