I am highly considering my choice for topic of books to be about probability. It is my favorite subject in math and I believe I can find books with multiple authors unlike my initial choice of fantasy books. I love fantasy but they usually only have one author. I am going to finalize my choice of books Friday and start sorting the information I want from the books that I want to be in the HTML and Json files. I will then create the HTML and Json files manually making sure to create a table that the structure is correct and is handled properly in R. Then load both files into two separate data sets comparing the two data frames to see if they are identical. Compare their structure, stored information, and how they handle different R functions.
This will be my first time making a table in HTML and Json so that may be a learning curve. Along with making sure that the structure in both files is done correctly may be a challenge as well. Another challenge will be trying to spot any difference they may have when analyzing in R.0
library(jsonlite)
## Warning: package 'jsonlite' was built under R version 4.4.3
library(rvest)
## Warning: package 'rvest' was built under R version 4.4.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Bookhtml <- read_html("C:/Users/typem/Documents/GitHub/Data607_HTML_JSON/books.html")
htmldf <- html_table(Bookhtml)[[1]]
BookJson <- fromJSON("C:/Users/typem/Documents/GitHub/Data607_HTML_JSON/books.json")
JsonDF <- as.data.frame(BookJson$rows)
colnames(JsonDF) <-BookJson$columns
dim(htmldf)
## [1] 3 5
dim(JsonDF)
## [1] 3 5
They have the same dimension.
htmldf
## # A tibble: 3 × 5
## `#` Title `Author(s)` Year Rating
## <int> <chr> <chr> <int> <dbl>
## 1 1 Fooled by Randomness: The Hidden Role of Chanc… Nassim Nic… 2001 4.08
## 2 2 The Drunkard's Walk: How Randomness Rules Our … Leonard Ml… 2008 3.95
## 3 3 Probability Theory: The Logic of Science E.T. Jayne… 2003 4.41
JsonDF
## # Title
## 1 1 Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets
## 2 2 The Drunkard's Walk: How Randomness Rules Our Lives
## 3 3 Probability Theory: The Logic of Science
## Author(s) Year Rating
## 1 Nassim Nicholas Taleb 2001 4.08
## 2 Leonard Mlodinow 2008 3.95
## 3 E.T. Jaynes, G. Larry Bretthorst 2003 4.41
Compare the two table side by side I notice that the numbering of the table are different types with the html being int and Json being character. The same thing follows for the year which is int compared to character and rating which is a double compared to character. Other than that the two table are essentially the same.