HTML Table Load
html_table <- as.data.frame(read_html("books.html") |> html_table(fill=TRUE))
kable(html_table)
| Linear Algebra and Its Applications |
David C. Lay, Steven R. Lay, Judi J. McDonald |
Simple Explinations, Transition to Advance topics |
| Calculus Illustrated. Volume 2: Differential
Calculus |
Peter Saveliev |
Visuals, Format of questions, Great background
information on each toppic |
| Statistics: Principles and Methods |
Richard A. Johnson, Gouri K. Bhattacharyya |
Practical, Basics coverage |
is.data.frame(html_table)
## [1] TRUE
XML Table Load
it loads as a list so must convert
xml_file = "books.xml"
books_xml = read_xml(xml_file)
is.data.frame(books_xml)
## [1] FALSE
Convert XML List to dataframe
## reload to make subvectors as lists as well
books_xml <- as_list(read_xml(xml_file))
xml_book_df = tibble::as_tibble(books_xml)|>
mutate(number = row_number())|>
unnest_longer(books)
df_unt_1 <- xml_book_df |>
unnest_longer( col = books, names_repair = "minimal") |>
select(c(1,3,4))
df_unt_2 <- df_unt_1 |>
filter(books_id != "title") |>
unnest_longer( col = books, names_repair = "minimal")
book_df <- rbind(df_unt_1 |>
filter(books_id == "title"), df_unt_2)
books_df <- book_df |>
pivot_wider(
names_from = books_id,
values_from = books
)
## Warning: Values from `books` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} |>
## dplyr::summarise(n = dplyr::n(), .by = c(number, books_id)) |>
## dplyr::filter(n > 1L)
books_df <- books_df |>
unnest_longer(col = c(title)) |>
unnest_longer(col = c(authors)) |>
unnest_longer(col = c(favoriteAttributes)) |>
select(2,4,6)
kable(books_df)
| Linear Algebra and Its Applications |
David C. Lay |
Simple Explanations |
| Linear Algebra and Its Applications |
David C. Lay |
Transition to Advanced Topics |
| Linear Algebra and Its Applications |
Steven R. Lay |
Simple Explanations |
| Linear Algebra and Its Applications |
Steven R. Lay |
Transition to Advanced Topics |
| Linear Algebra and Its Applications |
Judi J. McDonald |
Simple Explanations |
| Linear Algebra and Its Applications |
Judi J. McDonald |
Transition to Advanced Topics |
| Calculus Illustrated. Volume 2: Differential
Calculus |
Peter Saveliev |
Visuals |
| Calculus Illustrated. Volume 2: Differential
Calculus |
Peter Saveliev |
Format of Questions |
| Calculus Illustrated. Volume 2: Differential
Calculus |
Peter Saveliev |
Great Background Information on Each Topic |
| Statistics: Principles and Methods |
Richard A. Johnson |
Practical |
| Statistics: Principles and Methods |
Richard A. Johnson |
Basics Coverage |
| Statistics: Principles and Methods |
Gouri K. Bhattacharyya |
Practical |
| Statistics: Principles and Methods |
Gouri K. Bhattacharyya |
Basics Coverage |
is.data.frame(books_df)
## [1] TRUE
Load JSON table
books_data <- fromJSON("books.json")
json_books_df <- as.data.frame(books_data)
kable((json_books_df))
| Linear Algebra and Its Applications |
David C. Lay , Steven R. Lay , Judi J. McDonald |
Simple Explanations , Transition to Advanced
Topics |
| Calculus Illustrated. Volume 2: Differential
Calculus |
Peter Saveliev |
Visuals , Format of Questions , Great Background
Information on Each Topic |
| Statistics: Principles and Methods |
Richard A. Johnson , Gouri K. Bhattacharyya |
Practical , Basics Coverage |
is.data.frame(json_books_df)
## [1] TRUE
Conclusion
In conclusion JSON and HTML had libraries that would directly load
their contents to an R dataframe although not perfect as books still
need some tidying since many of the favorite attributes and authors are
on the same row and it can be beneficial seperating them for analysis
purposes. XML on the other hand dealt with a larger diffirence in data
structure making it a little more complicated when loading into R as
many of the values came in terms of lists so you have to unnest
them.