Assignment 6

HTML Table Load

html_table  <- as.data.frame(read_html("books.html") |> html_table(fill=TRUE))
kable(html_table)

Title	Authors	Favorite.Attributes
Linear Algebra and Its Applications	David C. Lay, Steven R. Lay, Judi J. McDonald	Simple Explinations, Transition to Advance topics
Calculus Illustrated. Volume 2: Differential Calculus	Peter Saveliev	Visuals, Format of questions, Great background information on each toppic
Statistics: Principles and Methods	Richard A. Johnson, Gouri K. Bhattacharyya	Practical, Basics coverage

is.data.frame(html_table)

## [1] TRUE

XML Table Load

it loads as a list so must convert

xml_file = "books.xml"
books_xml = read_xml(xml_file)
is.data.frame(books_xml)

## [1] FALSE

Convert XML List to dataframe

## reload to make subvectors as lists as well
books_xml <- as_list(read_xml(xml_file))

xml_book_df = tibble::as_tibble(books_xml)|>
              mutate(number = row_number())|>
              unnest_longer(books)

df_unt_1 <- xml_book_df  |>
            unnest_longer( col = books, names_repair = "minimal") |>
            select(c(1,3,4)) 

df_unt_2 <- df_unt_1  |> 
            filter(books_id != "title") |>
            unnest_longer( col = books, names_repair = "minimal") 

book_df <- rbind(df_unt_1 |>
           filter(books_id == "title"), df_unt_2)

books_df <- book_df |> 
  pivot_wider(
    names_from = books_id,
    values_from = books
  )

## Warning: Values from `books` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
##   {data} |>
##   dplyr::summarise(n = dplyr::n(), .by = c(number, books_id)) |>
##   dplyr::filter(n > 1L)

books_df <- books_df |>
              unnest_longer(col = c(title)) |> 
              unnest_longer(col = c(authors)) |> 
              unnest_longer(col = c(favoriteAttributes)) |>
              select(2,4,6)

kable(books_df)

title	authors	favoriteAttributes
Linear Algebra and Its Applications	David C. Lay	Simple Explanations
Linear Algebra and Its Applications	David C. Lay	Transition to Advanced Topics
Linear Algebra and Its Applications	Steven R. Lay	Simple Explanations
Linear Algebra and Its Applications	Steven R. Lay	Transition to Advanced Topics
Linear Algebra and Its Applications	Judi J. McDonald	Simple Explanations
Linear Algebra and Its Applications	Judi J. McDonald	Transition to Advanced Topics
Calculus Illustrated. Volume 2: Differential Calculus	Peter Saveliev	Visuals
Calculus Illustrated. Volume 2: Differential Calculus	Peter Saveliev	Format of Questions
Calculus Illustrated. Volume 2: Differential Calculus	Peter Saveliev	Great Background Information on Each Topic
Statistics: Principles and Methods	Richard A. Johnson	Practical
Statistics: Principles and Methods	Richard A. Johnson	Basics Coverage
Statistics: Principles and Methods	Gouri K. Bhattacharyya	Practical
Statistics: Principles and Methods	Gouri K. Bhattacharyya	Basics Coverage

is.data.frame(books_df)

## [1] TRUE

Load JSON table

books_data <- fromJSON("books.json") 
json_books_df <- as.data.frame(books_data) 
kable((json_books_df))

books.title	books.authors	books.favoriteAttributes
Linear Algebra and Its Applications	David C. Lay , Steven R. Lay , Judi J. McDonald	Simple Explanations , Transition to Advanced Topics
Calculus Illustrated. Volume 2: Differential Calculus	Peter Saveliev	Visuals , Format of Questions , Great Background Information on Each Topic
Statistics: Principles and Methods	Richard A. Johnson , Gouri K. Bhattacharyya	Practical , Basics Coverage

is.data.frame(json_books_df)

## [1] TRUE

Conclusion

In conclusion JSON and HTML had libraries that would directly load their contents to an R dataframe although not perfect as books still need some tidying since many of the favorite attributes and authors are on the same row and it can be beneficial seperating them for analysis purposes. XML on the other hand dealt with a larger diffirence in data structure making it a little more complicated when loading into R as many of the values came in terms of lists so you have to unnest them.