Week 7 Data 607

Several libraries needed to be installed before importing the HTML, XML, and JSON files. It also required a manual CRAN setup. Once installed, we were able to load the packages.

We load each file in a similar way, but they differ in how the data is structured. The HTML file is read in multiple lines and includes tags, such as “head” and “body”, similar to how content is organized on a webpage. The XML file reads more like structured data in a single line, similar to a CSV file, though it uses tags to define elements. Lastly, the JSON file is loaded as a list, reflecting its hierarchical key-value format.

books.html <- read_html("https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.html")   #read html file
books.html                                                                                                              #view

## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8 ...
## [2] <body><div class="ritz grid-container" dir="ltr"><table class="waffle" ce ...

books.xml <- "https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.xml"        
readbooksxml <- read_xml(books.xml)                                                   #read xml file                                    
readbooksxml                                                                          #view

## {xml_document}
## <data>
## [1] <Sheet1>\n  <Harry_Potter_and_the_Philosopher_s_Stone Author="J.K Rowling ...

json.url <- "https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.json"  #read json file
json.url                                                      #view

## [1] "https://raw.githubusercontent.com/crystaliquezada/week7_data607/refs/heads/main/Books.json"

Transforming each file into a data frame is another challenge. Due to the unique structure, each data frame needs a different approach.

books_df <- books.html %>% 
  html_table(fill = TRUE) %>% 
  .[[1]]
print(books_df)

## # A tibble: 4 × 5
##      `` A                                        B             C        D    
##   <int> <chr>                                    <chr>         <chr>    <chr>
## 1     1 Books                                    Author        Chapters Pages
## 2     2 Harry Potter and the Philosopher's Stone J.K Rowling   17       309  
## 3     3 Sula                                     Toni Morrison 11       192  
## 4     4 The Stranger                             Albert Camus  11       123

xml_df <- xml_find_all(readbooksxml, ".//Sheet1/*")
xml_df_nested <- data.frame(
  Book = sapply(xml_df, function(x) xml_name(x)),
  Author = xml_attr(xml_df, "Author"),
  Chapters = as.numeric(xml_attr(xml_df, "Chapters")),
  Pages = as.numeric(xml_attr(xml_df, "Pages")),
  stringsAsFactors = FALSE
)
print(xml_df_nested)

##                                       Book        Author Chapters Pages
## 1 Harry_Potter_and_the_Philosopher_s_Stone   J.K Rowling       17   309
## 2                                     Sula Toni Morrison       11   192
## 3                             The_Stranger  Albert Camus       11   123

books_json <- fromJSON(json.url)
books_df <- as.data.frame(do.call(rbind, lapply(books_json$Sheet1, as.data.frame)))
colnames(books_df) <- c("Author", "Chapters", "Pages")
print(books_df)

##                                                 Author Chapters Pages
## Harry Potter and the Philosopher's Stone   J.K Rowling       17   309
## Sula                                     Toni Morrison       11   192
## The Stranger                              Albert Camus       11   123

Week 7 Data 607

Crystal Quezada

2024-10-12