library(rvest) #for html
library(httr) #for html
library(xml2) #for xml
library(dplyr) #for xml
library(jsonlite) #for json
library(formattable)
For this assignment, I created three files containing the same information on three books, “I, Who Did Not Die”, “Handle with Care”, and “After the Wind”. The files are in html, json, and xml format. These files all contain the book title, subtitle, authors, and genre. I uploaded the file into github in a publically available repository.
I was able to load in the html file into R using the httr package, which allowed me to grab the html file from the GitHub url. I retrieved the content as a plain text string using the content() function from the httr package, and then transformed the string into html using read_html() from the rvest package. The rvest package also allowed me to transform the data into a dataframe.
html_file_github <- "https://raw.githubusercontent.com/koonkimb/Data607/main/Assignment%207/book.html"
response <- GET(html_file_github)
html_content_github <- content(response, as = "text", encoding = "UTF-8")
html_content <- read_html(html_content_github)
html_df <- html_content %>%
html_node("table") %>%
html_table(fill = TRUE)
formattable(html_df)
| Title | Subtitle | Author | Genre |
|---|---|---|---|
| I, Who Did Not Die | A Sweeping Story of Loss, Redemption, and Fate | Zahed Haftlang Najah Aboud Meredith May | Autobiography |
| Handle with Care | Jodi Picoult | Realistic Fiction | |
| After the Wind | Tragedy on Everest One Survivor’s Story | Lou Kasischke | Autobiography |
I read the XML file into R using the xml2 package. The xml2 package also contains functions that allow me to create a dataframe with specified XML tags from the data.
xml_file <- read_xml("https://raw.githubusercontent.com/koonkimb/Data607/main/Assignment%207/book.xml")
xml_content <- xml_file %>% xml_find_all("//Book")
xml_df <- data.frame(
title = xml_text(xml_find_all(xml_content, ".//Title")),
subtitle = xml_text(xml_find_all(xml_content, ".//Subtitle")),
author = xml_text(xml_find_all(xml_content, ".//Author")),
genre = xml_text(xml_find_all(xml_content, ".//Genre")),
stringsAsFactors = FALSE
)
formattable(xml_df)
| title | subtitle | author | genre |
|---|---|---|---|
| I, Who Did Not Die | A Sweeping Story of Loss, Redemption, and Fate | Zahed HaftlangNajah AboudMeredith May | Autobiography |
| Handle with Care | Jodi Picoult | Realistic Fiction | |
| After the Wind | Tragedy on Everest One Survivor’s Story | Lou Kasischke | Autobiography |
Retrieving the file and loading the data into a dataframe was the simplest with the JSON file, as I was able to do this in two lines using the jsonlite package.
json_file <- fromJSON("https://raw.githubusercontent.com/koonkimb/Data607/main/Assignment%207/book.json")
json_df <- as.data.frame(json_file)
formattable(json_df)
| title | subtitle | authors | genre |
|---|---|---|---|
| I, Who Did Not Die | A Sweeping Story of Loss, Redemption, and Fate | Zahed Haftlang, Najah Aboud , Meredith May | Autobiography |
| Handle with Care | Jodi Picoult | Realistic Fiction | |
| After the Wind | Tragedy on Everest One Survivor’s Story | Lou Kasischke | Autobiography |
This assignment introduced me to various data formats that might be encountered in data acquisition. Looking ahead to project 3, familiarity with HTML was crucial for the webscraping portion. Specifically, understanding the tag formatting in HTML had allowed me to pull relevant links from a webpage. Afterwards, understanding the utility of JSON files in the data cleaning step was also instrumental to our data processing, as importing the JSON and being able to use it as a data dictionary made the data much easier to read and parse.