In this assignment, I recorded information for three different books, stored the information in an HTML file, XML file, and JSON file, and read the files into R to see how they compare.
These are the three books I chose (from my bookshelf):
I recorded each book’s title, author(s), year of publication, genre, and Goodreads rating to create these three files:
I found this article from DataCamp very helpful in completing the assignment.
library(XML) # To read HTML and XML files
library(RCurl) # To get URL data
library(jsonlite) # To read JSON files
library(knitr) # To create responsive HTML tables
library(kableExtra) # To create responsive HTML tables
# Link to HTML file
u1 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.html"
# Get URL data
urldata <- getURL(u1)
# Read in the HTML file. Since the entire file is one table, we do not have to specify which table we want to use
raw_html <- readHTMLTable(urldata, stringsAsFactors = FALSE)
# Convert the HTML into an R dataframe
html <- data.frame(raw_html)
html %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
NULL.id | NULL.Title | NULL.Author | NULL.Year | NULL.Genre | NULL.Rating |
---|---|---|---|---|---|
1 | The Dictator’s Handbook | Bruce Bueno de Mesquita, Alastair Smith | 2011 | Politics | 4.27 |
2 | King Leopold’s Ghost | Adam Hochschild | 1998 | History | 4.15 |
3 | The Complete Maus (Maus #1-2) | Art Spiegelman | 1986 | Graphic Novel | 4.53 |
# Link to XML file
u2 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.xml"
# Get URL data
raw_xml <- getURL(u2)
# Parse the XML data from the URL
xml_file <- xmlParse(raw_xml)
# Convert the XML data into an R dataframe
xml_df <- xmlToDataFrame(xml_file, stringsAsFactors = FALSE)
xml_df %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
title | author1 | author2 | year | genre | rating |
---|---|---|---|---|---|
The Dictator’s Handbook | Bruce Bueno de Mesquita | Alastair Smith | 2011 | Politics | 4.27 |
King Leopold’s Ghost | Adam Hochschild | 1998 | History | 4.15 | |
The Complete Maus (Maus #1-2) | Art Spiegelman | 1986 | Graphic Novel | 4.53 |
# Link to JSON file
u3 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.json"
# Get the JSON data from the URL
json_file <- fromJSON(u3)
# Convert the JSON data into an R dataframe
json_df <- data.frame(json_file, stringsAsFactors = FALSE)
json_df %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
books.Title | books.Author1 | books.Author2 | books.Year | books.Genre | books.Rating |
---|---|---|---|---|---|
The Dictator’s Handbook | Bruce Bueno de Mesquita | Alastair Smith | 2011 | Politics | 4.27 |
King Leopold’s Ghost | Adam Hochschild | 1998 | History | 4.15 | |
The Complete Maus (Maus #1-2) | Art Spiegelman | 1986 | Graphic Novel | 4.53 |
The three dataframes ended up being very similar, but not identical. They all required slightly different steps to be accessible by R.
The JSON file required the fewest steps to read, while the XML required the most. XML also had more strict rules regarding special characters and naming that meant my file needed a few rounds of corrections before it could be read by R.
The column names of each dataframe ended up different depending on the file, reflecting the structure of the language used to create it.