Description

In this assignment, I recorded information for three different books, stored the information in an HTML file, XML file, and JSON file, and read the files into R to see how they compare.

These are the three books I chose (from my bookshelf):

I recorded each book’s title, author(s), year of publication, genre, and Goodreads rating to create these three files:

I found this article from DataCamp very helpful in completing the assignment.


library(XML)         # To read HTML and XML files
library(RCurl)       # To get URL data
library(jsonlite)    # To read JSON files
library(knitr)       # To create responsive HTML tables
library(kableExtra)  # To create responsive HTML tables

Reading HTML into R

# Link to HTML file
u1 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.html"

# Get URL data
urldata <- getURL(u1)

# Read in the HTML file. Since the entire file is one table, we do not have to specify which table we want to use
raw_html <- readHTMLTable(urldata, stringsAsFactors = FALSE)

# Convert the HTML into an R dataframe
html <- data.frame(raw_html)

html %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
NULL.id NULL.Title NULL.Author NULL.Year NULL.Genre NULL.Rating
1 The Dictator’s Handbook Bruce Bueno de Mesquita, Alastair Smith 2011 Politics 4.27
2 King Leopold’s Ghost Adam Hochschild 1998 History 4.15
3 The Complete Maus (Maus #1-2) Art Spiegelman 1986 Graphic Novel 4.53

Reading XML into R

# Link to XML file
u2 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.xml"

# Get URL data
raw_xml <- getURL(u2)

# Parse the XML data from the URL
xml_file <- xmlParse(raw_xml)

# Convert the XML data into an R dataframe
xml_df <- xmlToDataFrame(xml_file, stringsAsFactors = FALSE)

xml_df %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
title author1 author2 year genre rating
The Dictator’s Handbook Bruce Bueno de Mesquita Alastair Smith 2011 Politics 4.27
King Leopold’s Ghost Adam Hochschild 1998 History 4.15
The Complete Maus (Maus #1-2) Art Spiegelman 1986 Graphic Novel 4.53

Reading JSON into R

# Link to JSON file
u3 <- "https://raw.githubusercontent.com/koffeeya/msds/master/DATA%20607%20Data%20Acquisition%20and%20Management/Assignments/Assignment%2007/books.json"

# Get the JSON data from the URL
json_file <- fromJSON(u3)

# Convert the JSON data into an R dataframe
json_df <- data.frame(json_file, stringsAsFactors = FALSE)

json_df %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
books.Title books.Author1 books.Author2 books.Year books.Genre books.Rating
The Dictator’s Handbook Bruce Bueno de Mesquita Alastair Smith 2011 Politics 4.27
King Leopold’s Ghost Adam Hochschild 1998 History 4.15
The Complete Maus (Maus #1-2) Art Spiegelman 1986 Graphic Novel 4.53

Comparison

The three dataframes ended up being very similar, but not identical. They all required slightly different steps to be accessible by R.

The JSON file required the fewest steps to read, while the XML required the most. XML also had more strict rules regarding special characters and naming that meant my file needed a few rounds of corrections before it could be read by R.

The column names of each dataframe ended up different depending on the file, reflecting the structure of the language used to create it.