Loading Libraries

library(tidyverse)
library(RCurl)
library(XML)
library(rjson)
library(knitr)

Introduction

We were asked to create a XML file, a Json file, and a Html file. We were then asked to load them into R and create a data frame from the data. I decided to use the tidyverse and Rcurl libraries for loading in data,XML for loading in the XML and Html files,rjson to load in the json file, and knitr for table purposes, Below are three tabs each explaining how to load in one of the files and transform it into a data frame.

How to Extract Data from different file types

1. Extracting data from XML file

Use xmlParse to read in the file as an XML class document

rawXML <-
  xmlParse(getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-xml.xml"))

Then we use xmlRoot to allow R to interact with the XML file so we can call the first node (element) of the file

rootXML <- xmlRoot(rawXML)

When we call the first element of rootXML we see that we have the first book

$book1
<book1>
  <title>The Name of the Wind</title>
  <author1>Patrick Ruthfuss</author1>
  <author2> </author2>
  <genre>Sci-Fi</genre>
  <year>2007</year>
</book1> 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"

Then we can use the xmlSApply function to iterate through each header and value within the XML node and return a matrix.

dataXML <- xmlSApply(rootXML, function(x) xmlSApply(x, xmlValue))

	book1	book2	book3
title	The Name of the Wind	Salt: A World History	Punching the Air
author1	Patrick Ruthfuss	Mark Kurlansky	Ibi Zoboi
author2			Yusef Salaam
genre	Sci-Fi	Nonfiction	Young Adult
year	2007	2003	2020

Note: *If we only used xmlSApply once, R only reads the book header and then the values, as seen below, rather than reading both the headers and sub-headers

                                                 book1 
     "The Name of the WindPatrick Ruthfuss Sci-Fi2007" 
                                                 book2 
  "Salt: A World HistoryMark Kurlansky Nonfiction2003" 
                                                 book3 
"Punching the AirIbi ZoboiYusef SalaamYoung Adult2020"

Finally we use the data.frame function to transpose the matrix and convert it to a data frame.

book.xml.df <- data.frame(t(dataXML),row.names=NULL)

title	author1	author2	genre	year
The Name of the Wind	Patrick Ruthfuss		Sci-Fi	2007
Salt: A World History	Mark Kurlansky		Nonfiction	2003
Punching the Air	Ibi Zoboi	Yusef Salaam	Young Adult	2020

2. Extracting data from JSON file

First load the Json file using the fromJSON function.

rawJSON <- 
  fromJSON(getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-json.json"))

Create a table that stores the book list data as the row

tibJSON <- rawJSON %>%
  map_if(is.data.frame,list) %>%
  as_tibble() %>%
  unnest(cols = c('book1', 'book2', 'book3'))

Rename the rows and transpose the data frame

#Transpose data
book.json.df <- as.data.frame(t(as.matrix(tibJSON)))

#Rename columns
names(book.json.df) <-
  c("title","author1","author2","genre","year")

#Reset row names
rownames(book.json.df) <- NULL

title	author1	author2	genre	year
The Name of the Wind	Patrick Rothfuss	NA	Sci-Fi	2007
Salt: A World History	Mark Kurlansky	NA	Nonfiction	2003
Punching the Air	Ibi Zoboi	Yusef Salaam	Young Adult	2020

3. Extracting data from HTML file

Load the data

rawHTML <- getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-html.html")

Write it to a data frame and rename columns

#Create data frame
book.html.df <- as.data.frame(readHTMLTable(rawHTML, stringAsFactors = FALSE))

#Rename columns
names(book.html.df) <-
  tolower(gsub("NULL.","",names(book.html.df)))

title	author1	author2	genre	year
The Name of the Wind	Patrick Ruthfuss	null	Sci-Fi	2007
Salt: A World History	Mark Kurlansky	null	History	2003
Punching the Air	Ibi Zoboi	Yusef Salaam	Young Adult	2020

Conclusion

All three of the files can be read into R and transformed into the same data frame, see table below, using different functions and libraries. However, it would seem that HTML is the easiest file format to read in as it does not require any transformation of the data. This is because with both Json and XML files you have to read in the data and tell R what the column names and values are, whereas with HTML there is code to tell R which is which.

Compare data types for columns in each dataframe

	sapply(book.xml.df, class)
title	character
author1	character
author2	character
genre	character
year	character

	sapply(book.json.df, class)
title	character
author1	character
author2	character
genre	character
year	character

	sapply(book.html.df, class)
title	character
author1	character
author2	character
genre	character
year	character

Assignment 7 - Working with XML, JSON, and HTML in R

Stefano Biguzzi