Loading Libraries

library(tidyverse)
library(RCurl)
library(XML)
library(rjson)
library(knitr)

Introduction

We were asked to create a XML file, a Json file, and a Html file. We were then asked to load them into R and create a data frame from the data. I decided to use the tidyverse and Rcurl libraries for loading in data,XML for loading in the XML and Html files,rjson to load in the json file, and knitr for table purposes, Below are three tabs each explaining how to load in one of the files and transform it into a data frame.

How to Extract Data from different file types

1. Extracting data from XML file

  1. Use xmlParse to read in the file as an XML class document
rawXML <-
  xmlParse(getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-xml.xml"))
  1. Then we use xmlRoot to allow R to interact with the XML file so we can call the first node (element) of the file
rootXML <- xmlRoot(rawXML)

When we call the first element of rootXML we see that we have the first book

$book1
<book1>
  <title>The Name of the Wind</title>
  <author1>Patrick Ruthfuss</author1>
  <author2> </author2>
  <genre>Sci-Fi</genre>
  <year>2007</year>
</book1> 

attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"        
  1. Then we can use the xmlSApply function to iterate through each header and value within the XML node and return a matrix.
dataXML <- xmlSApply(rootXML, function(x) xmlSApply(x, xmlValue))
book1 book2 book3
title The Name of the Wind Salt: A World History Punching the Air
author1 Patrick Ruthfuss Mark Kurlansky Ibi Zoboi
author2 Yusef Salaam
genre Sci-Fi Nonfiction Young Adult
year 2007 2003 2020

Note: *If we only used xmlSApply once, R only reads the book header and then the values, as seen below, rather than reading both the headers and sub-headers

                                                 book1 
     "The Name of the WindPatrick Ruthfuss Sci-Fi2007" 
                                                 book2 
  "Salt: A World HistoryMark Kurlansky Nonfiction2003" 
                                                 book3 
"Punching the AirIbi ZoboiYusef SalaamYoung Adult2020" 
  1. Finally we use the data.frame function to transpose the matrix and convert it to a data frame.
book.xml.df <- data.frame(t(dataXML),row.names=NULL)
title author1 author2 genre year
The Name of the Wind Patrick Ruthfuss Sci-Fi 2007
Salt: A World History Mark Kurlansky Nonfiction 2003
Punching the Air Ibi Zoboi Yusef Salaam Young Adult 2020

2. Extracting data from JSON file

  1. First load the Json file using the fromJSON function.
rawJSON <- 
  fromJSON(getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-json.json"))
  1. Create a table that stores the book list data as the row
tibJSON <- rawJSON %>%
  map_if(is.data.frame,list) %>%
  as_tibble() %>%
  unnest(cols = c('book1', 'book2', 'book3'))
  1. Rename the rows and transpose the data frame
#Transpose data
book.json.df <- as.data.frame(t(as.matrix(tibJSON)))

#Rename columns
names(book.json.df) <-
  c("title","author1","author2","genre","year")

#Reset row names
rownames(book.json.df) <- NULL
title author1 author2 genre year
The Name of the Wind Patrick Rothfuss NA Sci-Fi 2007
Salt: A World History Mark Kurlansky NA Nonfiction 2003
Punching the Air Ibi Zoboi Yusef Salaam Young Adult 2020

3. Extracting data from HTML file

  1. Load the data
rawHTML <- getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-html.html")
  1. Write it to a data frame and rename columns
#Create data frame
book.html.df <- as.data.frame(readHTMLTable(rawHTML, stringAsFactors = FALSE))

#Rename columns
names(book.html.df) <-
  tolower(gsub("NULL.","",names(book.html.df)))
title author1 author2 genre year
The Name of the Wind Patrick Ruthfuss null Sci-Fi 2007
Salt: A World History Mark Kurlansky null History 2003
Punching the Air Ibi Zoboi Yusef Salaam Young Adult 2020

Conclusion

All three of the files can be read into R and transformed into the same data frame, see table below, using different functions and libraries. However, it would seem that HTML is the easiest file format to read in as it does not require any transformation of the data. This is because with both Json and XML files you have to read in the data and tell R what the column names and values are, whereas with HTML there is code to tell R which is which.

Compare data types for columns in each dataframe
sapply(book.xml.df, class)
title character
author1 character
author2 character
genre character
year character
sapply(book.json.df, class)
title character
author1 character
author2 character
genre character
year character
sapply(book.html.df, class)
title character
author1 character
author2 character
genre character
year character