We were asked to create a XML file, a Json file, and a Html file. We were then asked to load them into R and create a data frame from the data. I decided to use the tidyverse and Rcurl libraries for loading in data,XML for loading in the XML and Html files,rjson to load in the json file, and knitr for table purposes, Below are three tabs each explaining how to load in one of the files and transform it into a data frame.
xmlParse to read in the file as an XML class documentrawXML <-
xmlParse(getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-xml.xml"))xmlRoot to allow R to interact with the XML file so we can call the first node (element) of the fileWhen we call the first element of rootXML we see that we have the first book
$book1
<book1>
<title>The Name of the Wind</title>
<author1>Patrick Ruthfuss</author1>
<author2> </author2>
<genre>Sci-Fi</genre>
<year>2007</year>
</book1>
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
xmlSApply function to iterate through each header and value within the XML node and return a matrix.| book1 | book2 | book3 | |
|---|---|---|---|
| title | The Name of the Wind | Salt: A World History | Punching the Air |
| author1 | Patrick Ruthfuss | Mark Kurlansky | Ibi Zoboi |
| author2 | Yusef Salaam | ||
| genre | Sci-Fi | Nonfiction | Young Adult |
| year | 2007 | 2003 | 2020 |
Note: *If we only used xmlSApply once, R only reads the book header and then the values, as seen below, rather than reading both the headers and sub-headers
book1
"The Name of the WindPatrick Ruthfuss Sci-Fi2007"
book2
"Salt: A World HistoryMark Kurlansky Nonfiction2003"
book3
"Punching the AirIbi ZoboiYusef SalaamYoung Adult2020"
data.frame function to transpose the matrix and convert it to a data frame.| title | author1 | author2 | genre | year |
|---|---|---|---|---|
| The Name of the Wind | Patrick Ruthfuss | Sci-Fi | 2007 | |
| Salt: A World History | Mark Kurlansky | Nonfiction | 2003 | |
| Punching the Air | Ibi Zoboi | Yusef Salaam | Young Adult | 2020 |
fromJSON function.rawJSON <-
fromJSON(getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-json.json"))tibJSON <- rawJSON %>%
map_if(is.data.frame,list) %>%
as_tibble() %>%
unnest(cols = c('book1', 'book2', 'book3'))#Transpose data
book.json.df <- as.data.frame(t(as.matrix(tibJSON)))
#Rename columns
names(book.json.df) <-
c("title","author1","author2","genre","year")
#Reset row names
rownames(book.json.df) <- NULL| title | author1 | author2 | genre | year |
|---|---|---|---|---|
| The Name of the Wind | Patrick Rothfuss | NA | Sci-Fi | 2007 |
| Salt: A World History | Mark Kurlansky | NA | Nonfiction | 2003 |
| Punching the Air | Ibi Zoboi | Yusef Salaam | Young Adult | 2020 |
rawHTML <- getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment7/main/book-list-html.html")#Create data frame
book.html.df <- as.data.frame(readHTMLTable(rawHTML, stringAsFactors = FALSE))
#Rename columns
names(book.html.df) <-
tolower(gsub("NULL.","",names(book.html.df)))| title | author1 | author2 | genre | year |
|---|---|---|---|---|
| The Name of the Wind | Patrick Ruthfuss | null | Sci-Fi | 2007 |
| Salt: A World History | Mark Kurlansky | null | History | 2003 |
| Punching the Air | Ibi Zoboi | Yusef Salaam | Young Adult | 2020 |
All three of the files can be read into R and transformed into the same data frame, see table below, using different functions and libraries. However, it would seem that HTML is the easiest file format to read in as it does not require any transformation of the data. This is because with both Json and XML files you have to read in the data and tell R what the column names and values are, whereas with HTML there is code to tell R which is which.
|
|
|