The library Load in

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(openxlsx)
library(dplyr)
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(varhandle)
library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(RSelenium)
library(xml2)
library(xmlconvert)

## Warning: package 'xmlconvert' was built under R version 4.1.3

library(jsonlite)

## 
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
## 
##     flatten

The HTML Table

We’re going to do a basic and boring HTML table read in where first we will read in the HTML Table

html <- read_html("https://raw.githubusercontent.com/Amantux/Data607_Assignment7/main/table_html2.html")

Then from there we will grab the HTML node with the table, and then turn it into a dataframe, indicating keeping a header and to fill the null values if any.

html %>% html_node("table") %>% html_table(header =TRUE, fill = TRUE)

And as you can see, its perfectly functional, generating a simple df of fantasy books.

XML Read in

Let me first start out with I am not the most familiar with XML, I’ve very rarely come across XML tables. First things first, let’s read in the table, then start looking at the structure of the XML. From there, let’s grab the Xpath of the items and add them to their own lists. At that point let’s make a dataframe, creating columns in line with the table column name. Lastly, let’s drop the first row as it is not needed.

xml_address = "https://raw.githubusercontent.com/Amantux/Data607_Assignment7/main/Table_XML.xml"

xml_Book_List = (read_xml(xml_address))
#xml_structure(xml_Book_List)
#xml_text(xml_Book_List)
xml_find_all(xml_Book_List, xpath = "//Genre")

## {xml_nodeset (4)}
## [1] <Genre index="3">Genre</Genre>
## [2] <Genre>Fantasy </Genre>
## [3] <Genre>Fantasy </Genre>
## [4] <Genre>Fantasy </Genre>

xml_text(xml_find_all(xml_Book_List, xpath = "//Genre"))

## [1] "Genre"    "Fantasy " "Fantasy " "Fantasy "

Genre <- xml_text(xml_find_all(xml_Book_List, xpath = "//Genre"))
Title <- xml_text(xml_find_all(xml_Book_List, xpath = "//Title"))
Source <- xml_text(xml_find_all(xml_Book_List, xpath = "//Source"))
Language <- xml_text(xml_find_all(xml_Book_List, xpath = "//Language"))
Series <- xml_text(xml_find_all(xml_Book_List, xpath = "//Series"))
df <- tibble("Genre"=Genre, "Source"=Source, "Title"=Title, "Language"=Language, "Series"=Series)
df <- df[-1,]
df

#f <- xmlToDataFrame(nodes = getNodeSet(xml_Book_List, "//Sheet1"))

And as you can see, the table matches the prior table, and the source excel document.

Finally, JSON

As per usual, we are first going to read in the Json. From there, we will take the dataset and using the as.data.frame function to convert it into a dataframe. We will set the col.names to an empty string so it matches the formatting of the prior tables.

Json_address = "https://raw.githubusercontent.com/Amantux/Data607_Assignment7/main/Table_Json.txt"
json_raw <- fromJSON(Json_address)
json_table <- as.data.frame(json_raw, col.names = c(""))
json_table

Conclusion

Clearly, I have a preference for JSON and HTMl style tables over XML for parsing into dataframes. I really thought my implementations of those two were very slick. However I do think I could improve on my XML parsing and would love to see better examples of it.

Data 607 Project-2

Alex Moyse

The library Load in

The HTML Table

XML Read in

Finally, JSON

Conclusion