library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(openxlsx)
library(dplyr)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(varhandle)
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(RSelenium)
library(xml2)
library(xmlconvert)
## Warning: package 'xmlconvert' was built under R version 4.1.3
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
We’re going to do a basic and boring HTML table read in where first we will read in the HTML Table
html <- read_html("https://raw.githubusercontent.com/Amantux/Data607_Assignment7/main/table_html2.html")
Then from there we will grab the HTML node with the table, and then turn it into a dataframe, indicating keeping a header and to fill the null values if any.
html %>% html_node("table") %>% html_table(header =TRUE, fill = TRUE)
And as you can see, its perfectly functional, generating a simple df of fantasy books.
Let me first start out with I am not the most familiar with XML, I’ve very rarely come across XML tables. First things first, let’s read in the table, then start looking at the structure of the XML. From there, let’s grab the Xpath of the items and add them to their own lists. At that point let’s make a dataframe, creating columns in line with the table column name. Lastly, let’s drop the first row as it is not needed.
xml_address = "https://raw.githubusercontent.com/Amantux/Data607_Assignment7/main/Table_XML.xml"
xml_Book_List = (read_xml(xml_address))
#xml_structure(xml_Book_List)
#xml_text(xml_Book_List)
xml_find_all(xml_Book_List, xpath = "//Genre")
## {xml_nodeset (4)}
## [1] <Genre index="3">Genre</Genre>
## [2] <Genre>Fantasy </Genre>
## [3] <Genre>Fantasy </Genre>
## [4] <Genre>Fantasy </Genre>
xml_text(xml_find_all(xml_Book_List, xpath = "//Genre"))
## [1] "Genre" "Fantasy " "Fantasy " "Fantasy "
Genre <- xml_text(xml_find_all(xml_Book_List, xpath = "//Genre"))
Title <- xml_text(xml_find_all(xml_Book_List, xpath = "//Title"))
Source <- xml_text(xml_find_all(xml_Book_List, xpath = "//Source"))
Language <- xml_text(xml_find_all(xml_Book_List, xpath = "//Language"))
Series <- xml_text(xml_find_all(xml_Book_List, xpath = "//Series"))
df <- tibble("Genre"=Genre, "Source"=Source, "Title"=Title, "Language"=Language, "Series"=Series)
df <- df[-1,]
df
#f <- xmlToDataFrame(nodes = getNodeSet(xml_Book_List, "//Sheet1"))
And as you can see, the table matches the prior table, and the source excel document.
As per usual, we are first going to read in the Json. From there, we will take the dataset and using the as.data.frame function to convert it into a dataframe. We will set the col.names to an empty string so it matches the formatting of the prior tables.
Json_address = "https://raw.githubusercontent.com/Amantux/Data607_Assignment7/main/Table_Json.txt"
json_raw <- fromJSON(Json_address)
json_table <- as.data.frame(json_raw, col.names = c(""))
json_table
Clearly, I have a preference for JSON and HTMl style tables over XML for parsing into dataframes. I really thought my implementations of those two were very slick. However I do think I could improve on my XML parsing and would love to see better examples of it.