Observations - Working with XML and JSON in R

The tables and dataframes differ slightly; while creating the XML file I eliminated spaces in the column names and replaced with an underscore. The JSON file contained a heading “Graphic Novel” and different column/variable title “Name”. Dates were also formatted differently. I tested an Access xml export function after creating the xml table manually as suggested (used here, above); the default settings produced complications but was informative.

XML

Link: https://raw.githubusercontent.com/sigmasigmaiota/GraphicNovels/master/GraphicNovels.xml

Import

The packages XML and RCurl download the file from GitHub; then I convert to a list and bind into a dataframe. kableExtra displays the result.

library(XML)
library(RCurl)
library(kableExtra)

#set url, filename.
xmlurl<-"https://raw.githubusercontent.com/sigmasigmaiota/GraphicNovels/master/GraphicNovels.xml"

#RCurl.
xmlfile<-getURL(xmlurl)

#Parse the XML file.
xml.table <- xmlParse(xmlfile,useInternal=TRUE)
xml.table2<-xmlToList(xml.table)

#Convert to dataframe.
GN.xml<-do.call(rbind.data.frame, xml.table2)
rownames(GN.xml)<-NULL

kable(GN.xml)%>%
kable_styling()
Name Publisher First_Issue_Date Last_Issue_Date Author CoAuthor_1 CoAuthor_2
The Sandman DC 1989-01-01 1996-03-01 Neil Gaiman Sam Kieth Mike Dringenberg
Watchmen DC 1984-02-01 1987-09-01 Allen Moore Dave Gibbons John Higgins
The Swamp Thing DC 1986-01-01 1987-12-01 Allen Moore Stephen Bissette Jon Totleben

File

The file, as it looks after download from GitHub.

print(xml.table)
## <?xml version="1.0"?>
## <GraphicNovels>
##   <Graphic_Novel>
##     <Name>The Sandman</Name>
##     <Publisher>DC</Publisher>
##     <First_Issue_Date>1989-01-01</First_Issue_Date>
##     <Last_Issue_Date>1996-03-01</Last_Issue_Date>
##     <Author>Neil Gaiman</Author>
##     <CoAuthor_1>Sam Kieth</CoAuthor_1>
##     <CoAuthor_2>Mike Dringenberg</CoAuthor_2>
##   </Graphic_Novel>
##   <Graphic_Novel>
##     <Name>Watchmen</Name>
##     <Publisher>DC</Publisher>
##     <First_Issue_Date>1984-02-01</First_Issue_Date>
##     <Last_Issue_Date>1987-09-01</Last_Issue_Date>
##     <Author>Allen Moore</Author>
##     <CoAuthor_1>Dave Gibbons</CoAuthor_1>
##     <CoAuthor_2>John Higgins</CoAuthor_2>
##   </Graphic_Novel>
##   <Graphic_Novel>
##     <Name>The Swamp Thing</Name>
##     <Publisher>DC</Publisher>
##     <First_Issue_Date>1986-01-01</First_Issue_Date>
##     <Last_Issue_Date>1987-12-01</Last_Issue_Date>
##     <Author>Allen Moore</Author>
##     <CoAuthor_1>Stephen Bissette</CoAuthor_1>
##     <CoAuthor_2>Jon Totleben</CoAuthor_2>
##   </Graphic_Novel>
## </GraphicNovels>
## 

JSON

Link: https://raw.githubusercontent.com/sigmasigmaiota/GraphicNovels/master/GraphicNovels.json

Import

The package jsonlite is used to download and parse the JSON file.

library(jsonlite)

#set url, filename.
jsonurl<-"https://raw.githubusercontent.com/sigmasigmaiota/GraphicNovels/master/GraphicNovels.json"

#RCurl.
jsonfile<-getURL(jsonurl)

# Give the input file name to the function.
jsontable<- fromJSON(jsonfile)

#Unname the table to avoid column name changes.
GN.json<-as.data.frame(unname(jsontable))

kable(GN.json)%>%
kable_styling()
Name Publisher First.Issue.Date Last.Issue.Date Author CoAuthor.1 CoAuthor.2
The Sandman DC 1/1/89 3/1/96 Neil Gaiman Sam Kieth Mike Dringenberg
Watchmen DC 1/1/86 12/1/87 Allen Moore Dave Gibbons John Higgins
The Swamp Thing DC 2/1/84 9/1/87 Allen Moore Stephen Bissette Jon Totleben

File

The file, as it looks after download from GitHub, before parsing.

print(jsonfile)
## [1] "{\r\n\t\"Graphic Novels\": [{\r\n\t\t\t\"Name\": \"The Sandman\",\r\n\t\t\t\"Publisher\": \"DC\",\r\n\t\t\t\"First Issue Date\": \"1/1/89\",\r\n\t\t\t\"Last Issue Date\": \"3/1/96\",\r\n\t\t\t\"Author\": \"Neil Gaiman\",\r\n\t\t\t\"CoAuthor 1\": \"Sam Kieth\",\r\n\t\t\t\"CoAuthor 2\": \"Mike Dringenberg\"\r\n\t\t},\r\n\t\t{\r\n\t\t\t\"Name\": \"Watchmen\",\r\n\t\t\t\"Publisher\": \"DC\",\r\n\t\t\t\"First Issue Date\": \"1/1/86\",\r\n\t\t\t\"Last Issue Date\": \"12/1/87\",\r\n\t\t\t\"Author\": \"Allen Moore\",\r\n\t\t\t\"CoAuthor 1\": \"Dave Gibbons\",\r\n\t\t\t\"CoAuthor 2\": \"John Higgins\"\r\n\t\t},\r\n\t\t{\r\n\t\t\t\"Name\": \"The Swamp Thing\",\r\n\t\t\t\"Publisher\": \"DC\",\r\n\t\t\t\"First Issue Date\": \"2/1/84\",\r\n\t\t\t\"Last Issue Date\": \"9/1/87\",\r\n\t\t\t\"Author\": \"Allen Moore\",\r\n\t\t\t\"CoAuthor 1\": \"Stephen Bissette\",\r\n\t\t\t\"CoAuthor 2\": \"Jon Totleben\"\r\n\t\t}\r\n\t]\r\n}"

HTML

Link: https://raw.githubusercontent.com/sigmasigmaiota/GraphicNovels/master/GraphicNovels.html

Import

The package XML is used to download and parse; rlist helps organize.

library(rlist)

htmlurl<-"https://raw.githubusercontent.com/sigmasigmaiota/GraphicNovels/master/GraphicNovels.html"

htmlfile<-getURL(htmlurl)

#Alternate command.
htmltable2<- readHTMLTable(htmlfile)
htmltable2<-list.clean(htmltable2,fun=is.null,recursive=FALSE)

#Unname the table to avoid column name changes.
GN.html<-as.data.frame(unname(htmltable2))

kable(GN.html)%>%
kable_styling()
Graphic.Novel Publisher First.Issue.Date Last.Issue.Date Author CoAuthor.1 CoAuthor.2
The Sandman DC 1/1/89 3/1/96 Neil Gaiman Sam Kieth Mike Dringenberg
Watchmen DC 1/1/86 12/1/87 Allen Moore Dave Gibbons John Higgins
The Swamp Thing DC 2/1/84 9/1/87 Allen Moore Stephen Bissette Jon Totleben

File

The file, as it looks after download from GitHub, before parsing.

print(htmlfile)
## [1] "<!DOCTYPE html>\r\n<html lang=\"en\"> \r\n<head>\r\n<meta charset=\"utf-8\"/>\r\n<title>Graphic Novels</title>\r\n<style>\r\ntable {\r\n  font-family: arial, sans-serif;\r\n  border-collapse: collapse;\r\n  width: 100%;\r\n}\r\ntd, th {\r\n  border: 1px solid #dddddd;\r\n  text-align: left;\r\n  padding: 8px;\r\n}\r\ntr:nth-child(even) {\r\n  background-color: #dddddd;\r\n}\r\n</style>\r\n</head>\r\n<body>\r\n<h2>Graphic Novels</h2>\r\n<table>\r\n  <tr>\r\n    <th>Graphic Novel</th>\r\n    <th>Publisher</th>\r\n    <th>First Issue Date</th>\r\n    <th>Last Issue Date</th>\r\n    <th>Author</th>\r\n    <th>CoAuthor 1</th>\r\n    <th>CoAuthor 2</th>\r\n  </tr>\r\n  <tr>\r\n    <td>The Sandman</td>\r\n    <td>DC</td>\r\n    <td>1/1/89</td>\r\n    <td>3/1/96</td>\r\n    <td>Neil Gaiman</td>\r\n    <td>Sam Kieth</td>\r\n    <td>Mike Dringenberg</td>\r\n  </tr>\r\n  <tr>\r\n    <td>Watchmen</td>\r\n    <td>DC</td>\r\n    <td>1/1/86</td>\r\n    <td>12/1/87</td>\r\n    <td>Allen Moore</td>\r\n    <td>Dave Gibbons</td>\r\n    <td>John Higgins</td>\r\n  </tr>\r\n  <tr>\r\n    <td>The Swamp Thing</td>\r\n    <td>DC</td>\r\n    <td>2/1/84</td>\r\n    <td>9/1/87</td>\r\n    <td>Allen Moore</td>\r\n    <td>Stephen Bissette</td>\r\n    <td>Jon Totleben</td>\r\n  </tr>\r\n</table>\r\n</body>\r\n</html>"