Summary of Assignment This assignment involves loading and Transforming HTML,XML, and JSON data
This Assignment requires the following:
1). R-Studio
The following R-packages are used: #install.packages(“XML”) #install.packages(“RCurl”) #install.packages(“RJSONIO”)
Steps to reproduce: 1). Place the following (3) files in directory: “C:” books_dt.html books_dt.xml books_dt.JSON
2). run the R-Studio file: R_607_Week_7a_Hmk_HTML_XML_JSON__Daniel_Thonn.Rmd
Setting up and Preparing the Environment
#install.packages("XML")
library(XML)
#install.packages("RCurl")
library(RCurl)
## Loading required package: bitops
#install.packages("RJSONIO")
library(RJSONIO)
Load the HTML file and check structure
# Load the HTML file
html_1 <- getURL("https://raw.githubusercontent.com/danthonn/607_Week_7/master/books_dt.html")
html_1
## [1] "<!DOCTYPE html>\n<html>\n<head>\n<title>Books</title>\n</head>\n<body>\n<table border=\"1\">\n<tr>\n<th>Title</th>\n<th>Author</th>\n<th>Year-First_Edition</th>\n<th>Year-First-Film</th>\n<th>Original-Publisher</th>\n</tr>\n<tr>\n<td>The Virginian</td>\n<td>Owen Wister</td>\n<td>1902</td>\n<td>1914</td>\n<td>Macmillan Publishers</td>\n</tr>\n<tr>\n<td>Riders of the Purple Sage</td>\n<td>Zane Grey</td>\n<td>1912</td>\n<td>na</td>\n<td>Harper and Brothers</td>\n</tr>\n<tr>\n<td>Crooked Trails</td>\n<td>Frederick Remington</td>\n<td>1898</td>\n<td>1918</td>\n<td>Harper and Brothers</td>\n</tr>\n</table>\n</body>\n</html>"
# Read into an HTML table
html_2 <- readHTMLTable(html_1, header=TRUE)
html_2
## $`NULL`
## Title Author Year-First_Edition
## 1 The Virginian Owen Wister 1902
## 2 Riders of the Purple Sage Zane Grey 1912
## 3 Crooked Trails Frederick Remington 1898
## Year-First-Film Original-Publisher
## 1 1914 Macmillan Publishers
## 2 na Harper and Brothers
## 3 1918 Harper and Brothers
# Read into a Data Frame and check output
html_3 <- as.data.frame(html_2)
html_3
## NULL.Title NULL.Author NULL.Year.First_Edition
## 1 The Virginian Owen Wister 1902
## 2 Riders of the Purple Sage Zane Grey 1912
## 3 Crooked Trails Frederick Remington 1898
## NULL.Year.First.Film NULL.Original.Publisher
## 1 1914 Macmillan Publishers
## 2 na Harper and Brothers
## 3 1918 Harper and Brothers
head(html_3)
## NULL.Title NULL.Author NULL.Year.First_Edition
## 1 The Virginian Owen Wister 1902
## 2 Riders of the Purple Sage Zane Grey 1912
## 3 Crooked Trails Frederick Remington 1898
## NULL.Year.First.Film NULL.Original.Publisher
## 1 1914 Macmillan Publishers
## 2 na Harper and Brothers
## 3 1918 Harper and Brothers
Load the XML file and check structure
# Load the XML file
xml_1 <- getURL("https://raw.githubusercontent.com/danthonn/607_Week_7/master/books_dt.xml")
xml_1
## [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<BOOKS>\n <HEADINGS>\n <TITLE>Title</TITLE>\n <AUTHORS>Authors</AUTHORS>\n <YEAR-FIRST-EDITION>Year-First-Edition</YEAR-FIRST-EDITION>\n <YEAR-FIRST-FILM>Year-First-Film</YEAR-FIRST-FILM>\n <ORIGINAL-PUBLISHER>Original-Publisher</ORIGINAL-PUBLISHER>\t\n </HEADINGS>\n <BOOK>\n <TITLE>The Virginian</TITLE>\n <AUTHORS>Owen Wister</AUTHORS>\n <YEAR-FIRST-EDITION>1902</YEAR-FIRST-EDITION>\n <YEAR-FIRST-FILM>1914</YEAR-FIRST-FILM>\n <ORIGINAL-PUBLISHER>Macmillan Publishers</ORIGINAL-PUBLISHER>\t\n </BOOK>\n <BOOK>\n <TITLE>Riders of the Purple Sage</TITLE>\n <AUTHORS>Zane Grey</AUTHORS>\n <YEAR-FIRST-EDITION>1912</YEAR-FIRST-EDITION>\n <YEAR-FIRST-FILM>na</YEAR-FIRST-FILM>\n <ORIGINAL-PUBLISHER>Harper and Brothers</ORIGINAL-PUBLISHER>\t\n </BOOK>\n <BOOK>\n <TITLE>Crooked Trails</TITLE>\n <AUTHORS>Frederick Remington</AUTHORS>\n <YEAR-FIRST-EDITION>1898</YEAR-FIRST-EDITION>\n <YEAR-FIRST-FILM>1918</YEAR-FIRST-FILM>\n <ORIGINAL-PUBLISHER>Harper and Brothers</ORIGINAL-PUBLISHER>\t\n </BOOK>\n</BOOKS>\n"
# Parse the XML file
xml_2 <- xmlTreeParse(xml_1)
# Display the XML file
root_1 <- xmlRoot(xml_2)
xml_3 <- xmlSApply(root_1, xmlValue)
xml_3
## HEADINGS
## "TitleAuthorsYear-First-EditionYear-First-FilmOriginal-Publisher"
## BOOK
## "The VirginianOwen Wister19021914Macmillan Publishers"
## BOOK
## "Riders of the Purple SageZane Grey1912naHarper and Brothers"
## BOOK
## "Crooked TrailsFrederick Remington18981918Harper and Brothers"
# Load the XML file into a Data Frame
xml_4 <- as.data.frame(xml_3)
xml_4
## xml_3
## 1 TitleAuthorsYear-First-EditionYear-First-FilmOriginal-Publisher
## 2 The VirginianOwen Wister19021914Macmillan Publishers
## 3 Riders of the Purple SageZane Grey1912naHarper and Brothers
## 4 Crooked TrailsFrederick Remington18981918Harper and Brothers
Load the JSON file and check structure
# Load the JSON file
json_1 <- getURL("https://raw.githubusercontent.com/danthonn/607_Week_7/master/books_dt.JSON")
json_1
## [1] "[\n {\n \"Title\": \"Title\",\n \"Authors\": \"Authors\",\n \"Year-First-Edition\": \"Year-First-Edition\",\n \"Year-First-Film\": \"Year-First-Film\",\n \"Original-Publisher\": \"Original-Publisher\"\n \n }, \n {\n \"Title\": \"The Virginian\",\n \"Authors\": \"Owen Wister\",\n \"Year-First-Edition\": \"1902\",\n \"Year-First-Film\": \"1914\",\n \"Original-Publisher\": \"Macmillan Publishers\"\n \n },\n {\n \"Title\": \"Riders of the Purple Sage\",\n \"Authors\": \"Zane Grey\",\n \"Year-First-Edition\": \"1912\",\n \"Year-First-Film\": \"na\",\n \"Original-Publisher\": \"Harper and Brothers\"\n \n },\n {\n \"Title\": \"Crooked Trails\",\n \"Authors\": \"Frederick Remington\",\n \"Year-First-Edition\": \"1898\",\n \"Year-First-Film\": \"1918\",\n \"Original-Publisher\": \"Harper and Brothers\" \n }\n \n]\n"
# Read raw data into improved format
json_2 = RJSONIO::fromJSON(json_1)
json_2
## [[1]]
## Title Authors Year-First-Edition
## "Title" "Authors" "Year-First-Edition"
## Year-First-Film Original-Publisher
## "Year-First-Film" "Original-Publisher"
##
## [[2]]
## Title Authors Year-First-Edition
## "The Virginian" "Owen Wister" "1902"
## Year-First-Film Original-Publisher
## "1914" "Macmillan Publishers"
##
## [[3]]
## Title Authors
## "Riders of the Purple Sage" "Zane Grey"
## Year-First-Edition Year-First-Film
## "1912" "na"
## Original-Publisher
## "Harper and Brothers"
##
## [[4]]
## Title Authors Year-First-Edition
## "Crooked Trails" "Frederick Remington" "1898"
## Year-First-Film Original-Publisher
## "1918" "Harper and Brothers"
# Place in a data frame
json_3 <- as.data.frame(json_2)
json_3
## structure.c..Title....Authors....Year.First.Edition....Year.First.Film...
## Title Title
## Authors Authors
## Year-First-Edition Year-First-Edition
## Year-First-Film Year-First-Film
## Original-Publisher Original-Publisher
## structure.c..The.Virginian....Owen.Wister....1902....1914....Macmillan.Publishers.
## Title The Virginian
## Authors Owen Wister
## Year-First-Edition 1902
## Year-First-Film 1914
## Original-Publisher Macmillan Publishers
## structure.c..Riders.of.the.Purple.Sage....Zane.Grey....1912...
## Title Riders of the Purple Sage
## Authors Zane Grey
## Year-First-Edition 1912
## Year-First-Film na
## Original-Publisher Harper and Brothers
## structure.c..Crooked.Trails....Frederick.Remington....1898...
## Title Crooked Trails
## Authors Frederick Remington
## Year-First-Edition 1898
## Year-First-Film 1918
## Original-Publisher Harper and Brothers
Conclusion: The outputs of HTML, XML, and JSON are different, though similar once loaded to a data frame. They contain the same information in different formats. For further processing these different formats would need to be handled differently to extract and process the same information contained within.