Assignment 607_Homework-7: Hmw_607_Working_with_XML_and_JSON_in_R_Daniel_Thonn

Summary of Assignment This assignment involves loading and Transforming HTML,XML, and JSON data

This Assignment requires the following:

1). R-Studio

The following R-packages are used: #install.packages(“XML”) #install.packages(“RCurl”) #install.packages(“RJSONIO”)

Steps to reproduce: 1). Place the following (3) files in directory: “C:” books_dt.html books_dt.xml books_dt.JSON

2). run the R-Studio file: R_607_Week_7a_Hmk_HTML_XML_JSON__Daniel_Thonn.Rmd

Setting up and Preparing the Environment

#install.packages("XML")
library(XML)
#install.packages("RCurl")
library(RCurl)
## Loading required package: bitops
#install.packages("RJSONIO")
library(RJSONIO)

Load the HTML file and check structure

# Load the HTML file
html_1 <- getURL("https://raw.githubusercontent.com/danthonn/607_Week_7/master/books_dt.html")
html_1
## [1] "<!DOCTYPE html>\n<html>\n<head>\n<title>Books</title>\n</head>\n<body>\n<table border=\"1\">\n<tr>\n<th>Title</th>\n<th>Author</th>\n<th>Year-First_Edition</th>\n<th>Year-First-Film</th>\n<th>Original-Publisher</th>\n</tr>\n<tr>\n<td>The Virginian</td>\n<td>Owen Wister</td>\n<td>1902</td>\n<td>1914</td>\n<td>Macmillan Publishers</td>\n</tr>\n<tr>\n<td>Riders of the Purple Sage</td>\n<td>Zane Grey</td>\n<td>1912</td>\n<td>na</td>\n<td>Harper and Brothers</td>\n</tr>\n<tr>\n<td>Crooked Trails</td>\n<td>Frederick Remington</td>\n<td>1898</td>\n<td>1918</td>\n<td>Harper and Brothers</td>\n</tr>\n</table>\n</body>\n</html>"
# Read into an HTML table
html_2 <- readHTMLTable(html_1, header=TRUE)
html_2
## $`NULL`
##                       Title              Author Year-First_Edition
## 1             The Virginian         Owen Wister               1902
## 2 Riders of the Purple Sage           Zane Grey               1912
## 3            Crooked Trails Frederick Remington               1898
##   Year-First-Film   Original-Publisher
## 1            1914 Macmillan Publishers
## 2              na  Harper and Brothers
## 3            1918  Harper and Brothers
# Read into a Data Frame and check output
html_3 <- as.data.frame(html_2)
html_3
##                  NULL.Title         NULL.Author NULL.Year.First_Edition
## 1             The Virginian         Owen Wister                    1902
## 2 Riders of the Purple Sage           Zane Grey                    1912
## 3            Crooked Trails Frederick Remington                    1898
##   NULL.Year.First.Film NULL.Original.Publisher
## 1                 1914    Macmillan Publishers
## 2                   na     Harper and Brothers
## 3                 1918     Harper and Brothers
head(html_3)
##                  NULL.Title         NULL.Author NULL.Year.First_Edition
## 1             The Virginian         Owen Wister                    1902
## 2 Riders of the Purple Sage           Zane Grey                    1912
## 3            Crooked Trails Frederick Remington                    1898
##   NULL.Year.First.Film NULL.Original.Publisher
## 1                 1914    Macmillan Publishers
## 2                   na     Harper and Brothers
## 3                 1918     Harper and Brothers

Load the XML file and check structure

# Load the XML file
xml_1 <- getURL("https://raw.githubusercontent.com/danthonn/607_Week_7/master/books_dt.xml")
xml_1
## [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<BOOKS>\n   <HEADINGS>\n      <TITLE>Title</TITLE>\n      <AUTHORS>Authors</AUTHORS>\n      <YEAR-FIRST-EDITION>Year-First-Edition</YEAR-FIRST-EDITION>\n      <YEAR-FIRST-FILM>Year-First-Film</YEAR-FIRST-FILM>\n      <ORIGINAL-PUBLISHER>Original-Publisher</ORIGINAL-PUBLISHER>\t\n   </HEADINGS>\n   <BOOK>\n      <TITLE>The Virginian</TITLE>\n      <AUTHORS>Owen Wister</AUTHORS>\n      <YEAR-FIRST-EDITION>1902</YEAR-FIRST-EDITION>\n      <YEAR-FIRST-FILM>1914</YEAR-FIRST-FILM>\n      <ORIGINAL-PUBLISHER>Macmillan Publishers</ORIGINAL-PUBLISHER>\t\n   </BOOK>\n   <BOOK>\n      <TITLE>Riders of the Purple Sage</TITLE>\n      <AUTHORS>Zane Grey</AUTHORS>\n      <YEAR-FIRST-EDITION>1912</YEAR-FIRST-EDITION>\n      <YEAR-FIRST-FILM>na</YEAR-FIRST-FILM>\n      <ORIGINAL-PUBLISHER>Harper and Brothers</ORIGINAL-PUBLISHER>\t\n   </BOOK>\n   <BOOK>\n      <TITLE>Crooked Trails</TITLE>\n      <AUTHORS>Frederick Remington</AUTHORS>\n      <YEAR-FIRST-EDITION>1898</YEAR-FIRST-EDITION>\n      <YEAR-FIRST-FILM>1918</YEAR-FIRST-FILM>\n      <ORIGINAL-PUBLISHER>Harper and Brothers</ORIGINAL-PUBLISHER>\t\n   </BOOK>\n</BOOKS>\n"
# Parse the XML file
xml_2 <- xmlTreeParse(xml_1)

# Display the XML file
root_1 <- xmlRoot(xml_2)
xml_3 <- xmlSApply(root_1, xmlValue)
xml_3
##                                                          HEADINGS 
## "TitleAuthorsYear-First-EditionYear-First-FilmOriginal-Publisher" 
##                                                              BOOK 
##            "The VirginianOwen Wister19021914Macmillan Publishers" 
##                                                              BOOK 
##     "Riders of the Purple SageZane Grey1912naHarper and Brothers" 
##                                                              BOOK 
##    "Crooked TrailsFrederick Remington18981918Harper and Brothers"
# Load the XML file into a Data Frame
xml_4 <- as.data.frame(xml_3)
xml_4
##                                                             xml_3
## 1 TitleAuthorsYear-First-EditionYear-First-FilmOriginal-Publisher
## 2            The VirginianOwen Wister19021914Macmillan Publishers
## 3     Riders of the Purple SageZane Grey1912naHarper and Brothers
## 4    Crooked TrailsFrederick Remington18981918Harper and Brothers

Load the JSON file and check structure

# Load the JSON file
json_1 <- getURL("https://raw.githubusercontent.com/danthonn/607_Week_7/master/books_dt.JSON")
json_1
## [1] "[\n   {\n        \"Title\": \"Title\",\n        \"Authors\": \"Authors\",\n        \"Year-First-Edition\": \"Year-First-Edition\",\n        \"Year-First-Film\": \"Year-First-Film\",\n        \"Original-Publisher\": \"Original-Publisher\"\n       \n    },   \n   {\n        \"Title\": \"The Virginian\",\n        \"Authors\": \"Owen Wister\",\n        \"Year-First-Edition\": \"1902\",\n        \"Year-First-Film\": \"1914\",\n        \"Original-Publisher\": \"Macmillan Publishers\"\n       \n    },\n    {\n        \"Title\": \"Riders of the Purple Sage\",\n        \"Authors\": \"Zane Grey\",\n        \"Year-First-Edition\": \"1912\",\n        \"Year-First-Film\": \"na\",\n        \"Original-Publisher\": \"Harper and Brothers\"\n       \n    },\n    {\n        \"Title\": \"Crooked Trails\",\n        \"Authors\": \"Frederick Remington\",\n        \"Year-First-Edition\": \"1898\",\n        \"Year-First-Film\": \"1918\",\n        \"Original-Publisher\": \"Harper and Brothers\" \n    }\n    \n]\n"
# Read raw data into improved format
json_2 = RJSONIO::fromJSON(json_1)
json_2
## [[1]]
##                Title              Authors   Year-First-Edition 
##              "Title"            "Authors" "Year-First-Edition" 
##      Year-First-Film   Original-Publisher 
##    "Year-First-Film" "Original-Publisher" 
## 
## [[2]]
##                  Title                Authors     Year-First-Edition 
##        "The Virginian"          "Owen Wister"                 "1902" 
##        Year-First-Film     Original-Publisher 
##                 "1914" "Macmillan Publishers" 
## 
## [[3]]
##                       Title                     Authors 
## "Riders of the Purple Sage"                 "Zane Grey" 
##          Year-First-Edition             Year-First-Film 
##                      "1912"                        "na" 
##          Original-Publisher 
##       "Harper and Brothers" 
## 
## [[4]]
##                 Title               Authors    Year-First-Edition 
##      "Crooked Trails" "Frederick Remington"                "1898" 
##       Year-First-Film    Original-Publisher 
##                "1918" "Harper and Brothers"
# Place in a data frame
json_3 <- as.data.frame(json_2)
json_3
##                    structure.c..Title....Authors....Year.First.Edition....Year.First.Film...
## Title                                                                                  Title
## Authors                                                                              Authors
## Year-First-Edition                                                        Year-First-Edition
## Year-First-Film                                                              Year-First-Film
## Original-Publisher                                                        Original-Publisher
##                    structure.c..The.Virginian....Owen.Wister....1902....1914....Macmillan.Publishers.
## Title                                                                                   The Virginian
## Authors                                                                                   Owen Wister
## Year-First-Edition                                                                               1902
## Year-First-Film                                                                                  1914
## Original-Publisher                                                               Macmillan Publishers
##                    structure.c..Riders.of.the.Purple.Sage....Zane.Grey....1912...
## Title                                                   Riders of the Purple Sage
## Authors                                                                 Zane Grey
## Year-First-Edition                                                           1912
## Year-First-Film                                                                na
## Original-Publisher                                            Harper and Brothers
##                    structure.c..Crooked.Trails....Frederick.Remington....1898...
## Title                                                             Crooked Trails
## Authors                                                      Frederick Remington
## Year-First-Edition                                                          1898
## Year-First-Film                                                             1918
## Original-Publisher                                           Harper and Brothers

Conclusion: The outputs of HTML, XML, and JSON are different, though similar once loaded to a data frame. They contain the same information in different formats. For further processing these different formats would need to be handled differently to extract and process the same information contained within.

END