Here are a list of the necessary packages for this assignment:
library(RCurl)
library(XML)
library(jsonlite)
library(DT)Each of the tabs represents a particular file type, where the data was scraped and transformed into a data frame using R. All of the information in each of these files are the same as the other two files, but are saved in their respective file types.
For the HTML file, the readHTMLTable function was used to parse the data. This reads the file in as a “list,” and this list contains a dataframe, “ChemEngBooks,” as shown below:
books.html.url <- getURL("https://raw.githubusercontent.com/rg563/DATA607/master/Assignments/Assignment%205/books.html")
books.html <- readHTMLTable(books.html.url,header=TRUE)
class(books.html)[1] "list"
class(books.html$ChemEngBooks)[1] "data.frame"
books.html$ChemEngBooks
Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2 Chemical, Biochemical, and Engineering Thermodynamics
3 Applied Mathematics And Modeling For Chemical Engineers
Author Publisher Year Published Edition Price
1 Gavin Towler, Ray Sinnott Elsevier 2013 Second $94.86
2 Stanley I. Sandler Wiley 2006 Fourth $114.00
3 Richard G. Rice, Duong D. Do Wiley 2012 Second $66.50
ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720
Next, we simply extract the ChemEngBooks from “books.html” since this is the data frame.
books.html.df <- books.html$ChemEngBooks
books.html.df Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2 Chemical, Biochemical, and Engineering Thermodynamics
3 Applied Mathematics And Modeling For Chemical Engineers
Author Publisher Year Published Edition Price
1 Gavin Towler, Ray Sinnott Elsevier 2013 Second $94.86
2 Stanley I. Sandler Wiley 2006 Fourth $114.00
3 Richard G. Rice, Duong D. Do Wiley 2012 Second $66.50
ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720
Finally, we can display the data frame as a data table using the datatable function.
datatable(books.html.df)For the XML file, I used xmlParse to read the data into a variable. We did not see the function return a “list” this time, but we saw it return a “XMLInternalDocument.” As you can see below, it appears that the entire XML file was just reprinted.
books.xml.url <- getURL("https://raw.githubusercontent.com/rg563/DATA607/master/Assignments/Assignment%205/books.xml")
books.xml <- xmlParse(books.xml.url)
class(books.xml)[1] "XMLInternalDocument" "XMLAbstractDocument"
books.xml<?xml version="1.0" encoding="UTF-8"?>
<ChemEngBooks>
<Book>
<Title>Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design</Title>
<Author>Gavin Towler, Ray Sinnott</Author>
<Publisher>Elsevier</Publisher>
<Year_Published>2013</Year_Published>
<Edition>Second</Edition>
<Price>$94.86</Price>
<ISBN13>978-0080966595</ISBN13>
</Book>
<Book>
<Title>Chemical, Biochemical, and Engineering Thermodynamics</Title>
<Author>Stanley I. Sandler</Author>
<Publisher>Wiley</Publisher>
<Year_Published>2006</Year_Published>
<Edition>Fourth</Edition>
<Price>$114.00</Price>
<ISBN13>978-0471661740</ISBN13>
</Book>
<Book>
<Title>Applied Mathematics And Modeling For Chemical Engineers</Title>
<Author>Richard G. Rice, Duong D. Do</Author>
<Publisher>Wiley</Publisher>
<Year_Published>2012</Year_Published>
<Edition>Second</Edition>
<Price>$66.50</Price>
<ISBN13>978-1118024720</ISBN13>
</Book>
</ChemEngBooks>
We can use the xmlRoot function to remove the XML header and return the top-level XMLNode, which in this case is the ChemEngBooks itself.
books.xml.root <- xmlRoot(books.xml)
books.xml.root<ChemEngBooks>
<Book>
<Title>Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design</Title>
<Author>Gavin Towler, Ray Sinnott</Author>
<Publisher>Elsevier</Publisher>
<Year_Published>2013</Year_Published>
<Edition>Second</Edition>
<Price>$94.86</Price>
<ISBN13>978-0080966595</ISBN13>
</Book>
<Book>
<Title>Chemical, Biochemical, and Engineering Thermodynamics</Title>
<Author>Stanley I. Sandler</Author>
<Publisher>Wiley</Publisher>
<Year_Published>2006</Year_Published>
<Edition>Fourth</Edition>
<Price>$114.00</Price>
<ISBN13>978-0471661740</ISBN13>
</Book>
<Book>
<Title>Applied Mathematics And Modeling For Chemical Engineers</Title>
<Author>Richard G. Rice, Duong D. Do</Author>
<Publisher>Wiley</Publisher>
<Year_Published>2012</Year_Published>
<Edition>Second</Edition>
<Price>$66.50</Price>
<ISBN13>978-1118024720</ISBN13>
</Book>
</ChemEngBooks>
We can then utilize the xmlValue function to retrieve all the important information of each node. If we use a nested xmlSApply function we can retrieve all the information from each book into a matrix. The reason we need to use a nested one is because the ChemEngBooks are the top-element, followed by each Book node, and then the values within each Book node are the important information.
books.xml.matrix <- xmlSApply(books.xml.root, function(x) xmlSApply(x, xmlValue))
books.xml.matrix Book
Title "Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design"
Author "Gavin Towler, Ray Sinnott"
Publisher "Elsevier"
Year_Published "2013"
Edition "Second"
Price "$94.86"
ISBN13 "978-0080966595"
Book
Title "Chemical, Biochemical, and Engineering Thermodynamics"
Author "Stanley I. Sandler"
Publisher "Wiley"
Year_Published "2006"
Edition "Fourth"
Price "$114.00"
ISBN13 "978-0471661740"
Book
Title "Applied Mathematics And Modeling For Chemical Engineers"
Author "Richard G. Rice, Duong D. Do"
Publisher "Wiley"
Year_Published "2012"
Edition "Second"
Price "$66.50"
ISBN13 "978-1118024720"
Finally, we need to transpose the matrix, and then turn it into a dataframe.
books.xml.matrix.t <- t(books.xml.matrix)
books.xml.df <- data.frame(books.xml.matrix.t, row.names = NULL)
books.xml.df Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2 Chemical, Biochemical, and Engineering Thermodynamics
3 Applied Mathematics And Modeling For Chemical Engineers
Author Publisher Year_Published Edition Price
1 Gavin Towler, Ray Sinnott Elsevier 2013 Second $94.86
2 Stanley I. Sandler Wiley 2006 Fourth $114.00
3 Richard G. Rice, Duong D. Do Wiley 2012 Second $66.50
ISBN13
1 978-0080966595
2 978-0471661740
3 978-1118024720
The final data table from the XML file is shown below:
datatable(books.xml.df)The procedure for scraping a JSON file is almost identical to the HTML file with the exception of the fromJSON function used to read the file. Everything else about the procedures and structure of R code is the same as the HTML file.
books.json.url <- getURL("https://raw.githubusercontent.com/rg563/DATA607/master/Assignments/Assignment%205/books.json")
books.json <- fromJSON(books.json.url)
class(books.json)[1] "list"
class(books.json$ChemEngBooks)[1] "data.frame"
books.json$ChemEngBooks
Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2 Chemical, Biochemical, and Engineering Thermodynamics
3 Applied Mathematics And Modeling For Chemical Engineers
Author Publisher Year Published Edition Price
1 Gavin Towler, Ray Sinnott Elsevier 2013 Second $94.86
2 Stanley I. Sandler Wiley 2006 Fourth $114.00
3 Richard G. Rice, Duong D. Do Wiley 2012 Second $66.50
ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720
books.json.df <- books.json$ChemEngBooks
books.json.df Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2 Chemical, Biochemical, and Engineering Thermodynamics
3 Applied Mathematics And Modeling For Chemical Engineers
Author Publisher Year Published Edition Price
1 Gavin Towler, Ray Sinnott Elsevier 2013 Second $94.86
2 Stanley I. Sandler Wiley 2006 Fourth $114.00
3 Richard G. Rice, Duong D. Do Wiley 2012 Second $66.50
ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720
datatable(books.json.df)The major findings and differences between each file type are summarized below: