Loading Libraries

Here are a list of the necessary packages for this assignment:

  • RCurl - Save the URL
  • XML - Parse HTML and XML files
  • jsonlite - Parse JSON file
  • DT - Create datatable
library(RCurl)
library(XML)
library(jsonlite)
library(DT)

HTML, XML and JSON Web Scraping

Each of the tabs represents a particular file type, where the data was scraped and transformed into a data frame using R. All of the information in each of these files are the same as the other two files, but are saved in their respective file types.

HTML

For the HTML file, the readHTMLTable function was used to parse the data. This reads the file in as a “list,” and this list contains a dataframe, “ChemEngBooks,” as shown below:

books.html.url <- getURL("https://raw.githubusercontent.com/rg563/DATA607/master/Assignments/Assignment%205/books.html")
books.html <- readHTMLTable(books.html.url,header=TRUE)
class(books.html)
[1] "list"
class(books.html$ChemEngBooks)
[1] "data.frame"
books.html
$ChemEngBooks
                                                                                        Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2                                       Chemical, Biochemical, and Engineering Thermodynamics
3                                     Applied Mathematics And Modeling For Chemical Engineers
                        Author Publisher Year Published Edition   Price
1    Gavin Towler, Ray Sinnott  Elsevier           2013  Second  $94.86
2           Stanley I. Sandler     Wiley           2006  Fourth $114.00
3 Richard G. Rice, Duong D. Do     Wiley           2012  Second  $66.50
         ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720

Next, we simply extract the ChemEngBooks from “books.html” since this is the data frame.

books.html.df <- books.html$ChemEngBooks
books.html.df
                                                                                        Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2                                       Chemical, Biochemical, and Engineering Thermodynamics
3                                     Applied Mathematics And Modeling For Chemical Engineers
                        Author Publisher Year Published Edition   Price
1    Gavin Towler, Ray Sinnott  Elsevier           2013  Second  $94.86
2           Stanley I. Sandler     Wiley           2006  Fourth $114.00
3 Richard G. Rice, Duong D. Do     Wiley           2012  Second  $66.50
         ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720

Finally, we can display the data frame as a data table using the datatable function.

datatable(books.html.df)

XML

For the XML file, I used xmlParse to read the data into a variable. We did not see the function return a “list” this time, but we saw it return a “XMLInternalDocument.” As you can see below, it appears that the entire XML file was just reprinted.

books.xml.url <- getURL("https://raw.githubusercontent.com/rg563/DATA607/master/Assignments/Assignment%205/books.xml")
books.xml <- xmlParse(books.xml.url)
class(books.xml)
[1] "XMLInternalDocument" "XMLAbstractDocument"
books.xml
<?xml version="1.0" encoding="UTF-8"?>
<ChemEngBooks>
  <Book>
    <Title>Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design</Title>
    <Author>Gavin Towler, Ray Sinnott</Author>
    <Publisher>Elsevier</Publisher>
    <Year_Published>2013</Year_Published>
    <Edition>Second</Edition>
    <Price>$94.86</Price>
    <ISBN13>978-0080966595</ISBN13>
  </Book>
  <Book>
    <Title>Chemical, Biochemical, and Engineering Thermodynamics</Title>
    <Author>Stanley I. Sandler</Author>
    <Publisher>Wiley</Publisher>
    <Year_Published>2006</Year_Published>
    <Edition>Fourth</Edition>
    <Price>$114.00</Price>
    <ISBN13>978-0471661740</ISBN13>
  </Book>
  <Book>
    <Title>Applied Mathematics And Modeling For Chemical Engineers</Title>
    <Author>Richard G. Rice, Duong D. Do</Author>
    <Publisher>Wiley</Publisher>
    <Year_Published>2012</Year_Published>
    <Edition>Second</Edition>
    <Price>$66.50</Price>
    <ISBN13>978-1118024720</ISBN13>
  </Book>
</ChemEngBooks>
 

We can use the xmlRoot function to remove the XML header and return the top-level XMLNode, which in this case is the ChemEngBooks itself.

books.xml.root <- xmlRoot(books.xml)
books.xml.root
<ChemEngBooks>
  <Book>
    <Title>Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design</Title>
    <Author>Gavin Towler, Ray Sinnott</Author>
    <Publisher>Elsevier</Publisher>
    <Year_Published>2013</Year_Published>
    <Edition>Second</Edition>
    <Price>$94.86</Price>
    <ISBN13>978-0080966595</ISBN13>
  </Book>
  <Book>
    <Title>Chemical, Biochemical, and Engineering Thermodynamics</Title>
    <Author>Stanley I. Sandler</Author>
    <Publisher>Wiley</Publisher>
    <Year_Published>2006</Year_Published>
    <Edition>Fourth</Edition>
    <Price>$114.00</Price>
    <ISBN13>978-0471661740</ISBN13>
  </Book>
  <Book>
    <Title>Applied Mathematics And Modeling For Chemical Engineers</Title>
    <Author>Richard G. Rice, Duong D. Do</Author>
    <Publisher>Wiley</Publisher>
    <Year_Published>2012</Year_Published>
    <Edition>Second</Edition>
    <Price>$66.50</Price>
    <ISBN13>978-1118024720</ISBN13>
  </Book>
</ChemEngBooks> 

We can then utilize the xmlValue function to retrieve all the important information of each node. If we use a nested xmlSApply function we can retrieve all the information from each book into a matrix. The reason we need to use a nested one is because the ChemEngBooks are the top-element, followed by each Book node, and then the values within each Book node are the important information.

books.xml.matrix <- xmlSApply(books.xml.root, function(x) xmlSApply(x, xmlValue))
books.xml.matrix
               Book                                                                                         
Title          "Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design"
Author         "Gavin Towler, Ray Sinnott"                                                                  
Publisher      "Elsevier"                                                                                   
Year_Published "2013"                                                                                       
Edition        "Second"                                                                                     
Price          "$94.86"                                                                                     
ISBN13         "978-0080966595"                                                                             
               Book                                                   
Title          "Chemical, Biochemical, and Engineering Thermodynamics"
Author         "Stanley I. Sandler"                                   
Publisher      "Wiley"                                                
Year_Published "2006"                                                 
Edition        "Fourth"                                               
Price          "$114.00"                                              
ISBN13         "978-0471661740"                                       
               Book                                                     
Title          "Applied Mathematics And Modeling For Chemical Engineers"
Author         "Richard G. Rice, Duong D. Do"                           
Publisher      "Wiley"                                                  
Year_Published "2012"                                                   
Edition        "Second"                                                 
Price          "$66.50"                                                 
ISBN13         "978-1118024720"                                         

Finally, we need to transpose the matrix, and then turn it into a dataframe.

books.xml.matrix.t <- t(books.xml.matrix)
books.xml.df <- data.frame(books.xml.matrix.t, row.names = NULL)
books.xml.df
                                                                                        Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2                                       Chemical, Biochemical, and Engineering Thermodynamics
3                                     Applied Mathematics And Modeling For Chemical Engineers
                        Author Publisher Year_Published Edition   Price
1    Gavin Towler, Ray Sinnott  Elsevier           2013  Second  $94.86
2           Stanley I. Sandler     Wiley           2006  Fourth $114.00
3 Richard G. Rice, Duong D. Do     Wiley           2012  Second  $66.50
          ISBN13
1 978-0080966595
2 978-0471661740
3 978-1118024720

The final data table from the XML file is shown below:

datatable(books.xml.df)

JSON

The procedure for scraping a JSON file is almost identical to the HTML file with the exception of the fromJSON function used to read the file. Everything else about the procedures and structure of R code is the same as the HTML file.

books.json.url <- getURL("https://raw.githubusercontent.com/rg563/DATA607/master/Assignments/Assignment%205/books.json")
books.json <- fromJSON(books.json.url)
class(books.json)
[1] "list"
class(books.json$ChemEngBooks)
[1] "data.frame"
books.json
$ChemEngBooks
                                                                                        Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2                                       Chemical, Biochemical, and Engineering Thermodynamics
3                                     Applied Mathematics And Modeling For Chemical Engineers
                        Author Publisher Year Published Edition   Price
1    Gavin Towler, Ray Sinnott  Elsevier           2013  Second  $94.86
2           Stanley I. Sandler     Wiley           2006  Fourth $114.00
3 Richard G. Rice, Duong D. Do     Wiley           2012  Second  $66.50
         ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720
books.json.df <- books.json$ChemEngBooks
books.json.df
                                                                                        Title
1 Chemical Engineering Design: Principles, Practice and Economics of Plant and Process Design
2                                       Chemical, Biochemical, and Engineering Thermodynamics
3                                     Applied Mathematics And Modeling For Chemical Engineers
                        Author Publisher Year Published Edition   Price
1    Gavin Towler, Ray Sinnott  Elsevier           2013  Second  $94.86
2           Stanley I. Sandler     Wiley           2006  Fourth $114.00
3 Richard G. Rice, Duong D. Do     Wiley           2012  Second  $66.50
         ISBN-13
1 978-0080966595
2 978-0471661740
3 978-1118024720
datatable(books.json.df)

Conclusions

The major findings and differences between each file type are summarized below:

  • Despite the very different file structures, HTML and JSON files were read into R using the same R code and functions (except for fromJSON and readHTMLTable functions)
  • The process for scraping an XML file was more involved and a bit more challenging than the other file types. However, once the file structure of a XML file was understood, it was relatively straight forward.
  • After all of the scraping was complete, all of the files produced data tables that were identical.