DA607_homework7_web_scrapping

Yun Mai
March 18, 2017

Working with XML and JSON in R

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting. Take the information that you've selected about these three books, and separately create three files which store the book's information in HTML (using an html table), XML, and JSON formats (e.g. "books.html", "books.xml", and "books.json").

To help you better understand the different file structures, I'd prefer that you create each of these files "by hand" unless you're already very comfortable with the file formats. Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical? Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].

Load packages

library(RCurl)
## Loading required package: bitops
library(XML)
library(jsonlite)
library(knitr)
library(plyr)

HTML format

A table containing three books info in HTML format is created. The url is loaded to R. The HTML file is shown after the code.

html_url <- "https://raw.githubusercontent.com/YunMai-SPS/DA607-homework/master/DA607week7/week7hw_book_info_as_html.html"
fetch_html <- getURL(html_url)
parsed.book.html <- htmlParse(fetch_html)
print(parsed.book.html)
## <!DOCTYPE html>
## <html><body>
## <p>&gt;
##  
##    </p>
##     <title>Three Immunology Books</title>
## <table>
## <tr>
## <th>book id</th> <th>name</th> <th>authors</th> <th>eidtion</th> <th>pulisher</th> <th>language</th> <th>year published</th> <th>ISBN-13</th> <th>paperback</th> <th>Amazon Best Sellers Rank</th> </tr>
## <tr>
## <th>1</th> <th>KUBY Immunology</th> <th>Richard A. Goldsby, Thomas J. Kindt and Barbara A. Osborne</th> <th>7th</th> <th>W. H. Freeman</th> <th>English</th> <th>2013</th> <th>978-1464119910</th> <th>670 pages</th> <th>#8</th> </tr>
## <tr>
## <th>2</th> <th>Cellular and Molecular Immunology</th> <th>Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD</th> <th>8th</th> <th>Saunders</th> <th>English</th> <th>2014</th> <th>978-0323222754</th> <th>544 pages</th> <th>#59</th> </tr>
## <tr>
## <th>3</th> <th>How the Immune System Works</th> <th>Lauren M. Sompayrac</th> <th>5th</th> <th>Wiley-Blackwell</th> <th>English</th> <th>2015</th> <th>978-1118997772</th> <th>160 pages</th> <th>#17</th> </tr>
## </table>
## </body></html>
## 

Then read the data from html file as table. The readHTMLTable function maps the html data structure into a list.

book_html <- readHTMLTable(fetch_html)
class(book_html)
## [1] "list"

View the structure of the list. As shown after the code, it contains only one element which is a data frame.

str(book_html)
## List of 1
##  $ NULL:'data.frame':    3 obs. of  10 variables:
##   ..$ book id                 : Factor w/ 3 levels "1","2","3": 1 2 3
##   ..$ name                    : Factor w/ 3 levels "Cellular and Molecular Immunology",..: 3 1 2
##   ..$ authors                 : Factor w/ 3 levels "Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD",..: 3 1 2
##   ..$ eidtion                 : Factor w/ 3 levels "5th","7th","8th": 2 3 1
##   ..$ pulisher                : Factor w/ 3 levels "Saunders","W. H. Freeman",..: 2 1 3
##   ..$ language                : Factor w/ 1 level "English": 1 1 1
##   ..$ year published          : Factor w/ 3 levels "2013","2014",..: 1 2 3
##   ..$ ISBN-13                 : Factor w/ 3 levels "978-0323222754",..: 3 1 2
##   ..$ paperback               : Factor w/ 3 levels "160 pages","544 pages",..: 3 2 1
##   ..$ Amazon Best Sellers Rank: Factor w/ 3 levels "#17","#59","#8": 3 2 1
kable(book_html)
| book id | name | authors | eidtion | pulisher | language | year published | ISBN-13 | paperback | Amazon Best Sellers Rank | |:--------|:----------------------------------|:-----------------------------------------------------------------------|:--------|:----------------|:---------|:---------------|:---------------|:----------|:-------------------------| | 1 | KUBY Immunology | Richard A. Goldsby, Thomas J. Kindt and Barbara A. Osborne | 7th | W. H. Freeman | English | 2013 | 978-1464119910 | 670 pages | \#8 | | 2 | Cellular and Molecular Immunology | Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD | 8th | Saunders | English | 2014 | 978-0323222754 | 544 pages | \#59 | | 3 | How the Immune System Works | Lauren M. Sompayrac | 5th | Wiley-Blackwell | English | 2015 | 978-1118997772 | 160 pages | \#17 |

The table is shown after the code.

kable(book_html[[1]])
book id name authors eidtion pulisher language year published ISBN-13 paperback Amazon Best Sellers Rank
1 KUBY Immunology Richard A. Goldsby, Thomas J. Kindt and Barbara A. Osborne 7th W. H. Freeman English 2013 978-1464119910 670 pages #8
2 Cellular and Molecular Immunology Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD 8th Saunders English 2014 978-0323222754 544 pages #59
3 How the Immune System Works Lauren M. Sompayrac 5th Wiley-Blackwell English 2015 978-1118997772 160 pages #17

XML format

An XML file contained the books info is created. It will be shown after the code.

xml_url <- "https://raw.githubusercontent.com/YunMai-SPS/DA607-homework/master/DA607week7/week7hw_book_info_as_xml.xml"
fetch_xml <- getURL(xml_url)
parsed.book.xml <- xmlParse(fetch_xml)
parsed.book.xml
## <?xml version="1.0" encoding="ISO-8859-1"?>
## <immunology_books>
##   <book>
##     <bookid>1</bookid>
##     <name>KUBY Immunology</name>
##     <authors> Richard A. Goldsby, Thomas J. Kindt, Barbara A. Osborne</authors>
##     <eidtion>7th</eidtion>
##     <pulisher>W. H. Freeman</pulisher>
##     <Language>English</Language>
##     <yearpublished>2013</yearpublished>
##     <ISBN-13>978-1464119910</ISBN-13>
##     <paperback>670 pages</paperback>
##     <AmazonBestSellersRank>#8</AmazonBestSellersRank>
##   </book>
##   <book>
##     <bookid>2</bookid>
##     <name>Cellular and Molecular Immunology</name>
##     <authors> Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD</authors>
##     <eidtion>8th</eidtion>
##     <pulisher>Saunders</pulisher>
##     <Language>English</Language>
##     <yearpublished>2014</yearpublished>
##     <ISBN-13>978-0323222754</ISBN-13>
##     <paperback>544 pages</paperback>
##     <AmazonBestSellersRank>#59</AmazonBestSellersRank>
##   </book>
##   <book>
##     <bookid>3</bookid>
##     <name>How the Immune System Works</name>
##     <authors> Lauren M. Sompayrac</authors>
##     <eidtion>5th</eidtion>
##     <pulisher>Wiley-Blackwell</pulisher>
##     <Language>English</Language>
##     <yearpublished>2015</yearpublished>
##     <ISBN-13>978-1118997772</ISBN-13>
##     <paperback>160 pages</paperback>
##     <AmazonBestSellersRank>#17</AmazonBestSellersRank>
##   </book>
## </immunology_books>
## 

Then the top-level node of XML file is extracted with the xmlRoot() function and transformed into data frame with xmlToDataFrame() function. The xmlToDataFrame function maps the xml data structure into a data frame.

root <- xmlRoot(parsed.book.xml)
book_xml <- xmlToDataFrame(root)
class(book_xml)
## [1] "data.frame"

View the first few rows of the data frame.

kable(book_xml)
bookid name authors eidtion pulisher Language yearpublished ISBN-13 paperback AmazonBestSellersRank
1 KUBY Immunology Richard A. Goldsby, Thomas J. Kindt, Barbara A. Osborne 7th W. H. Freeman English 2013 978-1464119910 670 pages #8
2 Cellular and Molecular Immunology Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD 8th Saunders English 2014 978-0323222754 544 pages #59
3 How the Immune System Works Lauren M. Sompayrac 5th Wiley-Blackwell English 2015 978-1118997772 160 pages #17

JSON format

A JSON file contained the books info is created, as shown after the code.

json_url <- "https://raw.githubusercontent.com/YunMai-SPS/DA607-homework/master/DA607week7/week7hw_book_info_as_json.json"
fetch_json <- getURL(json_url)
fetch_json
## [1] "{\"Immunology books\" :[\n    {\n    \"book id\": 1,\n    \"name\": \"KUBY Immunology\",\n    \"authors\": \"Richard A. Goldsby, Thomas J. Kindt, Barbara A. Osborne\",\n    \"eidtion\": \"7th\",\n    \"pulisher\": \"W. H. Freeman\",\n    \"language\": \"English\",\n    \"year_published\": 2013,\n    \"ISBN-13\": \"978-1464119910\",\n    \"paperback\": \"670 pages\",\n    \"Amazon Best Sellers Rank\": \"#8\"\n    },\n    {\n    \"book id\": 2,\n    \"name\": \"Cellular and Molecular Immunology\",\n    \"authors\": \"Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD\",\n    \"eidtion\": \"8th\",\n    \"pulisher\": \"Saunders\",\n    \"language\": \"English\",\n    \"year_published\": 2014,\n    \"ISBN-13\": \"978-0323222754\",\n    \"paperback\": \"544 pages\",\n    \"Amazon Best Sellers Rank\": \"#59\"\n    },\n    {\n    \"book id\": 3,\n    \"name\": \"How the Immune System Works\",\n    \"authors\": \"Lauren M. Sompayrac\",\n    \"eidtion\": \"5th\",\n    \"pulisher\": \"Wiley-Blackwell\",\n    \"language\": \"English\",\n    \"year_published\": 2015,\n    \"ISBN-13\": \"978-1118997772\",\n    \"paperback\": \"160 pages\",\n    \"Amazon Best Sellers Rank\": \"#17\"\n    }]\n}\n\n"

Parse the JSON data with the fromJSON function. Under the rule of jsonlite, fromJSON function should map JSOn data into a data frame. It turned out to be a list.

parsed.book.json <- fromJSON(fetch_json)
class(parsed.book.json)
## [1] "list"

View the the structure of the list and it contains one element that is a data frame.

str(parsed.book.json)
## List of 1
##  $ Immunology books:'data.frame':    3 obs. of  10 variables:
##   ..$ book id                 : int [1:3] 1 2 3
##   ..$ name                    : chr [1:3] "KUBY Immunology" "Cellular and Molecular Immunology" "How the Immune System Works"
##   ..$ authors                 : chr [1:3] "Richard A. Goldsby, Thomas J. Kindt, Barbara A. Osborne" "Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD" "Lauren M. Sompayrac"
##   ..$ eidtion                 : chr [1:3] "7th" "8th" "5th"
##   ..$ pulisher                : chr [1:3] "W. H. Freeman" "Saunders" "Wiley-Blackwell"
##   ..$ language                : chr [1:3] "English" "English" "English"
##   ..$ year_published          : int [1:3] 2013 2014 2015
##   ..$ ISBN-13                 : chr [1:3] "978-1464119910" "978-0323222754" "978-1118997772"
##   ..$ paperback               : chr [1:3] "670 pages" "544 pages" "160 pages"
##   ..$ Amazon Best Sellers Rank: chr [1:3] "#8" "#59" "#17"
kable(parsed.book.json[[1]])
book id name authors eidtion pulisher language year_published ISBN-13 paperback Amazon Best Sellers Rank
1 KUBY Immunology Richard A. Goldsby, Thomas J. Kindt, Barbara A. Osborne 7th W. H. Freeman English 2013 978-1464119910 670 pages #8
2 Cellular and Molecular Immunology Abul K. Abbas MBBS, Andrew H. H. Lichtman MD PhD, Shiv Pillai MBBS PhD 8th Saunders English 2014 978-0323222754 544 pages #59
3 How the Immune System Works Lauren M. Sompayrac 5th Wiley-Blackwell English 2015 978-1118997772 160 pages #17

Conclusion: tables written in HTML and JSON are read into R as list objects, while XML table is parsed into R as data.frame object. HTML and JSON table is mapped to a data frame with well defined variables and observations and the data frame is stored in a list. kable function from knitr pacakge can draw a decent table for a data.frame type object but it does not map a list type object to a reader-friendly table, which are shown in the HTML part.