Goal

Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.

Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.

Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?

Packages

Load the necessary packages of choice:

library(XML)
## Warning: package 'XML' was built under R version 3.4.3
library(RCurl)
## Loading required package: bitops
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## Warning: package 'tibble' was built under R version 3.4.3
## Warning: package 'tidyr' was built under R version 3.4.3
## Warning: package 'readr' was built under R version 3.4.3
## Warning: package 'purrr' was built under R version 3.4.3
## Warning: package 'dplyr' was built under R version 3.4.2
## Warning: package 'forcats' was built under R version 3.4.3
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x tidyr::complete() masks RCurl::complete()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
library(knitr)
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten

Books (HTML)

First, we download the html file from github using the getURL function from the RCurl package and name it books_html_web

books_html_web <- getURL("https://raw.githubusercontent.com/LilesB/Data-607---Week-7-Assignment/master/books.html")

Next we parse the data using xmlParse and check the classification using the class function

books_html <- xmlParse(books_html_web, isHTML=TRUE)
class(books_html)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

Next, we view the contents of books_html

books_html
## <!DOCTYPE html>
## <html>
## <head><style>
## table {
##  font-family: arial, sans-serif;
##  border-collapse: collapse;
##  width: 100%;
## }
## 
## td, th {
##  text-align: left;
##  padding: 4px;
## }
## </style></head>
## <body>
## <h2>Books<h2>
## 
## </h2>
## </h2>
## <table>
## <tr>
## <th>Title</th>
##  <th>Authors</th>
##  <th>Publisher</th>
##  <th>Pages</th>
##  <th>Price</th>
##  <th>New Release Date</th>
##  <th>ISBN-10</th>
##  </tr>
## <tr>
## <td>Street Players</td>
##  <td>Donald Goines</td>
##  <td>Holloway House</td>
##  <td>256</td>
##  <td>$7.70</td>
##  <td>2006</td>
##  <td>0870678841</td>
##  </tr>
## <tr>
## <td>War and Peace</td>
##  <td>Leo Tolstoy</td>
##  <td>Modern Library Classics</td>
##  <td>1424</td>
##  <td>$12.80</td>
##  <td>2002</td>
##  <td>037570644</td>
##  </tr>
## <tr>
## <td>Homicide: A Year on the Killing Streets</td>
##  <td>David Simon</td>
##  <td>Holt Paperbacks</td>
##  <td>672</td>
##  <td>$13.34</td>
##  <td>2006</td>
##  <td>0805080759</td>
##  </tr>
## <tr>
## <td>The Isis Papers: The Keys to the Colors</td>
##  <td>Dr. Frances Cress Welsing</td>
##  <td>Third World Press</td>
##  <td>300</td>
##  <td>$13.95</td>
##  <td>1991</td>
##  <td>0883781042</td>
##  </tr>
## <tr>
## <td>From Slavery to Freedom: A History of African Americans</td>
##  <td>John Hope Franklin and Evelyn Brooks Higginbotham</td>
##  <td>McGraw-Hill</td>
##  <td>736</td>
##  <td>$116.53</td>
##  <td>2010</td>
##  <td>0072963786</td>
##  </tr>
## </table>
## </body>
## </html>
## 

Use readHTMLTable to read in the data and then view the contents

books_html <- readHTMLTable(books_html, header=TRUE)
books_html
## $`NULL`
##                                                     Title
## 1                                          Street Players
## 2                                           War and Peace
## 3                 Homicide: A Year on the Killing Streets
## 4                 The Isis Papers: The Keys to the Colors
## 5 From Slavery to Freedom: A History of African Americans
##                                             Authors
## 1                                     Donald Goines
## 2                                       Leo Tolstoy
## 3                                       David Simon
## 4                         Dr. Frances Cress Welsing
## 5 John Hope Franklin and Evelyn Brooks Higginbotham
##                 Publisher Pages   Price New Release Date    ISBN-10
## 1          Holloway House   256   $7.70             2006 0870678841
## 2 Modern Library Classics  1424  $12.80             2002  037570644
## 3         Holt Paperbacks   672  $13.34             2006 0805080759
## 4       Third World Press   300  $13.95             1991 0883781042
## 5             McGraw-Hill   736 $116.53             2010 0072963786
books_html <- as.data.frame(books_html)
names(books_html)
## [1] "NULL.Title"            "NULL.Authors"          "NULL.Publisher"       
## [4] "NULL.Pages"            "NULL.Price"            "NULL.New.Release.Date"
## [7] "NULL.ISBN.10"

As you can see once converted to a dataframe we lost the header titles so we will rename them using the dplyr package

books_html <- books_html %>% rename(Title = NULL.Title,
                      Authors = NULL.Authors,
                      Publisher = NULL.Publisher,
                      Pages = NULL.Pages,
                      Price = NULL.Price,
                      NewReleaseDate = NULL.New.Release.Date,
                      ISBN10 = NULL.ISBN.10)
kable(books_html, format = "html", caption = "Books in HTML Format" )
Books in HTML Format
Title Authors Publisher Pages Price NewReleaseDate ISBN10
Street Players Donald Goines Holloway House 256 $7.70 2006 0870678841
War and Peace Leo Tolstoy Modern Library Classics 1424 $12.80 2002 037570644
Homicide: A Year on the Killing Streets David Simon Holt Paperbacks 672 $13.34 2006 0805080759
The Isis Papers: The Keys to the Colors Dr. Frances Cress Welsing Third World Press 300 $13.95 1991 0883781042
From Slavery to Freedom: A History of African Americans John Hope Franklin and Evelyn Brooks Higginbotham McGraw-Hill 736 $116.53 2010 0072963786

Books (XML)

First, we download the xml file from github using the getURL function from the RCurl package and name it books_xml_web

books_xml_web <- getURL("https://raw.githubusercontent.com/LilesB/Data-607---Week-7-Assignment/master/books.xml")

Next we parse the data using xmlParse and check the classification using the class function

books_xml <- xmlParse(books_xml_web)
class(books_xml)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

Next, we view the contents of books_xml

books_xml
## <?xml version="1.0" encoding="UTF-8"?>
## <Books>
##   <book>
##     <Title lang="en">Street Players</Title>
##     <Authors>Donald Goines</Authors>
##     <Publisher>Holloway House</Publisher>
##     <Pages>256</Pages>
##     <Price>$7.70</Price>
##     <New_Release_Date>2006</New_Release_Date>
##     <ISBN-10>0870678841</ISBN-10>
##   </book>
##   <book>
##     <Title lang="en">War and Peace</Title>
##     <Authors>Leo Tolstoy</Authors>
##     <Publisher>Modern Library Classics</Publisher>
##     <Pages>1424</Pages>
##     <Price>$12.80</Price>
##     <New_Release_Date>2002</New_Release_Date>
##     <ISBN-10>037570644</ISBN-10>
##   </book>
##   <book>
##     <Title lang="en">Homicide: A Year on the Killing Streets</Title>
##     <Authors>David Simon</Authors>
##     <Publisher>Holt Paperbacks</Publisher>
##     <Pages>672</Pages>
##     <Price>$13.34</Price>
##     <New_Release_Date>2006</New_Release_Date>
##     <ISBN-10>0805080759</ISBN-10>
##   </book>
##   <book>
##     <Title lang="en">The Isis Papers: The Keys to the Colors</Title>
##     <Authors>Dr. Frances Cress Welsing</Authors>
##     <Publisher>Third World Press</Publisher>
##     <Pages>300</Pages>
##     <Price>$13.95</Price>
##     <New_Release_Date>1991</New_Release_Date>
##     <ISBN-10>0883781042</ISBN-10>
##   </book>
##   <book>
##     <Title lang="en">From Slavery to Freedom: A History of African Americans</Title>
##     <Authors>John Hope Franklin and Evelyn Brooks Higginbotham</Authors>
##     <Publisher>McGraw-Hill</Publisher>
##     <Pages>736</Pages>
##     <Price>$116.53</Price>
##     <New_Release_Date>2010</New_Release_Date>
##     <ISBN-10>0072963786</ISBN-10>
##   </book>
## </Books>
## 

We now convert books_xml to a data frame using xmlToDataFrame

books_xml <- xmlToDataFrame(books_xml)

Using the glimpse feature we will look into books_xml and then view the dataframe.

glimpse(books_xml)
## Observations: 5
## Variables: 7
## $ Title            <fctr> Street Players, War and Peace, Homicide: A Y...
## $ Authors          <fctr> Donald Goines, Leo Tolstoy, David Simon, Dr....
## $ Publisher        <fctr> Holloway House, Modern Library Classics, Hol...
## $ Pages            <fctr> 256, 1424, 672, 300, 736
## $ Price            <fctr> $7.70, $12.80, $13.34, $13.95, $116.53
## $ New_Release_Date <fctr> 2006, 2002, 2006, 1991, 2010
## $ `ISBN-10`        <fctr> 0870678841, 037570644, 0805080759, 088378104...
kable(books_xml, format = "html", caption = "Books in XML Format")
Books in XML Format
Title Authors Publisher Pages Price New_Release_Date ISBN-10
Street Players Donald Goines Holloway House 256 $7.70 2006 0870678841
War and Peace Leo Tolstoy Modern Library Classics 1424 $12.80 2002 037570644
Homicide: A Year on the Killing Streets David Simon Holt Paperbacks 672 $13.34 2006 0805080759
The Isis Papers: The Keys to the Colors Dr. Frances Cress Welsing Third World Press 300 $13.95 1991 0883781042
From Slavery to Freedom: A History of African Americans John Hope Franklin and Evelyn Brooks Higginbotham McGraw-Hill 736 $116.53 2010 0072963786

Books (JSON)

First, we download the json file from github using the getURL function from the RCurl package and name it books_json_web

books_json_web <- getURL("https://raw.githubusercontent.com/LilesB/Data-607---Week-7-Assignment/master/books.json")

Next, we view the contents of books_json_web

books_json_web
## [1] "{\"Books\": [\r\n\t{\"Title\":\"Street Players\",\"Authors\":\"Donald Goines\",\r\n\t\"Pubisher\":\"Holloway House\",\"Pages\":\"256\",\"Price\":\"$7.70\",\r\n\t\"New_Release_Date\":\"2006\",\"ISBN-10\":\"0870678841\"},\r\n\t{\"Title\":\"War and Peace\",\"Authors\":\"Leo Tolstoy\",\r\n\t\"Pubisher\":\"Modern Library Classics\",\"Pages\":\"1424\",\"Price\":\"$12.80\",\r\n\t\"New_Release_Date\":\"2002\",\"ISBN-10\":\"037570644\"},\r\n\t{\"Title\":\"Homicide: A Year on the Killing Streets\",\"Authors\":\"David Simon\",\r\n\t\"Pubisher\":\"Holt Paperbacks\",\"Pages\":\"672\",\"Price\":\"$13.34\",\r\n\t\"New_Release_Date\":\"2006\",\"ISBN-10\":\"0805080759\"},\r\n\t{\"Title\":\"The Isis Papers: The Keys to the Colors\",\"Authors\":\"Dr. Frances Cress Welsing\",\r\n\t\"Pubisher\":\"Third World Press\",\"Pages\":\"300\",\"Price\":\"$13.95\",\r\n\t\"New_Release_Date\":\"1991\",\"ISBN-10\":\"0883781042\"},\r\n\t{\"Title\":\"From Slavery to Freedom: A History of African Americans\",\"Authors\":\"John Hope Franklin and Evelyn Brooks Higginbotham\",\r\n\t\"Pubisher\":\"McGraw-Hill\",\"Pages\":\"736\",\"Price\":\"$116.53\",\r\n\t\"New_Release_Date\":\"2010\",\"ISBN-10\":\"0072963786\"}\r\n]}\t\r\n\r\n\t"

We then convert the JSON file using fromJSON

books_json <- fromJSON(books_json_web)

Next we convert books_json to a data frame and print the column headers using the names feature

books_json <- as.data.frame(books_json)
names(books_json)
## [1] "Books.Title"            "Books.Authors"         
## [3] "Books.Pubisher"         "Books.Pages"           
## [5] "Books.Price"            "Books.New_Release_Date"
## [7] "Books.ISBN.10"

Similar to the html conversion we lost the original names, so we will rename using dplyr package

books_json <- books_json %>% rename(Title = Books.Title,
                                    Authors = Books.Authors,
                                    Publisher = Books.Pubisher,
                                    Pages = Books.Pages,
                                    Price = Books.Price,
                                    NewReleaseDate = Books.New_Release_Date,
                                    ISBN10 = Books.ISBN.10)
kable(books_json, format = "html", caption = "Books in JSON Format")
Books in JSON Format
Title Authors Publisher Pages Price NewReleaseDate ISBN10
Street Players Donald Goines Holloway House 256 $7.70 2006 0870678841
War and Peace Leo Tolstoy Modern Library Classics 1424 $12.80 2002 037570644
Homicide: A Year on the Killing Streets David Simon Holt Paperbacks 672 $13.34 2006 0805080759
The Isis Papers: The Keys to the Colors Dr. Frances Cress Welsing Third World Press 300 $13.95 1991 0883781042
From Slavery to Freedom: A History of African Americans John Hope Franklin and Evelyn Brooks Higginbotham McGraw-Hill 736 $116.53 2010 0072963786