Pick three of your favorite books on one of your favorite subjects. At least one of the books should have more than one author. For each book, include the title, authors, and two or three other attributes that you find interesting.
Take the information that you’ve selected about these three books, and separately create three files which store the book’s information in HTML (using an html table), XML, and JSON formats (e.g. “books.html”, “books.xml”, and “books.json”). To help you better understand the different file structures, I’d prefer that you create each of these files “by hand” unless you’re already very comfortable with the file formats.
Write R code, using your packages of choice, to load the information from each of the three sources into separate R data frames. Are the three data frames identical?
Your deliverable is the three source files and the R code. If you can, package your assignment solution up into an .Rmd file and publish to rpubs.com. [This will also require finding a way to make your three text files accessible from the web].
| Title | Author | Publisher | Year | Edition | ISBN |
|---|---|---|---|---|---|
| Automated Data Collection with R | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | John Wiley & Sons, Ltd | 2015 | 1st | 978-1-118-83481-7 |
| Data Science for Business | Foster Provost, Tom Fawcett | O’Reilly Media, Inc | 2013 | 1st | 978-1-449-36132-7 |
| Text Mining with R: A Tidy Approach | Julia Silge, David Robinson | O’Reilly Media, Inc | 2017 | 1st | 978-1-491-98165-8 |
url <- getURL('https://raw.githubusercontent.com/shirley-wong/Data-607/master/Three_Books.htm')
HTML_data <- htmlParse(url)
HTML_data## <!DOCTYPE html>
## <html>
## <head><title>Three Books</title></head>
## <body>
## <table>
## <tr>
## <th>Title</th>
## <th>Authors</th>
## <th>Publisher</th>
## <th>Year</th>
## <th>Edition</th>
## <th>ISBN</th>
## </tr>
## <tr>
## <td>Automated Data Collection with R</td>
## <td>Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis</td>
## <td>John Wiley & Sons, Ltd</td>
## <td>2015</td>
## <td>1st</td>
## <td>978-1-118-83481-7</td>
## </tr>
## <tr>
## <td>Data Science for Business</td>
## <td>Foster Provost, Tom Fawcett</td>
## <td>O’Reilly Media, Inc</td>
## <td>2013</td>
## <td>1st</td>
## <td>978-1-449-36132-7</td>
## </tr>
## <tr>
## <td>Text Mining with R: A Tidy Approach</td>
## <td>Julia Silge, David Robinson</td>
## <td>O’Reilly Media, Inc</td>
## <td>2017</td>
## <td>1st</td>
## <td>978-1-491-98165-8</td>
## </tr>
## </table>
## </body>
## </html>
##
rvest Package:HTML_df <- url %>%
read_html(encoding = 'UTF-8') %>% # read url link for HTML data into R as a list
html_table(header = NA, trim = TRUE) %>% # convert the file to a list of dataframes
.[[1]] # Get the first element
kable(HTML_df)| Title | Authors | Publisher | Year | Edition | ISBN |
|---|---|---|---|---|---|
| Automated Data Collection with R | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | John Wiley & Sons, Ltd | 2015 | 1st | 978-1-118-83481-7 |
| Data Science for Business | Foster Provost, Tom Fawcett | O’Reilly Media, Inc | 2013 | 1st | 978-1-449-36132-7 |
| Text Mining with R: A Tidy Approach | Julia Silge, David Robinson | O’Reilly Media, Inc | 2017 | 1st | 978-1-491-98165-8 |
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
## $ Authors : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
## $ Year : int 2015 2013 2017
## $ Edition : chr "1st" "1st" "1st"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"
url <- getURL('https://raw.githubusercontent.com/shirley-wong/Data-607/master/Three_Books.xml')
XML_data <- xmlParse(url)
XML_data## <?xml version="1.0" encoding="UTF-8"?>
## <three_books>
## <book id="1">
## <Title>Automated Data Collection with R</Title>
## <Authors>
## <Author ID="1">Simon Munzert</Author>
## <Author ID="2">Christian Rubba</Author>
## <Author ID="3">Peter Meißner</Author>
## <Author ID="4">Dominic Nyhuis</Author>
## </Authors>
## <Publisher>John Wiley & Sons, Ltd</Publisher>
## <Year>2015</Year>
## <Edition>1st</Edition>
## <ISBN>978-1-118-83481-7</ISBN>
## </book>
## <book id="2">
## <Title>Data Science for Business</Title>
## <Authors>
## <Author ID="1">Foster Provost</Author>
## <Author ID="2">Tom Fawcett</Author>
## </Authors>
## <Publisher>O’Reilly Media, Inc</Publisher>
## <Year>2013</Year>
## <Edition>1st</Edition>
## <ISBN>978-1-449-36132-7</ISBN>
## </book>
## <book id="3">
## <Title>Text Mining with R: A Tidy Approach</Title>
## <Authors>
## <Author ID="1">Julia Silge</Author>
## <Author ID="2">David Robinson</Author>
## </Authors>
## <Publisher>O’Reilly Media, Inc</Publisher>
## <Year>2017</Year>
## <Edition>1st</Edition>
## <ISBN>978-1-491-98165-8</ISBN>
## </book>
## </three_books>
##
XML Package:XML_df <- url %>%
xmlParse() %>% #read url link for XML data into R as a list
xmlRoot() %>% #get the root node of XML data
xmlToDataFrame(stringsAsFactors = FALSE) %>% #convert the XML data to dataframe
mutate(Year=as.integer(Year))
kable(XML_df)| Title | Authors | Publisher | Year | Edition | ISBN |
|---|---|---|---|---|---|
| Automated Data Collection with R | Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis | John Wiley & Sons, Ltd | 2015 | 1st | 978-1-118-83481-7 |
| Data Science for Business | Foster ProvostTom Fawcett | O’Reilly Media, Inc | 2013 | 1st | 978-1-449-36132-7 |
| Text Mining with R: A Tidy Approach | Julia SilgeDavid Robinson | O’Reilly Media, Inc | 2017 | 1st | 978-1-491-98165-8 |
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
## $ Authors : chr "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis" "Foster ProvostTom Fawcett" "Julia SilgeDavid Robinson"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
## $ Year : int 2015 2013 2017
## $ Edition : chr "1st" "1st" "1st"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"
‘JSON Source File’
jsonlite Package:url <- getURL("https://raw.githubusercontent.com/shirley-wong/Data-607/master/Three_Books.json")
JSON_df <- url %>%
fromJSON() %>% #read JSON file
.[[1]] %>% #get the first element from the list which is the dataframe we are looking for
mutate(Authors = unlist(lapply(Authors, function(x) str_c(x, collapse = ', ')))) #get the values in the lists of Authors column and fit them into dataframe
kable(JSON_df)| Title | Authors | Publisher | Year | Edition | ISBN |
|---|---|---|---|---|---|
| Automated Data Collection with R | Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis | John Wiley & Sons, Ltd | 2015 | 1st | 978-1-118-83481-7 |
| Data Science for Business | Foster Provost, Tom Fawcett | O’Reilly Media, Inc | 2013 | 1st | 978-1-449-36132-7 |
| Text Mining with R: A Tidy Approach | Julia Silge, David Robinson | O’Reilly Media, Inc | 2017 | 1st | 978-1-491-98165-8 |
## 'data.frame': 3 obs. of 6 variables:
## $ Title : chr "Automated Data Collection with R" "Data Science for Business" "Text Mining with R: A Tidy Approach"
## $ Authors : chr "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis" "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
## $ Publisher: chr "John Wiley & Sons, Ltd" "O’Reilly Media, Inc" "O’Reilly Media, Inc"
## $ Year : int 2015 2013 2017
## $ Edition : chr "1st" "1st" "1st"
## $ ISBN : chr "978-1-118-83481-7" "978-1-449-36132-7" "978-1-491-98165-8"
1. Between HTML and XML
The two dataframes converted from HTML file and XML file are not exactly the same. The original data in element <table> in HTML file are completely and accurately parsed into R dataframe, however the original data in element <Authors> are parsed and concated without delimiters.
## [1] "Component \"Authors\": 3 string mismatches"
## [,1]
## [1,] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## [2,] "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis"
## [,2] [,3]
## [1,] "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"
## [2,] "Foster ProvostTom Fawcett" "Julia SilgeDavid Robinson"
2. Between HTML and JSON
The two dataframes are identical.
## [1] TRUE
3. Between XML and JSON
The two dataframe converted from XML file and JSON file are not exactly the same. The original data in element <Authors> are parsed and concated without delimiters, however the original data in element “Authors” are parsed and concated with ‘,’ as delimiters.
## [1] "Component \"Authors\": 3 string mismatches"
## [,1]
## [1,] "Simon MunzertChristian RubbaPeter MeißnerDominic Nyhuis"
## [2,] "Simon Munzert, Christian Rubba, Peter Meißner, Dominic Nyhuis"
## [,2] [,3]
## [1,] "Foster ProvostTom Fawcett" "Julia SilgeDavid Robinson"
## [2,] "Foster Provost, Tom Fawcett" "Julia Silge, David Robinson"